Steps to Install Spark2 on the Cloudera Quickstart VM
Cloudera, one of the leading distributions of Hadoop, provides an easy to install Virtual Machine for the purposes of getting started quickly on their platform. With this, someone can easily get a single node CDH cluster running within a Virtual Environment. Users could use this VM for their own personal learning, rapidly building applications on a dedicated cluster, or for many other purposes.
Out of the box, Cloudera is running Spark 1.6.x. It’s a very stable and reliable version of Spark. However, Spark 2.0 released in 2016 bringing with it exceptional improvements in features and performance. In this blog article, we’ll go through step by step, how you can get Spark2 installed on your Quickstart VM.
Pre-Installation Steps
- You will need to install Java 8 on the Virtual Machine for Spark2 to be supported. You can follow these instructions: Upgrading to Java8 on the Cloudera Quickstart VM
Installation Steps
1. Download and Install the VM
a. Navigate to https://www.cloudera.com/downloads/cdp-private-cloud-trial.html
b. Select the Platform you’d like the VM to run on and Download
c. Load the VM into your desired Platform
2. Configure the VM
Before starting the VM, set the following configurations:
— Set at least 8GB of RAM
— Set at least 2 CPUs
3. Startup the VM
4. Startup Cloudera Manager (CM)
Once the VM starts up, navigate to the Desktop and Execute the “Launch Cloudera Express” script.
Note: This may take a while to run
Once complete, you should now be able to view the Cloudera Manager by opening up your web browser (within the VM) and navigating to:
http://quickstart.cloudera:7180
From your local machine, you can navigate to:
Default Credentials: cloudera/cloudera
5. Configure CM to use Parcels
Navigate to the Desktop and Execute the “Migrate to Parcels” script.
Note: This may take a while to run
You can validate that CM is now using parcels by logging into the Cloudera Manager Web UI. Right next to the cluster name, it should say: (CDH x.x.x, Parcels)
Note: You will need to restart all the services on the cluster after this. You can do this by: Going to the Cloudera Manager Web UI, click on the button next to the Cluster Name and click Start.
6. Select the Version of Spark2 you want to Install
Navigate here to get a full list of the Spark Versions that are available:
7. Install Spark2 CSD
a. Open a Command Line Terminal
b. Login as Root
$ sudo su
c. Navigate to the CSD Directory
$ cd /opt/cloudera/csd
d. Download the CSD (Replace CSD_URL with the URL you copied from step #6)
$ wget CSD_URL
e. Set Permissions and Ownership
$ chown cloudera-scm:cloudera-scm SPARK2_ON_YARN-x.x.x.clouderax.jar
$ chmod 644 SPARK2_ON_YARN-x.x.x.clouderax.jar
f. Restart CM Services
$ service cloudera-scm-server restart
g. Login to the Cloudera Manager Web UI
h. Restart the Cloudera Management Service
— Select Clusters > Cloudera Management Service
— Select Actions > Restart
i. Restart the Cluster Services
— Select Clusters > Cloudera QuickStart
— Select Actions > Restart
8. Install Spark2 Parcel
Complete Documentation on how to manage Parcels:
a. Login to the Cloudera Manager Web UI
b. Navigate to Hosts -> Parcels
c. Locate the SPARK2 parcel from the list
d. Under Actions, click Download and wait for it to download
e. Under Actions, click Distribute and wait for it to be distributed
f. Under Actions, click Activate and wait for it to be activated
9. Install Spark2 Service
a. Login to the Cloudera Manager Web UI
b. Click on the button next to the Cluster Name and select “Add Service”
c. Select “Spark 2” and click “Continue”
d. Select whichever set of dependencies you would like and click “Continue”
e. Select the one instance available as the History Server and the Gateway and click “Continue”
f. Leave the default configurations as is and click “Continue”
g. The service will now be added and then you will be taken back to the CM home
h. Click on the blue button next to the Spark2 service and click “Restart Stale Services”
i. Ensure the “Re-deploy client configuration” is checked and click “Restart Now”
10. Setup Integration with Hive
a. Open a Command Line Terminal
b. Login as Root
$ sudo su
d. Create a symlink in the Spark2 Configurations to the hive-site.xml
$ ln -s /etc/hive/conf/hive-site.xml /etc/spark2/conf/hive-site.xml
Testing Smoke Test
MASTER=yarn /opt/cloudera/parcels/SPARK2/lib/spark2/bin/run-example SparkPi 100
Test Spark Shell
- Start the Shell
$ spark2-shell
2. Execute processes
spark
spark.sql("select 1").show()
Test PySpark Shell
- Start the Shell
$ pyspark2
2. Execute processes
spark
spark.sql("select 1").show()
Test Spark Submit
$ spark2-submit --class org.apache.spark.examples.SparkPi
/opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-
examples_2.11-*.jar
Test Integration with Hive
- Start the Shell
$ spark2-shell
2. Execute processes
spark.sql("show databases").show()
Note: Should show at least one database: default
Documentation
Installing or Upgrading CDS Powered by Apache Spark | 2.3.x | Cloudera Documentation