<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=2877026&amp;fmt=gif"> ""

Installing Spark2 on Cloudera’s Quickstart VM

By Robert Sanders - August 27, 2021
 

Steps to Install Spark2 on the Cloudera Quickstart VM

Cloudera, one of the leading distributions of Hadoop, provides an easy to install Virtual Machine for the purposes of getting started quickly on their platform. With this, someone can easily get a single node CDH cluster running within a Virtual Environment. Users could use this VM for their own personal learning, rapidly building applications on a dedicated cluster, or for many other purposes.

Out of the box, Cloudera is running Spark 1.6.x. It’s a very stable and reliable version of Spark. However, Spark 2.0 released in 2016 bringing with it exceptional improvements in features and performance. In this blog article, we’ll go through step by step, how you can get Spark2 installed on your Quickstart VM.

Pre-Installation Steps

Installation Steps

1. Download and Install the VM

a. Navigate to https://www.cloudera.com/downloads/quickstart_vms.html

b. Select the Platform you’d like the VM to run on and Download

c. Load the VM into your desired Platform

2. Configure the VM

Before starting the VM, set the following configurations:

— Set at least 8GB of RAM

— Set at least 2 CPUs

3. Startup the VM

4. Startup Cloudera Manager (CM)

Once the VM starts up, navigate to the Desktop and Execute the “Launch Cloudera Express” script.

Note: This may take a while to run

Once complete, you should now be able to view the Cloudera Manager by opening up your web browser (within the VM) and navigating to:

http://quickstart.cloudera:7180

From your local machine, you can navigate to:

http://localhost:7180

Default Credentials: cloudera/cloudera

5. Configure CM to use Parcels

Navigate to the Desktop and Execute the “Migrate to Parcels” script.

Note: This may take a while to run

You can validate that CM is now using parcels by logging into the Cloudera Manager Web UI. Right next to the cluster name, it should say: (CDH x.x.x, Parcels)

Note: You will need to restart all the services on the cluster after this. You can do this by: Going to the Cloudera Manager Web UI, click on the button next to the Cluster Name and click Start.

6. Select the Version of Spark2 you want to Install

Navigate here to get a full list of the Spark Versions that are available:

7. Install Spark2 CSD

a. Open a Command Line Terminal

b. Login as Root

$ sudo su

c. Navigate to the CSD Directory

$ cd /opt/cloudera/csd

d. Download the CSD (Replace CSD_URL with the URL you copied from step #6)

$ wget CSD_URL

e. Set Permissions and Ownership

$ chown cloudera-scm:cloudera-scm SPARK2_ON_YARN-x.x.x.clouderax.jar
$ chmod 644 SPARK2_ON_YARN-x.x.x.clouderax.jar

f. Restart CM Services

$ service cloudera-scm-server restart

g. Login to the Cloudera Manager Web UI

h. Restart the Cloudera Management Service

— Select Clusters > Cloudera Management Service

— Select Actions > Restart

i. Restart the Cluster Services

— Select Clusters > Cloudera QuickStart

— Select Actions > Restart

8. Install Spark2 Parcel

Complete Documentation on how to manage Parcels:

a. Login to the Cloudera Manager Web UI

b. Navigate to Hosts -> Parcels

c. Locate the SPARK2 parcel from the list

d. Under Actions, click Download and wait for it to download

e. Under Actions, click Distribute and wait for it to be distributed

f. Under Actions, click Activate and wait for it to be activated

9. Install Spark2 Service

a. Login to the Cloudera Manager Web UI

b. Click on the button next to the Cluster Name and select “Add Service”

c. Select “Spark 2” and click “Continue”

d. Select whichever set of dependencies you would like and click “Continue”

e. Select the one instance available as the History Server and the Gateway and click “Continue”

f. Leave the default configurations as is and click “Continue”

g. The service will now be added and then you will be taken back to the CM home

h. Click on the blue button next to the Spark2 service and click “Restart Stale Services”

i. Ensure the “Re-deploy client configuration” is checked and click “Restart Now”

10. Setup Integration with Hive

a. Open a Command Line Terminal

b. Login as Root

$ sudo su

d. Create a symlink in the Spark2 Configurations to the hive-site.xml

$ ln -s /etc/hive/conf/hive-site.xml /etc/spark2/conf/hive-site.xml

Testing Smoke Test

MASTER=yarn /opt/cloudera/parcels/SPARK2/lib/spark2/bin/run-example SparkPi 100

Test Spark Shell

  1. Start the Shell
$ spark2-shell

2. Execute processes

spark
spark.sql("select 1").show()

Test PySpark Shell

  1. Start the Shell
$ pyspark2

2. Execute processes

spark
spark.sql("select 1").show()

Test Spark Submit

$ spark2-submit --class org.apache.spark.examples.SparkPi
/opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-
examples_2.11-*.jar

Test Integration with Hive

  1. Start the Shell
$ spark2-shell

2. Execute processes

spark.sql("show databases").show()

Note: Should show at least one database: default

Author
Robert Sanders

Director of Big Data and Cloud Engineering for Clairvoyant LLC | Marathon Runner | Triathlete | Endurance Athlete

Tags: Big Data Spark Cloudera Hadoop Data Engineering Apache Spark