Installing Apache Spark on Ubuntu

Apache-Spark is an open-source framework for big data processing, used by professional data scientists and engineers to perform actions on large amounts of data. As the processing of large amounts of data needs fast processing, the processing machine/package must be efficient to do so. Spark uses DAG scheduler, memory caching and query execution to process the data as fast as possible and thus for large data handling.

The data structure of Spark is based on RDD (acronym of Resilient Distributed Dataset); RDD consists of unchangeable distributed collection of objects; these datasets may contain any type of objects related to Python, Java, Scala and can also contain the user defined classes. The wide usage of Apache-Spark is because of its working mechanism that it follows:

The Apache Spark works on master and slave phenomena; following this pattern, a central coordinator in Spark is known as “driver” (acts as a master) and its distributed workers are named as “executors” (acts as slave). And the third main component of Spark is “Cluster Manager”; as the name indicates it is a manager that manages executors and drivers. The executors are launched by “Cluster Manager” and in some cases the drivers are also launched by this manager of Spark. Lastly, the built-in manager of Spark is responsible for launching any Spark application on the machines: Apache-Spark consists of a number of notable features that are necessary to discuss here to highlight the fact why they are used in large data processing? So, the features of Apache-Spark are described below:

Features

Here are some distinctive features that makes Apache-Spark a better choice than its competitors:

Speed: As discussed above, it uses DAG scheduler (schedules the jobs and determines the suitable location for each task), Query execution and supportive libraries to perform any task effectively and rapidly.

Multi Language Support: The multi-language feature of Apache-Spark allows the developers to build applications based on Java, Python, R and Scala.

Real Time Processing: Instead of processing stored data, users can get the processing of results by Real Time Processing of data and therefore it produces instant results.

Better Analytics: For analytics, Spark uses a variety of libraries to provide analytics like, Machine Learning Algorithms, SQL queries etc. However, its competitor Apache-MapReduce only uses Map and Reduce functions to provide analytics; this analytical differentiation also indicates why spark outperforms MapReduce.

Focusing the importance and amazing features of Apache Spark; our today’s writing will pave the way for you to install Apache Spark on your Ubuntu

How to install Apache Spark on Ubuntu

This section will guide you to install Apache Spark on Ubuntu:

Step 1: Update the system and install Java

Before getting insight of the core part of installation; let’s update the system by using command mentioned below:

$ sudo apt update

After the update, the command written below will install Java environment as Apache-Spark is a Java based application:

$ sudo apt install default-jdk

Step 2: Download the Apache Spark file and extract

Once the Java is installed successfully, you are ready to download apache spark file from web and the following command will download the latest 3.0.3 build of spark:

$ wget https://archive.apache.org/dist/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz

You must extract the downloaded file so; the following command will perform the extraction (in my case):

$ tar xvf spark-3.0.3-bin-hadoop2.7.tgz

After that, move the extracted folder to “/opt/” directory by following the below-mentioned command:

$ sudo mv spark-3.0.3-bin-hadoop2.7/ /opt/spark

Once you have completed the above processes it means you are done with download the Apache Spark, but wait; it won’t work until you configure Spark environment the upcoming sections will guide you to configure and use Spark:

How to Configure Spark environment

For this, you have to set some environment variables in the configuration file “~/.profile”;

Access this file using your editor (nano in my case), the command written below will open this file in nano editor:

$ sudo nano ~/.profile

And write the following lines at the end of this file; once you are done, press “Ctrl+S” to save the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3

Load the file to get the changes for Spark environment:

$ source ~/.profile

How to start standalone master server of Spark

Once the environment variables are set; now you can start the process for standalone master server by using the command written below:

$ start-master.sh

Once you have started the process; the web interface of master server can be fetched by using the address mentioned below; write the following address in your browser address bar

https://localhost:8080/

How to start slave/worker server of Spark

The slave server can be started by using the command stated below: it is noticed that you need URL of master server to start worker:

$ start-slave.sh spark://adnan:7077

Once you have started; run the address (https://localhost:8080) and you will notice that there is one worker added in “Workers” section. It is noticed that worker is using “1” core of processor and 3.3GB of RAM by default:

For instance, we will limit the number of cores of the workers by using “-c” flag: For instance, the command mentioned below will start a server with “0” cores of processor usage:

$ start-slave.sh -c 0 spark://adnan:7077

You can see the changes by reloading the page (https://localhost:8080/):

Additionally, you can limit the memory of the new workers as well by using “-m” flag: the command written below will start a slave with memory usage of 256MB:

$ start-slave.sh -m 256M spark://adnan:7077

The added worker with limited memory is visible at web interface (https://localhost:8080/):

How to start/stop master and slave

You can stop or star master and slave at once by using the command mentioned below:

$ start-all.sh

Similarly, the command stated below will stop all instances at once:

$ stop-all.sh

To start and stop only master instance, use the following commands:

$ start-master.sh

And to stop the running master:

$ stop-master.sh

How to run Spark Shell

Once you are done with configuring the Spark environment; you can use the command mentioned below to run the spark shell; by this means it is tested also:

$ spark-shell

How to run Python in Spark Shell

If the spark shell is running on your system, you can run python on this environment; run the following command to get this:

$ pyspark

Note: the above command won’t work if you are working with Scala (default language in spark shell), you can get out of this by typing “: q” and pressing “Enter” or just press “Ctrl+C”.

Conclusion

Apache Spark is an open-source unified analytics engine that is used for big data processing using several libraries and mostly used by data engineers and others that have to work on huge amounts of data. In this article, we have provided an installation guide of Apache-Spark; as well as the configuration of Spark environment is also described in detail. The addition of workers with limited numbers or cores and specified memory would be helpful in saving resources while working with spark.