Ubuntu

How to Install and Configure Apache Hadoop on Ubuntu

Apache Hadoop is a Java-based, open-source, freely available software platform for storing and analyzing big datasets on your system clusters. It keeps its data in the Hadoop Distributed File system (HDFS) and processes it utilizing MapReduce. Hadoop has been used in machine learning and data mining techniques. It is also used for managing multiple dedicated servers.

The primary components of Apache Hadoop are:

  • HDFS: In Apache Hadoop, HDFS is a file system that is distributed over numerous nodes.
  • MapReduce: It is a framework for developing applications that handle a massive amount of data.
  • Hadoop Common: It is a set of libraries and utilities that are needed by Hadoop modules.
  • Hadoop YARN: In Hadoop, Hadoop Yarn manages the layers of resources.

Now, check out the below-given methods for installing and configuring Apache Hadoop on your Ubuntu system. So let’s start!

How to install Apache Hadoop on Ubuntu

First of all, we will open up our Ubuntu terminal by pressing “CTRL+ALT+T”, you can also type “terminal” in the application’s search bar as follows:

The next step is to update the system repositories:

$ sudo apt update

Now we will install Java on our Ubuntu system by writing out the following command in the terminal:

$ sudo apt install openjdk-11-jdk

Enter “y/Y” to permit the installation process to continue:

Now, verify the existence of the installed Java by checking its version:

$ java -version

We will create a separate user for running Apache Hadoop on our system by utilizing the “adduser” command:

$ sudo adduser hadoopuser

Enter the new user’s password, its full name, and other information. Type “y/Y” to confirm that the provided information is correct:

It’s time to switch the current user with the created Hadoop user, which is “hadoopuser” in our case:

$ su - hadoopuser

Now, utilize the below-given command for generating private and public key pairs:

$ ssh-keygen -t rsa

Enter the file address where you want to save the key pair. After this, add a passphrase that you are going to be used in the whole setup of the Hadoop user:

Next, add these key pairs to the ssh authorized_keys:

at ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

As we have stored the generated key pair in the ssh authorized key, now we will change the file permissions to “640” which means that only we as the “owner” of the file will have the read and write permissions, “groups” will only have the read permission. No permission will be granted to “other users”:

$ chmod 640 ~/.ssh/authorized_keys

Now authenticate the localhost by writing out the following command:

$ ssh localhost

Utilize the below-given wget command for installing the Hadoop framework for your system:

$ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

Extract the downloaded “hadoop-3.3.0.tar.gz” file with the tar command:

$ tar -xvzf hadoop-3.3.0.tar.gz

You can also rename the extracted directory as we will do by executing the below-given command:

$ mv hadoop-3.3.0 hadoop

Now, configure Java environment variables for setting up Hadoop. For this, we will check out the location of our “JAVA_HOME” variable:

$ dirname $(dirname $(readlink -f $(which java)))

Open the “~/.bashrc” file in your “nano” text editor:

$ nano ~/.bashrc

Add the following paths in the opened “~/.bashrc” file:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/home/hadoopuser/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

After that, press “CTRL+O” to save the changes we made in the file:

Now, write out the below-given command to activate the “JAVA_HOME” environment variable:

$ source ~/.bashrc

The next thing we have to do is to open up the environment variable file of Hadoop:

$ nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

We have to set our “JAVA_HOME” variable in the Hadoop environment:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Again, press “CTRL+O” to save the file content:

How to configure Apache Hadoop on Ubuntu

Till this point, we have successfully installed JAVA and Hadoop, created Hadoop users, configured SSH key-based authentication. Now, we will move forward to show you how to configure Apache Hadoop on the Ubuntu system. For this, the step is to create two directories: datanode and namenode, inside the home directory of Hadoop:

$ mkdir -p ~/hadoopdata/hdfs/namenode

$ mkdir -p ~/hadoopdata/hdfs/datanode

We will update the Hadoop “core-site.xml” file by adding our hostname, so firstly, confirm your system hostname by executing this command:

$ hostname

Now, open up the “core-site.xml” file in your “nano” editor:

$ nano $HADOOP_HOME/etc/hadoop/core-site.xml

Our system hostname in “linuxhint-VBox”, you can add the following lines with system’s host name in the opened “core-site.xml” Hadoop file:

<configuration>
<property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop.linuxhint-VBox.com:9000</value>
        </property>
</configuration>

Press “CTRL+O” and save the file:

In the “hdfs-site.xml” file, we will change the directory path of “datanode” and “namenode”:

$ nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

<configuration>
 
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
 
        <property>
                <name>dfs.name.dir</name>
                <value>file:///home/hadoopuser/hadoopdata/hdfs/namenode</value>
        </property>
 
        <property>
                <name>dfs.data.dir</name>
                <value>file:///home/hadoopuser/hadoopdata/hdfs/datanode</value>
        </property>
</configuration>

Again, to write out the added code in the file, press “CRTL+O”:

Next, open up the “mapred-site.xml” file and add the below-given code in it:

$ nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
</configuration>

Press “CTRL+O” to save the changes you made into the file:

The last file that needs to be updated is the “yarn-site.xml”. Open this Hadoop file in the “nano” editor:

$ nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Write out below-given lines in “yarn-site.xml” file:

<configuration>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
</configuration>

We have to start the Hadoop cluster to operate Hadoop. For this, we will format our “namenode” first:

$ hdfs namenode -format

Now start the Hadoop cluster by writing out the below-given command in your terminal:

$ start-dfs.sh

In the process of starting the Hadoop cluster, if you get the “Could resolve hostname error”, then you have to specify the hostname in the “/etc/host” file:

$ sudo nano /etc/hosts

Save the “/etc/host” file, and now you are all ready to start the Hadoop cluster:

$ start-dfs.sh

In the next step, we will start the “yarn” service of the Hadoop:

$ start-yarn.sh

The execution of the above-given command will show you the following output:

To check the status of all services of Hadoop, execute the “jps” command in your terminal:

$ jps

The output shows that all services are running successfully:

Hadoop listens at the port 8088 and 9870, so you are required to permit these ports through the firewall:

$ firewall-cmd --permanent --add-port=9870/tcp

$ firewall-cmd --permanent --add-port=8088/tcp

Now, reload the firewall settings:

$ firewall-cmd --reload

Now, open up your browser, and access your Hadoop “namenode” by entering your IP address with the port 9870:

Utilize the port “8080” with your IP address to access the Hadoop resource manager:

On the Hadoop web interface, you can look for the “Browse Directory” by scroll down the opened web page as follows:

That was all about installing and configuring Apache Hadoop on the Ubuntu system. For stopping the Hadoop cluster, you have to stop the services of “yarn” and “namenode”:

$ stop-dfs.sh

$ stop-yarn.sh

Conclusion

For different big data applications, Apache Hadoop is a freely available platform for managing, storing, and processing data that operates on clustered servers. It is a fault-tolerant distributed file system that allows parallel processing. In Hadoop, the MapReduce model is utilized for storing and extracting data from its nodes. In this article, we have shown you the method for installing and configuring Apache Hadoop on your Ubuntu system.

About the author

Sharqa Hameed

I am a Linux enthusiast, I love to read Every Linux blog on the internet. I hold masters degree in computer science and am passionate about learning and teaching.