How to Install Hadoop on Ubuntu 20.04 – Step by Step Guide

Hadoop is a popular open-source framework that enables the distributed processing of large data sets across clusters of computers. If you’re working with big data, learning how to set up Hadoop can be invaluable. This guide will walk you through how to install Hadoop on Ubuntu 20.04, with detailed instructions and practical tips for a seamless setup.

Prerequisites

Before you begin, ensure that your system meets the following requirements:

  1. Operating System: Ubuntu 20.04 (64-bit).
  2. RAM: At least 4 GB for basic setups (8 GB or more recommended).
  3. Java: Hadoop requires Java Development Kit (JDK) 8 or later.
  4. Internet Connection: For downloading the necessary packages.

It’s also helpful to have some experience working in a Linux environment, especially with a VPS server.

Step 1: Update Your System

Begin by updating your system packages to ensure everything is up to date. Open your terminal and run:

sudo apt update && sudo apt upgrade -y

This step ensures that you avoid potential dependency issues during installation, including common issues like Fix apt error.

Step 2: Install Java Development Kit (JDK)

Hadoop is written in Java, so installing JDK is a prerequisite.

Install OpenJDK 11:

Execute the following command to install OpenJDK 11:

sudo apt install openjdk-11-jdk -y

Verify Java Installation:

Check the Java version to confirm successful installation:

java -version

The output should display the Java version, something like openjdk version “11.0.x”.

Step 3: Download Hadoop

Download the latest stable version of Hadoop from the Apache Hadoop Official Website.

Example Command:

wget https://downloads.apache.org/hadoop/common/hadoop-x.x.x/hadoop-x.x.x.tar.gz

Replace x.x.x with the desired version number. At the time of writing, the latest version is 3.3.6.

Step 4: Extract and Move Hadoop Files

Once downloaded, extract the tarball and move it to an appropriate directory, such as /usr/local/hadoop.

tar -xzvf hadoop-x.x.x.tar.gz

sudo mv hadoop-x.x.x /usr/local/hadoop

Step 5: Configure Environment Variables

You need to configure Hadoop’s environment variables by editing the ~/.bashrc file.

Add the Following Lines:

export HADOOP_HOME=/usr/local/hadoop

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Apply the Changes:

source ~/.bashrc

Step 6: Configure Hadoop Core Files

Hadoop requires certain XML configuration files to define its core properties. Edit the following files within the $HADOOP_HOME/etc/hadoop directory:

1. core-site.xml

<configuration>

   <property>

      <name>fs.defaultFS</name>

      <value>hdfs://localhost:9000</value>

   </property>

</configuration>

2. hdfs-site.xml

<configuration>

   <property>

      <name>dfs.replication</name>

      <value>1</value>

   </property>

</configuration>

3. mapred-site.xml

Create this file (if it doesn’t exist) and define MapReduce properties:

cp mapred-site.xml.template mapred-site.xml

Add the following:

<configuration>

   <property>

      <name>mapreduce.framework.name</name>

      <value>yarn</value>

   </property>

</configuration>

4. yarn-site.xml

<configuration>

   <property>

      <name>yarn.nodemanager.aux-services</name>

      <value>mapreduce_shuffle</value>

   </property>

</configuration>

Step 7: Format the Namenode

Before starting Hadoop, format the Namenode to initialize HDFS.

hdfs namenode -format

Step 8: Start Hadoop Services

Start the following Hadoop daemons:

start-dfs.sh

start-yarn.sh

Verify that Hadoop is running by accessing its web interfaces:

  1. NameNode: http://localhost:9870
  2. ResourceManager: http://localhost:8088

Step 9: Verify Hadoop Installation

To confirm everything is functioning correctly, check the Java processes running for Hadoop:

jps

The output should display services like NameNode, DataNode, ResourceManager, and NodeManager.

Troubleshooting Common Errors

1. Fix apt error:

If you encounter this issue during the installation of Java or other packages, ensure that your /etc/apt/sources.list file is correctly configured and that the repositories are enabled.

2. Java not detected:

Double-check your environment variables and ensure Java is installed correctly.

Conclusion

This guide covered how to install Hadoop on Ubuntu 20.04 step by step. By following these instructions, you should now have a fully functional Hadoop setup ready to handle distributed processing tasks. Whether you’re working on a Linux VPS server or your local machine, these steps ensure a smooth setup process.  For remote management of your server, DashRDP, a web-based RDP (Remote Desktop Protocol) solution. DashRDP simplifies accessing your server from anywhere, making it easier to manage and monitor your Hadoop environment efficiently.


Faqs

Can I install Hadoop without configuring SSH?

No, Hadoop requires SSH to be configured for inter-node communication. This is necessary for passwordless login between the master and worker nodes in a Hadoop cluster.

What is Hadoop and why do I need it?

Hadoop is an open-source framework that allows for the distributed storage and processing of large data sets. It uses the Hadoop Distributed File System (HDFS) for storage and the MapReduce programming model for processing. You need it if you’re working with big data, as it can scale across many machines to process and store data more efficiently than traditional systems.

How do I add more nodes to the Hadoop cluster?

To add more nodes, update the slaves file located in $HADOOP_HOME/etc/hadoop/ with the hostnames or IP addresses of the new nodes. Then, copy the Hadoop configuration files to the new nodes, start the HDFS and YARN services on them, and ensure the NameNode recognizes the new nodes.