How to setup Hadoop 3.x cluster 3-Node

This article is a step by step guide to install and setup Hadoop 3.x cluster with 3-Nodes.

Pre-Requisites:

Concepts

It will be helpful and recommended to read and understand about ‘Namenode’ and ‘Datanode’ before starting this guide. Though it is not mandatory, it is good to have basic understanding of below terms

  • Cluster
  • YARN
  • NodeManager
  • ResourceManager
  • SSH

System Requirements

  • Java JDK must be installed
  • User should have sufficient permissions to create folders/directories

Steps to Setup Hadoop 3.x Cluster

Assume that we are planning to setup Hadoop 3.x cluster with following three systems. You can create cluster with any number of nodes and for simplicity we are demonstrating 3 node cluster.

forkedblog.hdfs.node.master.site (IP Address: 192.168.1.101)
forkedblog.hdfs.node.follower1.site (IP Address: 192.168.1.102)
forkedblog.hdfs.node.follower2.site (IP Address: 192.168.1.103)

Setup JAVA_HOME

First, verify Java is installed in all the nodes using following command

forkedblog$ java -version
You should see below output, something similar. If not, please install Java with JDK
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

Second, setup JAVA_HOME variable in all nodes. You can export it in terminal however it is shortlived and not applied to newly opened terminal. Recommended way is to add export statement to .bashrc file

export JAVA_HOME=/usr/local/jdk1.8.0_181

Create User

Create a system user account on all master and follower nodes to use the HDFS installation.

forkedblog$ useradd hdfs
forkedblog$ passwd hdfs
Changing password for user hdfs.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.

Configure hosts

For each node to communicate with other nodes using host names, edit the /etc/hosts file to add the IP address of the three servers in all nodes and save. Don’t forget to replace the sample IP with your IP:

forekblog$ vi /etc/hosts 
192.168.1.101 forkedblog.hdfs.node.master.site
192.168.1.102 forkedblog.hdfs.node.follower1.site
192.168.1.103 forkedblog.hdfs.node.follower2.site

Setup Key Based Login

Hadoop follows master/follower configuration. Out of 3 nodes, select 1 node to be a master. We need to ensure that master node is able to communicate with follower nodes without requesting authentication every time and this is achieved by sharing master nodes public key to follower nodes.

forkedblog$ su hdfs
hdfs$ ssh-keygen -t rsa # To generate RSA keys
hdfs$ ssh-copy-id -i ~/.ssh/id_rsa.pub hdfs@forkedblog.hdfs.node.master.site
hdfs$ ssh-copy-id -i ~/.ssh/id_rsa.pub hdfs@forkedblog.hdfs.node.follower1.site
hdfs$ ssh-copy-id -i ~/.ssh/id_rsa.pub hdfs@forkedblog.hdfs.node.follower1.site
hdfs$ chmod 0600 ~/.ssh/authorized_keys

Create Directories

Namenode and Datanodes requires directories to store data. We need three directories to be defined for HDFS

  1. dfs.namenode.name.dir – Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.
  2. dfs.datanode.data.dir – Comma separated list of paths on the local filesystem of a DataNode where the DataNode should store its blocks.
  3. dfs.journalnode.edits.dir – Path on the local filesystem where the JournalNode daemon will store its local state

In Master Node:

forkedblog$ mkdir -p /data/namenode
forkedblog$ mkdir -p /data/datanode
forkedblog$ mkdir -p /data/journalnode
forkedblog$ chown -R hdfs:hdfs /data

In Follower Nodes:

forkedblog$ mkdir -p /data/datanode
forkedblog$ chown -R hdfs:hdfs /data

Download Hadoop

We are ready with system setup now. Next step is to download Hadoop binaries from Apache Hadoop website. ‘wget’ command can be used to download binaries when working with CLI. Note that you need to do this, downloading binaries, in master as well as all follower nodes.

We install and run Hadoop from path /opt

forkedblog$ cd /opt
forkedblog$ wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz

After downloading, extract the tar file, rename the folder created to hadoop and change the ownership to hdfs.

forkedblog$ tar -xvf hadoop-3.2.0.tar.gz
forkedblog$ mv hadoop-3.2.0 hadoop
forkedblog$ chown -R hdfs:hdfs /opt/hadoop

Update Configuration – Master

HDFS configurations are defined in core-site.xml and hdfs-site.xml. Other configuration files are not used, default configurations are enough, for this setup. All core-site.xml, hdfs-site.xml, hadoop-env.sh and workers files are present in ‘etc/hadoop‘ and in our case, full path will be ‘/opt/hadoop/etc/hadoop

In hdfs-site.xml

Update below properties (Note: We have created directories for this in previous steps)

  1. dfs.namenode.name.dir
  2. dfs.datanode.data.dir
  3. dfs.journalnode.edits.dir
  4. dfs.replication (Number, number of replications of each data block)

Open the file and edit it as below

<configuration>
   <property> 
      <name>dfs.namenode.name.dir</name> 
      <value>/data/namenode</value> 
      <final>true</final> 
   </property> 

   <property> 
      <name>dfs.datanode.data.dir</name> 
      <value>/data/datanode</value> 
      <final>true</final> 
   </property>

   <property> 
      <name>dfs.journalnode.edits.dir</name> 
      <value>/data/journalnode</value> 
      <final>true</final> 
   </property> 

   <property> 
      <name>dfs.replication</name> 
      <value>2</value> 
   </property> 
</configuration>

In core-site.xml

  1. fs.defaultFS – default hdfs url, hdfs://forkedblog.hdfs.node.master.site:9090

Open core-site.xml and edit it as below

<configuration>
   <property> 
      <name>fs.default.name</name> 
      <value>hdfs://forkedblog.hdfs.node.master.site:9090/</value> 
   </property>

   <property> 
      <name>dfs.permissions</name> 
      <value>false</value> 
   </property> 
</configuration>

Update workers file

hdfs$ pwd
/opt/hadoop
hdfs$ vi etc/hadoop/workers
forkedblog.hdfs.node.follower1.site
forkedblog.hdfs.node.follower1.site
:wq! # To save and exit

You can add master node also in this file as worker node. But it is not recommended when you are expecting higher workloads.

Update hadoop-env.sh

Update below properties with username in hadoop-env.sh

  1. export HDFS_NAMENODE_USER=hdfs
  2. export HDFS_DATANODE_USER=hdfs
  3. export HDFS_SECONDARYNAMENODE_USER=hdfs
  4. export YARN_RESOURCEMANAGER_USER=hdfs
  5. export YARN_NODEMANAGER_USER=hdfs

Update Configuration – All nodes

Update hadoop-env.sh

JAVA_HOME property needs to updated with actual Java installation path or JDK path in hadoop-env.sh file. This needs to be done in all nodes.

Find the line with ‘export JAVA_HOME=‘ and add Java JDK path after equal sign.

Format NameNode – In Master node

Format the namenode in master node using below command. ‘hdfs’ command will be available inside ‘bin’ folder of hdfs installation.

hdfs$ ./hdfs namenode -format

Start Hadoop Services

Start Hadoop services using below command from ‘sbin’ folder.

hdfs$ pwd
/opt/hadoop/sbin
hdfs$ ./start-all.sh

NameNode UI

We have not configured any ports for Namenode webserver. So server will be started with random port. To get the port, grep namenode log file in master node.

hdfs$ pwd
/opt/hadoop/logs
hdfs$ grep "Starting Web-server for hdfs at" hadoop-hdfs-namenode-*.log
INFO org.apache.hadoop.hdfs.DFSUtil: Starting Web-server for hdfs at: http://0.0.0.0:9870

You can check this via URL – http://forkedblog.hdfs.node.master.site:9870

Namenode UI home page - Hadoop 3.x
NameNode Home page – Hadoop 3.x

Datanode UI

We have not configured any ports for Datanode webserver. So server will be started with random port. To get the port, grep datanode log file in any follower node.

hdfs$ pwd
/opt/hadoop/logs
hdfs$ grep "DatanodeHttpServer: Listening HTTP traffic" hadoop-hdfs-datanode-*.log
INFO org.apache.hadoop.hdfs.server.datanode.web.DatanodeHttpServer: Listening HTTP traffic on /0.0.0.0:9864

You can check this via URL – http://forkedblog.hdfs.node.follower1.site:9864 or http://forkedblog.hdfs.node.follower2.site:9864

Datanode UI Home page - Hadoop 3.x
Datanode Home page – Hadoop 3.x

That’s it, setup Hadoop 3.x cluster is complete. Now you can move your files to HDFS storage using ‘hdfs dfs -copyFromLocal’ command.

Hadoop by default starts SecondaryNameNode (in master node), NodeManager (in all nodes), and ResourceManager (in master node) services when we issue start all command. Like I said, these services are started with default configurations. Let’s learn configuring these services in upcoming articles.

Thanks for reading. Please let me know your thoughts in comments section.

Leave a Reply

avatar
  Subscribe  
Notify of