0% found this document useful (0 votes)
5 views113 pages

04. Hadoop Installaion (1)

The document outlines the installation and configuration of Hadoop in three modes: Local Standalone, Pseudo-distributed, and Fully Distributed. It provides detailed steps for setting up the environment, including creating users, installing Java, configuring SSH, and setting Hadoop paths. Additionally, it includes instructions for running a MapReduce program to count words in files using Hadoop's built-in examples.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views113 pages

04. Hadoop Installaion (1)

The document outlines the installation and configuration of Hadoop in three modes: Local Standalone, Pseudo-distributed, and Fully Distributed. It provides detailed steps for setting up the environment, including creating users, installing Java, configuring SSH, and setting Hadoop paths. Additionally, it includes instructions for running a MapReduce program to count words in files using Hadoop's built-in examples.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 113

Hadoop Installation and

Map-Reduce Programming

by
A K Chakravarthy
Assistant Professor
Department of Information Technology

1 Big Image Data Processing on Hadoop 02/03/25


Hadoop can be installed in your systems in three
different modes:

Local Standalone mode


Pseudo distributed mode
Fully distributed mode

2 Big Image Data Processing on Hadoop 02/03/25


3 Big Image Data Processing on Hadoop 02/03/25
4 Big Image Data Processing on Hadoop 02/03/25
5 Big Image Data Processing on Hadoop 02/03/25
6 Big Image Data Processing on Hadoop 02/03/25
Local Standalone mode:

This is the default mode. In this mode, all the


components of Hadoop, such as NameNode,
DataNode, JobTracker and TaskTracker, run on a
single Java process.

7 Big Image Data Processing on Hadoop 02/03/25


Pseudo-distributed mode:

In this mode, a separate JVM is spawned for each


of the Hadoop components and they communicate
across network sockets, effectively giving a fully
functioning minicluster on a single host.

8 Big Image Data Processing on Hadoop 02/03/25


Fully Distributed mode:

In this mode, Hadoop is spread across multiple


machines, some of which will be general-purpose
workers and others will be dedicated hosts for
components, such as NameNode and JobTracker.

9 Big Image Data Processing on Hadoop 02/03/25


Environment Setup for Hadoop

Hadoop is supported by Ubuntu/GNU/Linux


platform and its flavors.

Therefore, we have to install a Linux


operating system for setting up Hadoop
environment.

In case you have an OS other than Linux, you


can install a Virtualbox software in it and have
Linux inside the Virtualbox.

10 Big Image Data Processing on Hadoop 02/03/25


• Before installation of virtual box

• Install Microsoft Visual C++ Redistributable


Version

• Download Ubuntu -ubuntu-16.04.7-desktop-


amd64

• Install Ubuntu using Virtual box

M.Tech(CSE) Presentation 02/03/25


I. Local Standalone mode

12 Big Image Data Processing on Hadoop 02/03/25


Step-1: Creating a User in Ubuntu:
At the beginning, it is recommended to create a
separate user for Hadoop to isolate Hadoop file system
from Unix file system.

In addition to this, as we need to prepare cluster, first


create group and then a user in that group.

13 Big Image Data Processing on Hadoop 02/03/25


Follow the steps given below to create a group and a user
in that group:

$ clear
$ sudo addgroup aec_viper_group
$ sudo adduser –ingroup aec_viper_group aec_viper_user

Password is ‘aec’ in both the cases

14 Big Image Data Processing on Hadoop 02/03/25


$sudo gedit /etc/sudoers

Add the following line (after %sudo ALL=(ALL:ALL)


ALL)

%aec_viper_group ALL=(ALL:ALL) ALL

15 Big Image Data Processing on Hadoop 02/03/25


Change the user form existinguser… to acet_viper_user

LOGOUT of the present user


LOGIN with the newly created user
“aec_viper_user”

After this, if you type pwd, we should get


$aec_viper_user@ .......

16 Big Image Data Processing on Hadoop 02/03/25


Check, whether Java is installed or not
$java –version

If not
$ sudo apt-get install default –jre
$ sudo apt-get install default –jdk
(Internet connection is must)

17 Big Image Data Processing on Hadoop 02/03/25


Step-2: SSH Setup and Key Generation in
Ubuntu:

SSH setup is required to do different operations on a


cluster such as starting, stopping, distributed daemon
shell operations. To authenticate different users of
Hadoop, it is required to provide public/private key pair for
a Hadoop user and share it with different users.

18 Big Image Data Processing on Hadoop 02/03/25


$ ssh localhost
After Typing this you must be able to connect to
localhost and able to see the following output.
SCREENSHOT1
Otherwise use the following commands
$ sudo apt-get install openssh-server
$ sudo apt-get install vsftpd
(Internet connection is must)

$ ssh localhost

19 Big Image Data Processing on Hadoop 02/03/25


The following commands are used for generating
•A key value pair using SSH.
•Copy the public keys from id_rsa.pub to authorized_keys
and
•Provide the owner with read and write permissions to
authorized_keys file respectively.

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >>
~/.ssh/authorized_keys $ chmod 0600
~/.ssh/authorized_keys

20 Big Image Data Processing on Hadoop 02/03/25


Step-3: Installing Java
All of you type this command
$ uname –i

As Java is already installed in your


computer, You will get either
x86_64 or
i686 or
something else

21 Big Image Data Processing on Hadoop 02/03/25


Then open a text editor using search in ubuntu.
Create a file abc.txt

Those who got x86_64, type these two lines into


abc.txt
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin

Those who got other than x86_64, type these


two lines into abc.txt
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export PATH=$PATH:$JAVA_HOME/bin

22 Big Image Data Processing on Hadoop 02/03/25


Step-4: Download Hadoop

As Hadoop is already downloaded in one computer in


this lab, you need to copy that downloaded Hadoop into
your computer. So follow the following commands to do
this.

$ pwd (It must be nitw_viper_user)

$ wget ftp://172.168.10.168/Downloads/hadoop-2.7.2.tar.gz --user=lenova


--password=nitw
By doing this, the download Hadoop software which is in the tar form will be
copied into ‘nitw_viper_user’.

23 Big Image Data Processing on Hadoop 02/03/25


Step-5: Untar the Hadoop

To untar (like unzipping) the Hadoop in nitw_cvhd_user


follow the following commands.

$ tar zxf hadoop-2.7.2.tar.gz


$ ls –lrt

24 Big Image Data Processing on Hadoop 02/03/25


Step-6: Setting the Hadoop Path

Just like your Java path, Now we are doing it for Hadoop
path.
Open the file abc.txt, where you have already typed two
lines in that previously

Type the following two lines into abc.txt

export HADOOP_HOME=/home/acet_viper_user/hadoop-2.7.2
export PATH=$PATH:$HADOOP_HOME/bin

25 Big Image Data Processing on Hadoop 02/03/25


Step-7: Updating the bashrc file
To set the environmental changes, i.e. to keep the java path and
hadoop path to work correctly in all ‘terminals’, we have to save the
changes to ‘bashrc file’ of the current user.
Actually, bashrc is a hidden file. So to open this file, we have to
use .bashrc.
Use the following commands to do this.
$ nano .bashrc
bashrc will be opened,
1.go to the end of the file
2.paste the 4-lines from abc.txt into this bashrc file
3.Save the bashrc file

26 Big Image Data Processing on Hadoop 02/03/25


Step-8:Now use the command

$ source ~/.bashrc

(This command is to refresh the terminal with updated bashrc)

27 Big Image Data Processing on Hadoop 02/03/25


Step-9: Just verify, whether everything is done
properly or not , to check the path settings are
reflected or not.

$ echo $JAVA_HOME

$ echo $HADOOP_HOME

$ echo $PATH (It will show both Java path and Hadoop Path)

28 Big Image Data Processing on Hadoop 02/03/25


Step-10: Now, Check whether Hadoop is working
or not. (Just like Java -version)
$ hadoop version

Now Hadoop on Local Standalone


mode is Ready
29 Big Image Data Processing on Hadoop 02/03/25
Now, We run a program on Hadoop from the already
existing examples given.

Let's have an input directory where we will push a few files


and our requirement is to count the total number of words
in those files. To calculate the total number of words, we do
not need to write our MapReduce, provided the .jar file
contains the implementation for word count. You can try
other examples using the same .jar file; just issue the
following commands to check supported MapReduce
functional programs by hadoop-mapreduce-examples-
2.7.1.jar file.

30 Big Image Data Processing on Hadoop 02/03/25


$ cp $HADOOP_HOME/share/hadoop/mapreduce/hadoop-
mapreduce-examples-2.7.1.jar .
(This dot is must, i.e. to copy into present working director)

This command copies the examples into present working


directory i.e. nitw_viper_user

$ ls –lrt

31 Big Image Data Processing on Hadoop 02/03/25


Step-12: Now, we will execute wordcount program.

For this, we need input directory, from which the input


will be taken by the program.

And output will be written to another directory.

So First create a directory named as input_dir and copy


some text files into this directory.

$ mkdir input_dir
$ cp $HADOOP_HOME/*.txt input_dir
$ cd input_dir
$ ls –lrt
$ cd .. //to come to nitw_viper_user
32 Big Image Data Processing on Hadoop 02/03/25
Step-13: Now, use the following command to execute the
program Jar file name in the
current directory

$ hadoop jar hadoop-mapreduce-examples-2.7.2.jar


wordcount input_dir output_dir

This is
keyword
just like
java This is
This is output
This is the input directory
file name directory

33 Big Image Data Processing on Hadoop 02/03/25


Step-14: To see the output

$ cd output_dir
$ ls –lrt
$ cat part-r-00000

34 Big Image Data Processing on Hadoop 02/03/25


Assignment-1

35 Big Image Data Processing on Hadoop 02/03/25


II. Pseudo-distributed mode

36 Big Image Data Processing on Hadoop 02/03/25


Step-1: Setting Up Hadoop
You can set Hadoop environment variables by
appending the following commands to .bashrc
and then save it
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib”
export PATH=$PATH:$HADOOP_HOME/sbin

37 Big Image Data Processing on Hadoop 02/03/25


Step-2:Now use the command

$ source ~/.bashrc

(This command is to refresh the terminal with updated bashrc)

38 Big Image Data Processing on Hadoop 02/03/25


Step-3: Hadoop Configuration

You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”.

It is required to make changes in those configuration files


according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop
$ pwd

39 Big Image Data Processing on Hadoop 02/03/25


In order to develop Hadoop programs in java, you have to reset the
java environment variables in hadoop-env.sh file by replacing
JAVA_HOME value with the location of java in your system.

$ nano hadoop-env.sh

Those who got x86_64 (uname -i), type


export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Write this statement at the end of the file


Those who got other than x86_64, type
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

40 Big Image Data Processing on Hadoop 02/03/25


You need to configure the following
files also
core-site.xml
hdfs-site.xml
yarn-site-xml
mapred-site.xml

41 Big Image Data Processing on Hadoop 02/03/25


Step-4: Configuring core-site.xml

The core-site.xml file contains information such as the port


number used for Hadoop instance, memory allocated for the file
system, memory limit for storing the data, and size of
Read/Write buffers.
Open the core-site.xml and add the following properties in
between <configuration>, </configuration> tags.

$ nano core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

42 Big Image Data Processing on Hadoop 02/03/25


Step-5: Configuring hdfs-site.xml :

$ nano hdfs-site.xml
The hdfs-site.xml file contains information such as the value of
replication data, namenode path, and datanode paths of your local file
systems. It means the place where you want to store the Hadoop
infrastructure.
Let us assume the following data.
dfs.replication (data replication value) = 1

(In the below given path /nitw_cvhd_user/ is the user name.

hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)

namenode path = //home/nitw_cvhd_user/ hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)


datanode path = //home/nitw_cvhd_user/hadoopinfra/hdfs/datanode
43 Big Image Data Processing on Hadoop 02/03/25
<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>

<value>file:///home/acet_viper_user/hadoopinfra/hdfs/namenode</value>
</property>

<property>
<name>dfs.data.dir</name>

<value>file:///home/acet_viper_user/hadoopinfra/hdfs/datanode</value>
</property>

</configuration>

44 Big Image Data Processing on Hadoop 02/03/25


Step-6: Configuring yarn-site.xml

$ nano yarn-site.xml

This file is used to configure yarn into Hadoop. Open


the yarn-site.xml file and add the following
properties in between the <configuration>,
</configuration> tags in this file.

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

45 Big Image Data Processing on Hadoop 02/03/25


Step-7: Configuring mapred-site.xml

This file is used to specify which MapReduce framework we


are using.
By default, Hadoop contains a template of mapred-
site.xml.template First of all, it is required to copy the file from
mapred-site,xml.template to mapred-site.xml file using the
following command.

$ cp mapred-site.xml.template mapred-site.xml

Then open the mapred-site.xml and add the following properties in


between the <configuration> </configuration> tags.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
46 Big Image Data Processing on Hadoop 02/03/25
Step-8: Verifying Hadoop Installation

Goto home directory.

$ cd

$ hdfs namenode –format

47 Big Image Data Processing on Hadoop 02/03/25


Step-9: Verifying Hadoop dfs

Goto home directory

$ start–dfs.sh

$ start–yarn.sh

48 Big Image Data Processing on Hadoop 02/03/25


Step-9: Verifying Hadoop dfs

$ jps

49 Big Image Data Processing on Hadoop 02/03/25


Before proceeding further, you need to make sure that
Hadoop is working fine. Just issue the following command:
$ hadoop version
If everything is fine with your setup, then you should see the
following result:

It means your Hadoop's pseudo distributed mode setup is


working fine. By default, Hadoop is configured to run in a
non-distributed mode on a single machine.
50 Big Image Data Processing on Hadoop 02/03/25
51 Big Image Data Processing on Hadoop 02/03/25
52 Big Image Data Processing on Hadoop 02/03/25
Word-count Program execution
in HDFS Environment

When you have installed in Standalone mode, the


data that you have used to run the program is from
local file system.

But now, in Pseudo Distributed mode, we will be


seeing how to put the data into HDFS and get the
data from HDFS, so that we will have the feeling of
working on Hadoop(storage).

53 Big Image Data Processing on Hadoop 02/03/25


$hdfs dfs –mkdir hdfs://localhost:9000/acetinput
$hdfs dfs –ls hdfs://localhost:9000/

$hdfs dfs –put /../..file1.txt hdfs://localhost:9000/acetinput


$hdfs dfs –put /../..file2.txt hdfs://localhost:9000/acetinput

54 Big Image Data Processing on Hadoop 02/03/25


55 Big Image Data Processing on Hadoop 02/03/25
56 Big Image Data Processing on Hadoop 02/03/25
Step-13: Now, use the following command to execute the
program Jar file name in the
current directory

$ hadoop jar hadoop-mapreduce-examples-2.7.2.jar


wordcount hdfs://localhost:9000/acetinput hdfs://localhost:9000/acetoutput

This is
keyword This is
just like output
java directory
This is
This is the input
file name directory

57 Big Image Data Processing on Hadoop 02/03/25


Step-14: To see the output

$ hdfs dfs –cat hdfs://localhost:9000/acetoutput/part-r-00000

58 Big Image Data Processing on Hadoop 02/03/25


III. Fully-Distributed Mode

59 Big Image Data Processing on Hadoop 02/03/25


Prerequisites

Configuring Pseudo distributed mode of


Hadoop.
(Here we have used six Pseudo distributed Mode of Hadoop
installed systems. )

 Stop all the processes running in all the six


systems by using the command.

$stop-all.sh (in all the six systems)

60 Big Image Data Processing on Hadoop 02/03/25


ALL THE SLAVE NODES MUST
REMAIN IDLE UNLESS
SPECIFIED!

61 Big Image Data Processing on Hadoop 02/03/25


Networking

Update /etc/hosts on all machines. Put the alias


to the ip addresses of all the machines. Here we
are creating a cluster of 6 machines, one is
master, one is secondary and other 4 are slaves.

But I will be showing in one system,


afterwards from this system itself I will be
accessing all the remaining 5 systems and
update the data. i.e. secondary node and
slave nodes need not do anything for the
time being.
62 Big Image Data Processing on Hadoop 02/03/25
$sudo gedit /etc/hosts
Add the following lines at the end of this file (for six node
cluster)
(IP Address) (hostname) (alias)

192.168.192.104 selab104 master


192.168.192.105 selab105 secondarymaster
192.168.192.101 selab101 slave1
192.168.192.102 selab102 slave2
192.168.192.103 selab103 slave3
192.168.192.106 selab106 slave4

Note: To know the hostname of your computer: $hostname

63 Big Image Data Processing on Hadoop 02/03/25


$sudo gedit /etc/hosts
Add the following lines at the end of this file (for six node
cluster)
And make comment to the hostname of the system i.e. ‘#’
symbol before it.
Note: To know the hostname of your computer: $hostname
Full Distributed Mode (etc/hosts )
1. Put comments to the host name of the system i.e. # symbol before them.
127.0.0.1 localhost
# 127.0.1.1 selab114 (112/113/115/116/118)

2.(in addition to the existing file add these at the end)


(IP Address) (hostname) (alias)
192.168.192.114 selab114 master
192.168.192.112 selab112 secondarymaster
192.168.192.113 selab113 slave1
192.168.192.115 selab115 slave2
192.168.192.116 selab116 slave3
192.168.192.118 selab118 slave4

64 Big Image Data Processing on Hadoop 02/03/25


Process to connect to other 5 systems
SSH Access

The ‘nitw_viper_user’ user on the master must be able to connect:


1. To its own user account on the master.
$ssh master in this context.

2. To the ‘nitw_viper_user’ user account on other slaves via a


password-less SSH login.

•Add the ’master’ public SSH key using the following command.

$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub nitw_viper_user@secondarymaster


$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub nitw_viper_user@slave1
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub nitw_viper_user@slave2
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub nitw_viper_user@slave3
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub nitw_viper_user@slave4

65 Big Image Data Processing on Hadoop 02/03/25


Process to connect to other 5 systems…

Connect with user nitw_viper_user from the master to the user account
nitw_viper_user on all slaves in 6 different terminals! (UBUNTU Terminals)

1. From master to master: nitw_cvhd_user@selab104:~$ ssh master

2.From master to secondarymaster : nitw_viper_user@selab104:~$ ssh secondarymaster


which results in nitw_cvhd_user@selab105:~$
From master to slave2 : nitw_cvhd_user@selab114:~$ ssh slave1
which results in nitw_cvhd_user@selab101:~$
From master to slave3 : nitw_cvhd_user@selab114:~$ ssh slave2
which results in nitw_cvhd_user@selab102:~$
From master to slave4 : nitw_cvhd_user@selab114:~$ ssh slave3
which results in nitw_cvhd_user@selab103:~$
From master to slave5 : nitw_cvhd_user@selab114:~$ ssh slave4
which results in nitw_cvhd_user@selab106:~$

66 Big Image Data Processing on Hadoop 02/03/25


Screenshot to show all the nodes
accessed from the Master node

67 Big Image Data Processing on Hadoop 02/03/25


Now Update the /etc/hosts

of the secondarymaster and slave nodes(directly from the master


node)

68 Big Image Data Processing on Hadoop 02/03/25


Now we need to inform the master name to
Hadoop:
Create a file named masters in (means this file is not existing as on now)

/home/hadoop-2.7.2/etc/hadoop directory

Add the name of secondarymaster in masters file.


(It has to be done in all the systems:
master + secondarymaster + slaves,
but here I will be accessing all the systems from my system
and doing it.)

nitw_cvhd_user@selab104:~$ cd /home/hadoop-2.7.2/etc/hadoop
nitw_cvhd_user@selab104:~$ sudo gedit masters
Add the following line
secondarymaster
69 Big Image Data Processing on Hadoop 02/03/25
70 Big Image Data Processing on Hadoop 02/03/25
Now, we have to do the same process in
all the reaming 5 systems (I will be doing it
without touching those 5 systems. (i.e.
through ‘ssh’, already connected))

nitw_viper_user@selab105:~$

nitw_viper_user@selab101:~$ nitw_viper_user@selab102:~$

nitw_viper_user@selab103:~$ nitw_viper_user@selab106:~$

71 Big Image Data Processing on Hadoop 02/03/25


Now we need to inform the Slave names to Hadoop

Then, we need to inform the slaves’ names to Hadoop:


Edit the file named slaves in (means this file is already existing)

/home/hadoop-2.7.2/etc/hadoop directory

Add the names of slaves in slaves file.


(It has to be done in all the systems:
master + secondarymaster + slaves,
but here I will be accessing all the systems from my system
and doing it.)

72 Big Image Data Processing on Hadoop 02/03/25


nitw_viper_user@selab104:~$ cd /home/hadoop-2.7.2/etc/hadoop
nitw_viper_user@selab104:~$ sudo gedit slaves
Add the following lines
slave1
slave2
slave3
slave4

73 Big Image Data Processing on Hadoop 02/03/25


Now, we have to do the same process in
all the reaming 5 systems (I will be doing it
without touching those 5 systems. (i.e.
through ‘ssh’, already connected))

nitw_viper_user@selab105:~$

nitw_viper_user@selab101:~$ nitw_viper_user@selab102:~$

nitw_viper_user@selab103:~$ nitw_viper_user@selab106:~$

74 Big Image Data Processing on Hadoop 02/03/25


Now, edit the ‘core-site.xml’ (all machines)

nitw_viper_user@selab104:~/home/hadoop-2.7.2/etc/hadoop$
sudo gedit core-site.xml

nitw_viper_user@selab105(101,102,103,104,106):~/home/hadoop-2.7.2/
etc/hadoop$ sudo gedit core-site.xml

Change the fs.default.name parameter (in conf/core-site.xml),


which specifies the NameNode (the HDFS master) host and port.

/home/hadoop-2.7.2/etc/hadoop/core-site.xml (ALL
machines .i.e. Master as well as slave)
Pseudo Mode(core-site.xml) Full Distributed Mode(core-
site.xml)
<property> <property>
<name>fs.default.name</name>
<name>fs.default.name</name> <value>hdfs://master:9000</value>
</property>
<value>hdfs://localhost:9000</val
ue>
</property>

75 Big Image Data Processing on Hadoop 02/03/25


76 Big Image Data Processing on Hadoop 02/03/25
Now, edit the ‘hdfs-site.xml’ (all machines)
nitw_cvhd_user@selab104:~/home/hadoop-2.7.1/etc/hadoop$
sudo gedit hdfs-site.xml

nitw_viper_user@selab105(101,102,103,104,106):~/:~/home/
hadoop-2.7.1/etc/hadoop$ sudo gedit hdfs-site.xml

Change the dfs.replication parameter (in conf/hdfs-site.xml) which


specifies the default block replication. We have 5 data nodes
available, so we set dfs.replication to 3. (you can have any
number, but 3 is optimal)
conf/hdfs-site.xml (ALL machines)
Pseudo Mode(hdfs-site.xml) Full Distributed Mode (hdfs-site.xml)
<property> <property>
<name>dfs.replication</name> <name>dfs.replication</name>
<value>1</value> <value>3</value>
</property> </property>
<property> <property>
<name>dfs.name.dir</name> <name>dfs.name.dir</name>
<value>file:///home/nitw_viper_user/hadoopinfra/hdfs/namenode</value>
<value>file:///home/nitw_viper_user/hadoopinfra/hdfs/namenode</value> </property>
</property> <property>
<property> <name>dfs.data.dir</name>
<name>dfs.data.dir</name> <value>file:///home/nitw_viper_user/hadoopinfra/hdfs/datanode</value>
</property>
<value>file:///home/nitw_viper_user/hadoopinfra/hdfs/datanode</value>
</property> <property>
<name>dfs.secondary.http-address</name>
<value>selab123:50090</value>
<description>hostname:portnumber</description>
</property>

77 Big Image Data Processing on Hadoop 02/03/25


secondarymaster

78 Big Image Data Processing on Hadoop 02/03/25


Now, edit the ‘yarn-site.xml’ (all machines)

nitw_cvhd_user@selab104:~/home/hadoop-2.7.2/etc/hadoop$
sudo gedit yarn-site.xml

nitw_cvhd_user@selab105(101,102,103,104,106):~/:~/home/
hadoop-2.7.2/etc/hadoop$ sudo gedit yarn-site.xml

Change the dfs.replication parameter (in conf/hdfs-site.xml) which


specifies the default block replication. We have 5 data nodes
available, so we set dfs.replication to 3. (you can have any
number, but 3 is optimal)
Pseudo Mode(yarn-site.xml) Full Distributed Mode(yarn-site.xml)
<property> <property>

conf/yarn-site.xml (ALL machines)


<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</property> <property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>

79 Big Image Data Processing on Hadoop 02/03/25


80 Big Image Data Processing on Hadoop 02/03/25
Now, edit the ‘mapred-site.xml’ (all machines)
nitw_cvhd_user@selab104:~/home/hadoop-2.7.2/etc/hadoop$ sudo
gedit mapred-site.xml

nitw_cvhd_user@selab105(101,102,103,104,106):~/home/hadoop-
2.7.2/etc/hadoop$ sudo gedit mapred-site.xml

Change the mapred.job.tracker parameter (in conf/mapred-


site.xml), which specifies the JobTracker (MapReduce master) host
and port.
Pseudo Mode (mapred-site.xml) Full Distributed Mode (mapred-site.xml)
<property> <property>
<name>mapreduce.framework.name</name> <name>mapreduce.framework.name</name>
<value>yarn</value>
<value>yarn</value>
</property>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>6</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>6</value>
</property>
<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx512m</value>
</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx512m</value>
</property>

81 Big Image Data Processing on Hadoop 02/03/25


82 Big Image Data Processing on Hadoop 02/03/25
Formatting the HDFS filesystem via the NameNode
$ cd
$ hdfs namenode –format

83 Big Image Data Processing on Hadoop 02/03/25


Starting the multi-node cluster

Cluster is started by running the following command on master


nitw_viper_user@selab104:~$cd /home/hadoop-2.7.2

nitw_cvhd_user@selab104:~$ bin/start-all.sh

To check the daemons running ,

run jps on master and slave

84 Big Image Data Processing on Hadoop 02/03/25


HDFS Part (Storage):
•The NameNode daemon is started on master,
•SecondaryNode is started on secondarymaster and
•DataNode daemons are started on all slaves (here: master and
slave).

YARN Part (Processing):


•The ResouceManager is started on master, and
•NodeManager daemons are started on all slaves.

85 Big Image Data Processing on Hadoop 02/03/25


86 Big Image Data Processing on Hadoop 02/03/25
87 Big Image Data Processing on Hadoop 02/03/25
88 Big Image Data Processing on Hadoop 02/03/25
$hdfs dfs –mkdir hdfs://master:9000/acetinput1
$hdfs dfs –ls hdfs://master:9000/

$hdfs dfs –put /../..file1.txt hdfs://master:9000/acetinput1


$hdfs dfs –put /../..file2.txt hdfs://master:9000/acetinput1

89 Big Image Data Processing on Hadoop 02/03/25


90 Big Image Data Processing on Hadoop 02/03/25
91 Big Image Data Processing on Hadoop 02/03/25
92 Big Image Data Processing on Hadoop 02/03/25
93 Big Image Data Processing on Hadoop 02/03/25
94 Big Image Data Processing on Hadoop 02/03/25
95 Big Image Data Processing on Hadoop 02/03/25
96 Big Image Data Processing on Hadoop 02/03/25
97 Big Image Data Processing on Hadoop 02/03/25
98 Big Image Data Processing on Hadoop 02/03/25
99 Big Image Data Processing on Hadoop 02/03/25
Map-Reduce Programming

by
Dr. U.S.N. Raju
Asst. Professor, Dept. of CS&E,
N.I.T. Warangal

100 Big Image Data Processing on Hadoop 02/03/25


101 Big Image Data Processing on Hadoop 02/03/25
102 Big Image Data Processing on Hadoop 02/03/25
OpenCV Installation

by
Dr. U.S.N. Raju
Asst. Professor, Dept. of CS&E,
N.I.T. Warangal

103 Big Image Data Processing on Hadoop 02/03/25


Copy the following code in a file named opencv.sh
version="$(wget -q -O - http://sourceforge.net/projects/opencvlibrary/files/opencv-unix | egrep -m1 -o '\"[0-9](\.[0-9]+)+' |
cut -c2-)"
echo "Installing OpenCV" $version
mkdir OpenCV
cd OpenCV
echo "Removing any pre-installed ffmpeg and x264"
sudo apt-get -qq remove ffmpeg x264 libx264-dev
echo "Installing Dependenices"
sudo apt-get -qq install libopencv-dev build-essential checkinstall cmake pkg-config yasm libjpeg-dev libjasper-dev
libavcodec-dev libavformat-dev libswscale-dev libdc1394-22-dev libxine-dev libgstreamer0.10-dev libgstreamer-plugins-
base0.10-dev libv4l-dev python-dev python-numpy libtbb-dev libqt4-dev libgtk2.0-dev libfaac-dev libmp3lame-dev
libopencore-amrnb-dev libopencore-amrwb-dev
echo "Downloading OpenCV" $version
wget -O OpenCV-$version.zip http://sourceforge.net/projects/opencvlibrary/files/opencv-unix/$version/
opencv-"$version".zip/download
echo "Installing OpenCV" $version
unzip OpenCV-$version.zip
cd opencv-$version
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D WITH_TBB=ON -D
BUILD_NEW_PYTHON_SUPPORT=ON -D WITH_V4L=ON -D INSTALL_C_EXAMPLES=ON -D
INSTALL_PYTHON_EXAMPLES=ON -D BUILD_EXAMPLES=ON -D WITH_QT=ON -D WITH_OPENGL=ON ..
make -j2
sudo checkinstall
sudo sh -c 'echo "/usr/local/lib" > /etc/ld.so.conf.d/opencv.conf'
sudo ldconfig
echo "OpenCV" $version "ready to be used"

104 Big Image Data Processing on Hadoop 02/03/25


Then change the permissions for the file with
$chmod 777 opencv.sh
Then execute the script file as $./opencv.sh
Now open the terminal and type $python
Then the code as shown below
>>import cv2
If this gives an error, as shown below

105 Big Image Data Processing on Hadoop 02/03/25


Perform the steps as shown in below image

106 Big Image Data Processing on Hadoop 02/03/25


WebHDFS Rest-API

We use Rest-API of HDFS known as WebHDFS for handling with


images and videos in Hadoop. To enable this, add the following
two lines in hdfs-site.xml file:

Configuring hdfs-site.xml

<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

107 Big Image Data Processing on Hadoop 02/03/25


Creating Directory in HDFS

To create a directory named “input” in HDFS use the following


command after starting Hadoop

$hdfs dfs –mkdir /user/nitw_cvhd_user2/input

Copying a image into HDFS

To copy a image named “input_image.jpg” into “input”


directory in HDFS use the following command

$hdfs dfs –copyFromLocal <image_path>


/user/nitw_cvhd_user2/input/input_image.jpg

Opening an image which is in HDFS using webhdfs

An image named “input_image.jpg” which is in directory


named “input” in HDFS can now be opened using following URL

http://localhost:50070/webhdfs/v1/user/nitw_cvhd_user2/input/
input_image.jpg?op=OPEN
108 Big Image Data Processing on Hadoop 02/03/25
109 Big Image Data Processing on Hadoop 02/03/25
Creating 8 Bitmap images for an image in HDFS

The file named bitmap_hdfs.py has code for calculating 8


bitmap images for a given image where the input image is taken
from HDFS and output images are written to local file system.

The following image shows the contents of input directory

110 Big Image Data Processing on Hadoop 02/03/25


111 Big Image Data Processing on Hadoop 02/03/25
Creating 8 Bitmap images for an image in HDFS

The following images show the contents of output directory


before and after putting output images into HDFS

112 Big Image Data Processing on Hadoop 02/03/25


Thank you …

113 Big Image Data Processing on Hadoop 02/03/25

You might also like