Installation and Configuration System Tool For Hadoop
Installation and Configuration System Tool For Hadoop
Micro-Project Report on
"Installation and configuration system/tool for Hadoop"
By
1) Magar Vikram Abasaheb [2114660125]
2) Lahakar Vinay Balasaheb [2114660146]
3) Adhav Rushikesh Sunil [2114660127]
4) Jadhav Hrutik Sanjay [2114660120]
Guided By
Prof. DARODE S.S
CERTIFICATE
This is to certify that the project work entitled
Is
Submitted by
1) Magar Vikram Abasaheb [2114660125]
2) Lahakar Vinay Balasaheb [2114660146]
3) Adhav Rushikesh Sunil [2114660127]
4) Jadhav Hrutik Sanjay [2114660120]
in the partial fulfillment of Diploma in Computer Engineering has been Satisfactory carried out under my
guidance as per the requirement of Maharashtra State Board of Technical Education, Mumbai during the
academic year 2022-2023
Date:
Place:
An endeavor over long period can be successful only with advice and
guidance of many well-wishers.
I would also like to thank the staff of Mechanical Department for the
generous guidance.
Last but not the least we would like to thank our friends and for their
help in every way for the success of this project report.
v
1. Introduction
Big Data is definitely one of the most important terms nowadays, and will still be in the
future within the development of technology. As this term means a large collection of data sets,
one would immediately think about the different possible ways to manage and process this Big
Data.[8] After making some research about it, the results have shown some issues related to the
handling of such large and complex data sets. Indeed, many data processing software have been
proven to have some limitations and to be unable to handle Big Data including standard
database management systems, relational database management systems and object relational
database management systems. [8]
Big Data is currently a subject to many studies because of its huge significance. Indeed,
very important areas are collecting huge amounts of data such as:
The above companies and research centers, in addition to many others, play a significant
role in our daily lives, which makes Big Data necessary to be able to find a way for the data
produced to be structured and stored safely and effectively. Therefore, this capstone will deliver
an installation and configuration tool for Hadoop. Hadoop is a very helpful technology enabler
to parallel processing platforms that uses the MapReduce algorithm.
The idea behind my capstone project is to develop a Linux based application to help
Hadoop users install and configure the framework without having to deal with the overhead of
modifying many files and calling a lot of command lines in the terminal.
1
1.1 Project Purpose
This capstone project called “Installation configuration system/tool for Hadoop” is a
partial fulfillment of the Bachelor of Science in Computer Science’ requirements in Al
Akhawayn University. The main purpose behind this project is to test and ensure that I am a
ready engineer who can design and develop sophisticated and complex systems based on all the
competences I acquired throughout my journey in the university. This project is the result of
my work during this whole semester under the supervision of Dr Nasser Assem.
1.2 Motivation
My capstone project is an extension of the research project that I have performed last
semester, which consists of analyzing the performance of HDFS within a cluster using a
benchmark. This benchmark involves measuring the time of the execution of creating a large
sample of random data. Also, the calculation of the time needed to delete it using a cluster.
Before going through all this process, I had to install Hadoop in my machines. The required
time for installing this framework is several days due to many errors I faced and needed to fix.
The other issue was that there is a shortage of resources and documentation to be used while
trying to install my application.
Concerning the societal implications of my application, It will help Big data users and
developers especially Hadoop users to get it configured in less time than normally
needed by other users. In fact, a novice user of Hadoop would need several to get it
installed and configured. This application would help them perform the task in 2 to 5
minutes.
Technical considerations
Concerning this aspect, the application is respecting it as it does not involve some complex
technologies that the user cannot deal with. However, the user needs not to have a prior
technical knowledge to Linux technologies in order to interact with the system.
Environmental considerations
2
My application does not have a direct relation with this kind of consideration. Indirectly,
Big Data is used in many environmental sectors such as smart grids.
Ethical consideration
The subject of my application is very ethical, and the research materials that I relied on are
all cited to respect engineering ethics. Moreover, I will make sure to contact the Hadoop
center to allow me to share this application with other people in a legal manner.
Political & legal consideration
Actually, my application does not have a relationship with any political or legal sides of our
society.
Economic
My application is free for every user and will be shared on Ubuntu software center.
2. Theoretical Baselines
These organizations may face hundreds of gigabytes or terabytes data and need to
consider the processing time and data management options. Big Data are characterized by the
3Vs:
3
Figure 2.1.1: Big Data 3Vs
Big data analytics is allied to cloud computing environments because the analysis of a
large set of data in real time entails a platform similar to Hadoop in order to process data
across distributed clusters and MapReduce.[8]
4
2.3 Hadoop
Margaret Rouse defines Hadoop as a free java-based programming framework that
supports the processing of large data sets in a distributed computing environment [9]. Apache
Software Foundation is the main sponsor for this project.
5
2.4 Hadoop Distributed File System
HDFS is a distributed file system that offers a high performance access to data across
clusters. It has become a key tool for managing Big Data analytics programs. The file system
is developed in such a way to be fault-tolerant. Nevertheless, HDFS facilitates the data
transfer and enables the system to run even if a node fails, which decreases the risk of
failure.[8]
HDFS breaks down data and distributes it through the nodes among the cluster
allowing for parallel processing. Moreover, the data is copied several times and each copy is
placed in at least two different server racks. Hence, if the system fails to find data in node, it
could be retrieved from a different rack to continue the data processing while the system is
recovering from the failure. HDFS is built in a platform that consists of Master and Slave
architecture within a cluster. [8]
6
Morever, the Namenode receives an intervallic Heartbeat and a Blockreport from each
Datanode within the cluster. A receipt of a heartbeat informs about the good health of the
Datanode. A Blockreport contains a list of all blocks on that Datanode.[10]
As shown in the figure 2.4.1, The NameNode manages the file system operations
whereas the DataNode is responsible for the management of data storage on each node.
2.5 MapReduce
MapReduce is a programming pattern that was designed by Google Foundation in
2004. The idea behind MapReduce is to split data into chunks as that will be processed in
parallel. The output of the mapping process is direct as an input to the Reduce part so that can
be gathered at the end.[8]
The Map task consists of five phases including reading, mapping, collecting, spilling
and merging. The reading phase entails reading the data from the HDFS and creating the Key-
value. The Mapping phase involves executing the map function in order to generate the map-
output data. Concerning the collection of the data, the map-out data is put in a buffer before
spilling. In the spilling phase, data is compressed and written in local disk. At the end of the
process, the merging phase consists of integrating all file into an output file in each node.[8]
7
The Reduce task comprises four other phase: shuffling, merging, reducing and writing.
During the shuffling phase, the map-output data is transferred to the reduced node to get
decompressed. The merging phase executes the assimilation of the outputs coming from
different mappers to the reduce phase. The reducing phase calls a reduce function to return the
final output. The last step consists of writing back the output data in the HDFS. [8]
MapReduce allows developers to use library routines to create programs that need a
parallel processing framework without being perturbed by the intra-cluster communication and
failure handling since it is also Fault-tolerant.[8]
3. Methodology
3.1 Process Model
Deciding about which development model to follow in this project was not difficult to
make, as I knew exactly the output of my project at the end of the semester, given the constraints
of time, of implementation difficulties and other constraints. I decided to use the Waterfall
model for different reasons. I already have my list of requirements finalized and I am sure that
I do not have to reiterate on the requirements. Finally, Waterfall model is suitable to develop
my application because it allows me to have each step of the development clear and separated
from others.
9
Figure 3.1.1: waterfall model
4. Functional Requirements
The system “Installation configuration system/tool for Hadoop” will deal with two main
parts, the application that Installation of major components of Hadoop. In the second part, the
application will configure the required files and check if the all processes are successfully
running.
5. Non-Functional requirements
The program should be:
This window will guide the user to better use the application, as well as the
prerequisites needed to start installing the program including a 64bit machine.
User friendly :
10
The application consists of dialogs, and the user needs to click on the button
next until he finishes installing Hadoop.
The figure 4.2.3 expresses the steps of downloading Hadoop from the server. Then,
the system extracts the file inside the Hadoop-tool folder. The last step is to give ownership to
the user to make changes to it including creating a temporary file inside it.
12
Figure 4.2.3: Sequence diagram for installing the Hadoop
As shown in figure 4.2.4, the last step before running Hadoop consists of formatting the
Namenode. After running the cluster, the user needs to call the jps to check if all processes are
successfully running.
13
Figure 4.2.4: Sequence diagram for running the Hadoop
14
Table History:
This table allows the user to track the date and time when he started the operation of
installing Hadoop. Using this table, the user continues installing the program from the point
where he stopped before aborting it.
Table Record:
This Table allows the user to keep track of the operations that he has made before abort the
installation.
Before starting to work in my project, I had to find an environment that will allow me
to create Linux application in an easy manner. After many investigations and extensive
research, my choice was to work using “Quickly”. It is a Python based development
environment that allows the creation, packaging and sharing of Linux applications on Ubuntu
Software center. I used the MySQL server to manage my database from the creation of tables
to the different operations and routines that I needed during the implementation.
15
Python supports multiple programming paradigms including
o Regular Expressions
MySQL Server: is the second most used relational database management system.
Once I had a clear and detailed scenario of what is supposed to be done in the back office
of my application. Also, after choosing the right development model, which is the waterfall
model and the different technology enablers to be used for the implementation, the next step
was the implementation of the application. This part of the development of the application
was the most exhausting past, as it took me 3 months of efforts and research to finalize the
implementation.
After some tremendous nights of work and design, the results I was hoping to achieve was
finally visible. At the end I found out that all the functional requirements and the non-
functional requirements that I drew in the beginning of the project were met within the period
agreed upon.
8. Future work
This application is just the beginning and the first generation of “Installation
configuration system/tool for Hadoop”. I was able to meet all the functional and non-
functional requirements of the project. In the future, I am planning to make this application
available for every type of machines including 32 and 64bits. This tool works only for a single
node cluster. This means that I will work on making the system suitable also for multi node
clusters.
Moreover, I can extend this application to include the installation of Hive or any similar
framework belonging to Apache. As already mentioned, I will try to contact Hadoop managers
to all me to share this application legally and get some refund from them.
In addition, I will make the interface of my application more colorfull. I tried to add
images to the interface, but it does not work for the time being. I will do more reading about
“Quickly” to have done during the upcoming months.
16
All these suggestions are ones that I am seriously considering to work on in the future,
as the time during the semester does not allow to work on all these aspects. However, as I am
very interested in this idea I will keep up with the good work to get all these ideas to reality.
9. Conclusion
At the end of this report, I found that developing this application was a successful experience
for me since it allowed me to get different skills related to both personal and professional fields.
Personally, I expanded my knowledge about Hadoop, its components and the way it works.
First, I mastered the long process required to install Hadoop in a Linux machine. I learned how to
find the best research methodology.
Professionally, I had the chance to work with a professional supervisor that took in
consideration and respected my own ideas and way of tackling problems. During the last three
month, I challenged myself to reach my goal and manage the stress of work. On the other hand, I
experienced new techniques of implementation and design, and I was able to move from my comfort
zone in working on small applications on a more professional level of development.
10.Appendix
10.1 Prerequisites
- Installing Sun Java version 7
As shown in the figure 1, the first prerequisite for installing Hadoop is to get JDK
version 7 installed, which improves the performance, scalability and administration, according
to oracle website. To download and install the version 7 of JDK, I used the following command:
Depending on the machine, the system installs either a normal 32bit version or amd64 version.
17
Figure 1: JDK being installed
- Configuring openSSH
18
Figure 2: SSH server being installed
As shown in figure 3, the openSSH client component is being installed. To install the
OpenSSH client applications on Ubuntu system, I used the following command:
The SSH key permits authentication between two hosts without the need of a password. SSH
key authentication uses both a private key and a public key.
To generate the keys, I used the following command:
ssh-keygen -t rsa
19
Figure 4: SSH keygen being generated
10.2 Hadoop
- Installation
o Download
As shown in figure 5, I adopted version 1.2.1 of Hadoop since it is known as a stable version
of this framework. The mirror link for downloading it:
Wgets hhtps://archive.apache.org/dist/hadooop/core/Hadoop-1.2.1/Hadoop-1.2.1-bin.tar.gz
20
o Extraction
As shown in figure 6, the Hadoop tar file is extracted inside the Hadoop-tool folder. This
operation takes around 60 seconds to be completed. To extract the file, I used the following
command:
- Update /.bashrc
According to the Linux website, “The shell program /bin/bash (hereafter referred to as just
"the shell") uses a collection of startup files to help create an environment. Each file has a
specific use and may affect login and interactive environments differently. The files in
the /etc directory generally provide global settings. If an equivalent file exists in your home
directory it may override the global settings.”
As shown in figure 7, I added the following lines to the end of the .bashrc file to have the
environment recognize the Hadoop and java files.
21
export HADOOP_HOME=~/Hadoop-tool/Hadoop-1.2.1
export JAVA_HOME=/usr/lib/jvm/openjdk-7-jdk-amd64
export PATH=$PATH:$HADOOP_HOME/bin
- Configuration
Hadoop-env.sh
This configuration consists of defining a new environment variable for the JDK by updating it
to:
JAVA_HOME=/usr/lib/jvm/openjdk-7-jdk-amd64
Core-site.xml
This part starts by the creation of a temporary file and giving the user the right permissions and
ownership.
This part consists of defining the path of the Hadoop folder through adding the following lines
inside the configuration Tag:
<property>
22
<name>hadoop.tmp.dir</name>
<value>~/Hadoop-tool/Hadoop-1.2.1</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
After making all these modifications, the application shows the following message as shown in
the figure 8:
23
Figure 8: Configuration of Hadoop files
Before starting a Hadoop cluster, formatting the namenode using the following command is
necessary.
Reminder: if you format the namenode as you will lose all data in the HDFS
The figure 9 shows the output of formatting the namenode. You need to check a message saying
that it has been successfully formatted
Sudo ~/Hadoop-tool/Hadoop-1.2.1/bin/start-all.sh
As shown in the figure 10, the namenode, Datanode, Jobtracker and a Tasktracker started in
the machine.
As shown in the figure 11, the JPS tool is used to check if the Hadoop processes are correctly
running.
25
Figure 11: Checking JPS
26