Adm Final Word
Adm Final Word
2024-2025
A
Project Report
On
[HADOOP Configuration]
CERTIFICATE
This is to certify that
Prof. P. V. Dhurve
Guide External
CERTIFICATE
Prof. P. V. Dhurve
Guide External
We would like to deeply thank the various people who, during the several months which this
endeavour lasted, provided us with useful and helpful assistance. Without care and
consideration, this seminar would likely not have matured.
First, we would like to thank our project guide Head of Dept. Prof. M. M. Kulkarni Sir for
his guidance and interest. Her guidance reflects expertise we certainly do not master ourselves.
I also thank him for his patience throughout, in cross-reviewing which constitutes a rather
difficult balancing act.
Second, we would like to thank to subject teacher Prof. P. V. Dhurve all the Staff Members
of Computer Department for providing us their admirable feedback and insights whenever I
discussed my project with them. I also extend my thanks to our lab assistants who guided us in
implementation of our seminar.
I would like to extend my special thanks to our Principal, Prof. V. P. Nikhade for his
encouragement and words of wisdom.
Finally, I express my deepest gratitude to my family and friends who encouraged me since the
beginning and provided me with their insightful reviews to help me make my Project
successful.
2. Introduction 2
3. Project Objective 3
4. Features of Hadoop 4
5. Module of Hadoop 5
7. Conclusion 13
8. Reference 14
ABSTRACT:
What is Hadoop
Hadoop is an open-source framework from Apache and is used to store process and analyze
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing. It is being used by Facebook,
Yahoo, Google, Twitter, LinkedIn and many more.
Here are the key components and features of Hadoop:
2. MapReduce
4. Hadoop Common
1. Utilities and Libraries: Hadoop Common includes a set of utilities and libraries
needed by other Hadoop modules. This includes file system APIs, configuration
management, and basic I/O operations.
1
INTRODUCTION:
Applications built using HADOOP are run on large data sets distributed across clusters of
commodity computers. Commodity computers are cheap and widely available. These are
mainly useful for achieving greater computational power at low cost.
Similar to data residing in a local file system of a personal computer system, in Hadoop, data
resides in a distributed file system which is called as a Hadoop Distributed File system. The
processing model is based on ‘Data Locality’ concept wherein computational logic is sent to
cluster nodes(server) containing data. This computational logic is nothing, but a compiled
version of a program written in a high-level language such as Java. Such a program, processes
data stored in Hadoop HDFS.
In the age of big data, organizations are generating and processing vast amounts of data every
day. Traditional data management systems struggle to handle such large-scale data due to
limitations in storage, processing power, and scalability. Apache Hadoop is an open-source
framework that provides a solution to these challenges by enabling the distributed storage and
processing of massive datasets across a cluster of computers.
Hadoop is designed to handle big data in a cost-effective, scalable, and fault-tolerant manner.
The Hadoop framework is based on a simple yet powerful idea: distributing data across
multiple machines and processing it in parallel. This approach allows for the handling of data
that would otherwise be impossible or too expensive to manage with traditional database
systems.
2
PROJECT OBJECTIVE:
a) Learn the process of setting up a basic Hadoop cluster, either on local machines or
using virtual environments like Hadoop on Docker.
b) Configure HDFS and YARN for data storage and resource management,
respectively.
a) Understand how data is stored in the Hadoop Distributed File System (HDFS) and
how to perform basic file operations such as uploading, reading, and managing data
within HDFS.
b) Explore the fault-tolerant nature of HDFS and its ability to replicate data across
multiple nodes.
a) Get familiar with additional tools in the Hadoop ecosystem like Hive, Pig, and
HBase (optional), and understand how they complement Hadoop in solving
different data-related challenges.
a) Understand how Hadoop can scale to process petabytes of data and handle hardware
failures gracefully through replication and fault tolerance mechanisms.
3
Features Of ‘Hadoop’:
When working with Apache Hadoop in a microproject, it’s essential to understand its key
features and how they contribute to its power in managing and processing big data. Below are
the main features of Hadoop that will be highlighted and utilized throughout the project:
3. Scalability
a) Horizontal Scalability: Hadoop is designed to scale horizontally, meaning you can add
more nodes to the cluster as your data grows. This ensures that Hadoop can efficiently
handle petabytes of data.
b) Elasticity: The Hadoop system can dynamically scale resources up or down, based on
the size and complexity of the data being processed.
a) Data Replication: HDFS ensures fault tolerance by replicating data blocks across
multiple nodes. Even if some nodes fail, the data is still available from other replicas.
b) Automatic Recovery: If a node or task fails, Hadoop automatically recovers by
redistributing tasks and data, ensuring minimal disruption in processing.
4
Modules of Hadoop:
1) HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.
2) Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3) Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts it into
a data set which can be computed in Key value pair. The output of Map task is consumed by
reduce task and then the out of reducer gives the desired result.
4) Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
5
Installation of Hadoop:
Step 1: Click here to download the Java 8 Package. Save this file in your home directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz
Untar Java - Install Hadoop - Edureka
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc). Open.
bashrc file. Now, add Hadoop and Java Path as shown below.
Learn more about the Hadoop Ecosystem and its tools with the Hadoop Certification.
Command: vi .bashrc
6
Fig: Hadoop Installation – Setting Environment Variable
For applying all these changes to the current Terminal, execute the source command.
To make sure that Java and Hadoop have been properly installed on your system and can
be accessed through the Terminal, execute the java -version and hadoop version
commands.
7
Fig: Hadoop Installation – Checking Java Version
Command: ls
Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
core-site.xml informs Hadoop daemon where Name Node runs in the cluster. It
contains configuration settings of Hadoop core such as I/O settings that are
common to HDFS & MapReduce.
Command: vi core-site.xml
8
Fig: Hadoop Installation – Configuring core-site.xml
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration
tag:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e. Name Node, Data
Node, Secondary Name Node). It also includes the replication factor and block size of
HDFS.
Command: vi hdfs-site.xml
9
Fig: Hadoop Installation – Configuring hdfs-site.xml
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:
mapred-site.xml contains configuration settings of MapReduce application like
number of JVM that can run in parallel, the size of the mapper and the reducer
process, CPU cores available for a process, etc.
In some cases, mapred-site.xml file is not available. So, we have to create the mapred-
site.xml file using mapred-site.xml template.
10
Fig: Hadoop Installation – Configuring mapred-site.xml
Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
yarn-site.xml contains configuration settings of Resource Manager and
Node Manager like application memory management size, the operation needed on program
& algorithm, etc.
You can even check out the details of Big Data with the Azure Data Engineering
Certification in Hyderabad.
Command: vi yarn-site.xml
11
Fig: Hadoop Installation – Configuring yarn-site.xml
<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
12
CONCLUSION:
Congratulations, you have successfully installed a single-node Hadoop cluster in one go. In our
next blog of the Hadoop Tutorial Series, we will be covering how to install Hadoop on a multi-
node cluster as well.
Now that you have understood how to install Hadoop, check out the Hadoop admin course by
Edureka, a trusted online learning company with a network of more than 250,000 satisfied
learners spread across the globe. The Edureka Big Data Engineer Course helps learners become
experts in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume, and Sqoop using real-
time use cases on Retail, Social Media, Aviation, Tourism, Finance domains.
13
REFERENCE:
BOOK:
1. A Guide to Measuring and Monitoring Project Performance BY Harold Kerzner
2. Advanced Database Systems By Nabil R. Adam, Bhagvan.
3. Database Systems: Design, Implementation, and Management By Peter Rob.
WEBSITE NAME:
1. https://html.scribdassets.com/8517dys11c79xnq3/images/6-420fb4cfaa.png
2. https://www.emugames.net/
3. https://www.geeksforgeeks.org/DBMS
4. https://www.tutorialspoint.com
5. https://data-flair.training/blogs/best-data-mining-books/amp/
6. https://www.guru99.com/learn-hadoop-in-10-minutes.html
14