0% found this document useful (0 votes)
3 views

Adm Final Word

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Adm Final Word

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Nashik Gramin Shikshan Prasarak Mandal’s

BRAHMA VALLEY COLLEGE OF TECHNICAL


EDUCATION, ANJANERI, NASHIK DEPARTMENT OF
COMPUTER TECHNOLOGY

PROJECT WORK BOOK


YEAR 2024-2025
Project Title: HADOOP Configuration.

Sr. No. Student Name Seat No. Enrolment No.


1. Shreyash Jagan Chavan 420974 23611480169
2. Aditya Bhausaheb Jadhav 420976 23611480171
3. Vishal Anil Chaudhari 420973 23611480168

Project Guide: Prof: P. V. Dhurve


MAHARASHTRA STATE
BOARD OF TECHNICAL EDUCATION (MSBTE), MUMBAI

Brahma Valley College of Technical Education, Anjaneri, Nashik


Department of Computer Technology

2024-2025

A
Project Report
On
[HADOOP Configuration]

Shreyash Jagan Chavan


Aditya Bhausaheb Jadhav
Vishal Anil Chaudhari

Under the guidance of


[Prof. P. V. Dhurve]
Brahma Valley College of Technical Education,
Anjaneri, Nashik

Department of Computer Technology

CERTIFICATE
This is to certify that

Student Name: Shreyash Jagan Chavan


Aditya Bhausaheb Jadhav
Vishal Anil Chaudhari

Have successfully completed their Project on “HADOOP Configuration”


at Brahma Valley College of Technical Education, Anjaneri, Nashik in the partial
fulfilment of the Diploma course in Computer Technology in the academic Year
2024-2025.

Prof. P. V. Dhurve
Guide External

Prof. M. M. Kulkarni Prof. V.P. Nikhade


Head of the Department Principal
Brahma Valley College of Technical Education,
Anjaneri, Nashik

Department of Computer Technology

CERTIFICATE

This is to certify that Shreyash Jagan Chavan from Computer

Technology Department has successfully completed his/her Project on

“HADOOP Configuration” at Brahma Valley College of Technical Education,

Anjaneri, Nashik in the partial fulfilment of the Diploma course in Computer

Technology in the academic Year 2024-2025.

Prof. P. V. Dhurve
Guide External

Prof. M. M. Kulkarni Prof. V.P. Nikhade


Head of the Department Principal
ACKNOWLEDGEMENT

We would like to deeply thank the various people who, during the several months which this
endeavour lasted, provided us with useful and helpful assistance. Without care and
consideration, this seminar would likely not have matured.

First, we would like to thank our project guide Head of Dept. Prof. M. M. Kulkarni Sir for
his guidance and interest. Her guidance reflects expertise we certainly do not master ourselves.
I also thank him for his patience throughout, in cross-reviewing which constitutes a rather
difficult balancing act.

Second, we would like to thank to subject teacher Prof. P. V. Dhurve all the Staff Members
of Computer Department for providing us their admirable feedback and insights whenever I
discussed my project with them. I also extend my thanks to our lab assistants who guided us in
implementation of our seminar.

I would like to extend my special thanks to our Principal, Prof. V. P. Nikhade for his
encouragement and words of wisdom.

Finally, I express my deepest gratitude to my family and friends who encouraged me since the
beginning and provided me with their insightful reviews to help me make my Project
successful.

Shreyash Jagan Chavan


Aditya Bhausaheb Jadhav
Vishal Anil Chaudhari
INDEX

Sr. Topic Page


No. No.
1. Abstract 1

2. Introduction 2

3. Project Objective 3

4. Features of Hadoop 4

5. Module of Hadoop 5

6. Installation of Hadoop 6-12

7. Conclusion 13

8. Reference 14
ABSTRACT:

What is Hadoop
Hadoop is an open-source framework from Apache and is used to store process and analyze
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing. It is being used by Facebook,
Yahoo, Google, Twitter, LinkedIn and many more.
Here are the key components and features of Hadoop:

1. Hadoop Distributed File System (HDFS)


1. Storage Layer: HDFS is a distributed file system designed to store large amounts of
data across multiple machines. It breaks large files into smaller blocks and distributes
them across the cluster for parallel storage. Each block is replicated to ensure fault
tolerance.
2. Fault Tolerance: HDFS automatically replicates data (typically three copies) to ensure
reliability even if nodes fail.

2. MapReduce

1. Processing Layer: MapReduce is a programming model used for processing and


generating large datasets. It divides tasks into two phases:
a) Map: The input data is divided into smaller sub-tasks (mappers) that are
processed in parallel across the cluster.
b) Reduce: The results of the mappers are aggregated and processed in a final
"reduce" phase to produce the output.
2. This model allows for the parallel processing of vast amounts of data across many
machines in a Hadoop cluster.

3. YARN (Yet Another Resource Negotiator)

1. Resource Management: YARN is the resource management layer in Hadoop. It


manages and schedules the resources in the cluster, allowing multiple applications to
run concurrently. YARN allocates resources to different tasks (MapReduce, Spark,
etc.) based on their requirements.

4. Hadoop Common

1. Utilities and Libraries: Hadoop Common includes a set of utilities and libraries
needed by other Hadoop modules. This includes file system APIs, configuration
management, and basic I/O operations.

1
INTRODUCTION:

Apache Hadoop is an open-source software framework used to develop data processing


applications which are executed in a distributed computing environment.

Applications built using HADOOP are run on large data sets distributed across clusters of
commodity computers. Commodity computers are cheap and widely available. These are
mainly useful for achieving greater computational power at low cost.

Similar to data residing in a local file system of a personal computer system, in Hadoop, data
resides in a distributed file system which is called as a Hadoop Distributed File system. The
processing model is based on ‘Data Locality’ concept wherein computational logic is sent to
cluster nodes(server) containing data. This computational logic is nothing, but a compiled
version of a program written in a high-level language such as Java. Such a program, processes
data stored in Hadoop HDFS.

In the age of big data, organizations are generating and processing vast amounts of data every
day. Traditional data management systems struggle to handle such large-scale data due to
limitations in storage, processing power, and scalability. Apache Hadoop is an open-source
framework that provides a solution to these challenges by enabling the distributed storage and
processing of massive datasets across a cluster of computers.

Hadoop is designed to handle big data in a cost-effective, scalable, and fault-tolerant manner.
The Hadoop framework is based on a simple yet powerful idea: distributing data across
multiple machines and processing it in parallel. This approach allows for the handling of data
that would otherwise be impossible or too expensive to manage with traditional database
systems.

2
PROJECT OBJECTIVE:

1. Understand the Basics of Hadoop:

a) Gain a clear understanding of the core components of Hadoop: Hadoop


Distributed File System (HDFS), MapReduce, and YARN.
b) Explore how Hadoop can efficiently store and process vast amounts of data in a
distributed environment.

2. Set Up a Hadoop Cluster:

a) Learn the process of setting up a basic Hadoop cluster, either on local machines or
using virtual environments like Hadoop on Docker.
b) Configure HDFS and YARN for data storage and resource management,
respectively.

3. Perform Data Storage and Retrieval in HDFS:

a) Understand how data is stored in the Hadoop Distributed File System (HDFS) and
how to perform basic file operations such as uploading, reading, and managing data
within HDFS.
b) Explore the fault-tolerant nature of HDFS and its ability to replicate data across
multiple nodes.

4. Implement a Basic MapReduce Program:

a) Develop and execute a simple MapReduce program to process data stored in


HDFS.
b) Learn how the MapReduce programming model works by breaking down a task
into Map and Reduce phases, and how Hadoop distributes the computation across
a cluster.

5. Explore Hadoop Ecosystem Tools:

a) Get familiar with additional tools in the Hadoop ecosystem like Hive, Pig, and
HBase (optional), and understand how they complement Hadoop in solving
different data-related challenges.

6. Demonstrate Hadoop’s Scalability and Fault Tolerance:

a) Understand how Hadoop can scale to process petabytes of data and handle hardware
failures gracefully through replication and fault tolerance mechanisms.

3
Features Of ‘Hadoop’:
When working with Apache Hadoop in a microproject, it’s essential to understand its key
features and how they contribute to its power in managing and processing big data. Below are
the main features of Hadoop that will be highlighted and utilized throughout the project:

1. Distributed Storage (HDFS)

a) Scalability: Hadoop’s Hadoop Distributed File System (HDFS) allows data to be


distributed across many machines in a cluster, ensuring that even vast amounts of data
can be stored and managed efficiently.
b) Fault Tolerance: HDFS automatically replicates data blocks across multiple nodes
(typically three copies), ensuring that data is preserved even in the event of hardware
failures.
c) Data Locality: HDFS stores data on the nodes where processing happens, which
improves performance by minimizing the need to move data across the network.

2. Parallel Data Processing (MapReduce)

a) MapReduce Framework: Hadoop’s core processing engine, MapReduce, divides


tasks into smaller, parallelizable sub-tasks. It enables efficient processing of large
datasets by splitting the workload across multiple nodes in the cluster.
b) Map Phase: The input data is broken into smaller chunks, processed in parallel by
multiple mappers, and transformed into key-value pairs.
c) Reduce Phase: The results from the mappers are aggregated and combined to produce
the final output.

3. Scalability

a) Horizontal Scalability: Hadoop is designed to scale horizontally, meaning you can add
more nodes to the cluster as your data grows. This ensures that Hadoop can efficiently
handle petabytes of data.
b) Elasticity: The Hadoop system can dynamically scale resources up or down, based on
the size and complexity of the data being processed.

4. Fault Tolerance and Reliability

a) Data Replication: HDFS ensures fault tolerance by replicating data blocks across
multiple nodes. Even if some nodes fail, the data is still available from other replicas.
b) Automatic Recovery: If a node or task fails, Hadoop automatically recovers by
redistributing tasks and data, ensuring minimal disruption in processing.

4
Modules of Hadoop:
1) HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.

2) Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.

3) Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts it into
a data set which can be computed in Key value pair. The output of Map task is consumed by
reduce task and then the out of reducer gives the desired result.

4) Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.

5
Installation of Hadoop:
Step 1: Click here to download the Java 8 Package. Save this file in your home directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz
Untar Java - Install Hadoop - Edureka

Fig: Hadoop Installation – Extracting Java Files

Step 3: Download the Hadoop 2.7.3 Package.


Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-
2.7.3.tar.gz

Fig: Hadoop Installation – Downloading Hadoop

Step 4: Extract the Hadoop tar File.


Command: tar -xvf hadoop-2.7.3.tar.gz

Fig: Hadoop Installation – Extracting Hadoop Files .

Step 5: Add the Hadoop and Java paths in the bash file (.bashrc). Open.
bashrc file. Now, add Hadoop and Java Path as shown below.

Learn more about the Hadoop Ecosystem and its tools with the Hadoop Certification.

Command: vi .bashrc

6
Fig: Hadoop Installation – Setting Environment Variable

Then, save the bash file and close it.

For applying all these changes to the current Terminal, execute the source command.

Command: source. bashrc

Fig: Hadoop Installation – Refreshing environment variables

To make sure that Java and Hadoop have been properly installed on your system and can
be accessed through the Terminal, execute the java -version and hadoop version
commands.

Command: java -version

7
Fig: Hadoop Installation – Checking Java Version

Command: hadoop version

Fig: Hadoop Installation – Checking Hadoop Version

Step 6: Edit the Hadoop Configuration files.


Command: cd hadoop-2.7.3/etc/hadoop/

Command: ls

All the Hadoop configuration files are located in hadoop-2.7.3/etc/hadoop directory as


you can see in the snapshot below:

Fig: Hadoop Installation – Hadoop Configuration Files

Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
core-site.xml informs Hadoop daemon where Name Node runs in the cluster. It
contains configuration settings of Hadoop core such as I/O settings that are
common to HDFS & MapReduce.

Command: vi core-site.xml

8
Fig: Hadoop Installation – Configuring core-site.xml

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration
tag:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e. Name Node, Data
Node, Secondary Name Node). It also includes the replication factor and block size of
HDFS.

Command: vi hdfs-site.xml

9
Fig: Hadoop Installation – Configuring hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>

Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:
mapred-site.xml contains configuration settings of MapReduce application like
number of JVM that can run in parallel, the size of the mapper and the reducer
process, CPU cores available for a process, etc.

In some cases, mapred-site.xml file is not available. So, we have to create the mapred-
site.xml file using mapred-site.xml template.

Command: cp mapred-site.xml.template mapred-site.xml Command:


vi mapred-site.xml.

10
Fig: Hadoop Installation – Configuring mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
yarn-site.xml contains configuration settings of Resource Manager and
Node Manager like application memory management size, the operation needed on program
& algorithm, etc.

You can even check out the details of Big Data with the Azure Data Engineering
Certification in Hyderabad.

Command: vi yarn-site.xml

11
Fig: Hadoop Installation – Configuring yarn-site.xml

<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>

12
CONCLUSION:

Congratulations, you have successfully installed a single-node Hadoop cluster in one go. In our
next blog of the Hadoop Tutorial Series, we will be covering how to install Hadoop on a multi-
node cluster as well.

Now that you have understood how to install Hadoop, check out the Hadoop admin course by
Edureka, a trusted online learning company with a network of more than 250,000 satisfied
learners spread across the globe. The Edureka Big Data Engineer Course helps learners become
experts in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume, and Sqoop using real-
time use cases on Retail, Social Media, Aviation, Tourism, Finance domains.

13
REFERENCE:

BOOK:
1. A Guide to Measuring and Monitoring Project Performance BY Harold Kerzner
2. Advanced Database Systems By Nabil R. Adam, Bhagvan.
3. Database Systems: Design, Implementation, and Management By Peter Rob.

WEBSITE NAME:
1. https://html.scribdassets.com/8517dys11c79xnq3/images/6-420fb4cfaa.png
2. https://www.emugames.net/
3. https://www.geeksforgeeks.org/DBMS
4. https://www.tutorialspoint.com
5. https://data-flair.training/blogs/best-data-mining-books/amp/
6. https://www.guru99.com/learn-hadoop-in-10-minutes.html

14

You might also like