0% found this document useful (0 votes)
4 views

Hadoop

Hadoop in big data analytics

Uploaded by

Jasleen Kour
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Hadoop

Hadoop in big data analytics

Uploaded by

Jasleen Kour
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Hadoop

Hadoop is an open source software programming framework for storing a


large amount of data and performing the computation.

Its framework is based on Java programming with some native code in C


and shell scripts.

Hadoop is an open source framework overseen by Apache Software


Foundation which is written in Java for storing and processing of huge
datasets with the cluster of commodity hardware.

There are mainly two problems with the big data:

 First one is to store such a huge amount of data.


 The second one is to process that stored data.

The traditional approach like RDBMS is not sufficient due to the


heterogeneity of the data. So Hadoop comes as the solution to the problem
of big data i.e. storing and processing the big data with some extra
capabilities. There are mainly two components of Hadoop which are Hadoop
Distributed File System (HDFS) and Yet Another Resource
Negotiator(YARN).

History:

Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002
when they both started to work on Apache Nutch project.

Apache Nutch project was the process of building a search engine system
that can index 1 billion pages. After a lot of research on Nutch, they
concluded that such a system will cost around half a million dollars in
hardware, and along with a monthly running cost of $30, 000 approximately,
which is very expensive.

It’s co-founder Doug Cutting named it on his son’s toy elephant.

In 2003, they came across a paper that described the architecture of


Google’s distributed file system, called GFS (Google File System) which
was published by Google,
In 2004, Google published one more paper on the technique MapReduce,
which was the solution of processing those large datasets. Now this paper
was another half solution for Doug Cutting and Mike Cafarella for their Nutch
project.

It was created by Apache Software Foundation in 2006, based on a white


paper written by Google in 2003 that described the Google File System
(GFS) and the MapReduce programming model.

In January of 2008, Yahoo released Hadoop as an open source project


to ASF(Apache Software Foundation). And in July of 2008, Apache
Software Foundation successfully tested a 4000 node cluster with
Hadoop.

In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in


less than 17 hours for handling billions of searches and indexing millions of
web pages. And Doug Cutting left the Yahoo and joined Cloudera to
fulfill the challenge of spreading Hadoop to other industries.

In December of 2011, Apache Software Foundation released Apache


Hadoop version 1.0.

And later in Aug 2013, Version 2.0.6 was available.

And currently, we have Apache Hadoop version 3.0 which released


in December 2017.
Modules/Components of Hadoop

HDFS( Hadoop Distributed File System): HDFS(Hadoop Distributed File System) is


utilized for storage permission. It is mainly designed for working on commodity
Hardware devices (inexpensive devices), working on a distributed file system
design. HDFS is designed in such a way that it believes more in storing the data in
a large chunk of blocks rather than storing small data blocks.

 It states that the files will be broken into blocks and stored in nodes over
the distributed architecture.
 Yarn( Yet another Resource Negotiator) is used for job scheduling and
manage the cluster.
 Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes input
data and converts it into a data set which can be computed in Key value
pair. The output of Map task is consumed by reduce task and then the out
of reducer gives the desired result.
 Hadoop Common: These Java libraries are used to start Hadoop and are
used by other Hadoop modules.

Hadoop architecture

Architecture Mainly consists of 4 components.


 HDFS(Hadoop Distributed File System)
 MapReduce
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common
1. HDFS

HDFS (Hadoop Distributed File System) is utilized for storage permission. It is


mainly designed for working on commodity Hardware devices(inexpensive
devices), working on a distributed file system design. HDFS is designed in such a
way that it believes more in storing the data in a large chunk of blocks rather than
storing small data blocks.

NameNode(Master):
NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data.
Meta Data can be the transaction logs that keep track of the user’s activity in a
Hadoop cluster. Meta Data can also be the name of the file, size, and the
information about the location(Block number, Block ids) of Datanode.
Data Node(Slave):
DataNodes works as a Slave DataNodes are mainly utilized for storing the data in
a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more
than that.
The more number of DataNode, the Hadoop cluster will be able to store more
data. So it is advised that the DataNode should have High storing capacity to
store a large number of file blocks.
2. MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on
the YARN framework.
The major feature of MapReduce is to perform the distributed processing in
parallel in a Hadoop cluster which Makes Hadoop working so fast.
When you are dealing with Big Data, serial processing is no more of any use.
MapReduce has mainly 2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.

Job Tracker:
 Job tracker is a component which manages all activities related to
MapReduce jobs in the Hadoop cluster like initiating the task, resource
management and killing tasks.
 It acts as a master for all task tracker processes in MapReduce like how
Name node is master for all data node processes in HDFS.
 Job tracker has all the info about the task tracker nodes and its health at any
given point of time.
Task Tracker:
 Task tracker is process running on all slave nodes along with data node
processes.
 It updates Job tracker about resource and task status on the machine in
which it’s running.
3. YARN
YARN stands for “Yet Another Resource Negotiator“. It was introduced in
Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in
Hadoop 1.0.

4. Hadoop common: or Common utilities are nothing but our java


library and java files or we can say the java scripts that we need for all
the other components present in a Hadoop cluster. these utilities are
used by HDFS, YARN, and MapReduce for running the cluster.

You might also like