Unit III
Unit III
History of Hadoop
Hadoop is an open-source software framework used for
storing and processing Big Data in a distributed manner
on large clusters of commodity hardware. Hadoop is
licensed under Apache Software Foundation (ASF).
Hadoop is written in the Java programming language and
ranks among the highest-level Apache projects.
Doug Cutting and Mike J. Cafarella developed Hadoop.
By getting inspiration from Google, Hadoop is using technologies
like Map-Reduce programming model as well as Google file
system (GFS).
It is optimized to handle massive quantities of data that could
be structured, unstructured or semi-structured, using commodity
hardware, that is, relatively inexpensive computers.
It is intended to work upon from a single server to thousands of
machines each offering local computation and storage. It supports
the large collection of data set in a distributed computing
environment.
Hadoop Architecture
Apache Hadoop offers a scalable, flexible and reliable distributed
computing big data framework for a cluster of systems with
storage capacity and local computing power by leveraging
commodity hardware.
Hadoop runs applications using the MapReduce algorithm, where
the data is processed in parallel on different CPU nodes.
In short, Hadoop framework is capable enough to develop
applications capable of running on clusters of computers
and they could perform complete statistical analysis for
huge amounts of data.
Hadoop follows a Master Slave architecture for the
transformation and analysis of large datasets using
Hadoop MapReduce paradigm.
The 3 important Hadoop core components that
play a vital role in the Hadoop architecture are:
Hadoop Distributed File System (HDFS)
Hadoop MapReduce
Yet Another Resource Negotiator (YARN)
Hadoop Distributed File System (HDFS):
For every job submitted for execution in the system, there is one
Job tracker that resides on Name node and there are multiple
task trackers which reside on Data node.
A job is divided into multiple tasks which are then
run onto multiple data nodes in a cluster.
It is the responsibility of job tracker to coordinate
the activity by scheduling tasks to run on different
data nodes.
Execution of individual task is then look after by
task tracker, which resides on every data node
executing part of the job.
Task tracker's responsibility is to send the progress
report to the job tracker.
In addition, task tracker periodically sends
'heartbeat' signal to the Job tracker so as to notify
him of current state of the system.
Thus job tracker keeps track of overall progress of
each job. In the event of task failure, the job tracker
can reschedule it on a different task tracker.
Functional Components:
Disadvantage:
• More complex
• Not easy to configure for everyone
3. Fair Scheduler
• The Fair Scheduler is very much similar to that of the
capacity scheduler. The priority of the job is kept in
consideration.
• With the help of Fair Scheduler, the YARN applications
can share the resources in the large Hadoop Cluster and
these resources are maintained dynamically so no need
for prior capacity.
• The resources are distributed in such a manner that all
applications within a cluster get an equal amount of
time.
• Fair Scheduler takes Scheduling decisions on the
basis of memory, we can configure it to work with
CPU also.