0% found this document useful (0 votes)
41 views

Unit III

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses a master-slave architecture with the NameNode as the master and DataNodes as slaves. The NameNode manages the file system namespace and regulates access to files, while DataNodes store file data blocks. Hadoop Distributed File System (HDFS) allows for the reliable storage of very large files with fault tolerance.

Uploaded by

Farhan Sj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Unit III

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses a master-slave architecture with the NameNode as the master and DataNodes as slaves. The NameNode manages the file system namespace and regulates access to files, while DataNodes store file data blocks. Hadoop Distributed File System (HDFS) allows for the reliable storage of very large files with fault tolerance.

Uploaded by

Farhan Sj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

Unit III

History of Hadoop
Hadoop is an open-source software framework used for
storing and processing Big Data in a distributed manner
on large clusters of commodity hardware. Hadoop is
licensed under Apache Software Foundation (ASF).
Hadoop is written in the Java programming language and
ranks among the highest-level Apache projects.
Doug Cutting and Mike J. Cafarella developed Hadoop.
By getting inspiration from Google, Hadoop is using technologies
like Map-Reduce programming model as well as Google file
system (GFS).
 It is optimized to handle massive quantities of data that could
be structured, unstructured or semi-structured, using commodity
hardware, that is, relatively inexpensive computers.
It is intended to work upon from a single server to thousands of
machines each offering local computation and storage. It supports
the large collection of data set in a distributed computing
environment.
Hadoop Architecture
 Apache Hadoop offers a scalable, flexible and reliable distributed
computing big data framework for a cluster of systems with
storage capacity and local computing power by leveraging
commodity hardware.
 Hadoop runs applications using the MapReduce algorithm, where
the data is processed in parallel on different CPU nodes.
 In short, Hadoop framework is capable enough to develop
applications capable of running on clusters of computers
and they could perform complete statistical analysis for
huge amounts of data.
 Hadoop follows a Master Slave architecture for the
transformation and analysis of large datasets using
Hadoop MapReduce paradigm.
The 3 important Hadoop core components that
play a vital role in the Hadoop architecture are:
Hadoop Distributed File System (HDFS)
Hadoop MapReduce
Yet Another Resource Negotiator (YARN)
Hadoop Distributed File System (HDFS):

o Hadoop Distributed File System runs on top of the


existing file systems on each node in a Hadoop cluster.
o Hadoop Distributed File System is a block-structured
file system where each file is divided into blocks of a
pre-determined size.
 Data in a Hadoop cluster is broken down into smaller units (called
blocks) and distributed throughout the cluster. Each block is
duplicated twice (for a total of three copies), with the two replicas
stored on two nodes in a rack somewhere else in the cluster.
 Since the data has a default replication factor of three, it is highly
available and fault-tolerant.
 If a copy is lost (because of machine failure, for example), HDFS
will automatically re-replicate it elsewhere in the cluster, ensuring
that the threefold replication factor is maintained.
Features of HDFS:
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with
HDFS.
The built-in servers of namenode and datanode help
users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
Basic terminology:
Node: A node is simply a computer. This is typically non-
enterprise, commodity hardware for nodes that contain data.
 

Rack: Collection of nodes is called as a rack. A rack is a collection


of 30 or 40 nodes that are physically stored close together and are
all connected to the same network switch.
Network bandwidth between any two nodes in the same rack is
greater than bandwidth between two nodes on different racks.
Cluster: A Hadoop Cluster (or just cluster from now on) is a
collection of racks.
File Blocks:
Blocks are nothing but the smallest continuous location on your
hard drive where data is stored. In general, in any of the File
System, you store the data as a collection of blocks.
Components of HDFS:
• HDFS is a block-structured file system where each file
is divided into blocks of a pre-determined size.
• These blocks are stored across a cluster of one or
several machines.
• Apache Hadoop HDFS Architecture follows a
Master/Slave Architecture, where a cluster comprises of
a single NameNode (Master node) and all the other
nodes are DataNodes (Slave nodes).
• HDFS can be deployed on a broad spectrum of
machines that support Java.
• Though one can run several DataNodes on a
single machine, but in the practical world, these
DataNodes are spread across various machines.
NameNode:

• NameNode is the master node in the Apache Hadoop HDFS


Architecture that maintains and manages the blocks present
on the DataNodes (slave nodes).
• NameNode is a very highly available server that manages the
File System Namespace and controls access to files by
clients.
• The HDFS architecture is built in such a way that the user
data never resides on the NameNode. Name node contains
metadata and the data resides on DataNodes only.
Functions of NameNode:
• It is the master daemon that maintains and manages
the DataNodes (slave nodes)
• Manages the file system namespace.

• It records the metadata of all the files stored in


the cluster, e.g. the location of blocks stored, the
size of the files, permissions, hierarchy, etc.
There are two files associated with the metadata:
oFsImage: It contains the complete state of the
file system namespace since the start of the
NameNode.
oEditLogs: It contains all the recent
modifications made to the file system with
respect to the most recent FsImage.
It records each change that takes place to the file
system metadata. For example, if a file is deleted in
HDFS, the NameNode will immediately record this in
the EditLog.
It regularly receives a Heartbeat and a block report
from all the DataNodes in the cluster to ensure that the
DataNodes are live.
It keeps a record of all the blocks in HDFS and in which
nodes these blocks are located.
The NameNode is also responsible to take care of the
replication factor of all the blocks.

In case of the DataNode failure, the NameNode


chooses new DataNodes for new replicas, balance disk
usage and manages the communication traffic to the
DataNodes.
DataNode:
• Data Nodes are the slave nodes in HDFS.
• Unlike NameNode, DataNode is commodity hardware,
that is, a non-expensive system which is not of high
quality or high-availability.
• The DataNode is a block server that stores the data in
the local file ext3 or ext4.
Functions of DataNode:
The actual data is stored on DataNodes.
Datanodes perform read-write operations on the file systems, as per
client request.
They also perform operations such as block creation, deletion,
and replication according to the instructions of the namenode.
They send heartbeats to the NameNode periodically to report
the overall health of HDFS, by default; this frequency is set to 3
seconds.
Secondary NameNode:
• It is a separate physical machine which acts as a helper
of name node.
• It performs periodic check points.
• It communicates with the name node and take snapshot
of meta data which helps minimize downtime and loss
of data.
• The Secondary NameNode works concurrently with the
primary NameNode as a helper daemon.
Functions of Secondary NameNode:

The Secondary NameNode is one which constantly reads all


the file systems and metadata from the RAM of the
NameNode and writes it into the hard disk or the file system.
It is responsible for combining the EditLogs with FsImage from
the NameNode.
 It downloads the EditLogs from the NameNode at regular
intervals and applies to FsImage.
The new FsImage is copied back to the
NameNode, which is used whenever the
NameNode is started the next time.
Hence, Secondary NameNode performs regular
checkpoints in HDFS. Therefore, it is also called
CheckpointNode
HDFS Architecture:
• Apache Hadoop HDFS Architecture follows a
Master/Slave Architecture, where a cluster comprises of a
single NameNode (Master node) and all the other nodes
are DataNodes (Slave nodes).
• HDFS can be deployed on a broad spectrum of machines
that support Java.
• Though one can run several DataNodes on a single
machine, but in the practical world, these DataNodes are
spread across various machines.
Hadoop Streaming:
• It is a utility or feature that comes with a Hadoop distribution that
allows developers or programmers to write the Map-Reduce
program using different programming languages like Ruby, Perl,
Python, C++, etc.
• We can use any language that can read from the standard
input(STDIN) like keyboard input and all and write using standard
output(STDOUT).
• We all know the Hadoop Framework is completely written in java
but programs for Hadoop are not necessarily need to code in Java
programming language. feature of Hadoop Streaming is available
since Hadoop version 0.14.1.
• In the above example image, we can see that the flow shown in a dotted
block is a basic MapReduce job.
• In that, we have an Input Reader which is responsible for reading the
input data and produces the list of key-value pairs.
• We can read data in .csv format, in delimiter format, from a database
table, image data(.jpg, .png), audio data etc.
• The only requirement to read all these types of data is that we have to
create a particular input format for that data with these input readers.
• The input reader contains the complete logic about the data it is reading.
Suppose we want to read an image then we have to specify the logic in
the input reader so that it can read that image data and finally it will
generate key-value pairs for that image data.
• If we are reading an image data then we can generate key-value pair for each
pixel where the key will be the location of the pixel and the value will be its color
value from (0-255) for a colored image.
• Now this list of key-value pairs is fed to the Map phase andIf we are reading an
image data then we can generate key-value pair for each pixel where the key will
be the location of the pixel and the value will be its color value from (0-255) for a
colored image.
• Now this list of key-value pairs is fed to the Map phase and Mapper will work on
each of these key-value pair of each pixel and generate some intermediate key-
value pairs which are then fed to the Reducer after doing shuffling and sorting
then the final output produced by the reducer will be written to the HDFS.
• These are how a simple Map-Reduce job works. Mapper will work on each of
these key-value pair of each pixel and generate some intermediate key-value pairs
which are then fed to the Reducer after doing shuffling and sorting then the final
output produced by the reducer will be written to the HDFS. These are how a
simple Map-Reduce job works.
• Now let’s see how we can use different languages like Python, C++, Ruby
with Hadoop for execution.
• We can run this arbitrary language by running them as a separate process.
For that, we will create our external mapper and run it as an external separate
process.
• These external map processes are not part of the basic MapReduce flow. This
external mapper will take input from STDIN and produce output to
STDOUT.
• As the key-value pairs are passed to the internal mapper the internal mapper
process will send these key-value pairs to the external mapper where we
have written our code in some other language like with python with help of
STDIN.
• Now, these external mappers process these key-value pairs and generate
intermediate key-value pairs with help of STDOUT and send it to the internal
mappers.
• Similarly, Reducer does the same thing. Once the intermediate
key-value pairs are processed through the shuffle and sorting
process they are fed to the internal reducer which will send
these pairs to external reducer process that are working
separately through the help of STDIN and gathers the output
generated by external reducers with help of STDOUT and
finally the output is stored to our HDFS.
• This is how Hadoop Streaming works on Hadoop which is by
default available in Hadoop. We are just utilizing this feature
by making our external mapper and reducers. Now we can see
how powerful feature is Hadoop streaming. Anyone can write
his code in any language of his own choice.
Java Interfaces to HDFS:
• Java interface used for accessing Hadoop’s File system.
• In order to interact with Hadoop’s file system
programmatically, Hadoop provides multiple Java classes.
• Package named org.apache.Hadoop.fs Contains classes
Useful in manipulation Of a file in Hadoop ‘s file system.
• HDFS cluster Primarily consist of a name node that
manages the file system metada then we regret ta and a
data nodes that stores the actual data.
Read operation in HDFS:
Data read request is served by HDFS, name node and
data node. Let’s call the reader as a client.
1. A client initiates a read request by calling open()
method of FileSystem object. It is an object of type
DistributedFileSystem.
2. This object connects to namenode and gets metadata
information such as the locations of the blocks of the
file. Please note that these addresses are of first few
blocks of a file.
3. In response to this metadata request, addresses of the
datanodes having a copy of that block is returned back.

4. Once addresses of data nodes are received, an object of


type FSDataInputStream is returned to the client.

FSDataInputStream contains DFSInputStream which takes


care of interactions with DataNode and NameNode. In step 4
shown in the above diagram, a client invokes read() method
which causes DFSInputStream to establish a connection with
the first DataNode with the first block of a file.
5. Data is read in the form of streams where in
client invokes read() method repeatedly. This
process of read() operation continues till it reaches
the end of block.
6. Once the end of a block is reached,
DFSInputStream closes the connection and moves
on to locate the next DataNode for the next block.
7. Once a client has done with the reading, it calls
a close() method.
Mapreduce:
• A MapReduce is a data processing tool which is used to process
the data parallelly in a distributed form.
• The MapReduce is a paradigm which has two phases, the
mapper phase, and the reducer phase.
• In the Mapper, the input is given in the form of a key-value pair.
• The output of the Mapper is fed to the reducer as input.
• The reducer runs only after the Mapper is over.
• The reducer too takes input in key-value format, and the output
of reducer is the final output.
Map Reduce can be implemented by its components:
1. Architectural components
Job Tracker
Task Trackers
2. Functional components
Mapper (map)
Combiner (Shuffler)
Reducer (Reduce)
Architectural Components:
 The complete execution process (execution of Map and Reduce
tasks, both) is controlled by two types of entities called:
1. A Job tracker : Acts like a master (responsible for complete execution
of submitted job)
2. Multiple Task Trackers : Acts like slaves, each of them performing the
job

 For every job submitted for execution in the system, there is one
Job tracker that resides on Name node and there are multiple
task trackers which reside on Data node.
 A job is divided into multiple tasks which are then
run onto multiple data nodes in a cluster.
 It is the responsibility of job tracker to coordinate
the activity by scheduling tasks to run on different
data nodes.
 Execution of individual task is then look after by
task tracker, which resides on every data node
executing part of the job.
Task tracker's responsibility is to send the progress
report to the job tracker.
In addition, task tracker periodically sends
'heartbeat' signal to the Job tracker so as to notify
him of current state of the system.
Thus job tracker keeps track of overall progress of
each job. In the event of task failure, the job tracker
can reschedule it on a different task tracker.
Functional Components:

A job (complete Work) is submitted to the master,


Hadoop divides the job into phases , map phase and
reduce phase. In between Map and Reduce, there is
small phase called shuffle & Sort in MapReduce.
Map tasks
Reduce tasks
Map Phase
• This is very first phase in the execution of map-reduce program.
• In this phase data in each split is passed to a mapping function to
produce output values.
• The map takes key/value pair as input.
• Key is a reference to the input. Value is the data set on which to
operate.
• Map function applies the business logic to every value in input.
• Map produces an output is called intermediate output.
• An output of map is stored on the local disk from where it is shuffled to
reduce nodes.
Reduce Phase:
• In MapReduce Reduce takes intermediate Key / Value pairs as
input and process the output of the mapper.
• Key/value pairs provided to reduce are sorted by key.
• Usually, in the reducer, we do aggregation or summation sort of
computation.
• A function defined by user supplies the values for a given key to
the Reduce function.
• Reduce produces a final output as a list of key/value pairs.
• This final output is stored in HDFS and replication is done as usual.
Shuffling
• This phase consumes output of mapping phase. Its task
is to consolidate the relevant records from Mapping
phase output.
Job Scheduling in hadoop
• In Hadoop, we can receive multiple jobs from different clients to
perform.
• The Map-Reduce framework is used to perform multiple tasks in
parallel in a typical Hadoop cluster to process large size datasets at a
fast rate.
• This Map-Reduce Framework is responsible for scheduling and
monitoring the tasks given by different clients in a Hadoop cluster.
• But this method of scheduling jobs is used prior to Hadoop 2. 
• Now in Hadoop 2, we have YARN (Yet Another Resource Negotiator).
In YARN we have separate Daemons for performing Job scheduling,
Monitoring, and Resource Management as Application Master, Node
Manager, and Resource Manager respectively. 
• Here, Resource Manager is the Master Daemon responsible
for tracking or providing the resources required by any
application within the cluster, and Node Manager is the
slave Daemon which monitors and keeps track of the
resources used by an application and sends the feedback to
Resource Manager. 
• Schedulers and Applications Manager are the 2 major
components of resource Manager. The Scheduler in YARN
is totally dedicated to scheduling the jobs, it can not track
the status of the application. On the basis of required
resources, the scheduler performs or we can say schedule
the Jobs. 
There are mainly 3 types of Schedulers in
Hadoop: 
1.FIFO (First In First Out) Scheduler.
2.Capacity Scheduler.
3.Fair Scheduler.

These Schedulers are actually a kind of algorithm that we use


to schedule tasks in a Hadoop cluster when we receive
requests from different-different clients. 
• A Job queue is nothing but the collection of various tasks that we
have received from our various clients. The tasks are available in the
queue and we need to schedule this task on the basis of our
requirements. 
1. FIFO Scheduler
• As the name suggests FIFO i.e. First In First Out, so the
tasks or application that comes first will be served first.
• This is the default Scheduler we use in Hadoop.
• The tasks are placed in a queue and the tasks are
performed in their submission order.
• In this method, once the job is scheduled, no intervention
is allowed.
• So sometimes the high-priority process has to wait for a
long time since the priority of the task does not matter in
this method. 
Advantage: 
• No need for configuration
• First Come First Serve
• simple to execute
Disadvantage:  
• Priority of task doesn’t matter, so high
priority jobs need to wait
• Not suitable for shared cluster
2. Capacity Scheduler
• In Capacity Scheduler we have multiple job queues for scheduling
our tasks.
• The Capacity Scheduler allows multiple occupants to share a large
size Hadoop cluster.
• In Capacity Scheduler corresponding for each job queue, we provide
some slots or cluster resources for performing job operation.
• Each job queue has it’s own slots to perform its task.
• In case we have tasks to perform in only one queue then the tasks of
that queue can access the slots of other queues also as they are free to
use, and when the new task enters to some other queue then jobs in
running in its own slots of the cluster are replaced with its own job. 
• Capacity Scheduler also provides a level of
abstraction to know which occupant is utilizing
the more cluster resource or slots, so that the
single user or application doesn’t take
disappropriate or unnecessary slots in the cluster.
• The capacity Scheduler mainly contains 3 types of
the queue that are root, parent, and leaf which are
used to represent cluster, organization, or any
subgroup, application submission respectively. 
Advantage: 
• Best for working with Multiple clients or
priority jobs in a Hadoop cluster
• Maximizes throughput in the Hadoop cluster

Disadvantage:  
• More complex
• Not easy to configure for everyone
3. Fair Scheduler
• The Fair Scheduler is very much similar to that of the
capacity scheduler. The priority of the job is kept in
consideration.
• With the help of Fair Scheduler, the YARN applications
can share the resources in the large Hadoop Cluster and
these resources are maintained dynamically so no need
for prior capacity.
• The resources are distributed in such a manner that all
applications within a cluster get an equal amount of
time.
• Fair Scheduler takes Scheduling decisions on the
basis of memory, we can configure it to work with
CPU also. 

• As we told you it is similar to Capacity Scheduler


but the major thing to notice is that in Fair
Scheduler whenever any high priority job arises in
the same queue, the task is processed in parallel
by replacing some portion from the already
dedicated slots. 
Advantages: 
•Resources assigned to each application
depend upon its priority.
•it can limit the concurrent running task in
a particular pool or queue.
Disadvantages: 
•The configuration is required. 
Shuffling
• The process of transferring data from the mappers to
reducers is known as shuffling i.e. the process by which the
system performs the sort and transfers the map output to the
reducer as input.
• So, MapReduce shuffle phase is necessary for the reducers,
otherwise, they would not have any input (or input from
every mapper).
• As shuffling can start even before the map phase has finished
so this saves some time and completes the tasks in lesser
time.
Sorting
• MapReduce Framework automatically sort the
keys generated by the mapper.
• Thus, before starting of reducer, all intermediate
key-value pairs get sorted by key and not by
value.
• It does not sort values passed to each reducer.
• They can be in any order.
• Sorting in a MapReduce job helps reducer to
easily distinguish when a new reduce task should
start.
• This saves time for the reducer.
• Reducer in MapReduce starts a new reduce task
when the next key in the sorted input data is
different than the previous.
• Each reduce task takes key value pairs as input
and generates key-value pair as output.
Developing a Map Reduce Application

You might also like