0% found this document useful (0 votes)
36 views90 pages

Comprehensive Guide to Hadoop Basics

Unit-3 PPT.

Uploaded by

vhoratanvir1610
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views90 pages

Comprehensive Guide to Hadoop Basics

Unit-3 PPT.

Uploaded by

vhoratanvir1610
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Big Data Analytics(BDA)

GTU #3170722

Unit-3

Hadoop
 Outline
Looping
• History of Hadoop
• Features of Hadoop
• Advantages and Disadvantages of Hadoop
• Hadoop Eco System
• Hadoop vs SQL
• Hadoop Components
• Use case of Hadoop
• Processing data with Hadoop
• YARN Components
• YARN Architecture
• YARN MapReduce Application
• Execution Flow
• YARN workflow
• Anatomy of MapReduce Program,
• Input Splits,
• Relation between Input Splits and HDFS
History of Hadoop
History of Hadoop
 Hadoop is an open-source software framework for
storing and processing large datasets ranging in
size from gigabytes to petabytes.
 Hadoop was developed at the Apache Software
Foundation.
 In 2008, Hadoop defeated the supercomputers and
became the fastest system on the planet for
sorting terabytes of data.
 There are basically two components in Hadoop:
1. Hadoop distributed File System (HDFS):
 It allows you to store data of various formats across a cluster.
2. Yarn:
 For resource management in Hadoop. It allows parallel
processing over the data, i.e. stored across HDFS.
Basics of Hadoop
 Hadoop is an open-source software framework for storing data and
running applications on clusters of commodity hardware.
 It provides massive storage for any kind of data, enormous processing
power and the ability to handle virtually limitless concurrent tasks or jobs.
 A data residing in a local file system of a personal computer system, in
Hadoop, data resides in a distributed file system which is called as
a Hadoop Distributed File system - HDFS.
 The processing model is based on 'Data Locality' concept wherein
computational logic is sent to cluster nodes(server) containing data.
 This computational logic is nothing, but a compiled version of a program
written in a high-level language such as Java.
 Such a program, processes data stored in Hadoop HDFS.
Features of Hadoop
Features of Hadoop
 Open source
 Hadoop is open-source, which means it is free to use.
 Highly scalable Cluster
 A large amount of data is divided into multiple inexpensive machines in a cluster
which is processed parallelly. The number of these machines or nodes can be
increased or decreased as per the enterprise’s requirements. In traditional Relational
DataBase Management System the systems can not be scaled to approach large
amounts of data.
 Fault Tolerance is Available
 By default, Hadoop makes 3 copies of each file block and stored it into different
nodes. This replication factor is configurable and can be changed by changing the
replication property in the [Link] file.
 High Availability is Provided
 The High available Hadoop cluster also has 2 or more than two Name Node i.e. Active
NameNode and Passive NameNode.
 Cost-Effective
 Hadoop is open-source and uses cost-effective commodity hardware which provides a
Features of Hadoop
 Hadoop Provide Flexibility
 Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and
Videos) very efficiently.
 Easy to Use
 Hadoop is easy to use since the developers need not worry about any of the
processing work since it is managed by the Hadoop itself.
 Hadoop uses Data Locality
 The concept of Data Locality is used to make Hadoop processing fast. In the data
locality concept, the computation logic is moved near data rather than moving the
data to the computation [Link] cost of Moving data on HDFS is costliest and with
the help of the data locality concept, the bandwidth utilization in the system is
minimized.
 Provides Faster Data Processing
 Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop
Distributed File System).
 Support for Multiple Data Formats
 Hadoop supports multiple data formats like CSV, JSON, Avro, and many more, making
Features of Hadoop
 High Processing Speed
 Hadoop’s distributed processing model allows it to process large amounts of data at
high speeds. This is achieved by distributing data across multiple nodes and
processing it in parallel. As a result, Hadoop can process data much faster than
traditional database systems.
 Machine Learning Capabilities
 Hadoop offers machine learning capabilities through its ecosystem tools like Mahout,
which is a library for creating scalable machine learning applications. With these
tools, data analysts and developers can build machine learning models to analyze
and process large datasets.
 Integration with Other Tools
 Hadoop integrates with other popular tools like Apache Spark, Apache Flink, and
Apache Storm, making it easier to build data processing pipelines.
 Secure
 Hadoop provides built-in security features like authentication, authorization, and
encryption.
 Community Support
 Hadoop has a large community of users and developers who contribute to its
Advantages and
Disadvantages of Hadoop
Advantage & Disadvantage of Hadoop
Advantages Disadvantages

 Varied Data Sources  Issue With Small Files


 Cost-effective  Vulnerable By Nature
 Performance  Processing Overhead
 Fault-Tolerant  Supports Only Batch Processing
 Highly Available  Iterative Processing
 Low Network Traffic  Security
 High Throughput
 Open Source
 Scalable
 Ease of use
 Compatibility
 Multiple Languages Supported
Advantage & Disadvantage of Hadoop
Advantages Disadvantages

 Cost-effective  Issue With Small Files


 Flexibility  Vulnerability
 Scalability  High Up Processing
 Speed  Supports Only Batch Processing
 Fault-Tolerance  Lack of Security
 Highly Throughput  Low Performance In Small Data
 Low Network Traffic Surrounding
 Open Source
 Ease of use
 Compatibility
 Multiple Languages Supported
Why Hadoop Required?
1 2

3 5

4
Why Hadoop Required? - Traditional Restaurant
Scenario
Why Hadoop Required? - Traditional Scenario
Why Hadoop Required? - Distributed Processing
Scenario
Why Hadoop Required? - Distributed Processing Scenario Failure
Why Hadoop Required? - Solution to Restaurant Problem
Why Hadoop Required? - Hadoop in Restaurant
Analogy
 Hadoop functions in a similar fashion as
Bob’s restaurant.
 As the food shelf is distributed in Bob’s
restaurant, similarly, in Hadoop, the
data is stored in a distributed fashion
with replications, to provide fault
tolerance.
 For parallel processing, first the data is
processed by the slaves where it is
stored for some intermediate results
and then those intermediate results are
merged by master node to send the
final result.
Hadoop Eco System
Hadoop Ecosystem
 Hadoop is a framework that can process large data sets in the form of
clusters.
 As a framework, Hadoop is composed of multiple modules that are
compatible with a large technology ecosystem.
 The Hadoop ecosystem is a platform or suite that provides various
services to solve big data problems.
 It includes the Apache project and various commercial tools and solutions.
 Hadoop has four main elements, namely HDFS, MapReduce, YARN and
Hadoop Common.
 Most tools or solutions are used to supplement or support these core
elements.
 All of these tools work together to provide services such as data
absorption, analysis, storage, and maintenance.
Hadoop Ecosystem
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
Hadoop Ecosystem Distribution
HDFS
 Hadoop Distributed File System is the core component, or you can say,
the backbone of Hadoop Ecosystem.
 HDFS is one, and it is possible to store large data sets (i.e., structured,
unstructured and semi structured data).
 HDFS creates levels of abstraction of resources from where you can see all
the HDF as a single unit.
 It helps us in storing our data across various nodes and maintaining the
log file about the stored data (metadata).
 HDFS has two main components: NAMENODE and DATANODE.
Yarn
 YARN as the brain of your Hadoop Ecosystem.
 It performs all your processing activities by allocating resources and
scheduling tasks.
 It is a type of resource negotiator, as the name suggests, YARN is a
negotiator that helps manage all resources in the cluster.
 In short, you perform scheduling and resource allocation for the Hadoop
system.
 It consists of three main components, namely,
1. Resource Manager has the right to allocate resources for the applications
on the system.
2. Node Manager is responsible for allocating resources such as CPU,
memory, bandwidth and so on for each machine, and then identify the
resource manager.
3. Application Manager acts as an interface between the resource manager
and the node manager, and negotiates according to the requirements of
Map Reduce
 MapReduce is the core component of processing in a Hadoop Ecosystem
as it provides the logic of processing.
 MapReduce is a software framework which helps in writing applications
that processes large data sets using distributed and parallel algorithms
inside Hadoop environment.
 So, By making the use of distributed and parallel algorithms, MapReduce
makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
 MapReduce uses two functions, namely Map() and Reduce(), and its task
is:
 The Map() function performs actions like filtering, grouping and sorting.
 While Reduce() function aggregates and summarizes the result produced by map
function.
Pig
 Pig is basically developed by Yahoo, it uses the Pig Latin language, which
is a query-based language similar to SQL.
 It is a platform for structuring the data flows, processing and analyzing
massive data sets.
 Pig is responsible for executing commands and processing all MapReduce
activities in the background. After processing, pig stores the result in
HDFS.
 Pig Latin language is specially designed for this framework running in Pig
Runtime. Like the way Java runs on the JVM.
 Pig helps simplify programming and optimization and is therefore an
important part of the Hadoop ecosystem.
 Pig working like first the load command, loads the data. Then we perform
various functions on it like grouping, filtering, joining, sorting, etc.
 At last, either you can dump the data on the screen, or you can store the
result back in HDFS.
Hive
 With the help of SQL methodology and interface, HIVE
performs reading and writing of large data sets.
However, its query language is called as HQL (Hive Query
Language).
HIVE + SQL =
HQL
 It is highly scalable because it supports real-time processing and batch
processing.
 In addition, Hive supports all SQL data types, making query processing
easier.
 Similar to the query processing framework, HIVE also has two
components: JDBC driver and HIVE Command-Line.
 JDBC is used with the ODBC driver to set up connection and data storage
permissions whereas HIVE command line facilitates query processing.
Mahout
 Mahout, allows automatic learning of systems or applications.
 Machine learning, as its name suggests, can help systems develop
themselves based on certain patterns, user / environment interactions, or
algorithm-based fundamentals.
 It provides various libraries or functions, such as collaborative filtering,
clustering, and classification, which are all machine learning concepts.
 It allows to call the algorithm according to our needs with the help of its
own library.
Apache Spark
 Apache Spark is a framework for real time data analytics in a distributed
computing environment.
 The Spark is written in Scala and was originally developed at the
University of California, Berkeley.
 It executes in-memory computations to increase speed of data processing
over Map-Reduce.
 It is 100x faster than Hadoop for large scale data processing by exploiting
in-memory computations and other optimizations. Therefore, it requires
high processing power than Map-Reduce.
 It is better for real-time processing, while Hadoop is designed to store
unstructured data and perform batch processing on it.
 When we combine the capabilities of Apache Spark with the low-cost
operations of Hadoop on basic hardware, we get the best results.
 So, many companies use Spark and Hadoop together to process and
analyze big data stored in HDFS.
Apache HBase
 HBase is an open source non-relational distributed database. In other
words, it is a NoSQL database.
 It supports all types of data, which is why it can handle anything in the
Hadoop ecosystem.
 It is based on Google's Big-Table, which is a distributed storage system
designed to handle large data sets.
 HBase is designed to run on HDFS and provide features similar to Big-
Table.
 It provides us with a fault-tolerant way of storing sparse data, which is
common in most big data use cases.
 HBase is written in Java, and HBase applications can be written in REST,
Avro, and Thrift API.
 For example, You have billions of customer emails and you need to find
out the number of customers who has used the word complaint in their
emails.
Zookeeper
 There was a huge issue of management of coordination
and synchronization among the resources or the
components of Hadoop which resulted in inconsistency.
 Before Zookeeper, it was very difficult and time consuming
to coordinate between different services in Hadoop
Ecosystem.
 The services earlier had many problems with interactions
like common configuration while synchronizing data.
 Even if the services are configured, changes in the
configurations of the services make it complex and difficult
to handle.
 The grouping and naming was also a time-consuming
factor.
 Zookeeper overcame all the problems by performing
synchronization, inter-component based communication,
grouping, and maintenance.
Oozie
 Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit.
 There is two kinds of jobs.
1. Oozie work flow
2. Oozie coordinator jobs
 Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner.
 Oozie Coordinator jobs are those that are triggered when some data or
external stimulus is given to it.
Hadoop vs SQL
Hadoop vs SQL
Feature Hadoop SQL
Technology Modern Traditional
Volume Usually in PetaBytes Usually in GigaBytes
Storage, processing, retrieval and Storage, processing, retrieval and
Operations
pattern extraction from data pattern mining of data
Fault
Hadoop is highly fault tolerant SQL has good fault tolerance
Tolerance
Stores data in the form of key-value
Stores structured data in tabular format
Storage pairs, tables, hash map etc in
with fixed schema in cloud
distributed systems.
Scaling Linear Non linear
Cloudera, Horton work, AWS etc. Well-known industry leaders in SQL
Providers
provides Hadoop systems. systems are Microsoft, SAP, Oracle etc.
Interactive and batch oriented data
Data Access Batch oriented data access
access
It is licensed and costs a fortune to buy
It is open source and systems can be a SQL server, moreover if system runs
Cost
cost effectively scaled out of storage additional charges also
emerge
Hadoop vs SQL
Feature Hadoop SQL
It stores data in HDFS and process
It does not have any advanced
Optimization though Map Reduce with huge
optimization techniques
optimization techniques.
Dynamic schema, capable of storing
Static Schema, capable of storing
and processing log data, real-time
Structure data(fixed schema) in tabular format
data, images, videos, sensor data etc.
only(structured)
(both structured and unstructured)
Write data once, read data multiple
Data Update Read and Write data multiple times
times
Integrity Low High
Hadoop uses JDBC(Java Database
SQL systems can read and write data to
Interaction Connectivity) to communicate with
Hadoop systems
SQL systems to send and receive data
Hardware Uses commodity hardware Uses propriety hardware
Learning Hadoop for entry-level as well
Learning SQL is easy for even entry-
Training as seasoned profession is moderately
level professionals
hard
Hadoop Distributed File
System
Hadoop Distributed File System
 Hadoop File System was developed using distributed file system design.
 It is run on commodity hardware. Unlike other distributed systems, HDFS
is highly faulttolerant and designed using low-cost hardware.
 HDFS holds very large amount of data and provides easier access. To store
such huge data, the files are stored across multiple machines.
 These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
 HDFS also makes applications available to parallel processing.
 Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the status
of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
HDFS Master-Slave Architecture
 HDFS follows the master-slave architecture and it has the following
elements.
Hadoop Core Components - HDFS

Name Node  NameNode represented every files and directory


which is used in the namespace.

 DataNode helps you to manage the state of an HDFS


Data Node node and allows you to interacts with the blocks.

Secondary  It is constantly reads all the file systems and metadata

Node from the RAM of the NameNode and writes it into the
hard disk.
Name Node
 The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software.
 It is a software that can be run on commodity hardware.
 The system having the namenode acts as the master server and it does
the following tasks −
 It also executes file system operations such as renaming, closing, and opening files
and directories.
 It records each and every change that takes place to the file system metadata.
 It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are alive
 It keeps a record of all the blocks in the HDFS and DataNode in which they are
stored.
Data Node
 It is the slave daemon/process which runs on each slave machine.
 The actual data is stored on DataNodes.
 It is responsible for serving read and write requests from the clients.
 It is also responsible for creating blocks, deleting
blocks and replicating the same based on the decisions taken by the
NameNode.
 It sends heartbeats to the NameNode periodically to report the overall
health of HDFS, by default, this frequency is set to 3 seconds.
Secondary Node
 The Secondary NameNode works concurrently
with the primary NameNode as a helper
daemon/process.
 It is one which constantly reads all the file
systems and metadata from the RAM of the
NameNode and writes it into the hard disk or
the file system.
 It is responsible for combining the
EditLogs with Fs-Image
from the NameNode.
 It downloads the EditLogs from the
NameNode at regular intervals and applies to
FsImage.
 The new FsImage is copied back to the
NameNode, which is used whenever the
NameNode is started the next time.
Block
 A Block is the minimum amount of data that it can read or write.
 HDFS blocks are 128 MB by default and this is configurable.
 Files in HDFS are broken into block-sized chunks,which are stored as
independent units.
HDFS Architecture
Hadoop Cluster
 To Improve the Network
Performance
 The communication between nodes
residing on different racks is directed
via switch.
 In general, you will find greater
network bandwidth between machines
in the same rack than the machines
residing in different rack.
 It helps you to have reduce write
traffic in between different racks and
thus providing a better write
performance.
 Also, you will be gaining increased
read performance because you are
using the bandwidth of multiple racks.
 To Prevent Loss of Data
 We don’t have to worry about the data
even if an entire rack fails because of
HDFS Write Architecture
 HDFS client, wants to write a file named “[Link]” of size 248 MB.
 Assume that the system block size is configured for 128 MB (default).
 So, the client will be dividing the file “[Link]” into 2 blocks – one of
128 MB (Block A) and the other of 120 MB (block B).
HDFS Write Architecture – Cont.
 Writing Process Steps:
 At first, the HDFS client will reach out to the NameNode for a Write Request against
the two blocks, say, Block A & Block B.
 The NameNode will then grant the client the write permission and will provide the IP
addresses of the DataNodes where the file blocks will be copied eventually.
 The selection of IP addresses of DataNodes is purely randomized based on
availability, replication factor and rack awareness.
 The replication factor is set to default i.e. 3. Therefore, for each block the NameNode
will be providing the client a list of (3) IP addresses of DataNodes. The list will be
unique for each block.
 For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}
 For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of DataNode 9}
 Each block will be copied in three different DataNodes to maintain the replication
factor consistent throughout the cluster.
 Now the whole data copy process will happen in three stages:
 Set up of Pipeline
 Data streaming and replication
 Shutdown of Pipeline (Acknowledgement stage)
HDFS – Write Pipeline
Data Streaming and Replication
Shutdown of Pipeline or Acknowledgement stage
Processing data with Hadoop
Processing data with Hadoop using Map Reduce
 MapReduce and HDFS are the two major components of Hadoop which
makes it so powerful and efficient to use.
 MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner.
 The data is first split and then combined to produce the final result.
 The libraries for MapReduce is written in so many programming languages
with various different-different optimizations.
 The purpose of MapReduce in Hadoop is to Map each of the jobs and then
it will reduce it to equivalent tasks for providing less overhead over the
cluster network and to reduce the processing power.
 The MapReduce task is mainly divided into two phases Map Phase and
Reduce Phase.
Processing data with Hadoop using Map Reduce
Processing data with Hadoop using Map Reduce
Components of MapReduce Architecture:
 Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.
 Job: The MapReduce Job is the actual work that the client wanted to do
which is comprised of so many smaller tasks that the client wants to
process or execute.
 Hadoop MapReduce Master: It divides the particular job into
subsequent job-parts.
 Job-Parts: The task or sub-jobs that are obtained after dividing the main
job. The result of all the job-parts combined to produce the final output.
 Input Data: The data set that is fed to the MapReduce for processing.
 Output Data: The final result is obtained after the processing.
Processing data with Hadoop using Map Reduce
 The MapReduce task is mainly divided into 2 phases i.e. Map phase and
Reduce phase.
 Map: As the name suggests its main use is to map the input data in key-
value pairs. The input to the map may be a key-value pair where the key can
be the id of some kind of address and value is the actual value that it
keeps. The Map() function will be executed in its memory repository on
each of these input key-value pairs and generates the intermediate key-
value pair which works as input for the Reducer or Reduce() function.
 Reduce: The intermediate key-value pairs that work as input for Reducer
are shuffled and sort and send to the Reduce() function. Reducer
aggregate or group the data based on its key-value pair as per the
reducer algorithm written by the developer.
 How Job tracker and the task tracker deal with MapReduce:
 Job Tracker: The work of Job tracker is to manage all the resources and all the
jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.
Processing data with Hadoop using Map Reduce
xample of Map Reduce Algorithm
YARN Components & Architecture
YARN
Limitations of the Map Reduce Model
 Combining/Joining multiple datasets
 Possible but not elegant
 May require multiple jobs
 Custom composable workflow
 Requires multiple Jobs
 Plumbing between Jobs is manual
 Sort-Merge between MapReduce is Mandatory
 Fault tolerance mechanisms is Rigid

 Need for alternate programming models


 One of the motivations for YARN
YARN Components
 Yarn Components
 Client
 Resource Manager
 Scheduler
 Application Manager
 Node Manager
 Application Master
 Container
 Apache Yarn – “Yet Another Resource Negotiator” is the resource
management layer of Hadoop.
 The Yarn was introduced in Hadoop 2.x. Yarn allows different data
processing engines like graph processing, interactive processing, stream
processing as well as batch processing to run and process data stored in HDFS
(Hadoop Distributed File System).
 Apart from resource management, Yarn also does job Scheduling.
 Yarn extends the power of Hadoop to other evolving technologies, so they can
take the advantages of HDFS (most reliable and popular storage system on the
YARN Architecture

Hadoop 1.0 Architecture


Hadoop 2.0 Architecture
YARN Architecture
 Client:
 It submits map-reduce jobs.
 Resource Manager (RM):
 It is the master daemon of YARN.
 RM manages the global assignments of resources (CPU and memory) among all the
applications.
 Whenever it receives a processing request, it forwards it to the corresponding node
manager and allocates resources for the completion of the request accordingly.

 It has two major components:


 Scheduler:
 The scheduler is responsible for allocating the resources to the running application.
 It is a pure scheduler, means it does not perform other tasks such as monitoring or tracking
and does not guarantee a restart if a task fails

 Application manager:
 It manages running Application Masters in the cluster, i.e., it is responsible for starting
application masters and for monitoring and restarting them on different nodes in case of
YARN Architecture
 Node Manager (NM):
 It is the slave daemon of Yarn.
 Node Manager(NM) is responsible for containers monitoring their resource usage and
reporting the same to the Resource Manager (RM).
 Manage the user process on that machine.
 Yarn Node Manager also tracks the health of the node on which it is running.
 It monitors resource usage, performs log management and also kills a container
based on directions from the resource manager.
 It is also responsible for creating the container process and start it on the request of
Application master.
 A shuffle is a typical auxiliary service by the NMs for MapReduce applications on
YARN.
 Application Master (AM):
 One application master runs per application.
 It negotiates resources from the resource manager and works with the node
manager.
 It Manages the application life cycle.
 The application master requests the container from the node manager by sending a
YARN Architecture
 Container:
 It is a collection of physical resources such as RAM, CPU cores and disk on a single
node.
 The containers are invoked by Container Launch Context(CLC) which is a record
that contains information such as environment variables, security tokens,
dependencies etc.

1. Client submits an application.


 Application work flow:2. The Resource Manager allocates a container to start the
Application Manager.
3. The Application Manager registers itself with the Resource
Manager.
4. The Application Manager negotiates containers from the
Resource Manager.
5. The Application Manager notifies the Node Manager to launch
containers.
6. Application code is executed in the container.
7. Client contacts Resource Manager/Application Manager to
monitor application’s status.
8. Once the processing is complete, the Application Manager un-
registers with the Resource Manager.
Current Map Reduce vs. YARN architecture
Hadoop MapReduce Hadoop YARN
• Resource
Job
management Resource • Resource
• Job lifecycle Manager management
Tracker management
(Master) • Scheduling
• Scheduling,
(Maste progress
r) monitoring, fault Applicati
tolerance on • Job lifecycle
Master management
(per
Task app)
• Launch tasks
Tracker • Report status to Job Node • Launch Containers
(per tracker Manager • Monitor resource
node) (per usage
node) • Report to RM

MapReduce itself is an Application on YARN


YARN Features
 YARN gained popularity because of the following features-
 Scalability:
 The scheduler in Resource manager of YARN architecture allows Hadoop to extend
and manage thousands of nodes and clusters.
 Compatibility:
 YARN supports the existing map-reduce applications without disruptions thus making
it compatible with Hadoop 1.0 as well.
 Cluster Utilization:
 Since YARN supports Dynamic utilization of cluster in Hadoop, which enables
optimized Cluster Utilization.
 Multi-tenancy:
 It allows multiple engine access thus giving organizations a benefit of multi-tenancy.
YARN MapReduce Application
YARN MapReduce Application
 MapReduce is a programming model for building applications which can
process big data in parallel on multiple nodes.
 It provides analytical abilities for analysis of large amount of complex
data.
 Traditional model is not suitable to process large amount of data and
cannot be incorporated by standard database servers.
 Google solves this problem using MapReduce Algorithm.
 MapReduce is a distributed data processing algorithm, introduced by
Google.
 It is influenced by functional programming model. In cluster environment,
MapReduce algorithm is used to process large volume of data efficiently,
reliably and parallel.
 It uses divide and conquer approach to process large volume of data.
 It divides input task into manageable sub-task to execute parallel.
MapReduce Architecture
Example:
Welcome to Hadoop
Class Hadoop is good
Hadoop is bad
Output of
MapReduce task
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
How MapReduce Works?
 MapReduce divides a task into small parts and assigns them to many
computers.
 The results are collected at one place and integrated to form the result
dataset.
 The MapReduce algorithm contains two important tasks:
1. Map - Splits & Mapping
2. Reduce - Shuffling, Reducing
 The Map task takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines
those data tuples (key-value pairs) into a smaller set of tuples.
 The reduce task is always performed after the map job.
Map(Splits & Mapping) & Reduce(Shuffling, Reducing)
How MapReduce Works? – Cont.
 The complete execution
process (execution of Map and
Reduce tasks, both) is
controlled by two types of
entities called:
1. Job Tracker: Acts like
a master (responsible for
complete execution of
submitted job)
2. Multiple Task Trackers: Acts
like slaves, each of them
performing the job.
 For every job submitted for
execution in the system, there
is one Jobtracker that resides
on Namenode and there
are multiple
tasktrackers which reside
How MapReduce Works? – Cont.
 A job is divided into multiple tasks which are then run onto multiple data
nodes in a cluster.
 It is the responsibility of job tracker to coordinate the activity by
scheduling tasks to run on different data nodes.
 Execution of individual task is then to look after by task tracker, which
resides on every data node executing part of the job.
 Task tracker's responsibility is to send the progress report to the job
tracker.
 In addition, task tracker periodically sends 'heartbeat' signal to the
Jobtracker so as to notify him of the current state of the system.
 Thus job tracker keeps track of the overall progress of each job. In the
event of task failure, the job tracker can reschedule it on a different task
tracker.
MapReduce Algorithm

#3170722 (BDA)  Unit:2 – Hadoop


MapReduce Algorithm – Cont.
 Input Phase
 We have a Record Reader that translates each record in an input file and sends the
parsed data to the mapper in the form of key-value pairs.
 Map
 It is a user-defined function, which takes a series of key-value pairs and processes
each one of them to generate zero or more key-value pairs.
 Intermediate Keys
 They key-value pairs generated by the mapper are known as intermediate keys.
 Combiner
 A combiner is a type of local Reducer that groups similar data from the map phase
into identifiable sets.
 It takes the intermediate keys from the mapper as input and applies a user-defined
code to aggregate the values in a small scope of one mapper.
 It is not a part of the main MapReduce algorithm; it is optional.
MapReduce Algorithm – Cont.
 Shuffle and Sort
 The Reducer task starts with the Shuffle and Sort step.
 It downloads the grouped key-value pairs onto the local machine, where the Reducer
is running.
 The individual key-value pairs are sorted by key into a larger data list.
 The data list groups the equivalent keys together so that their values can be iterated
easily in the Reducer task.
 Reducer
 The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them.
 Here, the data can be aggregated, filtered, and combined in a number of ways, and it
requires a wide range of processing.
 Once the execution is over, it gives zero or more key-value pairs to the final step.
 Output Phase
 In the output phase, we have an output formatter that translates the final key-value
pairs from the Reducer function and writes them onto a file using a record writer.
MapReduce Feature
 Scalability
 Flexibility
 Security & Authentication
 Cost Effective Solution
 Fast
Anatomy of MapReduce Program
Anatomy of MapReduce Program
MapReduce Feature
 There are five independent entities:
 The client, which submits the MapReduce job.
 The YARN resource manager, which coordinates the allocation of
compute resources on the cluster.
 The YARN node managers, which launch and monitor the compute
containers on machines in the cluster.
 The MapReduce application master, which coordinates the tasks
running the MapReduce job The application master and
the MapReduce tasks run in containers that are scheduled by the resource
manager and managed by the node managers.
 The distributed file system, which is used for sharing job files between
the other entities.
MapReduce Feature
 Job Submission :
 The submit() method on Job creates an internal JobSubmitter instance and
calls submit JobInternal() on it.
 Having submitted the job, waitForCompletion polls the job’s progress once per
second and reports the progress to the console if it has changed since the last
report.
 When the job completes successfully, the job counters are displayed Otherwise, the
error that caused the job to fail is logged to the console.
 The job submission process implemented by JobSubmitter does the
following:
 Asks the resource manager for a new application ID, used for the MapReduce job ID.
 Checks the output specification of the job For example, if the output directory has
not been specified or it already exists, the job is not submitted and an error is thrown
to the MapReduce program.
 Computes the input splits for the job If the splits cannot be computed (because the
input paths don’t exist, for example), the job is not submitted and an error is thrown
to the MapReduce
program.
 Copies the resources needed to run the job, including the job JAR file, the
MapReduce Feature
 Job Initialization :
 When the resource manager receives a call to its submitApplication() method, it
hands off the request to the YARN scheduler.
 The scheduler allocates a container, and the resource manager then launches the
application master’s process there, under the node manager’s management.
 The application master for MapReduce jobs is a Java application whose main class is
MRAppMaster.
 It initializes the job by creating a number of bookkeeping objects to keep track of the
job’s progress, as it will receive progress and completion reports from the tasks.
 It retrieves the input splits computed in the client from the shared filesystem.
 It then creates a map task object for each split, as well as a number of reduce task
objects determined by the [Link] property (set by
the setNumReduceTasks() method on Job).

 Task Assignment:
 If the job does not qualify for running as an uber task, then the application master
requests containers for all the map and reduce tasks in the job from the resource
manager .
 Requests for map tasks are made first and with a higher priority than those for
MapReduce Feature
 Task Execution:
 Once a task has been assigned resources for a container on a particular node by the
resource manager’s scheduler, the application master starts the container by
contacting the node manager.
 The task is executed by a Java application whose main class is YarnChild. Before it
can run the task, it localizes the resources that the task needs, including the job
configuration and JAR file, and any files from the distributed cache.
 Finally, it runs the map or reduce task.

 Streaming:
 Streaming runs special map and reduce tasks for the purpose of launching the user
supplied executable and communicating with it.
 The Streaming task communicates with the process (which may be written in any
language) using standard input and output streams.
 During execution of the task, the Java process passes input key value pairs to the
external process, which runs it through the user defined map or reduce function and
passes the output key value pairs back to the Java process.
 From the node manager’s point of view, it is as if the child process ran the map or
reduce code itself.
MapReduce Feature
MapReduce Feature
 Progress and status updates :
 MapReduce jobs are long running batch jobs, taking anything from tens of seconds
to hours to run.
 A job and each of its tasks have a status, which includes such things as the state of
the job or task (e g running, successfully completed, failed), the progress of maps
and reduces, the values of the job’s counters, and a status message or description
(which may be set by user code).
 When a task is running, it keeps track of its progress (i.e the proportion of task is
completed).
 For map tasks, this is the proportion of the input that has been processed.
 For reduce tasks, it’s a little more complex, but the system can still estimate the
proportion of the reduce input processed.
 It does this by dividing the total progress into three parts, corresponding to the three
phases of the shuffle.
 As the map or reduce task runs, the child process communicates with its parent
application master through the umbilical interface.
 The task reports its progress and status (including counters) back to its application
master, which has an aggregate view of the job, every three seconds over the
umbilical interface.

MapReduce Feature

How status updates are propagated through the MapReduce


System
MapReduce Feature
 Job Completion:
 When the application master receives a notification that the last task for a job is
complete, it changes the status for the job to Successful.
 Then, when the Job polls for status, it learns that the job has completed successfully,
so it prints a message to tell the user and then returns from
the waitForCompletion().
 Finally, on job completion, the application master and the task containers clean up
their working state and the OutputCommitter’s commitJob () method is called.
 Job information is archived by the job history server to enable later interrogation by
users if desired.
Input Splits
 When a MapReduce job is run to process input data one of the thing Hadoop framework
does is to divide the input data into smaller chunks, these chunks are referred as input
splits in Hadoop.
 For each input split Hadoop creates one map task to process records in that input split.
That is how parallelism is achieved in Hadoop framework. For example if a MapReduce
job calculates that input data is divided into 8 input splits, then 8 mappers will be created
to process those input splits.
 Example:
 There are three files of size 128K, 129MB and 255 MB.
 Input splits depend on block size of cluster. Let’s assume block size of cluster is 64MB. So
depending upon block size of cluster, files are accordingly splitted.
 128 K file size :
 128K/ 64MB = 1 Block = 1 Input Split
 2. 129 MB file size:
 129MB/64MB = 3 Blocks = 3 Input Split
 3. 255 MB file size:
 255MB/64MB = 4 Blocks = 4 Input Split
Relation between Input Splits and HDFS Blocks
 HDFS Block-
 Block is a continuous location on the hard drive where data is stored.
 In general, File System stores data as a collection of blocks.
 In the same way, HDFC stores each file as blocks.
 The Hadoop application is responsible for distributing the data block across multiple nodes.

 Input Split in Hadoop-


 The data to be processed by an individual Mapper is represented by Input Split.
 The split is divided into records and each record (which is a key-value pair) is processed by the map.
 The number of map tasks is equal to the number of Input Splits.
 Initially, the data for MapReduce task is stored in input files and input files typically reside in HDFS.
 Input Format is used to define how these input files are split and read.
 Input Format is responsible for creating Input Split.
Relation between Input Splits and HDFS Blocks
 MapReduce InputSplit vs Blocks in Hadoop InputSplit vs Block Size in Hadoop-
 Block –
 The default size of the HDFS block is 128 MB which we can configure as per our requirement.
 All blocks of the file are of the same size except the last block, which can be of same size or smaller.
 The files are split into 128 MB blocks and then stored into Hadoop FileSystem.

 InputSplit –
 By default, split size is approximately equal to block size.
 InputSplit is user defined and the user can control split size based on the size of data in MapReduce program.

 Data Representation in Hadoop Blocks vs InputSplit-

 Block –
 It is the physical representation of data.
 It contains a minimum amount of data that can be read or write.

 InputSplit –
 It is the logical representation of data present in the block.
 It is used during data processing in MapReduce program or other processing techniques.
 Input Split doesn’t contain actual data, but a reference to the data.

You might also like