Comprehensive Guide to Hadoop Basics
Comprehensive Guide to Hadoop Basics
GTU #3170722
Unit-3
Hadoop
Outline
Looping
• History of Hadoop
• Features of Hadoop
• Advantages and Disadvantages of Hadoop
• Hadoop Eco System
• Hadoop vs SQL
• Hadoop Components
• Use case of Hadoop
• Processing data with Hadoop
• YARN Components
• YARN Architecture
• YARN MapReduce Application
• Execution Flow
• YARN workflow
• Anatomy of MapReduce Program,
• Input Splits,
• Relation between Input Splits and HDFS
History of Hadoop
History of Hadoop
Hadoop is an open-source software framework for
storing and processing large datasets ranging in
size from gigabytes to petabytes.
Hadoop was developed at the Apache Software
Foundation.
In 2008, Hadoop defeated the supercomputers and
became the fastest system on the planet for
sorting terabytes of data.
There are basically two components in Hadoop:
1. Hadoop distributed File System (HDFS):
It allows you to store data of various formats across a cluster.
2. Yarn:
For resource management in Hadoop. It allows parallel
processing over the data, i.e. stored across HDFS.
Basics of Hadoop
Hadoop is an open-source software framework for storing data and
running applications on clusters of commodity hardware.
It provides massive storage for any kind of data, enormous processing
power and the ability to handle virtually limitless concurrent tasks or jobs.
A data residing in a local file system of a personal computer system, in
Hadoop, data resides in a distributed file system which is called as
a Hadoop Distributed File system - HDFS.
The processing model is based on 'Data Locality' concept wherein
computational logic is sent to cluster nodes(server) containing data.
This computational logic is nothing, but a compiled version of a program
written in a high-level language such as Java.
Such a program, processes data stored in Hadoop HDFS.
Features of Hadoop
Features of Hadoop
Open source
Hadoop is open-source, which means it is free to use.
Highly scalable Cluster
A large amount of data is divided into multiple inexpensive machines in a cluster
which is processed parallelly. The number of these machines or nodes can be
increased or decreased as per the enterprise’s requirements. In traditional Relational
DataBase Management System the systems can not be scaled to approach large
amounts of data.
Fault Tolerance is Available
By default, Hadoop makes 3 copies of each file block and stored it into different
nodes. This replication factor is configurable and can be changed by changing the
replication property in the [Link] file.
High Availability is Provided
The High available Hadoop cluster also has 2 or more than two Name Node i.e. Active
NameNode and Passive NameNode.
Cost-Effective
Hadoop is open-source and uses cost-effective commodity hardware which provides a
Features of Hadoop
Hadoop Provide Flexibility
Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and
Videos) very efficiently.
Easy to Use
Hadoop is easy to use since the developers need not worry about any of the
processing work since it is managed by the Hadoop itself.
Hadoop uses Data Locality
The concept of Data Locality is used to make Hadoop processing fast. In the data
locality concept, the computation logic is moved near data rather than moving the
data to the computation [Link] cost of Moving data on HDFS is costliest and with
the help of the data locality concept, the bandwidth utilization in the system is
minimized.
Provides Faster Data Processing
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop
Distributed File System).
Support for Multiple Data Formats
Hadoop supports multiple data formats like CSV, JSON, Avro, and many more, making
Features of Hadoop
High Processing Speed
Hadoop’s distributed processing model allows it to process large amounts of data at
high speeds. This is achieved by distributing data across multiple nodes and
processing it in parallel. As a result, Hadoop can process data much faster than
traditional database systems.
Machine Learning Capabilities
Hadoop offers machine learning capabilities through its ecosystem tools like Mahout,
which is a library for creating scalable machine learning applications. With these
tools, data analysts and developers can build machine learning models to analyze
and process large datasets.
Integration with Other Tools
Hadoop integrates with other popular tools like Apache Spark, Apache Flink, and
Apache Storm, making it easier to build data processing pipelines.
Secure
Hadoop provides built-in security features like authentication, authorization, and
encryption.
Community Support
Hadoop has a large community of users and developers who contribute to its
Advantages and
Disadvantages of Hadoop
Advantage & Disadvantage of Hadoop
Advantages Disadvantages
3 5
4
Why Hadoop Required? - Traditional Restaurant
Scenario
Why Hadoop Required? - Traditional Scenario
Why Hadoop Required? - Distributed Processing
Scenario
Why Hadoop Required? - Distributed Processing Scenario Failure
Why Hadoop Required? - Solution to Restaurant Problem
Why Hadoop Required? - Hadoop in Restaurant
Analogy
Hadoop functions in a similar fashion as
Bob’s restaurant.
As the food shelf is distributed in Bob’s
restaurant, similarly, in Hadoop, the
data is stored in a distributed fashion
with replications, to provide fault
tolerance.
For parallel processing, first the data is
processed by the slaves where it is
stored for some intermediate results
and then those intermediate results are
merged by master node to send the
final result.
Hadoop Eco System
Hadoop Ecosystem
Hadoop is a framework that can process large data sets in the form of
clusters.
As a framework, Hadoop is composed of multiple modules that are
compatible with a large technology ecosystem.
The Hadoop ecosystem is a platform or suite that provides various
services to solve big data problems.
It includes the Apache project and various commercial tools and solutions.
Hadoop has four main elements, namely HDFS, MapReduce, YARN and
Hadoop Common.
Most tools or solutions are used to supplement or support these core
elements.
All of these tools work together to provide services such as data
absorption, analysis, storage, and maintenance.
Hadoop Ecosystem
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
Hadoop Ecosystem Distribution
HDFS
Hadoop Distributed File System is the core component, or you can say,
the backbone of Hadoop Ecosystem.
HDFS is one, and it is possible to store large data sets (i.e., structured,
unstructured and semi structured data).
HDFS creates levels of abstraction of resources from where you can see all
the HDF as a single unit.
It helps us in storing our data across various nodes and maintaining the
log file about the stored data (metadata).
HDFS has two main components: NAMENODE and DATANODE.
Yarn
YARN as the brain of your Hadoop Ecosystem.
It performs all your processing activities by allocating resources and
scheduling tasks.
It is a type of resource negotiator, as the name suggests, YARN is a
negotiator that helps manage all resources in the cluster.
In short, you perform scheduling and resource allocation for the Hadoop
system.
It consists of three main components, namely,
1. Resource Manager has the right to allocate resources for the applications
on the system.
2. Node Manager is responsible for allocating resources such as CPU,
memory, bandwidth and so on for each machine, and then identify the
resource manager.
3. Application Manager acts as an interface between the resource manager
and the node manager, and negotiates according to the requirements of
Map Reduce
MapReduce is the core component of processing in a Hadoop Ecosystem
as it provides the logic of processing.
MapReduce is a software framework which helps in writing applications
that processes large data sets using distributed and parallel algorithms
inside Hadoop environment.
So, By making the use of distributed and parallel algorithms, MapReduce
makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
MapReduce uses two functions, namely Map() and Reduce(), and its task
is:
The Map() function performs actions like filtering, grouping and sorting.
While Reduce() function aggregates and summarizes the result produced by map
function.
Pig
Pig is basically developed by Yahoo, it uses the Pig Latin language, which
is a query-based language similar to SQL.
It is a platform for structuring the data flows, processing and analyzing
massive data sets.
Pig is responsible for executing commands and processing all MapReduce
activities in the background. After processing, pig stores the result in
HDFS.
Pig Latin language is specially designed for this framework running in Pig
Runtime. Like the way Java runs on the JVM.
Pig helps simplify programming and optimization and is therefore an
important part of the Hadoop ecosystem.
Pig working like first the load command, loads the data. Then we perform
various functions on it like grouping, filtering, joining, sorting, etc.
At last, either you can dump the data on the screen, or you can store the
result back in HDFS.
Hive
With the help of SQL methodology and interface, HIVE
performs reading and writing of large data sets.
However, its query language is called as HQL (Hive Query
Language).
HIVE + SQL =
HQL
It is highly scalable because it supports real-time processing and batch
processing.
In addition, Hive supports all SQL data types, making query processing
easier.
Similar to the query processing framework, HIVE also has two
components: JDBC driver and HIVE Command-Line.
JDBC is used with the ODBC driver to set up connection and data storage
permissions whereas HIVE command line facilitates query processing.
Mahout
Mahout, allows automatic learning of systems or applications.
Machine learning, as its name suggests, can help systems develop
themselves based on certain patterns, user / environment interactions, or
algorithm-based fundamentals.
It provides various libraries or functions, such as collaborative filtering,
clustering, and classification, which are all machine learning concepts.
It allows to call the algorithm according to our needs with the help of its
own library.
Apache Spark
Apache Spark is a framework for real time data analytics in a distributed
computing environment.
The Spark is written in Scala and was originally developed at the
University of California, Berkeley.
It executes in-memory computations to increase speed of data processing
over Map-Reduce.
It is 100x faster than Hadoop for large scale data processing by exploiting
in-memory computations and other optimizations. Therefore, it requires
high processing power than Map-Reduce.
It is better for real-time processing, while Hadoop is designed to store
unstructured data and perform batch processing on it.
When we combine the capabilities of Apache Spark with the low-cost
operations of Hadoop on basic hardware, we get the best results.
So, many companies use Spark and Hadoop together to process and
analyze big data stored in HDFS.
Apache HBase
HBase is an open source non-relational distributed database. In other
words, it is a NoSQL database.
It supports all types of data, which is why it can handle anything in the
Hadoop ecosystem.
It is based on Google's Big-Table, which is a distributed storage system
designed to handle large data sets.
HBase is designed to run on HDFS and provide features similar to Big-
Table.
It provides us with a fault-tolerant way of storing sparse data, which is
common in most big data use cases.
HBase is written in Java, and HBase applications can be written in REST,
Avro, and Thrift API.
For example, You have billions of customer emails and you need to find
out the number of customers who has used the word complaint in their
emails.
Zookeeper
There was a huge issue of management of coordination
and synchronization among the resources or the
components of Hadoop which resulted in inconsistency.
Before Zookeeper, it was very difficult and time consuming
to coordinate between different services in Hadoop
Ecosystem.
The services earlier had many problems with interactions
like common configuration while synchronizing data.
Even if the services are configured, changes in the
configurations of the services make it complex and difficult
to handle.
The grouping and naming was also a time-consuming
factor.
Zookeeper overcame all the problems by performing
synchronization, inter-component based communication,
grouping, and maintenance.
Oozie
Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit.
There is two kinds of jobs.
1. Oozie work flow
2. Oozie coordinator jobs
Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner.
Oozie Coordinator jobs are those that are triggered when some data or
external stimulus is given to it.
Hadoop vs SQL
Hadoop vs SQL
Feature Hadoop SQL
Technology Modern Traditional
Volume Usually in PetaBytes Usually in GigaBytes
Storage, processing, retrieval and Storage, processing, retrieval and
Operations
pattern extraction from data pattern mining of data
Fault
Hadoop is highly fault tolerant SQL has good fault tolerance
Tolerance
Stores data in the form of key-value
Stores structured data in tabular format
Storage pairs, tables, hash map etc in
with fixed schema in cloud
distributed systems.
Scaling Linear Non linear
Cloudera, Horton work, AWS etc. Well-known industry leaders in SQL
Providers
provides Hadoop systems. systems are Microsoft, SAP, Oracle etc.
Interactive and batch oriented data
Data Access Batch oriented data access
access
It is licensed and costs a fortune to buy
It is open source and systems can be a SQL server, moreover if system runs
Cost
cost effectively scaled out of storage additional charges also
emerge
Hadoop vs SQL
Feature Hadoop SQL
It stores data in HDFS and process
It does not have any advanced
Optimization though Map Reduce with huge
optimization techniques
optimization techniques.
Dynamic schema, capable of storing
Static Schema, capable of storing
and processing log data, real-time
Structure data(fixed schema) in tabular format
data, images, videos, sensor data etc.
only(structured)
(both structured and unstructured)
Write data once, read data multiple
Data Update Read and Write data multiple times
times
Integrity Low High
Hadoop uses JDBC(Java Database
SQL systems can read and write data to
Interaction Connectivity) to communicate with
Hadoop systems
SQL systems to send and receive data
Hardware Uses commodity hardware Uses propriety hardware
Learning Hadoop for entry-level as well
Learning SQL is easy for even entry-
Training as seasoned profession is moderately
level professionals
hard
Hadoop Distributed File
System
Hadoop Distributed File System
Hadoop File System was developed using distributed file system design.
It is run on commodity hardware. Unlike other distributed systems, HDFS
is highly faulttolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store
such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
HDFS also makes applications available to parallel processing.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status
of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Master-Slave Architecture
HDFS follows the master-slave architecture and it has the following
elements.
Hadoop Core Components - HDFS
Node from the RAM of the NameNode and writes it into the
hard disk.
Name Node
The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software.
It is a software that can be run on commodity hardware.
The system having the namenode acts as the master server and it does
the following tasks −
It also executes file system operations such as renaming, closing, and opening files
and directories.
It records each and every change that takes place to the file system metadata.
It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are alive
It keeps a record of all the blocks in the HDFS and DataNode in which they are
stored.
Data Node
It is the slave daemon/process which runs on each slave machine.
The actual data is stored on DataNodes.
It is responsible for serving read and write requests from the clients.
It is also responsible for creating blocks, deleting
blocks and replicating the same based on the decisions taken by the
NameNode.
It sends heartbeats to the NameNode periodically to report the overall
health of HDFS, by default, this frequency is set to 3 seconds.
Secondary Node
The Secondary NameNode works concurrently
with the primary NameNode as a helper
daemon/process.
It is one which constantly reads all the file
systems and metadata from the RAM of the
NameNode and writes it into the hard disk or
the file system.
It is responsible for combining the
EditLogs with Fs-Image
from the NameNode.
It downloads the EditLogs from the
NameNode at regular intervals and applies to
FsImage.
The new FsImage is copied back to the
NameNode, which is used whenever the
NameNode is started the next time.
Block
A Block is the minimum amount of data that it can read or write.
HDFS blocks are 128 MB by default and this is configurable.
Files in HDFS are broken into block-sized chunks,which are stored as
independent units.
HDFS Architecture
Hadoop Cluster
To Improve the Network
Performance
The communication between nodes
residing on different racks is directed
via switch.
In general, you will find greater
network bandwidth between machines
in the same rack than the machines
residing in different rack.
It helps you to have reduce write
traffic in between different racks and
thus providing a better write
performance.
Also, you will be gaining increased
read performance because you are
using the bandwidth of multiple racks.
To Prevent Loss of Data
We don’t have to worry about the data
even if an entire rack fails because of
HDFS Write Architecture
HDFS client, wants to write a file named “[Link]” of size 248 MB.
Assume that the system block size is configured for 128 MB (default).
So, the client will be dividing the file “[Link]” into 2 blocks – one of
128 MB (Block A) and the other of 120 MB (block B).
HDFS Write Architecture – Cont.
Writing Process Steps:
At first, the HDFS client will reach out to the NameNode for a Write Request against
the two blocks, say, Block A & Block B.
The NameNode will then grant the client the write permission and will provide the IP
addresses of the DataNodes where the file blocks will be copied eventually.
The selection of IP addresses of DataNodes is purely randomized based on
availability, replication factor and rack awareness.
The replication factor is set to default i.e. 3. Therefore, for each block the NameNode
will be providing the client a list of (3) IP addresses of DataNodes. The list will be
unique for each block.
For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}
For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of DataNode 9}
Each block will be copied in three different DataNodes to maintain the replication
factor consistent throughout the cluster.
Now the whole data copy process will happen in three stages:
Set up of Pipeline
Data streaming and replication
Shutdown of Pipeline (Acknowledgement stage)
HDFS – Write Pipeline
Data Streaming and Replication
Shutdown of Pipeline or Acknowledgement stage
Processing data with Hadoop
Processing data with Hadoop using Map Reduce
MapReduce and HDFS are the two major components of Hadoop which
makes it so powerful and efficient to use.
MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner.
The data is first split and then combined to produce the final result.
The libraries for MapReduce is written in so many programming languages
with various different-different optimizations.
The purpose of MapReduce in Hadoop is to Map each of the jobs and then
it will reduce it to equivalent tasks for providing less overhead over the
cluster network and to reduce the processing power.
The MapReduce task is mainly divided into two phases Map Phase and
Reduce Phase.
Processing data with Hadoop using Map Reduce
Processing data with Hadoop using Map Reduce
Components of MapReduce Architecture:
Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.
Job: The MapReduce Job is the actual work that the client wanted to do
which is comprised of so many smaller tasks that the client wants to
process or execute.
Hadoop MapReduce Master: It divides the particular job into
subsequent job-parts.
Job-Parts: The task or sub-jobs that are obtained after dividing the main
job. The result of all the job-parts combined to produce the final output.
Input Data: The data set that is fed to the MapReduce for processing.
Output Data: The final result is obtained after the processing.
Processing data with Hadoop using Map Reduce
The MapReduce task is mainly divided into 2 phases i.e. Map phase and
Reduce phase.
Map: As the name suggests its main use is to map the input data in key-
value pairs. The input to the map may be a key-value pair where the key can
be the id of some kind of address and value is the actual value that it
keeps. The Map() function will be executed in its memory repository on
each of these input key-value pairs and generates the intermediate key-
value pair which works as input for the Reducer or Reduce() function.
Reduce: The intermediate key-value pairs that work as input for Reducer
are shuffled and sort and send to the Reduce() function. Reducer
aggregate or group the data based on its key-value pair as per the
reducer algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
Job Tracker: The work of Job tracker is to manage all the resources and all the
jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.
Processing data with Hadoop using Map Reduce
xample of Map Reduce Algorithm
YARN Components & Architecture
YARN
Limitations of the Map Reduce Model
Combining/Joining multiple datasets
Possible but not elegant
May require multiple jobs
Custom composable workflow
Requires multiple Jobs
Plumbing between Jobs is manual
Sort-Merge between MapReduce is Mandatory
Fault tolerance mechanisms is Rigid
Application manager:
It manages running Application Masters in the cluster, i.e., it is responsible for starting
application masters and for monitoring and restarting them on different nodes in case of
YARN Architecture
Node Manager (NM):
It is the slave daemon of Yarn.
Node Manager(NM) is responsible for containers monitoring their resource usage and
reporting the same to the Resource Manager (RM).
Manage the user process on that machine.
Yarn Node Manager also tracks the health of the node on which it is running.
It monitors resource usage, performs log management and also kills a container
based on directions from the resource manager.
It is also responsible for creating the container process and start it on the request of
Application master.
A shuffle is a typical auxiliary service by the NMs for MapReduce applications on
YARN.
Application Master (AM):
One application master runs per application.
It negotiates resources from the resource manager and works with the node
manager.
It Manages the application life cycle.
The application master requests the container from the node manager by sending a
YARN Architecture
Container:
It is a collection of physical resources such as RAM, CPU cores and disk on a single
node.
The containers are invoked by Container Launch Context(CLC) which is a record
that contains information such as environment variables, security tokens,
dependencies etc.
Task Assignment:
If the job does not qualify for running as an uber task, then the application master
requests containers for all the map and reduce tasks in the job from the resource
manager .
Requests for map tasks are made first and with a higher priority than those for
MapReduce Feature
Task Execution:
Once a task has been assigned resources for a container on a particular node by the
resource manager’s scheduler, the application master starts the container by
contacting the node manager.
The task is executed by a Java application whose main class is YarnChild. Before it
can run the task, it localizes the resources that the task needs, including the job
configuration and JAR file, and any files from the distributed cache.
Finally, it runs the map or reduce task.
Streaming:
Streaming runs special map and reduce tasks for the purpose of launching the user
supplied executable and communicating with it.
The Streaming task communicates with the process (which may be written in any
language) using standard input and output streams.
During execution of the task, the Java process passes input key value pairs to the
external process, which runs it through the user defined map or reduce function and
passes the output key value pairs back to the Java process.
From the node manager’s point of view, it is as if the child process ran the map or
reduce code itself.
MapReduce Feature
MapReduce Feature
Progress and status updates :
MapReduce jobs are long running batch jobs, taking anything from tens of seconds
to hours to run.
A job and each of its tasks have a status, which includes such things as the state of
the job or task (e g running, successfully completed, failed), the progress of maps
and reduces, the values of the job’s counters, and a status message or description
(which may be set by user code).
When a task is running, it keeps track of its progress (i.e the proportion of task is
completed).
For map tasks, this is the proportion of the input that has been processed.
For reduce tasks, it’s a little more complex, but the system can still estimate the
proportion of the reduce input processed.
It does this by dividing the total progress into three parts, corresponding to the three
phases of the shuffle.
As the map or reduce task runs, the child process communicates with its parent
application master through the umbilical interface.
The task reports its progress and status (including counters) back to its application
master, which has an aggregate view of the job, every three seconds over the
umbilical interface.
MapReduce Feature
InputSplit –
By default, split size is approximately equal to block size.
InputSplit is user defined and the user can control split size based on the size of data in MapReduce program.
Block –
It is the physical representation of data.
It contains a minimum amount of data that can be read or write.
InputSplit –
It is the logical representation of data present in the block.
It is used during data processing in MapReduce program or other processing techniques.
Input Split doesn’t contain actual data, but a reference to the data.