0% found this document useful (0 votes)
12 views

Hadoop 1

Uploaded by

krishnaharish678
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Hadoop 1

Uploaded by

krishnaharish678
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 75

HADOOP

Hadoop
⚫ Hadoop is an open-source framework that allows to
store and process big data in a distributed environment
across clusters of computers using simple
programming models.
⚫ It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
⚫ Framework written in java orginally developed by
doug cutting in 2005.
⚫ The Hadoop framework application works in an
environment that provides distributed storage and
computation across clusters of computers
Hadoop
⚫ Hadoop uses Map Reduce and Google file system
technologies as its foundation
⚫ Hadoop runs applications using the MR
algorithm,where the data is processed in parallel with
others.
⚫ Could perform complete statistical analytics on huge
amount of data.
⚫ Hadoop is designed to scale up from single server to
thousands of machines, each offering local
computation and storage
Hadoop
⚫ Unstructured data such as log files, Twitter feeds,
media files, data from the internet in general is
becoming more and more relevant to businesses.
⚫ Everyday a large amount of unstructured data is
getting dumped into our machines. The major
challenge is not to store large data sets in our systems
but to retrieve and analyze this kind of big data in the
organizations.
Hadoop
⚫ Hadoop is a framework that has the ability to store and
analyze data present in different machines at different
locations very quickly and in a very cost effective
manner.
⚫ It uses the concept of Map Reduce which enables it to
divide the query into small parts and process them in
parallel.
Key Advantages
⚫ Stores Data in its native format
⚫ Scalable
⚫ Cost Effective
⚫ Resilient To failure
⚫ Flexibility
⚫ Fast
Features of Hadoop
• It is optimized to handle massive quantities of structured,
semi-structured and unstructured data, using commodity
hardware that is relatively inexpensive computers.
• It replicates its data across multiple computers so that if
one goes down, the data can still be processed from
another machine that stores its replica.
⚫ It complements On-Line Transaction Processing (OLTP)
and On-Line Analytical Processing(OLAP).However ,it is
not a replacement for a relational database management
system.
⚫ It is not good when work cannot be parallelized or when
there are dependencies within the data.
⚫ It is not good for processing small files.It works best with
huge data files and data sets.
What is hadoop used for?
Hadoop Versions

⚫ Hadoop 1.0
⚫ Hadoop 2.0
Hadoop2.0 Architecture
Hadoop Common

⚫ Hadoop Common is the set of common utilities that


support other Hadoop modules.
⚫ Java Utilities
⚫ To use all basic concepts of java in Hadoop
Hadoop Distributed File System

⚫ Hadoop Distributed File System (HDFS) is a file system


that provides reliable data storage and access across all the
nodes in a Hadoop cluster. It links together the file
systems on many local nodes to create a single file system.
⚫ Data in a Hadoop cluster is broken down into smaller
pieces (called blocks) and distributed throughout various
nodes in the cluster.
⚫ This way, the map and reduce functions can be
executed on smaller subsets of your larger data sets, and
this provides the scalability that is needed for big data
processing. This powerful feature is made possible
through the HDFS of Hadoop.
MapReduce

⚫ MapReduce is a programming framework of Hadoop


suitable for writing applications that process large
amounts of structured and unstructured data in parallel
across a cluster of thousands of machines, in a reliable,
fault-tolerant manner.
⚫ MapReduce is the heart of Hadoop. It is this
programming paradigm that allows for massive
scalability across hundreds or thousands of servers in a
Hadoop cluster.
⚫ The MapReduce concept is fairly simple to
understand for those who are familiar with clustered
scale-out data processing solutions.
Yet Another Resource Negotiator
⚫ YARN is a cluster management technology. It is one of
the key features in second-generation Hadoop.
⚫ It is the next-generation MapReduce, which assigns CPU,
memory and storage to applications running on a
Hadoop cluster.
⚫ It enables application frameworks other than MapReduce
to run on Hadoop, opening up a wealth of possibilities.
⚫ Part of the core Hadoop project, YARN is the architectural
center of Hadoop that allows multiple data processing
engines such as interactive SQL, real-time streaming, data
science and batch processing to handle data stored in a
single platform.
Core Components Of Hadoop
⚫ Hadoop Components
⚫ HDFS(Hadoop Distributed File System)
⚫ MR(Map Reduce).
⚫ Hadoop Layers
⚫ Storage Layer(HDFS)
⚫ Processing/Computing Layer(Map Reduce)
HDFS
⚫ HDFS is a sub project of the apache hadoop project.
⚫ This system is designed to provide a fault tolerant file
system designed to run on commodity hardware.
⚫ HDFS uses a Master/Slave architecture to process the
data.
⚫ HDFS is a distributed file system that provides
high-throughput access to data.
⚫ It provides a limited interface for managing the file
system to allow it to scale and provide high
throughput
Hadoop Clusters
⚫ Hadoop cluster is a special type of computational
cluster designed for storing and analyzing
vast amount of unstructured data in a
distributed computing environment. These
clusters run on low cost commodity
computers.
Architecture
⚫ HDFS cluster consists of a single NameNode, a
master server that manages the file system namespace
and regulates access to files by clients.
⚫ In addition, there are a number of DataNodes, usually
one per node in the cluster, which manage storage
attached to the nodes that they run on. HDFS exposes
a file system namespace and allows user data to be
stored in files
⚫ Whenever s files/data needs to be stored in hdfs,files
are internally divided in blocks and these blocks
gets assigned to data node.
⚫ Data nodes then performs read,write,block creation
and deletion operation as instructed by name
node.
HDFS NODES
⚫ Three nodes

⚫ NAME NODE
⚫ SECONDARY NAME NODE
⚫ DATA NODE
Name Node/Master Node
⚫ NameNode is the master of the system. It maintains the
name system (directories and files) and manages the
blocks which are present on the DataNodes.
⚫ Tasks: Store and manage file system meta
data,Mappings blocks to data node,File
system operations and Regulates client access
to files.
⚫ It also keeps track of location of data on data
nodes
Name Node
⚫ NameNode does NOT store the files but only the file's
metadata.
⚫ NameNode oversees the health of DataNode and
coordinates access to the data stored in DataNode
⚫ Name node keeps track of all the file system related
information such as to
⚫ Which section of file is saved in which part of the
cluster
⚫ Last access time for the files
⚫ User permissions like which user have access to the
file
⚫ It also executes file system operations such as
renaming, closing, and opening files and directories.
⚫ The NameNode is the centerpiece of an HDFS file
system. It keeps the directory tree of all files in the file
system, and tracks where across the cluster the file data
is kept. It does not store the data of these files itself.
⚫ NameNode stores MetaData(No of Blocks, On
Which Rack which DataNode the data is stored and
other details) about the data being stored in
DataNodes whereas the DataNode stores the actual
Data.
Secondary Name Node
⚫ Don't get confused with the name "Secondary". Secondary
Node is NOT the backup or high availability node for
Name node.
⚫ The job of Secondary Node is to contact NameNode in
a periodic manner after certain time interval(by default
1 hour). .
⚫ NameNode which keeps all filesystem metadata in RAM
has no capability to process that metadata on to disk.
⚫ So if NameNode crashes, you lose everything in
RAM itself and you don't have any backup of
filesystem.
SNN
⚫ What secondary node does is it contacts NameNode in
an hour and pulls copy of metadata information out
of NameNode.
⚫ It shuffle and merge this information into clean
file folder and sent to back again to NameNode,
while keeping a copy for itself.
⚫ Hence Secondary Node is not the backup rather it does
job of housekeeping.
⚫ In case of NameNode failure, saved metadata can
rebuild it easily.
SNN Data Formats

⚫ fsimage
⚫ snapshot of the fs when name node started
⚫ Edit logs
⚫ It’s the sequence of changes made to the fs after name
node started.
Data Node
⚫ Every node in a cluster,there will be a data node.These
nodes manage the data storage of their system.
⚫ Datanodes perform read-write operations on the file
systems, as per client request.
⚫ They also perform operations such as block creation,
deletion, and replication according to the
instructions of the namenode
Data Node
⚫ A DataNode connects to the NameNode; spinning
until that service comes up.
⚫ It then responds to requests from the NameNode
for filesystem operations.

⚫ Slave nodes are the majority of machines in Hadoop


Cluster and are responsible to
⚫ Store the data
⚫ Process the computation
Blocks
⚫ Generally the user data is stored in the files of HDFS. The
file in a file system will be divided into one or more
segments and/or stored in individual data nodes.
⚫ These file segments are called as blocks. In other words,
the minimum amount of data that HDFS can read or
write is called a Block. The default block size is 64MB,
but it can be increased as per the need to change in HDFS
configuration.
Data Node
⚫ A data node also continuously sends heart beat to name
node to ensure the connectivity between the name
node and data node.
⚫ In case if the name node does not receive a heart beat
from a particular data node for 10 min , then it
considers that data node to be dead/out of service and
initiates replication of blocks which were hosted on
that data node to be hosted on some other data node.
2 Daemons
It provides connectivits
between hadoop and your
application when you submit code to
cluster JobTracker creates the
exe plan by deciding
which task and assign to which
node . its monitiors all the running
task
Job Tracker
⚫ When a task fails it automatically reschedules the
task to a different node after a predefined no
of retries. Job tracker is a master daemon
responsible for executing overall Map reduce job.
there is a single job tracker per hadoop cluster
⚫ "It is responsible for scheduling tasks to the
task tracker moniter ree xecute the task"
Task Tracker
⚫ This daemon is responsible for executing the
individual tasks that is assigned by the job
tacks
⚫ There is a single task tracker per slave and
spawns multiple java virtual machines to
handle mutiple MR tasks in parallel.
Job Tracker
⚫ When the job tracker fails to receive a heart beat from a
task tracker, the job tracker assumes that the task tracker
has failed and resubmits the task to another available node
in the cluster".
⚫ Once the client submits a job to the job tracker, it
partitions and assigns diverse map reduce tasks for each
task tracker in the cluster.
Map Reduce
⚫ Software framework
⚫ Helps you to process massive amounts of data in
parallel.
⚫ Heart of Hadoop
⚫ Java API
⚫ Set of services for managing the task
⚫ Map-performs transformation
⚫ Reduce-Peforms an aggregation
Map Reduce
⚫ MapReduce is a framework using which we can write
applications to process huge amounts of data, in parallel,
on large clusters of commodity hardware in a reliable
manner.
⚫ MapReduce is a processing technique and a program
model for distributed computing based on java. The
MapReduce algorithm contains two important tasks,
namely Map and Reduce
Map
⚫ Application on operation to a piece of data
⚫ Provide some intermediate output
⚫ Map tasks process these independent chunks
completes in a parallel manner. the output
produced by the map tasks server as intermediate
data and is shared on the localdisk
of that server
Map
⚫ Map takes a set of data and converts it into another set
of data where individual elements are broken
down into tuples. (key/value pair)
⚫ The output of mappers are automatically
shuffled and sorted by the framework
Reduce
⚫ Consolidates the intermediate outputs from the map steps.
⚫ Provides final output.
⚫ Reduce task which takes the output from a map as an
input and combines these data tuples into a smaller set of
tuples.
⚫ Reduce task is always performed after the map job.
Reduce
⚫ But once we write an application in the map reduce form
scaling the application to run over hundreds thousands or
even tens of thousand of machine in a cluster is merely a
confugiration change.
How MapReduce Works?

⚫ The Map task takes a set of data and converts it into


another set of data, where individual elements are
broken down into tuples (key-value pairs).
⚫ The Reduce task takes the output from the Map as an
input and combines those data tuples (key-value pairs)
into a smaller set of tuples.
Phases
⚫ Input Phase − Here we have a Record Reader that
translates each record in an input file and sends the parsed
data to the mapper in the form of key-value pairs.
⚫ Map − Map is a user-defined function, which takes a
series of key-value pairs and processes each one of them
to generate zero or more key-value pairs.
⚫ Intermediate Keys − They key-value pairs generated by
the mapper are known as intermediate keys.
Phases
⚫ Combiner − A combiner is a type of local Reducer
that groups similar data from the map phase into
identifiable sets. It takes the intermediate keys from the
mapper as input and applies a user-defined code to
aggregate the values in a small scope of one mapper. It
is not a part of the main MapReduce algorithm; it is
optional.
Phases
⚫ Reducer − The Reducer takes the grouped key-value
paired data as input and runs a Reducer function on
each one of them. Here, the data can be aggregated,
filtered, and combined in a number of ways, and it
requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs
to the final step.
Output
⚫ Output Phase − In the output phase, we have an
output formatter that translates the final key-value
pairs from the Reducer function and writes them onto a
file using a record writer.
Keywords
⚫ Tokenize − Tokenizes the tweets into maps of tokens
and writes them as key-value pairs.
⚫ Filter − Filters unwanted words from the maps of
tokens and writes the filtered maps as key-value pairs.
⚫ Count − Generates a token counter per word.
⚫ Aggregate Counters − Prepares an aggregate of
similar counter values into small manageable units
Map Reduce API

⚫ JobContext Interface
⚫ Job Class
⚫ Mapper Class
⚫ Reducer Class
HDFS READ FILE
HDFS WRITES A FILE
MR
MR VS YARN

You might also like