Hadoop 1
Hadoop 1
Hadoop
⚫ Hadoop is an open-source framework that allows to
store and process big data in a distributed environment
across clusters of computers using simple
programming models.
⚫ It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
⚫ Framework written in java orginally developed by
doug cutting in 2005.
⚫ The Hadoop framework application works in an
environment that provides distributed storage and
computation across clusters of computers
Hadoop
⚫ Hadoop uses Map Reduce and Google file system
technologies as its foundation
⚫ Hadoop runs applications using the MR
algorithm,where the data is processed in parallel with
others.
⚫ Could perform complete statistical analytics on huge
amount of data.
⚫ Hadoop is designed to scale up from single server to
thousands of machines, each offering local
computation and storage
Hadoop
⚫ Unstructured data such as log files, Twitter feeds,
media files, data from the internet in general is
becoming more and more relevant to businesses.
⚫ Everyday a large amount of unstructured data is
getting dumped into our machines. The major
challenge is not to store large data sets in our systems
but to retrieve and analyze this kind of big data in the
organizations.
Hadoop
⚫ Hadoop is a framework that has the ability to store and
analyze data present in different machines at different
locations very quickly and in a very cost effective
manner.
⚫ It uses the concept of Map Reduce which enables it to
divide the query into small parts and process them in
parallel.
Key Advantages
⚫ Stores Data in its native format
⚫ Scalable
⚫ Cost Effective
⚫ Resilient To failure
⚫ Flexibility
⚫ Fast
Features of Hadoop
• It is optimized to handle massive quantities of structured,
semi-structured and unstructured data, using commodity
hardware that is relatively inexpensive computers.
• It replicates its data across multiple computers so that if
one goes down, the data can still be processed from
another machine that stores its replica.
⚫ It complements On-Line Transaction Processing (OLTP)
and On-Line Analytical Processing(OLAP).However ,it is
not a replacement for a relational database management
system.
⚫ It is not good when work cannot be parallelized or when
there are dependencies within the data.
⚫ It is not good for processing small files.It works best with
huge data files and data sets.
What is hadoop used for?
Hadoop Versions
⚫ Hadoop 1.0
⚫ Hadoop 2.0
Hadoop2.0 Architecture
Hadoop Common
⚫ NAME NODE
⚫ SECONDARY NAME NODE
⚫ DATA NODE
Name Node/Master Node
⚫ NameNode is the master of the system. It maintains the
name system (directories and files) and manages the
blocks which are present on the DataNodes.
⚫ Tasks: Store and manage file system meta
data,Mappings blocks to data node,File
system operations and Regulates client access
to files.
⚫ It also keeps track of location of data on data
nodes
Name Node
⚫ NameNode does NOT store the files but only the file's
metadata.
⚫ NameNode oversees the health of DataNode and
coordinates access to the data stored in DataNode
⚫ Name node keeps track of all the file system related
information such as to
⚫ Which section of file is saved in which part of the
cluster
⚫ Last access time for the files
⚫ User permissions like which user have access to the
file
⚫ It also executes file system operations such as
renaming, closing, and opening files and directories.
⚫ The NameNode is the centerpiece of an HDFS file
system. It keeps the directory tree of all files in the file
system, and tracks where across the cluster the file data
is kept. It does not store the data of these files itself.
⚫ NameNode stores MetaData(No of Blocks, On
Which Rack which DataNode the data is stored and
other details) about the data being stored in
DataNodes whereas the DataNode stores the actual
Data.
Secondary Name Node
⚫ Don't get confused with the name "Secondary". Secondary
Node is NOT the backup or high availability node for
Name node.
⚫ The job of Secondary Node is to contact NameNode in
a periodic manner after certain time interval(by default
1 hour). .
⚫ NameNode which keeps all filesystem metadata in RAM
has no capability to process that metadata on to disk.
⚫ So if NameNode crashes, you lose everything in
RAM itself and you don't have any backup of
filesystem.
SNN
⚫ What secondary node does is it contacts NameNode in
an hour and pulls copy of metadata information out
of NameNode.
⚫ It shuffle and merge this information into clean
file folder and sent to back again to NameNode,
while keeping a copy for itself.
⚫ Hence Secondary Node is not the backup rather it does
job of housekeeping.
⚫ In case of NameNode failure, saved metadata can
rebuild it easily.
SNN Data Formats
⚫ fsimage
⚫ snapshot of the fs when name node started
⚫ Edit logs
⚫ It’s the sequence of changes made to the fs after name
node started.
Data Node
⚫ Every node in a cluster,there will be a data node.These
nodes manage the data storage of their system.
⚫ Datanodes perform read-write operations on the file
systems, as per client request.
⚫ They also perform operations such as block creation,
deletion, and replication according to the
instructions of the namenode
Data Node
⚫ A DataNode connects to the NameNode; spinning
until that service comes up.
⚫ It then responds to requests from the NameNode
for filesystem operations.
⚫ JobContext Interface
⚫ Job Class
⚫ Mapper Class
⚫ Reducer Class
HDFS READ FILE
HDFS WRITES A FILE
MR
MR VS YARN