0% found this document useful (0 votes)
15 views

DM Hadoop Architecture

Hadoop uses the MapReduce programming algorithm and framework to process large datasets in parallel across clusters. It consists of four main components - MapReduce, HDFS, YARN, and common utilities. MapReduce breaks data into key-value pairs, maps functions to process them in parallel, and reduces the results. HDFS provides fault-tolerant storage across clusters with a master NameNode and slave DataNodes. YARN manages resources and scheduling.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

DM Hadoop Architecture

Hadoop uses the MapReduce programming algorithm and framework to process large datasets in parallel across clusters. It consists of four main components - MapReduce, HDFS, YARN, and common utilities. MapReduce breaks data into key-value pairs, maps functions to process them in parallel, and reduces the results. HDFS provides fault-tolerant storage across clusters with a master NameNode and slave DataNodes. YARN manages resources and scheduling.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Hadoop works on MapReduce Programming Algorithm that was introduced by

Google. Today lots of Big Brand Companies are using Hadoop in their
Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The
Hadoop Architecture Mainly consists of 4 components.

● MapReduce

● HDFS(Hadoop Distributed File System)

● YARN(Yet Another Resource Negotiator)

● Common Utilities or Hadoop Common

MapReduce is nothing but just like an Algorithm or a data structure that is based
on the YARN framework. The major feature of MapReduce is to perform the
distributed processing in parallel in a Hadoop cluster which Makes Hadoop
working so fast. When you are dealing with Big Data, serial processing is no
more of any use. MapReduce has mainly 2 tasks which are divided phase-wise:

● The Map() function here breaks this DataBlocks into Tuples that are
nothing but a key-value pair.
● The Reduce() function then combines this broken Tuples or key-value pair
based on its Key value and form set of Tuples, and performs some
operation like sorting, summation type job, etc.

Map Task:

● RecordReader The purpose of recordreader is to break the records. It is

responsible for providing key-value pairs in a Map() function. The key is

actually its locational information and value is the data associated with

it.

● Map: A map is nothing but a user-defined function whose work is to

process the Tuples obtained from a record reader.


● Combiner: Combiner is used for grouping the data in the Map

workflow.

● Partitionar: Partitional is responsible for fetching key-value pairs

generated in the Mapper Phases. Hashcode of each key is also fetched

by this partition. Then the partitioner performs its(Hashcode) modulus

with the number of reducers(key.hashcode()%(number of reducers)).

Reduce Task

● Shuffle and Sort: The Task of Reducer starts with this step, the

process in which the Mapper generates the intermediate key-value and

transfers them to the Reducer task is known as Shuffling. Using the

Shuffling process the system can sort the data using its key value.

Once some of the Mapping tasks are done Shuffling begins, that is why

it is a faster process and does not wait for the completion of the task

performed by Mapper.

● Reduce: The main function or task of the Reduce is to gather the Tuple

generated from Map and then perform some sorting and aggregation

sort of process on those key-values depending on its key element.


HDFS(Hadoop Distributed File System) is utilized for storage permission. It is
mainly designed for working on commodity Hardware devices(inexpensive
devices), working on a distributed file system design. HDFS is designed in such a
way that it believes more in storing the data in a large chunk of blocks rather
than storing small data blocks.

HDFS in Hadoop provides Fault-tolerance and High availability to the storage


layer and the other devices present in that Hadoop cluster. Data storage Nodes
in HDFS.

● NameNode(Master)

● DataNode(Slave)
● Namenode is mainly used for storing the Metadata i.e. the data about the
data. Metadata can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
● DataNodes works as a Slave. DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to
500 or even more than that.

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the
single block of data is divided into multiple blocks of size 128MB which is
default and you can also change it manually.

YARN(Yet Another Resource Negotiator)is a Framework on which MapReduce


works. YARN performs 2 operations that are Job scheduling and Resource
Management. The Purpose of Job scheduler is to divide a big task into small jobs
so that each job can be assigned to various slaves in a Hadoop cluster and
Processing can be Maximized. Job Scheduler also keeps track of which job is
important, which job has more priority, dependencies between the jobs and all
the other information like job timing, etc. And the use of Resource Manager is to
manage all the resources that are made available for running a Hadoop cluster.

Features of YARN

● Multi-Tenancy

● Scalability

● Cluster-Utilization

● Compatibility

Hadoop common or Common utilities are nothing but our java library and java

files or we can say the java scripts that we need for all the other components

present in a Hadoop cluster. These utilities are used by HDFS, YARN, and

MapReduce for running the cluster. Hadoop Common verifies that Hardware

failure in a Hadoop cluster is common so it needs to be solved automatically in

software by Hadoop Framework.

You might also like