0% found this document useful (0 votes)

15 views

DM Hadoop Architecture

Hadoop uses the MapReduce programming algorithm and framework to process large datasets in parallel across clusters. It consists of four main components - MapReduce, HDFS, YARN, and common utilities. MapReduce breaks data into key-value pairs, maps functions to process them in parallel, and reduces the results. HDFS provides fault-tolerant storage across clusters with a master NameNode and slave DataNodes. YARN manages resources and scheduling.

Uploaded by

20PCT19 THANISHKA S

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

DM Hadoop Architecture

Uploaded by

20PCT19 THANISHKA S

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Hadoop works on MapReduce Programming Algorithm that was introduced by

Google. Today lots of Big Brand Companies are using Hadoop in their
Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The
Hadoop Architecture Mainly consists of 4 components.

● MapReduce

● HDFS(Hadoop Distributed File System)

● YARN(Yet Another Resource Negotiator)

● Common Utilities or Hadoop Common

MapReduce is nothing but just like an Algorithm or a data structure that is based
on the YARN framework. The major feature of MapReduce is to perform the
distributed processing in parallel in a Hadoop cluster which Makes Hadoop
working so fast. When you are dealing with Big Data, serial processing is no
more of any use. MapReduce has mainly 2 tasks which are divided phase-wise:

● The Map() function here breaks this DataBlocks into Tuples that are
nothing but a key-value pair.
● The Reduce() function then combines this broken Tuples or key-value pair
based on its Key value and form set of Tuples, and performs some
operation like sorting, summation type job, etc.

Map Task:

● RecordReader The purpose of recordreader is to break the records. It is

responsible for providing key-value pairs in a Map() function. The key is

actually its locational information and value is the data associated with

it.

● Map: A map is nothing but a user-defined function whose work is to

process the Tuples obtained from a record reader.

● Combiner: Combiner is used for grouping the data in the Map

workflow.

● Partitionar: Partitional is responsible for fetching key-value pairs

generated in the Mapper Phases. Hashcode of each key is also fetched

by this partition. Then the partitioner performs its(Hashcode) modulus

with the number of reducers(key.hashcode()%(number of reducers)).

Reduce Task

● Shuffle and Sort: The Task of Reducer starts with this step, the

process in which the Mapper generates the intermediate key-value and

transfers them to the Reducer task is known as Shuffling. Using the

Shuffling process the system can sort the data using its key value.

Once some of the Mapping tasks are done Shuffling begins, that is why

it is a faster process and does not wait for the completion of the task

performed by Mapper.

● Reduce: The main function or task of the Reduce is to gather the Tuple

generated from Map and then perform some sorting and aggregation

sort of process on those key-values depending on its key element.

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is
mainly designed for working on commodity Hardware devices(inexpensive
devices), working on a distributed file system design. HDFS is designed in such a
way that it believes more in storing the data in a large chunk of blocks rather
than storing small data blocks.

HDFS in Hadoop provides Fault-tolerance and High availability to the storage

layer and the other devices present in that Hadoop cluster. Data storage Nodes
in HDFS.

● NameNode(Master)

● DataNode(Slave)
● Namenode is mainly used for storing the Metadata i.e. the data about the
data. Metadata can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
● DataNodes works as a Slave. DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to
500 or even more than that.

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the
single block of data is divided into multiple blocks of size 128MB which is
default and you can also change it manually.

YARN(Yet Another Resource Negotiator)is a Framework on which MapReduce

works. YARN performs 2 operations that are Job scheduling and Resource
Management. The Purpose of Job scheduler is to divide a big task into small jobs
so that each job can be assigned to various slaves in a Hadoop cluster and
Processing can be Maximized. Job Scheduler also keeps track of which job is
important, which job has more priority, dependencies between the jobs and all
the other information like job timing, etc. And the use of Resource Manager is to
manage all the resources that are made available for running a Hadoop cluster.

Features of YARN

● Multi-Tenancy

● Scalability

● Cluster-Utilization

● Compatibility

Hadoop common or Common utilities are nothing but our java library and java

files or we can say the java scripts that we need for all the other components

present in a Hadoop cluster. These utilities are used by HDFS, YARN, and

MapReduce for running the cluster. Hadoop Common verifies that Hardware

failure in a Hadoop cluster is common so it needs to be solved automatically in

software by Hadoop Framework.

Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Big Data Notes
No ratings yet
Big Data Notes
8 pages
HADOOP
No ratings yet
HADOOP
19 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Introduction to Hadoop - Copy
No ratings yet
Introduction to Hadoop - Copy
14 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
CC unit5
No ratings yet
CC unit5
27 pages
unit5 b
No ratings yet
unit5 b
4 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
BDM 2
No ratings yet
BDM 2
5 pages
Unit 5
No ratings yet
Unit 5
7 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Cloud Computing
No ratings yet
Cloud Computing
19 pages
Unit 5
No ratings yet
Unit 5
35 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
IDS Unit3
No ratings yet
IDS Unit3
16 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Hadoop Architec
No ratings yet
Hadoop Architec
14 pages
L5-MapReduce-P3
No ratings yet
L5-MapReduce-P3
23 pages
What Is MapReduce in Hadoop - Architecture - Example
No ratings yet
What Is MapReduce in Hadoop - Architecture - Example
7 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
bdcc-2.2
No ratings yet
bdcc-2.2
12 pages
Hadoop
No ratings yet
Hadoop
12 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
BDA-Unit-1
No ratings yet
BDA-Unit-1
35 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Big Data
No ratings yet
Big Data
67 pages
ECS765P_W3_Hadoop principles and components
No ratings yet
ECS765P_W3_Hadoop principles and components
47 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Hadoop 2full Mod2
No ratings yet
Hadoop 2full Mod2
10 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Unit 3
No ratings yet
Unit 3
18 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
1 MapReduce introduction with example
No ratings yet
1 MapReduce introduction with example
52 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Hadoop
No ratings yet
Hadoop
7 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Bda Imp No Header Footer (1)
No ratings yet
Bda Imp No Header Footer (1)
25 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
IMTC634_Data Science_Chapter 13
No ratings yet
IMTC634_Data Science_Chapter 13
16 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
Module 2
No ratings yet
Module 2
23 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet