Unit 2 Notes BDA

This document discusses HDFS architecture and components. It explains that HDFS is the primary storage system for Hadoop and has a distributed architecture. The two main components are the NameNode, which manages metadata, and DataNodes, which store data blocks. The NameNode does not store data itself. DataNodes store and retrieve data blocks as instructed by the NameNode. HDFS provides fault tolerance, scalability, and high throughput for large datasets.

Uploaded by

vasusrivastava138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Unit 2 Notes BDA

Uploaded by

vasusrivastava138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

1.What is HDFS? Explain HDFS Architecture in Details?

It is the most important component of Hadoop Ecosystem.

HDFS is the primary storage system of Hadoop. Hadoop
distributed file system (HDFS) is a java-based file system that
provides scalable, fault tolerance, reliable and cost efficient
data storage for Big data. HDFS is a distributed filesystem that
runs on commodity hardware. HDFS is already configured
with default configuration for many installations. Most of the
time for large clusters configuration is needed. Hadoop
interact directly with HDFS by shelllike commands.
HDFS Components:
There are two major components of Hadoop HDFS-
NameNode and DataNode. Let’s now discuss these Hadoop
HDFS Components
NameNode
It is also known as Master node. NameNode does not store
actual data or dataset. NameNode stores Metadata i.e.
number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of
files and directories.
Tasks of HDFS NameNode
. Manage file system namespace.
• Regulates client’s access to files.
• Executes file system execution such as naming, closing,
opening files and directories.
ii. DataNode
It is also known as Slave. HDFS Datanode is responsible for
storing actual data in HDFS. Datanode performs read and
write operation as per the request of the clients. Replica
block of Datanode consists of 2 files on the file system. The
first file is for data and second file is for recording the block’s
metadata. HDFS Metadata includes checksums for data. At
startup, each Datanode connects to its corresponding
Namenode and does handshaking. Verification of namespace
ID and software version of DataNode take place by
handshaking. At the time of mismatch found, DataNode goes
down automatically.
Tasks of HDFS DataNode
.Data Node performs operations like block replica creation,
deletion, and replication according to the instruction of
NameNode.
• DataNode manages data storage of the system.
**Advantages:**
HDFS offers several advantages, including:
- High fault tolerance due to data replication.
- Scalability for handling large datasets.
- High throughput for both reading and writing data.
- Streaming data access, suitable for batch processing.
- Data locality: It places computation close to the data,
reducing network traffic.

**Limitations:**
HDFS is optimized for batch processing and is not suitable for
low-latency or real-time access patterns.
In summary, HDFS is a distributed file system designed to
store and manage large datasets across a cluster of machines.
Its architecture ensures fault tolerance, data replication, and
efficient data storage and retrieval, making it a core
component of the Hadoop ecosystem.

2.Write Short note on Map Reduce & Explain Map Reduce

Process with Example?

Traditional systems tend to use a centralized server for

storing and retrieving data. Such huge amount of data cannot
be accommodated by standard database servers. Â Also,
centralized systems create too much of a bottleneck while
processing multiple files simultaneously. Google, came up
with MapReduce to solve such bottleneck issues. MapReduce
will divide the task into small parts and process each part
independently by assigning them to different systems. After
all the parts are processed and analyzed, the output of each
computer is collected in one single location and then an
output dataset is prepared for the given problem.

MapReduce is a programming framework that allows us to

perform distributed and parallel processing on large data sets
in a distributed environment. MapReduce consists of two
distinct tasks – Map and Reduce.
• As the name MapReduce suggests, the reducer phase takes
place after the mapper phase has been completed.
• So, the first is the map job, where a block of data is read
and processed to produce key-value pairs as intermediate
outputs.
• The output of a Mapper or map job (key-value pairs) is
input to the Reducer.
• The reducer receives the key-value pair from multiple map
jobs.
• Then, the reducer aggregates those intermediate data
tuples (intermediate keyvalue pair) into a smaller set of
tuples or key-value pairs which is the final output. Let us
understand more about MapReduce and its components.
MapReduce majorly has the following three Classes. They
are, Mapper Class The first stage in Data Processing using
MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective
key-value pair. Hadoop’s Mapper store saves this
intermediate data into the local disk.
• Input Split It is the logical representation of data. It
represents a block of work that contains a single map task in
the MapReduce Program.
• RecordReader It interacts with the Input split and converts
the obtained data in the form of KeyValue Pairs. Reducer
Class The Intermediate output generated from the mapper is
fed to the reducer which processes it and generates the final
output which is then saved in the HDFS. Driver Class The
major component in a MapReduce job is a Driver Class. It is
responsible for setting up a MapReduce Job to run-in
Hadoop. We specify the names of Mapper and Reducer
Classes long with data types and their respective job names.

In Big Data Analytics, MapReduce plays a crucial role. When it

is combined with HDFS we can use MapReduce to handle Big
Data. The basic unit of information used by MapReduce is a
key-value pair. All the data whether structured or
unstructured needs to be translated to the key-value pair
before it is passed through the MapReduce model.
1. Critical path problem: It is the amount of time taken to
finish the job without delaying the next milestone or actual
completion date. So, if, any of the machines delay the job, the
whole work gets delayed.
2. Reliability problem: What if, any of the machines which are
working with a part of data fails? The management of this
failover becomes a challenge.
3. Equal split issue: How will I divide the data into smaller
chunks so that each machine gets even part of data to work
with. In other words, how to equally divide the data such that
no individual machine is overloaded or underutilized.
4. The single split may fail: If any of the machines fail to
provide the output, I will not be able to calculate the result.
So, there should be a mechanism to ensure this fault
tolerance capability of the system.
5. Aggregation of the result: There should be a mechanism to
aggregate the result generated by each of the machines to
produce the final output.

6.What is YARN in Hadoop, Explain YARN Architecture in

details with its component?

YARN stands for “Yet Another Resource Negotiator“. It was

introduced in Hadoop 2.0 to remove the bottleneck on Job
Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time
of its launching, but it has now evolved to be known as large-
scale distributed operating system used for Big Data
processing.
YARN architecture basically separates resource management
layer from the processing layer. In Hadoop 1.0 version, the
responsibility of Job tracker is split between the resource
manager and application manager.
YARN Features: YARN gained popularity because of the
following features-
• Scalability: The scheduler in Resource manager of YARN
architecture allows Hadoop to extend and manage thousands
of nodes and clusters.
• Compatibility: YARN supports the existing map-reduce
applications without disruptions thus making it compatible
with Hadoop 1.0 as well.
• Cluster Utilization:Since YARN supports Dynamic utilization
of cluster in Hadoop, which enables optimized Cluster
Utilization.
• Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.

The main components of YARN architecture include:

• Client: It submits map-reduce jobs.
• Resource Manager: It is the master daemon of YARN and is
responsible for resource assignment and management
among all the applications. Whenever it receives a processing
request, it forwards it to the corresponding node manager
and allocates resources for the completion of the request
accordingly. It has two major components:
• Scheduler: It performs scheduling based on the
allocated application and available resources. It is a pure
scheduler, means it does not perform other tasks such as
monitoring or tracking and does not guarantee a restart if a
task fails. The YARN scheduler supports plugins such as
Capacity Scheduler and Fair Scheduler to partition the cluster
resources.
• Application manager: It is responsible for accepting the
application and negotiating the first container from the
resource manager. It also restarts the Application Master
container if a task fails.

• Node Manager: It take care of individual node on Hadoop

cluster and manages application and workflow and that
particular node. Its primary job is to keep-up with the
Resource Manager. It registers with the Resource Manager
and sends heartbeats with the health status of the node. It
monitors resource usage, performs log management and also
kills a container based on directions from the resource
manager. It is also responsible for creating the container
process and start it on the request of Application master.
• Application Master: An application is a single job submitted
to a framework. The application master is responsible for
negotiating resources with the resource manager, tracking
the status and monitoring progress of a single application.
The application master requests the container from the node
manager by sending a Container Launch Context(CLC) which
includes everything an application needs to run. Once the
application is started, it sends the health report to the
resource manager from time-to-time.
• Container: It is a collection of physical resources such as
RAM, CPU cores and disk on a single node. The containers are
invoked by Container Launch Context(CLC) which is a record
that contains information such as environment variables,
security tokens, dependencies etc.

Essential Prealgebra Skills Practice Workbook - Chris McMullen
No ratings yet
Essential Prealgebra Skills Practice Workbook - Chris McMullen
350 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Big Data Notes Unit-3
No ratings yet
Big Data Notes Unit-3
7 pages
Big data unit 3 own
No ratings yet
Big data unit 3 own
20 pages
CH 2
No ratings yet
CH 2
6 pages
unit5 b
No ratings yet
unit5 b
4 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Unit 3
No ratings yet
Unit 3
18 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
HADOOP
No ratings yet
HADOOP
19 pages
CC unit5
No ratings yet
CC unit5
27 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Big Data
No ratings yet
Big Data
16 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
Introduction to Hadoop - Copy
No ratings yet
Introduction to Hadoop - Copy
14 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
YARN
No ratings yet
YARN
5 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
ECS765P_W3_Hadoop principles and components
No ratings yet
ECS765P_W3_Hadoop principles and components
47 pages
bda final sem 7
No ratings yet
bda final sem 7
120 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
BDM 2
No ratings yet
BDM 2
5 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Hadoop 2full Mod2
No ratings yet
Hadoop 2full Mod2
10 pages
BDA-Unit-1
No ratings yet
BDA-Unit-1
35 pages
10 - Big Data Architecture and Tools (1)
No ratings yet
10 - Big Data Architecture and Tools (1)
31 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
Hadoop Intro1
No ratings yet
Hadoop Intro1
15 pages
Hadoop
No ratings yet
Hadoop
7 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Unit IV Notes
No ratings yet
Unit IV Notes
34 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Bigdata and Hadoop - Unit III
No ratings yet
Bigdata and Hadoop - Unit III
24 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
No ratings yet
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
30 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Unit 3
No ratings yet
Unit 3
25 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Hadoop
No ratings yet
Hadoop
4 pages
learn
No ratings yet
learn
16 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
bdcc-2.2
No ratings yet
bdcc-2.2
12 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Bda Imp No Header Footer (1)
No ratings yet
Bda Imp No Header Footer (1)
25 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Ekt120 Lecture07 Arrays1
No ratings yet
Ekt120 Lecture07 Arrays1
31 pages
Complex Clauses (1) : Finite and Non-Finite Clauses Clause Combining Strategies
No ratings yet
Complex Clauses (1) : Finite and Non-Finite Clauses Clause Combining Strategies
40 pages
The Implementation of Cooperative Learning Type TGT
No ratings yet
The Implementation of Cooperative Learning Type TGT
36 pages
Performativitas Tokoh Utama Dan Toxic Maskulinitas Dalam Film Pendek Pria: Kajian Teori Queer Judith Butler
No ratings yet
Performativitas Tokoh Utama Dan Toxic Maskulinitas Dalam Film Pendek Pria: Kajian Teori Queer Judith Butler
18 pages
Observational Report 8607
No ratings yet
Observational Report 8607
4 pages
Ai Agents Roadmap
No ratings yet
Ai Agents Roadmap
12 pages
Personal Pronouns and To Be Verb - 103214
No ratings yet
Personal Pronouns and To Be Verb - 103214
45 pages
Geometry of Time: Hartmut Winkler
No ratings yet
Geometry of Time: Hartmut Winkler
15 pages
The Christmas Party SV A2
No ratings yet
The Christmas Party SV A2
5 pages
Vesica Piscis Key
No ratings yet
Vesica Piscis Key
36 pages
Disk Quota
0% (1)
Disk Quota
19 pages
Witnessing Template
No ratings yet
Witnessing Template
3 pages
Word Biblical Commentary Series - Bibliographic Listing
100% (1)
Word Biblical Commentary Series - Bibliographic Listing
19 pages
Q4-Periodical Test-English3
No ratings yet
Q4-Periodical Test-English3
6 pages
Intertextuality Between The Alchemical Precepts of Maria Prophetissa and Sayings in The Gospel of Thomas
No ratings yet
Intertextuality Between The Alchemical Precepts of Maria Prophetissa and Sayings in The Gospel of Thomas
4 pages
Aktu Question Paper Year: - 2022-23
No ratings yet
Aktu Question Paper Year: - 2022-23
18 pages
Best Evidence 2nd Edition Michael Schmicker instant download
100% (6)
Best Evidence 2nd Edition Michael Schmicker instant download
60 pages
Annex A1 Rpms Tool For Proficient Teachers Sy 2022 2023 - Compress
No ratings yet
Annex A1 Rpms Tool For Proficient Teachers Sy 2022 2023 - Compress
20 pages
Text Chunking Using NLTK
No ratings yet
Text Chunking Using NLTK
24 pages
Inquiry Presentation
No ratings yet
Inquiry Presentation
18 pages
Home Assignment 21-1-6
No ratings yet
Home Assignment 21-1-6
6 pages
Intro 2 Netlab
No ratings yet
Intro 2 Netlab
10 pages
Remote Codes
No ratings yet
Remote Codes
21 pages
Chapter-1: 1.1 Non Real Time Operating Systems
No ratings yet
Chapter-1: 1.1 Non Real Time Operating Systems
58 pages
Speech On Teachers Day
No ratings yet
Speech On Teachers Day
2 pages
Object Orientation and C++: An Introduction For CFD Specialists
No ratings yet
Object Orientation and C++: An Introduction For CFD Specialists
24 pages
Practical 17
No ratings yet
Practical 17
5 pages
3H Complete
No ratings yet
3H Complete
18 pages
Symmetric Groups: N N N N
No ratings yet
Symmetric Groups: N N N N
11 pages

Unit 2 Notes BDA

Uploaded by

Unit 2 Notes BDA

Uploaded by

1.What is HDFS? Explain HDFS Architecture in Details?

It is the most important component of Hadoop Ecosystem.

2.Write Short note on Map Reduce & Explain Map Reduce

Traditional systems tend to use a centralized server for

MapReduce is a programming framework that allows us to

In Big Data Analytics, MapReduce plays a crucial role. When it

6.What is YARN in Hadoop, Explain YARN Architecture in

YARN stands for “Yet Another Resource Negotiator“. It was

The main components of YARN architecture include:

• Node Manager: It take care of individual node on Hadoop

You might also like