0% found this document useful (0 votes)
43 views

Unit 2 Notes BDA

This document discusses HDFS architecture and components. It explains that HDFS is the primary storage system for Hadoop and has a distributed architecture. The two main components are the NameNode, which manages metadata, and DataNodes, which store data blocks. The NameNode does not store data itself. DataNodes store and retrieve data blocks as instructed by the NameNode. HDFS provides fault tolerance, scalability, and high throughput for large datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Unit 2 Notes BDA

This document discusses HDFS architecture and components. It explains that HDFS is the primary storage system for Hadoop and has a distributed architecture. The two main components are the NameNode, which manages metadata, and DataNodes, which store data blocks. The NameNode does not store data itself. DataNodes store and retrieve data blocks as instructed by the NameNode. HDFS provides fault tolerance, scalability, and high throughput for large datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1.What is HDFS? Explain HDFS Architecture in Details?

It is the most important component of Hadoop Ecosystem.


HDFS is the primary storage system of Hadoop. Hadoop
distributed file system (HDFS) is a java-based file system that
provides scalable, fault tolerance, reliable and cost efficient
data storage for Big data. HDFS is a distributed filesystem that
runs on commodity hardware. HDFS is already configured
with default configuration for many installations. Most of the
time for large clusters configuration is needed. Hadoop
interact directly with HDFS by shelllike commands.
HDFS Components:
There are two major components of Hadoop HDFS-
NameNode and DataNode. Let’s now discuss these Hadoop
HDFS Components
NameNode
It is also known as Master node. NameNode does not store
actual data or dataset. NameNode stores Metadata i.e.
number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of
files and directories.
Tasks of HDFS NameNode
. Manage file system namespace.
• Regulates client’s access to files.
• Executes file system execution such as naming, closing,
opening files and directories.
ii. DataNode
It is also known as Slave. HDFS Datanode is responsible for
storing actual data in HDFS. Datanode performs read and
write operation as per the request of the clients. Replica
block of Datanode consists of 2 files on the file system. The
first file is for data and second file is for recording the block’s
metadata. HDFS Metadata includes checksums for data. At
startup, each Datanode connects to its corresponding
Namenode and does handshaking. Verification of namespace
ID and software version of DataNode take place by
handshaking. At the time of mismatch found, DataNode goes
down automatically.
Tasks of HDFS DataNode
.Data Node performs operations like block replica creation,
deletion, and replication according to the instruction of
NameNode.
• DataNode manages data storage of the system.
**Advantages:**
HDFS offers several advantages, including:
- High fault tolerance due to data replication.
- Scalability for handling large datasets.
- High throughput for both reading and writing data.
- Streaming data access, suitable for batch processing.
- Data locality: It places computation close to the data,
reducing network traffic.

**Limitations:**
HDFS is optimized for batch processing and is not suitable for
low-latency or real-time access patterns.
In summary, HDFS is a distributed file system designed to
store and manage large datasets across a cluster of machines.
Its architecture ensures fault tolerance, data replication, and
efficient data storage and retrieval, making it a core
component of the Hadoop ecosystem.

2.Write Short note on Map Reduce & Explain Map Reduce


Process with Example?

Traditional systems tend to use a centralized server for


storing and retrieving data. Such huge amount of data cannot
be accommodated by standard database servers. Â Also,
centralized systems create too much of a bottleneck while
processing multiple files simultaneously. Google, came up
with MapReduce to solve such bottleneck issues. MapReduce
will divide the task into small parts and process each part
independently by assigning them to different systems. After
all the parts are processed and analyzed, the output of each
computer is collected in one single location and then an
output dataset is prepared for the given problem.

MapReduce is a programming framework that allows us to


perform distributed and parallel processing on large data sets
in a distributed environment. MapReduce consists of two
distinct tasks – Map and Reduce.
• As the name MapReduce suggests, the reducer phase takes
place after the mapper phase has been completed.
• So, the first is the map job, where a block of data is read
and processed to produce key-value pairs as intermediate
outputs.
• The output of a Mapper or map job (key-value pairs) is
input to the Reducer.
• The reducer receives the key-value pair from multiple map
jobs.
• Then, the reducer aggregates those intermediate data
tuples (intermediate keyvalue pair) into a smaller set of
tuples or key-value pairs which is the final output. Let us
understand more about MapReduce and its components.
MapReduce majorly has the following three Classes. They
are, Mapper Class The first stage in Data Processing using
MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective
key-value pair. Hadoop’s Mapper store saves this
intermediate data into the local disk.
• Input Split It is the logical representation of data. It
represents a block of work that contains a single map task in
the MapReduce Program.
• RecordReader It interacts with the Input split and converts
the obtained data in the form of KeyValue Pairs. Reducer
Class The Intermediate output generated from the mapper is
fed to the reducer which processes it and generates the final
output which is then saved in the HDFS. Driver Class The
major component in a MapReduce job is a Driver Class. It is
responsible for setting up a MapReduce Job to run-in
Hadoop. We specify the names of Mapper and Reducer
Classes long with data types and their respective job names.

In Big Data Analytics, MapReduce plays a crucial role. When it


is combined with HDFS we can use MapReduce to handle Big
Data. The basic unit of information used by MapReduce is a
key-value pair. All the data whether structured or
unstructured needs to be translated to the key-value pair
before it is passed through the MapReduce model.
1. Critical path problem: It is the amount of time taken to
finish the job without delaying the next milestone or actual
completion date. So, if, any of the machines delay the job, the
whole work gets delayed.
2. Reliability problem: What if, any of the machines which are
working with a part of data fails? The management of this
failover becomes a challenge.
3. Equal split issue: How will I divide the data into smaller
chunks so that each machine gets even part of data to work
with. In other words, how to equally divide the data such that
no individual machine is overloaded or underutilized.
4. The single split may fail: If any of the machines fail to
provide the output, I will not be able to calculate the result.
So, there should be a mechanism to ensure this fault
tolerance capability of the system.
5. Aggregation of the result: There should be a mechanism to
aggregate the result generated by each of the machines to
produce the final output.

6.What is YARN in Hadoop, Explain YARN Architecture in


details with its component?

YARN stands for “Yet Another Resource Negotiator“. It was


introduced in Hadoop 2.0 to remove the bottleneck on Job
Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time
of its launching, but it has now evolved to be known as large-
scale distributed operating system used for Big Data
processing.
YARN architecture basically separates resource management
layer from the processing layer. In Hadoop 1.0 version, the
responsibility of Job tracker is split between the resource
manager and application manager.
YARN Features: YARN gained popularity because of the
following features-
• Scalability: The scheduler in Resource manager of YARN
architecture allows Hadoop to extend and manage thousands
of nodes and clusters.
• Compatibility: YARN supports the existing map-reduce
applications without disruptions thus making it compatible
with Hadoop 1.0 as well.
• Cluster Utilization:Since YARN supports Dynamic utilization
of cluster in Hadoop, which enables optimized Cluster
Utilization.
• Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.

The main components of YARN architecture include:


• Client: It submits map-reduce jobs.
• Resource Manager: It is the master daemon of YARN and is
responsible for resource assignment and management
among all the applications. Whenever it receives a processing
request, it forwards it to the corresponding node manager
and allocates resources for the completion of the request
accordingly. It has two major components:
• Scheduler: It performs scheduling based on the
allocated application and available resources. It is a pure
scheduler, means it does not perform other tasks such as
monitoring or tracking and does not guarantee a restart if a
task fails. The YARN scheduler supports plugins such as
Capacity Scheduler and Fair Scheduler to partition the cluster
resources.
• Application manager: It is responsible for accepting the
application and negotiating the first container from the
resource manager. It also restarts the Application Master
container if a task fails.

• Node Manager: It take care of individual node on Hadoop


cluster and manages application and workflow and that
particular node. Its primary job is to keep-up with the
Resource Manager. It registers with the Resource Manager
and sends heartbeats with the health status of the node. It
monitors resource usage, performs log management and also
kills a container based on directions from the resource
manager. It is also responsible for creating the container
process and start it on the request of Application master.
• Application Master: An application is a single job submitted
to a framework. The application master is responsible for
negotiating resources with the resource manager, tracking
the status and monitoring progress of a single application.
The application master requests the container from the node
manager by sending a Container Launch Context(CLC) which
includes everything an application needs to run. Once the
application is started, it sends the health report to the
resource manager from time-to-time.
• Container: It is a collection of physical resources such as
RAM, CPU cores and disk on a single node. The containers are
invoked by Container Launch Context(CLC) which is a record
that contains information such as environment variables,
security tokens, dependencies etc.

You might also like