0% found this document useful (0 votes)
10 views16 pages

Bda Unit 2

Hadoop is a framework for managing big data through distributed storage (HDFS), parallel processing (MapReduce), and resource management (YARN). It offers advantages such as speed, diversity in data formats, cost-effectiveness, resilience, and scalability. The document also details the architecture of Hadoop, its components, and compares Hadoop 1 and Hadoop 2, highlighting limitations and ecosystem components.

Uploaded by

mshraghvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views16 pages

Bda Unit 2

Hadoop is a framework for managing big data through distributed storage (HDFS), parallel processing (MapReduce), and resource management (YARN). It offers advantages such as speed, diversity in data formats, cost-effectiveness, resilience, and scalability. The document also details the architecture of Hadoop, its components, and compares Hadoop 1 and Hadoop 2, highlighting limitations and ecosystem components.

Uploaded by

mshraghvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

BDA UNIT-2

1) What is Hadoop? What are requirement of Hadoop


Framework? and explain in detail.
A) Hadoop is a framework that uses distributed storage and
parallel processing to store and manage big data. It is the
software most used by data analysts to handle big data, and
its market size continues to grow.
There are three core components of Hadoop as mentioned
earlier. They are HDFS, MapReduce, and YARN. These together
form the Hadoop framework architecture.

 HDFS (Hadoop Distributed File System):


It is a data storage system. Since the data sets are huge, it uses
a distributed system to store this data. It is stored in blocks
where each block is 128 MB. It consists of NameNode and
DataNode. There can only be one NameNode but multiple
DataNodes.
Features:
 The storage is distributed to handle a large data pool.
 Distribution increases data security.
 It is fault-tolerant, other blocks can pick up the failure of one
block
 MapReduce:
The MapReduce framework is the processing unit. All data is
distributed and processed parallelly. There is a MasterNode that
distributes data amongst SlaveNodes. The SlaveNodes do the
processing and send it back to the MasterNode.
Features:
 Consists of two phases, Map Phase and Reduce Phase.
 Processes big data faster with multiples nodes working under
one CPU
 YARN (yet another Resources Negotiator):
It is the resource management unit of the Hadoop framework.
The data which is stored can be processed with help of YARN
using data processing engines like interactive processing. It can
be used to fetch any sort of data analysis.
Features:
 It is a filing system that acts as an Operating System for the data
stored on HDFS
 It helps to schedule the tasks to avoid overloading any system

Advantages of Hadoop for Big Data


 Speed. Hadoop’s concurrent processing, MapReduce model,
and HDFS lets users run complex queries in just a few seconds.
 Diversity. Hadoop’s HDFS can store different data formats, like
structured, semi-structured, and unstructured.
 Cost-Effective. Hadoop is an open-source data framework.
 Resilient. Data stored in a node is replicated in other cluster
nodes, ensuring fault tolerance.
 Scalable. Since Hadoop functions in a distributed environment,
you can easily add more servers.
2) What are the advantages of Hadoop? How do you analyze
data in Hadoop?
A) Advantages of Hadoop for Big Data
 Speed. Hadoop’s concurrent processing, MapReduce model,
and HDFS lets users run complex queries in just a few seconds.
 Diversity. Hadoop’s HDFS can store different data formats, like
structured, semi-structured, and unstructured.
 Cost-Effective. Hadoop is an open-source data framework.
 Resilient. Data stored in a node is replicated in other cluster
nodes, ensuring fault tolerance.
 Scalable. Since Hadoop functions in a distributed environment,
you can easily add more servers.
(analyze means write hdfs and mapreduce steps and then we will
analyze using performing aggregations,pattern finding,filtering
data and finally visualize using powerbi or tableau.)
3) Define HDFS. Describe Name node, Datanode, Job Tracker
and Task Tracker.
A) HDFS (Hadoop Distributed File System):
It is a data storage system. Since the data sets are huge, it uses
a distributed system to store this data. It is stored in blocks
where each block is 128 MB. It consists of NameNode and
DataNode. There can only be one NameNode but multiple
DataNodes.
Features:
 The storage is distributed to handle a large data pool.
 Distribution increases data security.
 It is fault-tolerant, other blocks can pick up the failure of one
block
1. NameNode
 Definition: The master node in HDFS (Hadoop Distributed File
System).
 Function: Manages the file system namespace (like directories
and file names) and metadata (like where data blocks are
stored).
 Example: When you save a file, NameNode doesn’t store the
file data but keeps the info about which DataNodes have the
file blocks.

2. DataNode
 Definition: The worker node in HDFS.
 Function: Stores the actual data blocks of files.
 Example: When you open a file, DataNodes send the real file
data to you as directed by the NameNode.

3. JobTracker
 Definition: The master node for processing jobs (used in
MapReduce framework).
 Function: Distributes tasks (map and reduce jobs) to different
nodes and monitors their progress.
 Example: When you submit a data processing job, JobTracker
divides it into smaller tasks and assigns them to TaskTrackers.

4. TaskTracker
 Definition: The worker node that runs tasks as assigned by the
JobTracker.
 Function: Executes map and reduce tasks, reports status and
progress to the JobTracker.
 Example: Each TaskTracker runs the actual computation on the
data blocks stored on its own machine.

4) Describe the Design Principal of Hadoop.


A) The Design Principles of Hadoop are the core ideas that guide
how Hadoop is built and functions.
� 1. Scalability
 Hadoop is designed to scale horizontally – you can add more
machines to handle more data and processing.
 It can run on hundreds or thousands of cheap computers
(nodes).

� 2. Fault Tolerance
 Hadoop handles hardware failures automatically.
 Data is replicated (by default 3 times) across multiple nodes
using HDFS.
 If a node fails, Hadoop continues working using the replicated
data.

� 3. Data Locality Optimization


 Hadoop moves computation to data, not the other way around.
 This reduces network traffic and increases speed because
processing happens where the data is stored.

� 4. High Throughput
 Designed for batch processing of large datasets.
 It uses the MapReduce model for parallel processing, which
enables fast and efficient data analysis.

� 5. Simplicity of Programming
 Programmers write Map and Reduce functions, and Hadoop
handles the rest (like splitting tasks and managing failures).
 Abstracts complex operations, making big data processing
easier.

� 6. Diversity
 Hadoop can process structured, semi-structured, and
unstructured data (text, images, videos, logs, etc.).
 No strict schema is required for storing data in HDFS.

� 7. Cost-Effective
 Open-source framework, so no license cost.
5) Explain about Mapreduce? Demonstrate working of various
phases of Mapreduce with appropriate example and
diagram.

A)
6) Discuss Hadoop YARN with diagram and explain.
 A) YARN (yet another Resources Negotiator):
It is the resource management unit of the Hadoop framework.
The data which is stored can be processed with help of YARN
using data processing engines like interactive processing. It can
be used to fetch any sort of data analysis.
Features:
 It is a filing system that acts as an Operating System for the data
stored on HDFS
 It helps to schedule the tasks to avoid overloading any system
:

� Client
 Submits MapReduce jobs to the system.

� Resource Manager (Master of YARN)


 Manages and allocates resources for all jobs in the cluster.
 Has two parts:
o Scheduler: Assigns resources based on availability, using
strategies like Fair or Capacity scheduling.
o Application Manager: Accepts job requests and starts the
Application Master. Restarts it if it fails.
� Node Manager (One per Node)
 Manages resources and applications on each individual node.
 Starts and stops containers as instructed.

� Application Master (One per Job)


 Controls one specific job.
 Talks to the Resource Manager to get resources.
 Talks to the Node Manager to launch containers.
 Monitors job progress and reports back.

� Container
 A bundle of resources (like CPU and memory) on a node.
 Runs a part of the application.
 Launched using Container Launch Context (CLC), which
includes settings and files needed to run.
Working of YARN:
7)Compare RDBMS with Hadoop
A)
RDBMS HADOOP
1) Stores data in tables with rows 1) Stores data in a distributed file
and columns. system (HDFS).
2) Suitable for small to moderate 2) Handles massive volumes of
amounts of structured data. structured, semi-structured, and
unstructured data.
3) Runs on a single server or a 3) Runs across thousands of
small cluster. machines (clusters).
4) Uses SQL for querying data. 4) Uses MapReduce, Hive, or Pig
for data processing.
5) Scaling is done by upgrading 5) Scaling is done by adding
the same server. more machines.
6) Data processing is fast for 6) Optimized for batch
small datasets. processing of large datasets.
7)Expensive 7)Cost-efficient
8)Less fault-tolerant 8)Highly fault-tolerant

8)Hadoop architecture and components (same as framework)


9) Difference between Hadoop1 and Hadooop2 and
Working
A)
Hadoop1 Hadoop2
1) Used only MapReduce for 1) Supports MapReduce + other
processing. processing models like Spark.
2) Job Tracker handled both 2) Introduced YARN to separate
resource & job management. resource & job management.
3) Limited scalability 3)Improved scalability
4) Single point of failure in Job 4) YARN architecture avoids
Tracker. single point of failure.
5) Fixed slots for Map and 5) Dynamic resource allocation
Reduce tasks. via containers.
6) Only suitable for batch 6) Supports batch + real-time +
processing. interactive processing.

Working of Hadoop 1 (MapReduce-based)


1. Client submits a job.
2. Job Tracker divides it into smaller tasks and assigns them to
Task Trackers.
3. Task Trackers run Map & Reduce tasks and report progress.
4. If any node fails, Job Tracker reschedules that task.
Working of Hadoop 2 (YARN-based)
1. Client submits an application.
2. Resource Manager assigns a container to launch the
Application Master.
3. Application Master negotiates with Resource Manager to get
more containers.
4. Node Managers run tasks inside containers.
5. YARN separates resource handling (Resource Manager) and job
execution (Application Master) for better performance and
flexibility.

10)Limitations and ecosystem of hadoop1 and Hadoop 2.


A) Limitations of Hadoop 1
1. Only supports MapReduce
2. Single JobTracker
3. Limited scalability
4. Single point of failure –
5. Fixed task slots
6. Batch-only processing
Ecosystem of hadoop1:
Hadoop 1 – Ecosystem Components
 HDFS – Distributed storage system.
 MapReduce – Core data processing engine.
 Hive – SQL-like query language for MapReduce.
 Pig – Scripting platform for analyzing large data sets.
 Sqoop – Transfers data between Hadoop and RDBMS.
 Flume – Collects and transports logs/data into HDFS.
 Oozie – Workflow scheduler for jobs.
 Mahout: build scalable machine learning algorithms.
Hadoop 2 – Limitations
 Complex architecture compared to Hadoop 1.
 YARN adds overhead in terms of configuration and
management.
 Still batch-oriented at the core (though supports more models).
 Requires proper tuning and monitoring for high performance.
Hadoop 2 – Ecosystem Components
 HDFS – Distributed storage system.
 MapReduce – Core data processing engine.
 Hive – SQL-like query language for MapReduce.
 Pig – Scripting platform for analyzing large data sets.
 Sqoop – Transfers data between Hadoop and RDBMS.
 Flume – Collects and transports logs/data into HDFS.
 Oozie – Workflow scheduler for jobs.
 Mahout: build scalable machine learning algorithms.
 YARN – Resource management layer (replaces JobTracker).

You might also like