0% found this document useful (0 votes)

10 views16 pages

Bda Unit 2

Hadoop is a framework for managing big data through distributed storage (HDFS), parallel processing (MapReduce), and resource management (YARN). It offers advantages such as speed, diversity in data formats, cost-effectiveness, resilience, and scalability. The document also details the architecture of Hadoop, its components, and compares Hadoop 1 and Hadoop 2, highlighting limitations and ecosystem components.

Uploaded by

mshraghvin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views16 pages

Bda Unit 2

Uploaded by

mshraghvin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

BDA UNIT-2

1) What is Hadoop? What are requirement of Hadoop

Framework? and explain in detail.
A) Hadoop is a framework that uses distributed storage and
parallel processing to store and manage big data. It is the
software most used by data analysts to handle big data, and
its market size continues to grow.
There are three core components of Hadoop as mentioned
earlier. They are HDFS, MapReduce, and YARN. These together
form the Hadoop framework architecture.

 HDFS (Hadoop Distributed File System):

It is a data storage system. Since the data sets are huge, it uses
a distributed system to store this data. It is stored in blocks
where each block is 128 MB. It consists of NameNode and
DataNode. There can only be one NameNode but multiple
DataNodes.
Features:
 The storage is distributed to handle a large data pool.
 Distribution increases data security.
 It is fault-tolerant, other blocks can pick up the failure of one
block
 MapReduce:
The MapReduce framework is the processing unit. All data is
distributed and processed parallelly. There is a MasterNode that
distributes data amongst SlaveNodes. The SlaveNodes do the
processing and send it back to the MasterNode.
Features:
 Consists of two phases, Map Phase and Reduce Phase.
 Processes big data faster with multiples nodes working under
one CPU
 YARN (yet another Resources Negotiator):
It is the resource management unit of the Hadoop framework.
The data which is stored can be processed with help of YARN
using data processing engines like interactive processing. It can
be used to fetch any sort of data analysis.
Features:
 It is a filing system that acts as an Operating System for the data
stored on HDFS
 It helps to schedule the tasks to avoid overloading any system

Advantages of Hadoop for Big Data

 Speed. Hadoop’s concurrent processing, MapReduce model,
and HDFS lets users run complex queries in just a few seconds.
 Diversity. Hadoop’s HDFS can store different data formats, like
structured, semi-structured, and unstructured.
 Cost-Effective. Hadoop is an open-source data framework.
 Resilient. Data stored in a node is replicated in other cluster
nodes, ensuring fault tolerance.
 Scalable. Since Hadoop functions in a distributed environment,
you can easily add more servers.
2) What are the advantages of Hadoop? How do you analyze
data in Hadoop?
A) Advantages of Hadoop for Big Data
 Speed. Hadoop’s concurrent processing, MapReduce model,
and HDFS lets users run complex queries in just a few seconds.
 Diversity. Hadoop’s HDFS can store different data formats, like
structured, semi-structured, and unstructured.
 Cost-Effective. Hadoop is an open-source data framework.
 Resilient. Data stored in a node is replicated in other cluster
nodes, ensuring fault tolerance.
 Scalable. Since Hadoop functions in a distributed environment,
you can easily add more servers.
(analyze means write hdfs and mapreduce steps and then we will
analyze using performing aggregations,pattern finding,filtering
data and finally visualize using powerbi or tableau.)
3) Define HDFS. Describe Name node, Datanode, Job Tracker
and Task Tracker.
A) HDFS (Hadoop Distributed File System):
It is a data storage system. Since the data sets are huge, it uses
a distributed system to store this data. It is stored in blocks
where each block is 128 MB. It consists of NameNode and
DataNode. There can only be one NameNode but multiple
DataNodes.
Features:
 The storage is distributed to handle a large data pool.
 Distribution increases data security.
 It is fault-tolerant, other blocks can pick up the failure of one
block
1. NameNode
 Definition: The master node in HDFS (Hadoop Distributed File
System).
 Function: Manages the file system namespace (like directories
and file names) and metadata (like where data blocks are
stored).
 Example: When you save a file, NameNode doesn’t store the
file data but keeps the info about which DataNodes have the
file blocks.

2. DataNode
 Definition: The worker node in HDFS.
 Function: Stores the actual data blocks of files.
 Example: When you open a file, DataNodes send the real file
data to you as directed by the NameNode.

3. JobTracker
 Definition: The master node for processing jobs (used in
MapReduce framework).
 Function: Distributes tasks (map and reduce jobs) to different
nodes and monitors their progress.
 Example: When you submit a data processing job, JobTracker
divides it into smaller tasks and assigns them to TaskTrackers.

4. TaskTracker
 Definition: The worker node that runs tasks as assigned by the
JobTracker.
 Function: Executes map and reduce tasks, reports status and
progress to the JobTracker.
 Example: Each TaskTracker runs the actual computation on the
data blocks stored on its own machine.

4) Describe the Design Principal of Hadoop.

A) The Design Principles of Hadoop are the core ideas that guide
how Hadoop is built and functions.
� 1. Scalability
 Hadoop is designed to scale horizontally – you can add more
machines to handle more data and processing.
 It can run on hundreds or thousands of cheap computers
(nodes).

� 2. Fault Tolerance
 Hadoop handles hardware failures automatically.
 Data is replicated (by default 3 times) across multiple nodes
using HDFS.
 If a node fails, Hadoop continues working using the replicated
data.

� 3. Data Locality Optimization

 Hadoop moves computation to data, not the other way around.
 This reduces network traffic and increases speed because
processing happens where the data is stored.

� 4. High Throughput
 Designed for batch processing of large datasets.
 It uses the MapReduce model for parallel processing, which
enables fast and efficient data analysis.

� 5. Simplicity of Programming
 Programmers write Map and Reduce functions, and Hadoop
handles the rest (like splitting tasks and managing failures).
 Abstracts complex operations, making big data processing
easier.

� 6. Diversity
 Hadoop can process structured, semi-structured, and
unstructured data (text, images, videos, logs, etc.).
 No strict schema is required for storing data in HDFS.

� 7. Cost-Effective
 Open-source framework, so no license cost.
5) Explain about Mapreduce? Demonstrate working of various
phases of Mapreduce with appropriate example and
diagram.

A)
6) Discuss Hadoop YARN with diagram and explain.
 A) YARN (yet another Resources Negotiator):
It is the resource management unit of the Hadoop framework.
The data which is stored can be processed with help of YARN
using data processing engines like interactive processing. It can
be used to fetch any sort of data analysis.
Features:
 It is a filing system that acts as an Operating System for the data
stored on HDFS
 It helps to schedule the tasks to avoid overloading any system
:

� Client
 Submits MapReduce jobs to the system.

� Resource Manager (Master of YARN)

 Manages and allocates resources for all jobs in the cluster.
 Has two parts:
o Scheduler: Assigns resources based on availability, using
strategies like Fair or Capacity scheduling.
o Application Manager: Accepts job requests and starts the
Application Master. Restarts it if it fails.
� Node Manager (One per Node)
 Manages resources and applications on each individual node.
 Starts and stops containers as instructed.

� Application Master (One per Job)

 Controls one specific job.
 Talks to the Resource Manager to get resources.
 Talks to the Node Manager to launch containers.
 Monitors job progress and reports back.

� Container
 A bundle of resources (like CPU and memory) on a node.
 Runs a part of the application.
 Launched using Container Launch Context (CLC), which
includes settings and files needed to run.
Working of YARN:
7)Compare RDBMS with Hadoop
A)
RDBMS HADOOP
1) Stores data in tables with rows 1) Stores data in a distributed file
and columns. system (HDFS).
2) Suitable for small to moderate 2) Handles massive volumes of
amounts of structured data. structured, semi-structured, and
unstructured data.
3) Runs on a single server or a 3) Runs across thousands of
small cluster. machines (clusters).
4) Uses SQL for querying data. 4) Uses MapReduce, Hive, or Pig
for data processing.
5) Scaling is done by upgrading 5) Scaling is done by adding
the same server. more machines.
6) Data processing is fast for 6) Optimized for batch
small datasets. processing of large datasets.
7)Expensive 7)Cost-efficient
8)Less fault-tolerant 8)Highly fault-tolerant

8)Hadoop architecture and components (same as framework)

9) Difference between Hadoop1 and Hadooop2 and
Working
A)
Hadoop1 Hadoop2
1) Used only MapReduce for 1) Supports MapReduce + other
processing. processing models like Spark.
2) Job Tracker handled both 2) Introduced YARN to separate
resource & job management. resource & job management.
3) Limited scalability 3)Improved scalability
4) Single point of failure in Job 4) YARN architecture avoids
Tracker. single point of failure.
5) Fixed slots for Map and 5) Dynamic resource allocation
Reduce tasks. via containers.
6) Only suitable for batch 6) Supports batch + real-time +
processing. interactive processing.

Working of Hadoop 1 (MapReduce-based)

1. Client submits a job.
2. Job Tracker divides it into smaller tasks and assigns them to
Task Trackers.
3. Task Trackers run Map & Reduce tasks and report progress.
4. If any node fails, Job Tracker reschedules that task.
Working of Hadoop 2 (YARN-based)
1. Client submits an application.
2. Resource Manager assigns a container to launch the
Application Master.
3. Application Master negotiates with Resource Manager to get
more containers.
4. Node Managers run tasks inside containers.
5. YARN separates resource handling (Resource Manager) and job
execution (Application Master) for better performance and
flexibility.

10)Limitations and ecosystem of hadoop1 and Hadoop 2.

A) Limitations of Hadoop 1
1. Only supports MapReduce
2. Single JobTracker
3. Limited scalability
4. Single point of failure –
5. Fixed task slots
6. Batch-only processing
Ecosystem of hadoop1:
Hadoop 1 – Ecosystem Components
 HDFS – Distributed storage system.
 MapReduce – Core data processing engine.
 Hive – SQL-like query language for MapReduce.
 Pig – Scripting platform for analyzing large data sets.
 Sqoop – Transfers data between Hadoop and RDBMS.
 Flume – Collects and transports logs/data into HDFS.
 Oozie – Workflow scheduler for jobs.
 Mahout: build scalable machine learning algorithms.
Hadoop 2 – Limitations
 Complex architecture compared to Hadoop 1.
 YARN adds overhead in terms of configuration and
management.
 Still batch-oriented at the core (though supports more models).
 Requires proper tuning and monitoring for high performance.
Hadoop 2 – Ecosystem Components
 HDFS – Distributed storage system.
 MapReduce – Core data processing engine.
 Hive – SQL-like query language for MapReduce.
 Pig – Scripting platform for analyzing large data sets.
 Sqoop – Transfers data between Hadoop and RDBMS.
 Flume – Collects and transports logs/data into HDFS.
 Oozie – Workflow scheduler for jobs.
 Mahout: build scalable machine learning algorithms.
 YARN – Resource management layer (replaces JobTracker).

Bda Final Sem 7
No ratings yet
Bda Final Sem 7
120 pages
BDA Unit-4 Part-1 HDFS, MapReduce
No ratings yet
BDA Unit-4 Part-1 HDFS, MapReduce
76 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Hadoop
No ratings yet
Hadoop
83 pages
Unit III
No ratings yet
Unit III
15 pages
Big Data 2 - Part
No ratings yet
Big Data 2 - Part
40 pages
Bda Unit 4-1
No ratings yet
Bda Unit 4-1
64 pages
Bda Unit-2
No ratings yet
Bda Unit-2
37 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
BDA Unit 2 Q&A
No ratings yet
BDA Unit 2 Q&A
14 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
HADOOP
No ratings yet
HADOOP
19 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Big Data Analysis IAT-1
No ratings yet
Big Data Analysis IAT-1
43 pages
Hadoop Big Data: Follow This Link To Know About Features of Hadoop
No ratings yet
Hadoop Big Data: Follow This Link To Know About Features of Hadoop
85 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
CC Unit 2
No ratings yet
CC Unit 2
29 pages
Module 2
No ratings yet
Module 2
23 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Bda Ese
No ratings yet
Bda Ese
21 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Module 2 CN
No ratings yet
Module 2 CN
23 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
IMTC634 - Data Science - Chapter 13
No ratings yet
IMTC634 - Data Science - Chapter 13
16 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Super 25 Unit 3 Notes
No ratings yet
Super 25 Unit 3 Notes
8 pages
Csen 3101
No ratings yet
Csen 3101
11 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Bigdata Short
No ratings yet
Bigdata Short
8 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Hadoop Interview1
No ratings yet
Hadoop Interview1
27 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
14 pages
Attachment
No ratings yet
Attachment
11 pages
Unit-3: Describe Mapreduce With Application?
No ratings yet
Unit-3: Describe Mapreduce With Application?
6 pages
BDA CW Chapter 2
No ratings yet
BDA CW Chapter 2
6 pages
Unit 2
No ratings yet
Unit 2
7 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
11 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Hadoop
No ratings yet
Hadoop
11 pages
BDM 2
No ratings yet
BDM 2
5 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Unit 5
No ratings yet
Unit 5
7 pages
HADOOP
No ratings yet
HADOOP
10 pages
BDAunit II
No ratings yet
BDAunit II
4 pages
wmq80 Installconfig
No ratings yet
wmq80 Installconfig
1,338 pages
Big-Data Final
No ratings yet
Big-Data Final
7 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
37 pages
Top Tech Trends 2024 For Industry Cloud Platforms
No ratings yet
Top Tech Trends 2024 For Industry Cloud Platforms
20 pages
KUKA Ethernet/IP 2.0: Controller Option
No ratings yet
KUKA Ethernet/IP 2.0: Controller Option
51 pages
Trends Unit 1
No ratings yet
Trends Unit 1
90 pages
SM-A166 UG EU UU Eng Rev.1.0 240924
No ratings yet
SM-A166 UG EU UU Eng Rev.1.0 240924
124 pages
Mil4 1
No ratings yet
Mil4 1
43 pages
Boiler Efficiency: Heat Input Is Pulverised Coal Heat Output Is Superheated Steam
No ratings yet
Boiler Efficiency: Heat Input Is Pulverised Coal Heat Output Is Superheated Steam
21 pages
Audio-Frequency Generator PDF
No ratings yet
Audio-Frequency Generator PDF
5 pages
Bda Unit 4
No ratings yet
Bda Unit 4
12 pages
English For Computer Technicians and Users 05 11 2024
No ratings yet
English For Computer Technicians and Users 05 11 2024
34 pages
WINSEM2020-21 EEE4033 ETH VL2020210501456 Reference Material I 03-Feb-2021 MODULE-I Lecture 1
No ratings yet
WINSEM2020-21 EEE4033 ETH VL2020210501456 Reference Material I 03-Feb-2021 MODULE-I Lecture 1
76 pages
Cloud Computing & Security
No ratings yet
Cloud Computing & Security
15 pages
Design of Experiments For Analytical Method Development and Validation
No ratings yet
Design of Experiments For Analytical Method Development and Validation
5 pages
Solutions For 4D Cadastre - With A Case Study On Utility Networks
No ratings yet
Solutions For 4D Cadastre - With A Case Study On Utility Networks
20 pages
Bda Unit 3
No ratings yet
Bda Unit 3
8 pages
Tesla Script
No ratings yet
Tesla Script
3 pages
Durability of Concrete Containing Blended Cements in Harsh Marine
No ratings yet
Durability of Concrete Containing Blended Cements in Harsh Marine
15 pages
CyberPower K01-C000065-03 UM BU650-1000E En-1
No ratings yet
CyberPower K01-C000065-03 UM BU650-1000E En-1
2 pages
Thapar University, Patiala
No ratings yet
Thapar University, Patiala
2 pages
UTV400 ROV Mobilization at UTRACO 13052024
No ratings yet
UTV400 ROV Mobilization at UTRACO 13052024
3 pages
Tenant Verification
No ratings yet
Tenant Verification
2 pages
Halcyon Tankbands02
No ratings yet
Halcyon Tankbands02
2 pages
Invoice
No ratings yet
Invoice
2 pages
Exp PR FP 5
No ratings yet
Exp PR FP 5
5 pages
3D Obstacle Avoidance For UAV Based On RL and RealSense
No ratings yet
3D Obstacle Avoidance For UAV Based On RL and RealSense
6 pages
Multi-Hop PSO Based Routing Protocol For Wireless Sensor Networks With Energy Harvesting
No ratings yet
Multi-Hop PSO Based Routing Protocol For Wireless Sensor Networks With Energy Harvesting
7 pages
DB3 File How To Open DB3 File (And What It Is)
No ratings yet
DB3 File How To Open DB3 File (And What It Is)
4 pages
Eship
No ratings yet
Eship
5 pages
Jyothi Engineering College, Cheruthuruthy: S7 Civil Contact Class Timetable
No ratings yet
Jyothi Engineering College, Cheruthuruthy: S7 Civil Contact Class Timetable
1 page
RENR2635-10 Truck Payload - Calibrate in 789C
No ratings yet
RENR2635-10 Truck Payload - Calibrate in 789C
2 pages
ML Mid-1 QB - Cse
No ratings yet
ML Mid-1 QB - Cse
3 pages
800 - Halloumiboxes (1903A)
No ratings yet
800 - Halloumiboxes (1903A)
2 pages
Nach 26 May 25 05 53 19
No ratings yet
Nach 26 May 25 05 53 19
1 page
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Bda Unit 2

Uploaded by

Bda Unit 2

Uploaded by

BDA UNIT-2

1) What is Hadoop? What are requirement of Hadoop

 HDFS (Hadoop Distributed File System):

Advantages of Hadoop for Big Data

4) Describe the Design Principal of Hadoop.

� 3. Data Locality Optimization

� Resource Manager (Master of YARN)

� Application Master (One per Job)

8)Hadoop architecture and components (same as framework)

Working of Hadoop 1 (MapReduce-based)

10)Limitations and ecosystem of hadoop1 and Hadoop 2.

You might also like