Bda Unit 2
Bda Unit 2
2. DataNode
Definition: The worker node in HDFS.
Function: Stores the actual data blocks of files.
Example: When you open a file, DataNodes send the real file
data to you as directed by the NameNode.
3. JobTracker
Definition: The master node for processing jobs (used in
MapReduce framework).
Function: Distributes tasks (map and reduce jobs) to different
nodes and monitors their progress.
Example: When you submit a data processing job, JobTracker
divides it into smaller tasks and assigns them to TaskTrackers.
4. TaskTracker
Definition: The worker node that runs tasks as assigned by the
JobTracker.
Function: Executes map and reduce tasks, reports status and
progress to the JobTracker.
Example: Each TaskTracker runs the actual computation on the
data blocks stored on its own machine.
� 2. Fault Tolerance
Hadoop handles hardware failures automatically.
Data is replicated (by default 3 times) across multiple nodes
using HDFS.
If a node fails, Hadoop continues working using the replicated
data.
� 4. High Throughput
Designed for batch processing of large datasets.
It uses the MapReduce model for parallel processing, which
enables fast and efficient data analysis.
� 5. Simplicity of Programming
Programmers write Map and Reduce functions, and Hadoop
handles the rest (like splitting tasks and managing failures).
Abstracts complex operations, making big data processing
easier.
� 6. Diversity
Hadoop can process structured, semi-structured, and
unstructured data (text, images, videos, logs, etc.).
No strict schema is required for storing data in HDFS.
� 7. Cost-Effective
Open-source framework, so no license cost.
5) Explain about Mapreduce? Demonstrate working of various
phases of Mapreduce with appropriate example and
diagram.
A)
6) Discuss Hadoop YARN with diagram and explain.
A) YARN (yet another Resources Negotiator):
It is the resource management unit of the Hadoop framework.
The data which is stored can be processed with help of YARN
using data processing engines like interactive processing. It can
be used to fetch any sort of data analysis.
Features:
It is a filing system that acts as an Operating System for the data
stored on HDFS
It helps to schedule the tasks to avoid overloading any system
:
� Client
Submits MapReduce jobs to the system.
� Container
A bundle of resources (like CPU and memory) on a node.
Runs a part of the application.
Launched using Container Launch Context (CLC), which
includes settings and files needed to run.
Working of YARN:
7)Compare RDBMS with Hadoop
A)
RDBMS HADOOP
1) Stores data in tables with rows 1) Stores data in a distributed file
and columns. system (HDFS).
2) Suitable for small to moderate 2) Handles massive volumes of
amounts of structured data. structured, semi-structured, and
unstructured data.
3) Runs on a single server or a 3) Runs across thousands of
small cluster. machines (clusters).
4) Uses SQL for querying data. 4) Uses MapReduce, Hive, or Pig
for data processing.
5) Scaling is done by upgrading 5) Scaling is done by adding
the same server. more machines.
6) Data processing is fast for 6) Optimized for batch
small datasets. processing of large datasets.
7)Expensive 7)Cost-efficient
8)Less fault-tolerant 8)Highly fault-tolerant