0% found this document useful (0 votes)

61 views6 pages

Overview of Hadoop Architecture and Use Cases

Uploaded by

Aniket Raj Kashyap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views6 pages

Overview of Hadoop Architecture and Use Cases

Uploaded by

Aniket Raj Kashyap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

History of Hadoop:

Hadoop was created by Doug Cutting and Mike Cafarella in 2005, inspired by Google's MapReduce and
Google File System (GFS) papers.

Named after a toy elephant, Hadoop was initially developed to support distributed processing of large
datasets across clusters of commodity hardware.

The project was later contributed to the Apache Software Foundation and became an open-source
platform for big data processing and analytics.

Apache Hadoop:

Apache Hadoop is an open-source framework for distributed storage and processing of big data.

It provides a scalable and fault-tolerant ecosystem for storing, processing, and analyzing large datasets
across clusters of computers.

Key components of Apache Hadoop include Hadoop Distributed File System (HDFS) for storage and
MapReduce for processing.

Hadoop Distributed File System (HDFS):

HDFS is a distributed file system designed to store large files and streaming data across multiple nodes
in a Hadoop cluster.

It provides high throughput access to application data and ensures fault tolerance by replicating data
across multiple nodes.

HDFS follows a master-slave architecture with NameNode as the master and DataNodes as slaves.

Components of Hadoop:

HDFS (Hadoop Distributed File System): Stores data across a distributed cluster of machines.

MapReduce: Provides a programming model and processing framework for distributed processing of
large datasets.
YARN (Yet Another Resource Negotiator): Manages resources and schedules tasks across the Hadoop
cluster.

Hadoop Common: Contains libraries and utilities used by other Hadoop modules.

Hadoop EcoSystem: Includes various tools and frameworks built on top of Hadoop for specific tasks like
data ingestion, processing, and analysis.

Data Format:

Hadoop supports various data formats, including structured (e.g., Avro, Parquet), semi-structured (e.g.,
JSON, XML), and unstructured (e.g., text files).

It can handle data in any format, but using optimized formats like Avro and Parquet can improve
performance and storage efficiency.

Analyzing Data with Hadoop:

Hadoop enables analyzing large datasets using the MapReduce programming model, which involves
breaking down tasks into map and reduce phases.

Map tasks process input data and emit intermediate key-value pairs, which are then shuffled, sorted,
and aggregated by reduce tasks to produce the final output.

Scaling Out:

Hadoop allows scaling out by adding more nodes to the cluster to accommodate increasing data
volumes and processing demands.

It provides automatic load balancing and fault tolerance mechanisms to ensure smooth operation as the
cluster grows.

Hadoop Streaming and Pipes:

Hadoop Streaming enables running non-Java applications (e.g., Python, Perl) as MapReduce jobs by
streaming input/output through standard input/output streams.

Hadoop Pipes is a C++ API for writing MapReduce applications in C++.

Hadoop EcoSystem:

The Hadoop ecosystem consists of various tools and frameworks that extend the functionality of
Hadoop for specific use cases:

Hive: Data warehousing and SQL querying.

Pig: Data flow scripting.

HBase: NoSQL database for real-time read/write access to Hadoop data.

Spark: In-memory data processing.

Kafka: Distributed streaming platform.

MapReduce Framework and Basics:

MapReduce is a programming model and processing framework for parallel processing of large datasets
across distributed clusters.

It consists of two phases: Map and Reduce.

Map tasks process input data and generate intermediate key-value pairs, which are then shuffled,
sorted, and aggregated by reduce tasks to produce the final output.

Developing a MapReduce Application:

Developing a MapReduce application involves implementing map and reduce functions to process input
data and produce output.

Hadoop provides APIs for writing MapReduce applications in Java, Python (using Hadoop Streaming),
and other programming languages.

Unit Tests with MR Unit:

MR Unit is a unit testing framework for testing MapReduce applications.

It allows developers to write test cases to validate the correctness of map and reduce functions and
simulate MapReduce job execution locally without a Hadoop cluster.

Anatomy of a MapReduce Job Run:

A MapReduce job consists of multiple map and reduce tasks, which are executed across nodes in the
Hadoop cluster.

Input data is divided into splits, processed by map tasks in parallel, and then shuffled, sorted, and
aggregated by reduce tasks.

Failures:

Hadoop handles failures gracefully by automatically restarting failed tasks and reallocating resources to
healthy nodes.

Data replication in HDFS ensures fault tolerance by storing multiple copies of data across the cluster.

Job Scheduling:

YARN (Yet Another Resource Negotiator) in Hadoop is responsible for scheduling and managing
resources across applications in the cluster.

It allocates resources based on application requirements and cluster availability to optimize resource
utilization.

Shuffle and Sort:

Shuffle and sort phase in MapReduce involves transferring intermediate key-value pairs from map tasks
to reduce tasks, sorting them by keys, and grouping values with the same key for aggregation.

Task Execution:

Map and reduce tasks are executed on individual nodes in the Hadoop cluster, with each task processing
a portion of the input data in parallel.
MapReduce Types:

MapReduce supports different types of input and output formats, including text, sequence, and custom
formats.

Input and output formats define how data is read from input sources and written to output destinations.

Input Formats:

Input formats determine how input data is split and processed by map tasks.

Hadoop provides built-in input formats for handling various types of data sources, including text files,
sequence files, and database tables.

Output Formats:

Output formats define how output data is written by reduce tasks.

Hadoop provides built-in output formats for writing output data to different storage systems, including
HDFS, databases, and distributed file systems.

MapReduce Features:

Fault tolerance: Hadoop provides fault tolerance mechanisms to handle node failures and ensure job
completion.

Scalability: Hadoop scales horizontally by adding more nodes to the cluster to accommodate increasing
data volumes and processing demands.

Data locality: Hadoop optimizes data processing by scheduling tasks on nodes where data is located to
minimize data transfer over the network.

Task parallelism: MapReduce enables parallel processing of data by executing map and reduce tasks in
parallel across multiple nodes in the cluster.
Real-world MapReduce Notes:

In real-world scenarios, MapReduce is used for processing large datasets in various domains, including
web search, social media analytics, log processing, and recommendation systems.

MapReduce jobs can be optimized for performance by tuning parameters like input/output formats,
data partitioning, and task parallelism.

Monitoring and debugging tools like Hadoop JobTracker and TaskTracker provide insights into job
execution and performance metrics.

Unit 2
No ratings yet
Unit 2
9 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
BIG DATA Finalised
No ratings yet
BIG DATA Finalised
28 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit - 2 (A)
No ratings yet
Unit - 2 (A)
8 pages
Unit 2
No ratings yet
Unit 2
17 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
43 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Chap 2 Hadoop
No ratings yet
Chap 2 Hadoop
24 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Unit 2
No ratings yet
Unit 2
56 pages
Understanding Hadoop and MapReduce
No ratings yet
Understanding Hadoop and MapReduce
13 pages
Big Data Analytics with Apache Hadoop
No ratings yet
Big Data Analytics with Apache Hadoop
7 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Hadoop 10
No ratings yet
Hadoop 10
8 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Bda Unit - 3
No ratings yet
Bda Unit - 3
15 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Unit IV
No ratings yet
Unit IV
14 pages
Unit-2 Hadoop
No ratings yet
Unit-2 Hadoop
16 pages
7) Intro To Hadoop and Mapreducer
No ratings yet
7) Intro To Hadoop and Mapreducer
10 pages
Hadoop Notes 1
No ratings yet
Hadoop Notes 1
9 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
21 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Hadoop
No ratings yet
Hadoop
7 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Comprehensive Review of Hadoop Ecosystem
No ratings yet
Comprehensive Review of Hadoop Ecosystem
10 pages
Unit II BDA
No ratings yet
Unit II BDA
32 pages
History and Features of Hadoop
No ratings yet
History and Features of Hadoop
11 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Hadoop Seminar Report IIT Guwahati
No ratings yet
Hadoop Seminar Report IIT Guwahati
28 pages
Assignment 10
No ratings yet
Assignment 10
5 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
No ratings yet
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
6 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Hadoop for Data Professionals
No ratings yet
Hadoop for Data Professionals
34 pages
Unit III
No ratings yet
Unit III
32 pages
Hadoop in Bigdata Processing Concept
No ratings yet
Hadoop in Bigdata Processing Concept
2 pages
Hadoop Notes 2
No ratings yet
Hadoop Notes 2
5 pages
CC 2
No ratings yet
CC 2
25 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Hadoop for Big Data Solutions
No ratings yet
Hadoop for Big Data Solutions
31 pages
Hadoop Streaming and MapReduce Overview
No ratings yet
Hadoop Streaming and MapReduce Overview
20 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Understanding Hadoop Architecture and Benefits
No ratings yet
Understanding Hadoop Architecture and Benefits
10 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
50 pages
Java Lectures Note
No ratings yet
Java Lectures Note
50 pages
Lombok - Databases - REST API
No ratings yet
Lombok - Databases - REST API
8 pages
ITCS Unit-3 (Notes)
No ratings yet
ITCS Unit-3 (Notes)
8 pages
JAVA Notes - JBDL 76 - Noida 4
No ratings yet
JAVA Notes - JBDL 76 - Noida 4
19 pages
Indian Culture, Society, and Traditions
No ratings yet
Indian Culture, Society, and Traditions
16 pages
ITCS Unit-4 (Notes)
No ratings yet
ITCS Unit-4 (Notes)
6 pages
ITCS Unit-5 (Notes)
No ratings yet
ITCS Unit-5 (Notes)
4 pages
Unit-2 Itcs
No ratings yet
Unit-2 Itcs
12 pages
Big Data Analytics Essentials
No ratings yet
Big Data Analytics Essentials
3 pages
Risk Assessment Tower Erection Work
82% (11)
Risk Assessment Tower Erection Work
4 pages
Improving Ball Bearing Production Efficiency
No ratings yet
Improving Ball Bearing Production Efficiency
8 pages
Brakes
No ratings yet
Brakes
34 pages
Bibliography: Why Is Bibliography Important in Research?
100% (1)
Bibliography: Why Is Bibliography Important in Research?
7 pages
UD150L-40E/40F/45E/45F XUD150L-40F D150LC-40E/40F/45E/45F GSC-2002S D150BC-41/VC-41/LC-41 GSC-2002S Software Release Notice
100% (1)
UD150L-40E/40F/45E/45F XUD150L-40F D150LC-40E/40F/45E/45F GSC-2002S D150BC-41/VC-41/LC-41 GSC-2002S Software Release Notice
47 pages
MSDS Soda Lime
No ratings yet
MSDS Soda Lime
7 pages
BCS-054 Computer Oriented Numerical Techniques
No ratings yet
BCS-054 Computer Oriented Numerical Techniques
4 pages
Applsci 14 07503 2
No ratings yet
Applsci 14 07503 2
13 pages
Workplace Hazard Control Guide
No ratings yet
Workplace Hazard Control Guide
11 pages
Klüber Summit Hysyn FG 32 100
100% (1)
Klüber Summit Hysyn FG 32 100
2 pages
Thermochemistry Final 03-09-2025
No ratings yet
Thermochemistry Final 03-09-2025
2 pages
AR50 Diaphragm Pump Manual
No ratings yet
AR50 Diaphragm Pump Manual
23 pages
Asset Management Document
No ratings yet
Asset Management Document
10 pages
An Assignment On Mixing Group 03
No ratings yet
An Assignment On Mixing Group 03
81 pages
15, Health, Safety & Security
No ratings yet
15, Health, Safety & Security
25 pages
T&D-HSE-PRC-0008 HSE Meeting Procedure
100% (1)
T&D-HSE-PRC-0008 HSE Meeting Procedure
6 pages
Activity 4 Sociological Perspective of The Self
No ratings yet
Activity 4 Sociological Perspective of The Self
3 pages
Computer Hardware - Wikipedia
No ratings yet
Computer Hardware - Wikipedia
10 pages
Oct'25 P25-54 Cable Ranker BDP
No ratings yet
Oct'25 P25-54 Cable Ranker BDP
1 page
Ascend - Digital Catalouge
No ratings yet
Ascend - Digital Catalouge
6 pages
Fahim Anjum Sip Report
No ratings yet
Fahim Anjum Sip Report
38 pages
PIC Timer Quiz: ECE 341 Week 9
No ratings yet
PIC Timer Quiz: ECE 341 Week 9
4 pages
Full Petition For Writ of Certiorari, SCOTUS, Gillespie V Barker, Rodems & Cook
No ratings yet
Full Petition For Writ of Certiorari, SCOTUS, Gillespie V Barker, Rodems & Cook
904 pages
MG5000 V2.31 MG5050 V2.31: Programming Guide
100% (2)
MG5000 V2.31 MG5050 V2.31: Programming Guide
64 pages
Teacher's Curriculum Guide
No ratings yet
Teacher's Curriculum Guide
8 pages
SAP GTS Compliance for Traders
0% (1)
SAP GTS Compliance for Traders
13 pages
Factory Test Certification Report
No ratings yet
Factory Test Certification Report
5 pages
Willett, Nora S - Shin, Hijung Valentina - Jin, Zeyu - Li, Wilmot - Pose2Pose - Pose Selection and Transfer For 2D Character Animation (2020, ACM)
No ratings yet
Willett, Nora S - Shin, Hijung Valentina - Jin, Zeyu - Li, Wilmot - Pose2Pose - Pose Selection and Transfer For 2D Character Animation (2020, ACM)
12 pages
JAVA PROGRAMMING - OBJECT-ORIENTED PROGRAMMING - (2ND YR. COLLEGE 1ST. SEM.) - 02 - Laboratory - Exercise - 132
No ratings yet
JAVA PROGRAMMING - OBJECT-ORIENTED PROGRAMMING - (2ND YR. COLLEGE 1ST. SEM.) - 02 - Laboratory - Exercise - 132
2 pages
Home Loan Sanction Ticket Vamshidhar
No ratings yet
Home Loan Sanction Ticket Vamshidhar
6 pages

Overview of Hadoop Architecture and Use Cases

Uploaded by

Overview of Hadoop Architecture and Use Cases

Uploaded by

History of Hadoop:

Hadoop Distributed File System (HDFS):

Analyzing Data with Hadoop:

Hadoop Streaming and Pipes:

Hadoop Pipes is a C++ API for writing MapReduce applications in C++.

Hive: Data warehousing and SQL querying.

Pig: Data flow scripting.

HBase: NoSQL database for real-time read/write access to Hadoop data.

Spark: In-memory data processing.

Kafka: Distributed streaming platform.

MapReduce Framework and Basics:

It consists of two phases: Map and Reduce.

Developing a MapReduce Application:

Unit Tests with MR Unit:

MR Unit is a unit testing framework for testing MapReduce applications.

Anatomy of a MapReduce Job Run:

Shuffle and Sort:

Output formats define how output data is written by reduce tasks.

You might also like