0% found this document useful (0 votes)

15 views6 pages

Introduction to Hadoop Framework

Uploaded by

Bhavani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views6 pages

Introduction to Hadoop Framework

Uploaded by

Bhavani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Introduction to Hadoop

Hadoop is an open-source, distributed processing framework designed to handle

large datasets across clusters of commodity hardware. It provides a reliable,
scalable, and cost-effective solution for storing and processing massive amounts of
data.

Why Hadoop?

Hadoop addresses the challenges posed by Big Data, which is characterized by its
volume, velocity, and variety. Traditional systems struggle to handle such data
efficiently. Hadoop offers:

 Scalability: Easily scales to handle petabytes and exabytes of data by adding

more nodes to the cluster.

 Fault Tolerance: Data is replicated across multiple nodes, ensuring data

availability even if some nodes fail.

 Cost-Effectiveness: Utilizes commodity hardware, reducing infrastructure

costs.

 Flexibility: Supports various data types and processing paradigms.

Why Not RDBMS?

Relational Database Management Systems (RDBMS) are designed for structured

data and transactional processing. They face limitations when dealing with Big
Data:

 Scalability Limitations: Scaling RDBMS vertically (adding more resources

to a single server) becomes expensive and eventually reaches a limit.

 Schema Dependency: RDBMS require a predefined schema, making it

difficult to handle unstructured or semi-structured data.

 Cost: Enterprise-grade RDBMS can be expensive to license and maintain.

 Processing Speed: Complex queries on large datasets can be slow in

RDBMS.
 RDBMS vs. Hadoop

Feature RDBMS Hadoop

Structured, Semi-
Data Structure Structured structured,
Unstructured
Horizontal (highly
Scalability Vertical (limited)
scalable)
Massive (petabytes,
Data Volume Limited
exabytes)
Batch Processing,
Processing Transactional, Complex Queries
Data Analytics
Schema Predefined Schema-on-Read
Hardware Specialized, Expensive Commodity Hardware
Transaction Processing, Big Data Analytics,
Use Cases
Reporting Data Warehousing

History of Hadoop

 2002: Doug Cutting and Mike Cafarella started the Nutch project, an open-
source web search engine.

 2003: Google published the Google File System (GFS) paper, which
inspired the development of HDFS.

 2004: Google published the MapReduce paper, which inspired the

development of the MapReduce framework.

 2006: Hadoop was created as a subproject of Nutch.

 2008: Hadoop became a top-level Apache project.

 Present: Hadoop ecosystem continues to evolve with new components and

improvements.

Hadoop Overview

Hadoop consists of several core components:

 HDFS (Hadoop Distributed File System): A distributed file system that
stores large datasets across a cluster of machines.

 YARN (Yet Another Resource Negotiator): A resource management and

job scheduling framework.

 MapReduce: A programming model for processing large datasets in

parallel.

 Hadoop Common: Provides common utilities and libraries used by other

Hadoop modules.

Use Case of Hadoop

 Log Processing: Analyzing web server logs, application logs, and security
logs.

 Data Warehousing: Building large-scale data warehouses for business

intelligence.

 Recommendation Systems: Building recommendation engines for e-

commerce and content platforms.

 Fraud Detection: Detecting fraudulent transactions in financial institutions.

 Social Media Analytics: Analyzing social media data to understand trends

and sentiment.

HDFS (Hadoop Distributed File System)

HDFS is a distributed file system designed to store large datasets reliably and
efficiently. Key features include:

 Data Replication: Data is replicated across multiple nodes to ensure fault

tolerance.

 Block Size: Data is divided into blocks (typically 128MB) and stored across
the cluster.

 Namenode: Manages the file system namespace and metadata.

 Datanode: Stores the actual data blocks.

 High Throughput: Optimized for sequential data access.

Processing Data with Hadoop

Hadoop processes data using the MapReduce programming model. This involves
dividing the data into smaller chunks, processing them in parallel using mappers,
and then aggregating the results using reducers.

Managing Resources and Applications with Hadoop YARN (Yet Another

Resource Negotiator)

YARN is the resource management layer in Hadoop. It allows multiple

applications to run on the same cluster, sharing resources dynamically.

 ResourceManager: Manages the cluster resources and allocates them to

applications.

 NodeManager: Manages the resources on individual nodes and executes

tasks.

 ApplicationMaster: Manages the lifecycle of an application and

coordinates its tasks.

Introduction to MapReduce Programming

Introduction

MapReduce is a programming model for processing large datasets in parallel. It

involves two main phases: Map and Reduce.

Mapper

The Mapper phase transforms input data into key-value pairs. It processes each
input record and emits one or more key-value pairs.

 Input: Input data is read from HDFS.

 Processing: The mapper function is applied to each input record.

 Output: Key-value pairs are emitted.

Reducer
The Reducer phase aggregates and summarizes the data based on the keys emitted
by the mappers. It receives the output of the mappers, sorts it by key, and then
applies a reduce function to each key and its associated values.

 Input: Key-value pairs from the mappers.

 Processing: The reducer function is applied to each key and its associated
values.

 Output: Final results are written to HDFS.

Combiner

A Combiner is an optional optimization that performs local aggregation of the

mapper output before it is sent to the reducers. This reduces the amount of data that
needs to be transferred across the network.

 Functionality: Performs local aggregation on the mapper output.

 Benefits: Reduces network traffic and improves performance.

 Placement: Runs on the same node as the mapper.

Partitioner

The Partitioner determines which reducer will receive a given key-value pair from
the mapper. It ensures that all key-value pairs with the same key are sent to the
same reducer.

 Functionality: Distributes the mapper output to the reducers.

 Default Partitioner: Uses a hash function to distribute the keys evenly.

 Custom Partitioner: Can be implemented to control the distribution of

keys.

Searching

MapReduce can be used for searching large datasets. The mapper can filter the data
based on a search query, and the reducer can aggregate the results.

Sorting
Hadoop automatically sorts the mapper output by key before it is sent to the
reducers. This makes it easy to perform sorting operations on large datasets.

Compression

Compression can be used to reduce the storage space and network bandwidth
required for Hadoop jobs. Hadoop supports various compression codecs, such as
Gzip, LZO, and Snappy.

 Benefits: Reduces storage costs, improves network performance.

 Considerations: Compression and decompression can add overhead to the

processing time.

This study guide provides a foundational understanding of Hadoop and

MapReduce. Remember to review your lecture notes, assignments, and practice
problems to further solidify your knowledge. Good luck with your final exams!

Introduction to Hadoop Framework Basics
No ratings yet
Introduction to Hadoop Framework Basics
14 pages
Big Data Storage and Processing Overview
No ratings yet
Big Data Storage and Processing Overview
18 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
23 pages
Overview of Hadoop Ecosystem Components
No ratings yet
Overview of Hadoop Ecosystem Components
50 pages
Introduction to Hadoop and Big Data Analytics
No ratings yet
Introduction to Hadoop and Big Data Analytics
83 pages
NoSQL and Hadoop for Big Data Solutions
No ratings yet
NoSQL and Hadoop for Big Data Solutions
22 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
50 pages
Big Data Processing with Hadoop and Spark
No ratings yet
Big Data Processing with Hadoop and Spark
44 pages
Hadoop Overview and Configuration Guide
No ratings yet
Hadoop Overview and Configuration Guide
8 pages
Introduction to Hadoop and HDFS Basics
No ratings yet
Introduction to Hadoop and HDFS Basics
69 pages
Big Data Analytics Exam Guide
No ratings yet
Big Data Analytics Exam Guide
15 pages
NoSQL and Big Data Management Overview
No ratings yet
NoSQL and Big Data Management Overview
92 pages
Overview of Hadoop and HDFS Components
No ratings yet
Overview of Hadoop and HDFS Components
24 pages
Introduction to Hadoop for Big Data
No ratings yet
Introduction to Hadoop for Big Data
61 pages
Big Data and Hadoop Architecture Overview
No ratings yet
Big Data and Hadoop Architecture Overview
9 pages
Overview of Hadoop Ecosystem Components
No ratings yet
Overview of Hadoop Ecosystem Components
12 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
22 pages
Hadoop vs RDBMS: Key Differences Explained
No ratings yet
Hadoop vs RDBMS: Key Differences Explained
23 pages
History and Components of Hadoop
No ratings yet
History and Components of Hadoop
9 pages
Big Data Fundamentals and Hadoop Overview
No ratings yet
Big Data Fundamentals and Hadoop Overview
6 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
101 pages
Understanding Hadoop and Distributed Computing
No ratings yet
Understanding Hadoop and Distributed Computing
7 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
34 pages
Understanding Hadoop Framework Features
No ratings yet
Understanding Hadoop Framework Features
17 pages
Big Data Management Applications Overview
No ratings yet
Big Data Management Applications Overview
29 pages
Cloud Data Management & Processing Techniques
No ratings yet
Cloud Data Management & Processing Techniques
10 pages
Big Data Analytics with Hadoop Overview
No ratings yet
Big Data Analytics with Hadoop Overview
22 pages
Hadoop Ecosystem Components Overview
No ratings yet
Hadoop Ecosystem Components Overview
97 pages
Hadoop Frameworks and Data Visualization
No ratings yet
Hadoop Frameworks and Data Visualization
71 pages
Big Data Tools and Techniques Overview
No ratings yet
Big Data Tools and Techniques Overview
19 pages
Overview of Hadoop Components
No ratings yet
Overview of Hadoop Components
19 pages
Understanding Apache Hadoop Framework
No ratings yet
Understanding Apache Hadoop Framework
19 pages
Cloud Computing and Hadoop Overview
No ratings yet
Cloud Computing and Hadoop Overview
27 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Introduction to Hadoop and Its Advantages
No ratings yet
Introduction to Hadoop and Its Advantages
51 pages
BDA Module 2: Hadoop Components Overview
No ratings yet
BDA Module 2: Hadoop Components Overview
40 pages
Hadoop Ecosystem and Installation Guide
No ratings yet
Hadoop Ecosystem and Installation Guide
48 pages
Overview of Hadoop Framework and Ecosystem
No ratings yet
Overview of Hadoop Framework and Ecosystem
180 pages
Understanding Big Data and Hadoop Components
No ratings yet
Understanding Big Data and Hadoop Components
21 pages
MapReduce in Big Data Explained
No ratings yet
MapReduce in Big Data Explained
10 pages
Overview of Hadoop Architecture
100% (1)
Overview of Hadoop Architecture
32 pages
Hadoop & MapReduce Overview
No ratings yet
Hadoop & MapReduce Overview
18 pages
Understanding Hadoop and Its Ecosystem
No ratings yet
Understanding Hadoop and Its Ecosystem
90 pages
Data Processing with Hadoop Overview
No ratings yet
Data Processing with Hadoop Overview
23 pages
Hadoop and Spark Overview Guide
No ratings yet
Hadoop and Spark Overview Guide
34 pages
Hadoop Basics: Data Formats & Features
No ratings yet
Hadoop Basics: Data Formats & Features
47 pages
Big Data Analytics and Hadoop Overview
No ratings yet
Big Data Analytics and Hadoop Overview
30 pages
Understanding Hadoop for Big Data
No ratings yet
Understanding Hadoop for Big Data
19 pages
Overview of the Hadoop Ecosystem
No ratings yet
Overview of the Hadoop Ecosystem
34 pages
Hadoop Overview and Applications
No ratings yet
Hadoop Overview and Applications
12 pages
Big Data Overview and Hadoop Insights
No ratings yet
Big Data Overview and Hadoop Insights
38 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
103 pages
Advanced Analytics and Unstructured Data Insights
No ratings yet
Advanced Analytics and Unstructured Data Insights
11 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
25 pages
Data-Intensive Computing at Wasit University
No ratings yet
Data-Intensive Computing at Wasit University
8 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
26 pages
Understanding Hadoop and NoSQL Databases
No ratings yet
Understanding Hadoop and NoSQL Databases
4 pages
Project Report Guidelines for IT Diploma
No ratings yet
Project Report Guidelines for IT Diploma
6 pages
Learn to Tell Time: Clocks & Exercises
No ratings yet
Learn to Tell Time: Clocks & Exercises
13 pages
QSD Satellite Dish Installation Guide
No ratings yet
QSD Satellite Dish Installation Guide
10 pages
3-Phase Digital Power Clamp Meter Guide
No ratings yet
3-Phase Digital Power Clamp Meter Guide
35 pages
Tle CSS: Terminating and Connecting Electrical Wiring and Electronics Circuit (TCEW)
100% (3)
Tle CSS: Terminating and Connecting Electrical Wiring and Electronics Circuit (TCEW)
20 pages
Data Engineer Resume of K A Abdul Muhsin
No ratings yet
Data Engineer Resume of K A Abdul Muhsin
2 pages
XMLF100D2025 Pressure Sensor Overview
No ratings yet
XMLF100D2025 Pressure Sensor Overview
2 pages
2020 Varian SRS E2E Phantom Instructions For Use v1.0
No ratings yet
2020 Varian SRS E2E Phantom Instructions For Use v1.0
25 pages
Embedded Systems & IoT Lab Manual
100% (1)
Embedded Systems & IoT Lab Manual
59 pages
Java User Input and Conditionals Guide
No ratings yet
Java User Input and Conditionals Guide
5 pages
Strategic IT in Healthcare Management
No ratings yet
Strategic IT in Healthcare Management
7 pages
Nonfiction Reading Test: Chess
No ratings yet
Nonfiction Reading Test: Chess
4 pages
Creating Your Fake Assistant System
100% (1)
Creating Your Fake Assistant System
13 pages
Design Thinking for Business Opportunities
No ratings yet
Design Thinking for Business Opportunities
10 pages
Surge Arrester Installation Guide
No ratings yet
Surge Arrester Installation Guide
24 pages
FIU Smart Grid Testbed Overview
No ratings yet
FIU Smart Grid Testbed Overview
4 pages
Child Ladder Safety Alarm Device
No ratings yet
Child Ladder Safety Alarm Device
1 page
AGA Report 11: Coriolis Meter Updates
100% (1)
AGA Report 11: Coriolis Meter Updates
7 pages
OPC Custom Interface Manual
No ratings yet
OPC Custom Interface Manual
42 pages
Characteristics of Embedded Systems
No ratings yet
Characteristics of Embedded Systems
45 pages
L2A CDU Performance in Data Centers
No ratings yet
L2A CDU Performance in Data Centers
20 pages
ISO 45001 Implementation Checklist
No ratings yet
ISO 45001 Implementation Checklist
6 pages
Social Isolation in Modern North America
No ratings yet
Social Isolation in Modern North America
6 pages
Deep Learning Exam Questions BCS311
No ratings yet
Deep Learning Exam Questions BCS311
4 pages
Optimizing VHD for Virtual Storage
No ratings yet
Optimizing VHD for Virtual Storage
28 pages
Rounding and Significant Figures Guide
No ratings yet
Rounding and Significant Figures Guide
13 pages
Computer Engineering Assignment 5
No ratings yet
Computer Engineering Assignment 5
3 pages
Gomoku Strategies and AI Solutions
100% (1)
Gomoku Strategies and AI Solutions
223 pages
PLANET - Technology - 20170221 - Video Online
No ratings yet
PLANET - Technology - 20170221 - Video Online
64 pages
TEA1007 Phase Control Circuit Overview
No ratings yet
TEA1007 Phase Control Circuit Overview
9 pages

Introduction to Hadoop Framework

Uploaded by

Introduction to Hadoop Framework

Uploaded by

Introduction to Hadoop

Hadoop is an open-source, distributed processing framework designed to handle

 Scalability: Easily scales to handle petabytes and exabytes of data by adding

 Fault Tolerance: Data is replicated across multiple nodes, ensuring data

 Cost-Effectiveness: Utilizes commodity hardware, reducing infrastructure

 Flexibility: Supports various data types and processing paradigms.

Why Not RDBMS?

Relational Database Management Systems (RDBMS) are designed for structured

 Scalability Limitations: Scaling RDBMS vertically (adding more resources

 Schema Dependency: RDBMS require a predefined schema, making it

 Cost: Enterprise-grade RDBMS can be expensive to license and maintain.

 Processing Speed: Complex queries on large datasets can be slow in

Feature RDBMS Hadoop

 2004: Google published the MapReduce paper, which inspired the

 2006: Hadoop was created as a subproject of Nutch.

 2008: Hadoop became a top-level Apache project.

 Present: Hadoop ecosystem continues to evolve with new components and

Hadoop consists of several core components:

 YARN (Yet Another Resource Negotiator): A resource management and

 MapReduce: A programming model for processing large datasets in

 Hadoop Common: Provides common utilities and libraries used by other

Use Case of Hadoop

 Data Warehousing: Building large-scale data warehouses for business

 Recommendation Systems: Building recommendation engines for e-

 Fraud Detection: Detecting fraudulent transactions in financial institutions.

 Social Media Analytics: Analyzing social media data to understand trends

HDFS (Hadoop Distributed File System)

 Data Replication: Data is replicated across multiple nodes to ensure fault

 Namenode: Manages the file system namespace and metadata.

 Datanode: Stores the actual data blocks.

Processing Data with Hadoop

Managing Resources and Applications with Hadoop YARN (Yet Another

YARN is the resource management layer in Hadoop. It allows multiple

 ResourceManager: Manages the cluster resources and allocates them to

 NodeManager: Manages the resources on individual nodes and executes

 ApplicationMaster: Manages the lifecycle of an application and

Introduction to MapReduce Programming

MapReduce is a programming model for processing large datasets in parallel. It

 Input: Input data is read from HDFS.

 Processing: The mapper function is applied to each input record.

 Output: Key-value pairs are emitted.

 Input: Key-value pairs from the mappers.

 Output: Final results are written to HDFS.

A Combiner is an optional optimization that performs local aggregation of the

 Functionality: Performs local aggregation on the mapper output.

 Benefits: Reduces network traffic and improves performance.

 Placement: Runs on the same node as the mapper.

 Functionality: Distributes the mapper output to the reducers.

 Default Partitioner: Uses a hash function to distribute the keys evenly.

 Custom Partitioner: Can be implemented to control the distribution of

 Benefits: Reduces storage costs, improves network performance.

 Considerations: Compression and decompression can add overhead to the

This study guide provides a foundational understanding of Hadoop and

You might also like