1.
INTRODUCING HADOOP
Big Data refers to the massive amounts of structured, semi-structured, and
unstructured data generated daily from various sources. Its major
characteristics are:
Volume: The sheer amount of data produced (e.g., NYSE generates 1.5 billion
shares and trade data daily, Facebook stores 2.7 billion comments and likes
daily, Google processes about 24 petabytes of data daily).
Variety: Data comes in many formats—text, videos, logs, images, etc.—from
varied sources.
Velocity: The speed at which data is generated and needs to be processed (e.g.,
Twitter users tweet 300,000 times per minute, email users send 200 million
messages per minute).
Value: The potential insights and business advantages that can be extracted
from analyzing this data.
2. WHY HADOOP
Hadoop can handle huge amounts of different types of data very [Link] is
why it is so widely adopted for big data tasks.
Low Cost:
Hadoop is open-source and uses inexpensive, easily available hardware to
store large volumes of data.
Computing Power:
Hadoop processes very large volumes of data by using many computers
(nodes) together. More nodes = more processing power.
Scalability:
If you need more processing, you simply add more nodes. Administration
is easy as the system grows.
Storage Flexibility:
You don’t need to pre-process data to store it. Hadoop lets you store any
type, size, or format (structured, unstructured, free-form) of data.
Inherent Data Protection:
Hadoop protects data by creating multiple copies across nodes. If one node
fails, work is redirected to others, so data and processes continue without
interruption.
How Does Hadoop Work? (Figure 5.3):
Hadoop uses a distributed file system on low-cost hardware. Here’s how it manages data:
Distributes and Duplicates Data:
Breaks down data files into chunks (for example, “25–30” could be one chunk)
and duplicates each chunk across several nodes. This ensures reliability and quick
access.
Parallel Processing:
Each node processes its own chunk of data at the same time (in parallel), speeding
up computation.
Automatic Failover:
If a node fails, Hadoop automatically reassigns its tasks to working nodes, keeping
the system running smoothly.
In Summary:
Hadoop is popular because it's affordable, powerful, flexible, scalable, and fault-
tolerant. It is designed to handle the biggest data problems by distributing both storage
and computation across many inexpensive machines, making processing of vast and
varied data not just possible—but efficient and reliable.
3. WHY NOT RDBMS FOR BIG DATA?
Poor fit for large files:
RDBMS is not suitable for storing and processing large files, images, and
videos. These types of data are increasingly common in big data scenarios.
Not ideal for advanced analytics:
RDBMS systems struggle with advanced analytics and machine
learning workloads, often required in modern data applications.
Cost Issues:
As data grows, the cost per GB of storage in RDBMS rises steeply.
The image and the explanation (Figure 5.4) illustrate that as you scale up
(from GBs to TBs to PBs), the cost per GB (€ to €€ to €€€) increases
significantly.
Scaling an RDBMS to petabyte levels becomes prohibitively expensive.
Scaling Model (Scale-Up):
Traditional RDBMS scale by adding more powerful (and expensive) hardware,
not by adding more cheap computers. This is called "scaling up" and is costly.
As you increase the amount of data (from GBs to TBs to PBs), the cost per GB goes up
drastically, shown by € → €€ → €€€.
The arrow labeled "Scale Up" shows that this increasing cost is tied to the way RDBMS
systems are traditionally scaled for big data—by investing in larger, more powerful machines.
4. RDBM Versus HADOOP
PARAMETERS RDBMS HADOOP
Relational Database
System Management System. Node Based Flat Structure.
Suitable for structured, unstructured data. Supports
variety of data formats in real time such as XML,
Data Suitable for structured data. JSON, text based flat file formats, etc.
Processing OLTP Analytical, Big Data Processing
When the data needs Big Data processing, which does not require any
Choice consistent relationship. consistent relationships between data.
Needs expensive hardware or
high-end processors to store In a Hadoop Cluster, a node requires only a processor,
Processor huge volumes of data. a network card, and few hard drives.
Cost around $10,000 to
$14,000 per terabytes of
Cost storage. Cost around $4,000 per terabytes of storage.
5. DISTRIBUTED COMPUTING CHALLENGES
5.1. Hardware Failure
In distributed systems, multiple servers or hard disks are networked together,
increasing the likelihood of hardware failure.
For instance, a typical hard disk might fail once in three years, but with 1,000
disks, failures are a daily possibility.
Main problem: How to retrieve data if a hardware component fails?
Hadoop’s Solution: Replication Factor
Replication Factor (RF): This denotes the number of copies of each data
block stored across the network.
Example: If RF=2, there are two copies of each data block on different
servers.
Why it matters: If one node fails, data can still be accessed from the other
copy, ensuring fault-tolerance and reliability.
The system constantly maintains the set number of replicas, so if one is lost
due to hardware failure, it is automatically recreated elsewhere.
5.2. How to Process This Gigantic Store of Data?
Distributed systems store data across many networked machines. The
challenge is integrating all these pieces for efficient processing.
Main problem: How to process large volumes of distributed data as a unified
dataset?
Hadoop’s Solution: MapReduce Programming
MapReduce is a programming model to process data across multiple
machines.
It lets developers write programs that handle massive data sets by
dividing the tasks among many nodes and aggregating the results.
This enables efficient computation and data integration, even when data is
distributed across a huge cluster.
Summary:
Distributed computing faces challenges of hardware failure and massive,
distributed data processing. Hadoop addresses these using Replication
Factor for fault-tolerant storage and MapReduce for parallel data processing,
making it highly suitable for big data environments
[Link] OF HADOOP
Creator:
Hadoop was created by Doug Cutting, who is also known for developing
Apache Lucene (a popular text search library).
Origins:
Hadoop began as part of the Apache Nutch (Yahoo) project, which was an
open-source web search engine.
2002:
Doug Cutting and Mike Cafarella started working on Nutch.
2003-2004:
Google published papers on Google File System (GFS) and
MapReduce. These concepts influenced Hadoop's architecture.
2005:
Doug Cutting added Distributed File System (DFS) and MapReduce
to Nutch.
2006:
Yahoo hired Doug Cutting, and Hadoop was spun out from Nutch as
a separate project.
2008:
Cloudera, a major Hadoop vendor, was founded.
2009:
Doug Cutting joined Cloudera.
7. HADOOP OVERVIEW
Hadoop is an open-source software framework designed for storing and
processing massive amounts of data across clusters of inexpensive, commodity
hardware. It accomplishes two major goals:
Massive data storage
Faster data processing (in parallel)
7.1 Key Aspects of Hadoop
[Link]-Source Software:
Free to download, use, and contribute to.
[Link] Approach:
Provides all necessary tools and programs to develop and execute distributed
data processing.
[Link]:
Data is divided and stored across multiple connected computers (nodes).
Processing tasks are distributed and performed in parallel, improving speed.
[Link] Storage:
Stores colossal volumes of data using low-cost hardware.
5. Faster Processing:
Handles large-scale data processing in parallel for quick results.
7.2 Hadoop Components
Core Components
1. HDFS (Hadoop Distributed File System):
Main storage system which splits and distributes data across
multiple nodes.
Built-in redundancy ensures data durability.
2. MapReduce:
Programming model for processing and computing large
datasets across distributed nodes in a parallel fashion.
Hadoop Ecosystem (Supporting Projects)
HIVE: Data warehousing, SQL-like queries
PIG: High-level scripting for data analysis
SQOOP: Data transfer between Hadoop and RDBMS
HBASE: NoSQL database on Hadoop
FLUME: Collecting and loading log data
OOZIE: Workflow scheduling
MAHOUT: Machine learning
7.3 Conceptual Layers of Hadoop
Data Storage Layer:
Handles storing huge amounts of data (HDFS).
Data Processing Layer:
Processes data in parallel to extract insights (MapReduce).
[Diagram of conceptual layers: 'Data Storage' and 'Data Processing'. Figure 5.9]
7.4 High-Level Architecture
Hadoop uses a Master-Slave architecture:
Master Node (NameNode):
Manages file system namespace and controls access (where to find what data).
Responsible for coordination, storage partitioning.
Slave Nodes (DataNodes):
Actually store the data and perform computation (processing tasks).
Each node runs components for:
Computation (MapReduce)
Storage (HDFS)
![High-level architecture diagram showing one master node and multiple slave nodes,
each handling both computation and storage. Figure 5.10]
[Link] CASE OF HADOOP
8.1 ClickStream Data
ClickStream data refers to the sequence of mouse clicks made by users as they
navigate websites.
By analyzing this data, businesses can understand customer purchasing
behavior.
Marketers use ClickStream analysis to optimize product pages, promotional
content, and other aspects of their websites to improve business outcomes
Why Use Hadoop for ClickStream Data?
Three Key Benefits:
Integration with Other Data Sources
Hadoop enables joining ClickStream data with other business data sources such
as:
CRM (Customer Relationship Management)-customer
history,preference
Customer demographics- age, location,etc
Sales data- purchases, conversations.
Advertising campaign data- ad impressions, clicks, conversations
This comprehensive integration delivers deeper insights into customer behavior.
Cost-Effective Scalability
Hadoop’s scalable architecture allows storage of years of clickstream
data with minimal incremental cost.
This means you can perform historical and year-over-year analysis without the
rising storage costs that traditional systems may incur
Flexible Analysis with Hive and Pig
Business analysts can use Apache Hive or Apache Pig to:
Organize clickstream data by user session
Refine and preprocess it
Feed it into visualization or advanced analytics tools
These tools make it much easier to extract valuable patterns from massive
clickstream datasets.
Benefit Description
Joins ClickStream with CRM & sales
data Enables combined analysis for deeper insights
Facilitates long-term analysis without high
Stores years of data inexpensively incremental cost
Analysis using Hive or Pig Streamlines data organization and analytics
[Link] (HADOOP DISTRIBUTED FILE SYTEM)
What is HDFS?
HDFS (Hadoop Distributed File System) is the primary storage system of
Hadoop, designed for storing and managing large volumes of data reliably and
efficiently across multiple nodes.
Key Features of HDFS
1. Storage Component of Hadoop:
HDFS is responsible for storing all the data in a Hadoop cluster.
2. Distributed File System:
It spreads data across multiple computers (nodes) and allows parallel reading
and writing, improving speed and fault tolerance.
3. Modeled after Google File System:
The design draws inspiration from Google’s GFS, using similar ideas like
block storage and metadata management.
4. Optimized for High Throughput:
HDFS uses large block sizes and tries to move computation closer to where
data resides (“data locality”) for efficient processing of big files.
[Link] for Fault Tolerance:
Each file is split into blocks (default 64MB) and each block
is replicated (default 3 copies for a file size of 192 MB) across the
cluster. This ensures data is safe even if nodes fail.
[Link] Block Re-Replication:
If a node fails, HDFS automatically re-copies (“re-replicates”) its
blocks to healthy nodes to maintain the required replication factor.
[Link] Very Large Files:
HDFS excels at storing and processing files that are gigabytes or
terabytes in size.
[Link] on Top of Existing File Systems:
It operates over native OS file systems like ext3/ext4, managing
how and where files are stored and retrieved.
Architectural Highlights
Block Structured:
Files are broken into blocks. Each block is replicated and stored on different
nodes for fault tolerance and efficient access.
Default Block Size:
Each block is typically 64MB (can be configured).
Default Replication Factor:
By default, each block has 3 copies on different nodes.
9.1 HDFS Daemons
When a file (e.g., “[Link]” of size 192MB) is stored, it is split into three
blocks of 64MB each. Each block is replicated three times (default setting) and
distributed across different nodes.
9.1.1 NameNode: Manages all file system metadata (file names, block locations,
permissions). It ensures files are split and replicated as per configuration.
The NameNode is the master server in the HDFS architecture. Its primary role
is to manage the file system namespace. Here are its key responsibilities:
Manages Metadata: Keeps records of the directory structure, file names,
permissions, and the location of blocks within the cluster.
Stores the Namespace: The complete mapping of file names to blocks and
blocks to DataNodes is called the file system namespace. This is stored in a file
called the FsImage.
Tracks Transactions: Every change to the metadata (like creating, deleting, or
renaming files) is logged in an EditLog (transaction log).
Rack Awareness: Uses rack IDs to identify DataNodes in different racks,
optimizing data placement and network utilization.
No Data Storage: The NameNode does NOT store the actual data—only the
metadata.
Single Point of Management: Typically, there is one active NameNode per
cluster (for high availability, a standby may exist).
9.1.2 DataNodes: Store the actual data blocks on their local disks.
If a DataNode fails, HDFS automatically makes new replicas to
maintain the set replication factor.
The DataNodes are the worker nodes in HDFS and are responsible
for storing the actual file data in the form of blocks. Key points about
DataNodes include:
Stores Data Blocks: Each DataNode stores a set of blocks on its
local disks. The files are split into blocks (default: 64MB), and each
block is replicated as per the replication factor (default: 3).
Handles Read/Write Requests: DataNodes serve client read and
write requests by accessing the physical blocks.
Block Reports: Regularly send block reports and heartbeats to
the NameNode to confirm availability and block status.
Re-Replication: If a DataNode fails or a block is lost, the HDFS
system uses other DataNodes’ replicas to restore the lost blocks
automatically.
Scalability and Fault Tolerance: Data blocks are spread and
replicated across many DataNodes, ensuring fault tolerance and
improved performance.
Together, the NameNode and DataNodes form the backbone of
HDFS, allowing for distributed, fault-tolerant, and highly
scalable data storage and retrieval.
9.1.3 Secondary NameNode:
Purpose:
The Secondary NameNode periodically takes a snapshot of HDFS
metadata (the information that tells Hadoop which data blocks belong to which
file, where they are stored, etc.) at intervals specified in the Hadoop
configuration.
Functionality:
Unlike its name might suggest, the Secondary NameNode is not a backup
NameNode. Instead, it helps the main NameNode by:
Merging the FsImage (a persistent snapshot of HDFS metadata) and
the EditLog (which records all recent changes to metadata).
Producing a new, up-to-date FsImage file, which helps prevent the
EditLog from growing too large and keeps recovery times manageable in
case of failure.
Memory Requirements:
The memory requirements for the Secondary NameNode are similar to those
of the main NameNode.
Recommendation: It is best practice to run the NameNode and Secondary
NameNode on different machines to avoid overloading one machine.
Disaster Recovery:
If the main NameNode fails, the Secondary NameNode can be manually
used to help restore the cluster—but not automatically.
The Secondary NameNode does not continuously record real-time HDFS
metadata changes; it only has the latest snapshot from its last merge
operation.
9.2 anatomy of file read
Steps in HDFS File Read
[Link] Initiates Read (Open):
The client (using the HDFS Client API) wants to read a file and
calls open() on the DistributedFileSystem.
[Link] Block Location (NameNode):
The DistributedFileSystem communicates with the NameNode to request
the location of the file’s data blocks.
The NameNode responds with the addresses of the DataNodes that store
replicas of each block.
[Link] Stream Initialization:
The DistributedFileSystem returns a FSDataInputStream object to the
client.
This stream enables the client to read from the file. It contains information
about which DataNodes hold the required blocks.
Reading Data from DataNodes:
The client uses the FSDataInputStream to call read().
For the first file block, the client connects to the nearest (most optimal)
DataNode containing that block.
Streaming Blocks:
The client repeatedly calls read() to stream the data from the connected
DataNode.
When it reaches the end of the current block, the client closes the
connection to that DataNode.
The input stream then moves to the next block, connecting to the best
DataNode for that block, and the process continues until the whole file is
read.
Closing Connection:
After the file is completely read, the client calls close() on
the FSDataInputStream to close the connection and end the operation.
9.3 Anatomy of file write
Steps in HDFS File Write:
1. Client Requests File Creation
o The HDFS Client (running inside Client JVM) asks the DistributedFileSystem to
create a new file.
2. NameNode Involvement
o The DistributedFileSystem contacts the NameNode to create a file entry (metadata
only, no data yet).
o The NameNode checks if the file already exists, user permissions, etc.
o If valid, it creates an empty file entry in its namespace.
3. Write Starts
o The client starts writing data to the file using FSDataOutputStream.
4. Data Written in Packets → Sent to DataNodes
o The data is split into packets.
o These packets are sent to a pipeline of DataNodes (as per replication
factor, usually 3).
o Example: Client → DataNode1 → DataNode2 → DataNode3.
[Link] Packets (Ack)
Each DataNode stores the packet and then sends an acknowledgment back
through the pipeline:
DataNode3 → DataNode2 → DataNode1 → Client.
This ensures all replicas have safely stored the data before moving to the
next packet.
[Link] Operation
After finishing all writes, the client calls close() on FSDataOutputStream.
[Link] Notification
Finally, the DistributedFileSystem informs the NameNode that the file
write is complete.
The NameNode updates its metadata to mark the file as closed and ready
for use.
9.4 Steps of the Default Replica Placement Strategy
First Replica
Placed on the same node as the client writing the data. This optimizes for
write performance since one copy is immediately local.
Second Replica
Placed on a node that is on a different rack from the first.
Racks are physical groupings of nodes; having replicas on different racks
protects data from rack-level failures (like a network switch or power
outage).
Third Replica
Placed on the same rack as the second replica, but on a different
node within that rack. This balances fault tolerance and network usage.
key Points of the Strategy
Pipeline Creation:
Once replica locations are determined, Hadoop sets up a pipeline between
these nodes for data transfer and replication during writes.
Reliability:
By placing one replica on the local node, one on a remote rack, and one more
on the same remote rack, Hadoop ensures that data is protected from both
node and rack failures, while also reducing cross-rack network traffic.
In summary, this strategy ensures a balance of reliability (protection from
failures) and network efficiency by distributing replicas intelligently across
nodes and racks within the Hadoop cluster.
9.5 Working with HDFS Command
common HDFS (Hadoop Distributed File System) commands and their objectives.
1. Listing Directories and Files at the Root of HDFS
Objective: Get the list of directories and files at the root of HDFS.
Command:
text
hadoop fs -ls /
-ls → list files and directories.
/ → root directory in HDFS.
2. Listing All Directories and Files Recursively
Objective: Get the list of all directories and files in HDFS recursively.
Command:
text
hadoop fs -ls -R /
-R → recursive (goes inside subdirectories too).
This shows all files and folders in HDFS.
3. Creating a Directory
Objective: Create a directory (e.g., sample) in HDFS.
Command:
text
hadoop fs -mkdir /sample
-mkdir → make directory.
/sample → directory name in HDFS.
Creates a new directory named /sample.
4. Copying a File from Local to HDFS
Objective: Copy a file from the local file system to HDFS.
Command:
Text
hadoop fs -put /root/sample/[Link] /sample/[Link]
-put → upload file from local filesystem to HDFS.
First path → local file location.
Second path → destination in HDFS.
Uploads [Link] from the local path to the /sample directory in HDFS.
5. Copying a File from HDFS to Local
Objective: Copy a file from HDFS to the local file system.
Command:
hadoop fs -get /sample/[Link] /root/sample/[Link]
-get → download file from HDFS to local filesystem.
First path → HDFS file.
Second path → local destination.
download [Link] from your local machine to HDFS.
6. Copy from Local Using copyFromLocal
Objective: Copy a file from local file system to HDFS using copyFromLocal.
Command:
hadoop fs -copyFromLocal /root/sample/[Link] /sample/[Link]
Another way to upload a local file to HDFS.
-copyFromLocal works like -put.
7. Copy to Local Using copyToLocal
Objective: Copy a file from HDFS to local file system using copyToLocal.
Command:
hadoop fs -copyToLocal /sample/[Link] /root/sample/[Link]
-copyToLocal works like -get.
Another command to download a file from HDFS to local.
8. Displaying Contents of a File
Objective: Display the contents of a file stored in HDFS on the console.
Command:
hadoop fs -cat /sample/[Link]
-cat → shows file content on the terminal.
Outputs the contents of [Link] from HDFS.
9. Copying a File within HDFS
Objective: Copy a file from one directory to another within HDFS.
Command:
hadoop fs -cp /sample/[Link] /sample1
-cp → copy within HDFS.
First path → source file.
Second path → destination directory.
Copies [Link] to /sample1.
10. Removing a Directory
Objective: Remove a directory from HDFS.
Command:
hadoop fs -rm -r /sample1
-rm → remove file/directory.
-r → recursive (needed for directories).
Deletes the /sample1 directory and all its contents recursively.
9.6 Special Features of HDFS
1. Data Replication
What it is: HDFS automatically makes multiple copies (replicas) of every data
block.
Why it matters: Clients (applications accessing HDFS) do not need to keep
track of where each piece of data is stored. If you need to read or write data,
the system automatically directs the operation to the nearest available copy
(replica) to ensure high performance and reliability.
Benefit: This ensures that even if some machines fail, the data remains safe
and quickly accessible.
2. Data Pipeline
What it is: When a client writes data to HDFS, it doesn't write to all replicas at
once. Instead, it sends the data block to the first DataNode (storage server) in a
list called a "pipeline."
How it works:
The first DataNode writes the block, then passes it to the second
DataNode, which in turn passes it to the third, and so on, until all replicas
receive the data.
This process continues for all blocks of the file, making sure every replica
is written in order.
Benefit: This pipeline process allows for efficient and reliable replication of
blocks, improves write performance, and maintains data integrity across the
system.
[Link] DATA WITH HADOOP
MapReduce is a software framework that processes massive amounts of data in
parallel by splitting the input data into independent chunks.
10.1Mapreduce Daemons
The two main roles in the MapReduce framework are:
[Link] (Master):
JobTracker is the boss in Hadoop’s MapReduce.
When you give Hadoop some code (a job), the JobTracker:
Makes a plan for how the work will be split.
Assigns small tasks to different computers (nodes).
Keeps an eye on all tasks.
If something fails, it gives the work to another node.
There is only one JobTracker in the whole cluster.
[Link] (Slave)
TaskTracker is the worker in Hadoop’s MapReduce.
Every computer (node) has one TaskTracker.
The TaskTracker:
Gets tasks from the JobTracker and runs them.
Starts multiple JVMs (small programs) to do work in parallel.
Sends a heartbeat message regularly to the JobTracker to say “I’m alive and
working.”
If the JobTracker doesn’t hear from it, it thinks the TaskTracker has failed.
Phases in MapReduce:
Map: Converts inputs into key-value pairs, with each map task operating in parallel.
Reduce: Combines outputs from the mappers and generates a reduced result,
aggregating the map outputs.
Example:(map)
Input: A list of words in a file.
Map output:
(cat, 1)
(dog, 1)
(cat, 1)
Reduce input:
(cat, [1,1])
(dog, [1])
Reduce output:
(cat, 2)
(dog, 1)
10.2 How Does MapReduce Work?
MapReduce is a programming model for processing large amounts of data in parallel
across a distributed cluster (multiple computers). It breaks the data analysis task into two
main stages:
It divides work into two phases:
1. Map Phase → Takes input data, processes it, and converts it into key-value pairs.
2. Reduce Phase → Aggregates (combines) and summarizes the output from the map
phase.
The Components in the Diagram
1. Client
The user/application that submits a job (task to be executed).
2. JobTracker (Master Node)
Receives the job from the client.
Divides the job into smaller tasks.
Assigns tasks to TaskTrackers on different worker nodes.
Monitors the progress and handles failures.
3. TaskTracker (Worker Node)
Executes the tasks (Map and Reduce) assigned by the JobTracker.
Reports the task status (success/failure) back to the JobTracker.
4. Map Tasks
Process raw input data.
Transform it into intermediate key-value pairs.
Example: Counting words → ("apple", 1), ("banana", 1).
5. Reduce Tasks
Collect all key-value pairs from the Map tasks.
Aggregate them to give the final result.
Example: Combine counts → ("apple", 5), ("banana", 3).
10.3 MapReduce Example
Objective
Count the number of occurrences of each word in a large collection of text files
using Hadoop MapReduce.
Key Components
Driver Class
Sets up and configures the job (specifies mapper, reducer, input/output paths, etc.)
Example: WordCounter class
Mapper Class
Reads lines of text, splits them into words, and outputs each word as a key with
value 1.
Example: WordCounterMap class
Reducer Class
Sums up the counts for each word key to get the total occurrences.
Example: (Reducer code not shown here, but typically combines values for keys)
What this driver does:
Creates a Hadoop job.
Attaches
the Mapper (WordCounterMap) and Reducer
(WordCounterRed).
Sets input file and output folder.
Submits job to Hadoop framework for execution.
1) This is a Hadoop MapReduce program to count the number of times each
word appears in a text file.
[Link]: DRIVER PROGRAM
Package [Link]; /Defines the Java package.
The program imports Hadoop classes:
Import [Link]; /Needed for input/output error handling.
For example, if Hadoop fails to read/write files, IOException is thrown.
import [Link]; /Hadoop uses its own special class → Path.
In Hadoop, we don’t use normal Java File.
Instead, we use Hadoop’s Path class to point to HDFS files.
import [Link];
import [Link];
Hadoop doesn’t use normal Java types (int, String) because they are not
serializable by default.
IntWritable = Hadoop’s version of int
Text = Hadoop’s version of String
import [Link];
import [Link];
Job is used to set everything about the MapReduce program (mapper, reducer, inpoutput,
etc).
import [Link];
import [Link];
Base classes for writing Mapper and Reducer.
import [Link];
import [Link];
import [Link];
import [Link];
These are ready-made input/output classes.
FileInputFormat → Reads files from HDFS as input.
TextInputFormat → Treats each line of text as one record.
FileOutputFormat → Writes job’s result to HDFS.
TextOutputFormat → Writes output as plain text.
public class WordCounter {
Defines the main driver class of the program.
public static void main (String [] args) throws Exception {
•Entry point of the program (when you run it).
•throws Exception means: if something goes wrong (I/O, Hadoop error), program exits.
Job job = new Job();
•Creates a job configuration object.
•Hadoop uses this to know what to run.
[Link]("wordcounter");
Helpful for logs and debugging.
When you run Hadoop jobs, you’ll see the job name in tracking UI.
[Link]([Link]);
Tells Hadoop which JAR file contains your program.
It uses this class (WordCounter) to locate the JAR.
[Link]([Link]);
[Link]([Link]);
•Tells Hadoop which Mapper and Reducer to use.
•These are user-defined classes (you’ll write them separately).
[Link]([Link]);
[Link]([Link]);
•Defines the output data type from the Reducer.
•Output will be → (word, count)
•word → Text
•count → IntWritable
[Link](job, new Path("/sample/[Link]"));
[Link](job, new Path("/sample/wordcount"));
Tells Hadoop where to read input from (/sample/[Link])
And where to write results (/sample/wordcount)
Both are HDFS paths (not local file system).
[Link]([Link](true) ? 0 : 1);
}
}
[Link](true) → Submits job to Hadoop and waits until it finishes.
If job is successful → exit with code 0.
If job fails → exit with code 1.
What Happens When You Run It
Input file ([Link]) contains some text.
Mapper splits lines into words → emits (word, 1).
Hadoop shuffles/sorts words.
Reducer adds up counts → emits (word, total_count).
Output is saved in /sample/wordcount/part-r-00000.
[Link] MapReduce program for Word Count.
package [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class WordCounterMap extends Mapper<LongWritable, Text, Text,
IntWritable> {
Defines the Mapper class.
<LongWritable, Text, Text, IntWritable> means:
Input key → LongWritable (position of line in file).
Input value → Text (the actual line of text).
Output key → Text (the word).
Output value → IntWritable (the count = 1).
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
•This is the map method that Hadoop calls automatically for each line of the input.
•Parameters:
•key → line offset in file (not usually used in word count).
•value → the actual text line.
•context → used to emit output (key-value pairs).
•throws IOException, InterruptedException → handles errors that may occur during
processing.
String[] words = [Link]().split(",");
•[Link]() → Convert the Hadoop Text object into a normal Java String.
•.split(",") → Split the line into words, using comma (,) as a separator.
So if a line is: "apple,banana,apple" → it becomes an array ["apple", "banana",
"apple"].
for (String word : words) {
Loop through each word in the array.
Example: For ["apple", "banana", "apple"], it loops 3 times.
[Link](new Text(word), new IntWritable(1));
}
}
}
Emits the key-value pair (word, 1) for each word found.
Example:
"apple" → (apple, 1)
"banana" → (banana, 1)
"apple" → (apple, 1)
These outputs are sent to Hadoop’s Shuffle & Sort phase before going to the Reducer.
3)[Link]:Reduce Class
package [Link];
Defines that this class belongs to the package [Link].
Packages are like folders in Java used to organize classes.
import [Link];
import [Link];
import [Link];
import [Link];
public class WordCounterRed extends Reducer<Text, IntWritable, Text, IntWritable>
{
Defines the Reducer class.
<Text, IntWritable, Text, IntWritable> means:
Input key → Text (the word).
Input values → IntWritable (list of counts, usually many 1s).
Output key → Text (the word).
Output value → IntWritable (the final count).
@Override
protected void reduce(Text word, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
This is the reduce method that Hadoop calls automatically for each unique key
(word).
Parameters:
word → one unique word (e.g., "apple").
values → all the counts (e.g., [1, 1, 1, 1]).
context → used to emit final results (word, total_count).
Integer count = 0;
Initialize a local counter variable.
This will be used to sum up all the counts for the current word.
for (IntWritable val : values) {
count += [Link]();
}
Loops through the list of counts (values).
[Link]() converts Hadoop’s IntWritable into a normal int.
Adds each count to count.
Example: for "apple" with values [1,1,1], after loop → count = 3.
[Link](word, new IntWritable(count));
}
}
Emits the final result for that word.
Example: (apple, 3)
This gets written to the output file in HDFS.
SQL MapReduce
Access Interactive and Batch Batch
Structure Static Dynamic
Updates Read/write many times Write once, read many times
Integrity High Low
Scalability Nonlinear Linear
[Link] RESOURCES AND APPLICATIONS WITH HADOOP
YARN (YET ANOTHER RESOURCES NEGOTIATOR)
What is Hadoop YARN?
YARN is a core component introduced in Hadoop 2.x.
Stands for Yet Another Resource Negotiator.
It’s responsible for cluster resource management and enables running
multiple types of applications (not just MapReduce) on Hadoop.
YARN allows Hadoop to support various processing models: batch, interactive,
streaming, graph, and more.
11.1 Limitations of Hadoop 1.0 Architecture
Single NameNode Bottleneck: In Hadoop 1.0, only one NameNode manages
the entire file system's metadata for the Hadoop cluster. If it becomes
overloaded or fails, it can disrupt the whole cluster.
Restricted Processing Model: Hadoop 1.0 is built mainly for batch-oriented
processing using MapReduce. It doesn’t support interactive, online, or
streaming workloads.
Not Suitable for Interactive or Advanced Analysis: The MapReduce model
is poorly suited for real-time, interactive analytics, or in-memory machine
learning.
Resource Utilization Issues: Resource slots are rigidly divided into "map"
and "reduce" slots. Sometimes, map slots are full while reduce slots sit idle,
and vice versa, resulting in inefficient resource use.
MapReduce Constrained: MapReduce is responsible for both resource
management and data processing, leading to limited flexibility and scalability.
11.3 Hadoop 2: HDFS
Components:
(a) Namespace: Handles files/directories, manages creation, modification, and
deletion.
(b) Blocks Storage Service: Handles data blocks, replication, and storage
management.
HDFS 2 Features:
Horizontal Scalability: Uses federation of multiple independent NameNodes,
allowing more clusters to scale out and store bigger datasets.
High Availability: Passive Standby NameNode enables a backup NameNode
that can take over automatically if the active one fails, preventing single points
of failure.
How It Works:
Multiple NameNodes, each with its own namespace, share
the DataNodes for block storage—enabling parallel
management and failover.
All namespace changes are saved to shared storage. If the
Active NameNode fails, the Passive Standby is updated and
quickly takes over.
11.4 Hadoop 2 YARN: Taking Hadoop Beyond Batch
YARN helps us to store all data in one place. WE can interact in multiple ways to
get predictable performance and quality of [Link] was originally architected
by Yahoo.
Active NameNode: This is the primary NameNode that handles all client requests
(read/write operations) in HDFS.
Passive NameNode (Standby): This is a backup NameNode that remains in
standby mode. It does not serve requests directly but is kept updated.
Shared Edit Logs: Both Active and Passive NameNodes share the edit log storage.
The Active NameNode writes changes (metadata updates) into the shared log.
The Passive NameNode continuously reads these logs to stay synchronized with
the Active NameNode.
Purpose: This setup ensures high availability (HA). If the Active NameNode fails,
the Passive can quickly take over without major downtime.
Figure 5.27: Hadoop 1.x vs Hadoop 2.x
This figure shows the architectural difference between Hadoop
versions.
Hadoop 1.x
Consists of two main components:
1. MapReduce: Handles both cluster resource management and data
processing.
2. HDFS (Hadoop Distributed File System): Provides redundant and
reliable storage.
Limitation: MapReduce was tightly coupled with resource
management → it could only run MapReduce jobs, no support for
other processing models.
Hadoop 2.x
Enhanced architecture by separating cluster management from data
processing:
1. YARN (Yet Another Resource Negotiator): A new resource manager
that handles cluster management and job scheduling.
2. MapReduce: Still present but now only for data processing.
3. Others: Hadoop 2.x supports other data processing engines like
Spark, Tez, etc., not just MapReduce.
4. HDFS: Continues to provide redundant and reliable storage.
Advantages of Hadoop 2.x:
Supports multiple frameworks beyond MapReduce (e.g., Spark,
Storm).
Better resource utilization through YARN.
Scalability and flexibility improved.
This diagram is about Hadoop YARN (Yet Another Resource Negotiator).
It shows how YARN manages different kinds of applications in a Hadoop cluster.
1. Applications Supported
At the top, different kinds of applications are shown that can run on Hadoop with the
help of YARN:
Batch (MR) → Traditional MapReduce batch jobs.
Interactive (TEZ) → Framework for fast interactive query processing (used in Hive,
Pig).
Online (HBASE) → Real-time, online database processing with HBase.
Streaming (Storm) → Stream data processing in real time.
In-Memory (Spark) → In-memory analytics for very fast data computation.
Others (Graph, Search) → Specialized applications like graph processing
(e.g., Giraph) or search engines.
YARN Architecture Diagram :
Components:
Client – Submits jobs to YARN.
ResourceManager (RM) – The master that manages resources
across the cluster.
NodeManager (NM) – Runs on each node and manages
containers on that node.
ApplicationMaster (AppMstr) – Created per application;
negotiates resources from RM and works with NodeManagers.
Container – A resource allocation (CPU, memory, etc.) where
tasks run.
Flow of Execution (Steps in Diagram):
[Link] Submission –
The client submits the application/job to the ResourceManager.
[Link] ApplicationMaster –
The RM contacts a NodeManager to start an ApplicationMaster
for that job in a container.
[Link] Request –
The ApplicationMaster communicates with the
ResourceManager, asking for resources (containers) to run tasks.
[Link] Allocation –
The ResourceManager allocates containers across various
NodeManagers.
[Link] Containers –
The ApplicationMaster communicates with the NodeManagers to
launch containers.
[Link] Execution –
Containers run the actual tasks (map, reduce, or other processing
logic).
[Link] Communication –
The client communicates with the ResourceManager or directly
with the ApplicationMaster for status updates.
[Link] & Completion –
The ApplicationMaster monitors tasks, reports progress to the
ResourceManager, and releases resources once the job completes.
Example with WordCount (MapReduce job)
You submit a WordCount job.
YARN creates an ApplicationMaster just for that job.
That AppMaster:
Asks RM → “I need containers for mappers and reducers.”
RM allocates them on available nodes.
AppMaster tells NodeManagers on those nodes → “Run mapper here, run
reducer there.”
Collects progress, retries failures if needed.
When the job is done → AppMaster shuts down.