0% found this document useful (0 votes)

62 views91 pages

Understanding Hadoop for Big Data

The document introduces Hadoop as an open-source framework designed for handling big data, highlighting its key features such as low cost, scalability, and fault tolerance. It contrasts Hadoop with traditional RDBMS, emphasizing Hadoop's ability to efficiently store and process large volumes of varied data types. Additionally, it discusses the architecture of Hadoop, including its core components like HDFS and MapReduce, and outlines its applications in analyzing ClickStream data.

Uploaded by

shreekd2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views91 pages

Understanding Hadoop for Big Data

Uploaded by

shreekd2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

1.

INTRODUCING HADOOP
Big Data refers to the massive amounts of structured, semi-structured, and
unstructured data generated daily from various sources. Its major
characteristics are:
Volume: The sheer amount of data produced (e.g., NYSE generates 1.5 billion
shares and trade data daily, Facebook stores 2.7 billion comments and likes
daily, Google processes about 24 petabytes of data daily).
Variety: Data comes in many formats—text, videos, logs, images, etc.—from
varied sources.
Velocity: The speed at which data is generated and needs to be processed (e.g.,
Twitter users tweet 300,000 times per minute, email users send 200 million
messages per minute).
Value: The potential insights and business advantages that can be extracted
from analyzing this data.
2. WHY HADOOP
Hadoop can handle huge amounts of different types of data very [Link] is
why it is so widely adopted for big data tasks.
 Low Cost:
 Hadoop is open-source and uses inexpensive, easily available hardware to
store large volumes of data.
 Computing Power:
 Hadoop processes very large volumes of data by using many computers
(nodes) together. More nodes = more processing power.
 Scalability:
 If you need more processing, you simply add more nodes. Administration
is easy as the system grows.
 Storage Flexibility:
 You don’t need to pre-process data to store it. Hadoop lets you store any
type, size, or format (structured, unstructured, free-form) of data.
 Inherent Data Protection:
 Hadoop protects data by creating multiple copies across nodes. If one node
fails, work is redirected to others, so data and processes continue without
interruption.
How Does Hadoop Work? (Figure 5.3):
Hadoop uses a distributed file system on low-cost hardware. Here’s how it manages data:
 Distributes and Duplicates Data:
 Breaks down data files into chunks (for example, “25–30” could be one chunk)
and duplicates each chunk across several nodes. This ensures reliability and quick
access.
 Parallel Processing:
 Each node processes its own chunk of data at the same time (in parallel), speeding
up computation.
 Automatic Failover:
 If a node fails, Hadoop automatically reassigns its tasks to working nodes, keeping
the system running smoothly.
In Summary:
 Hadoop is popular because it's affordable, powerful, flexible, scalable, and fault-
tolerant. It is designed to handle the biggest data problems by distributing both storage
and computation across many inexpensive machines, making processing of vast and
varied data not just possible—but efficient and reliable.

3. WHY NOT RDBMS FOR BIG DATA?
 Poor fit for large files:
RDBMS is not suitable for storing and processing large files, images, and
videos. These types of data are increasingly common in big data scenarios.
 Not ideal for advanced analytics:
RDBMS systems struggle with advanced analytics and machine
learning workloads, often required in modern data applications.
 Cost Issues:
As data grows, the cost per GB of storage in RDBMS rises steeply.
 The image and the explanation (Figure 5.4) illustrate that as you scale up
(from GBs to TBs to PBs), the cost per GB (€ to €€ to €€€) increases
significantly.
 Scaling an RDBMS to petabyte levels becomes prohibitively expensive.
 Scaling Model (Scale-Up):
Traditional RDBMS scale by adding more powerful (and expensive) hardware,
not by adding more cheap computers. This is called "scaling up" and is costly.
 As you increase the amount of data (from GBs to TBs to PBs), the cost per GB goes up
drastically, shown by € → €€ → €€€.
The arrow labeled "Scale Up" shows that this increasing cost is tied to the way RDBMS
systems are traditionally scaled for big data—by investing in larger, more powerful machines.
 4. RDBM Versus HADOOP

PARAMETERS RDBMS HADOOP

Relational Database
System Management System. Node Based Flat Structure.

Suitable for structured, unstructured data. Supports

variety of data formats in real time such as XML,
Data Suitable for structured data. JSON, text based flat file formats, etc.

Processing OLTP Analytical, Big Data Processing

When the data needs Big Data processing, which does not require any
Choice consistent relationship. consistent relationships between data.

Needs expensive hardware or

high-end processors to store In a Hadoop Cluster, a node requires only a processor,
Processor huge volumes of data. a network card, and few hard drives.

Cost around $10,000 to

$14,000 per terabytes of
Cost storage. Cost around $4,000 per terabytes of storage.
5. DISTRIBUTED COMPUTING CHALLENGES
5.1. Hardware Failure
 In distributed systems, multiple servers or hard disks are networked together,
increasing the likelihood of hardware failure.
 For instance, a typical hard disk might fail once in three years, but with 1,000
disks, failures are a daily possibility.
 Main problem: How to retrieve data if a hardware component fails?
Hadoop’s Solution: Replication Factor
 Replication Factor (RF): This denotes the number of copies of each data
block stored across the network.
 Example: If RF=2, there are two copies of each data block on different
servers.
 Why it matters: If one node fails, data can still be accessed from the other
copy, ensuring fault-tolerance and reliability.
 The system constantly maintains the set number of replicas, so if one is lost
due to hardware failure, it is automatically recreated elsewhere.
5.2. How to Process This Gigantic Store of Data?
 Distributed systems store data across many networked machines. The
challenge is integrating all these pieces for efficient processing.
 Main problem: How to process large volumes of distributed data as a unified
dataset?
Hadoop’s Solution: MapReduce Programming
 MapReduce is a programming model to process data across multiple
machines.
 It lets developers write programs that handle massive data sets by
dividing the tasks among many nodes and aggregating the results.
 This enables efficient computation and data integration, even when data is
distributed across a huge cluster.
Summary:
Distributed computing faces challenges of hardware failure and massive,
distributed data processing. Hadoop addresses these using Replication
Factor for fault-tolerant storage and MapReduce for parallel data processing,
making it highly suitable for big data environments
[Link] OF HADOOP
 Creator:
Hadoop was created by Doug Cutting, who is also known for developing
Apache Lucene (a popular text search library).
 Origins:
Hadoop began as part of the Apache Nutch (Yahoo) project, which was an
open-source web search engine.
 2002:
Doug Cutting and Mike Cafarella started working on Nutch.
 2003-2004:
Google published papers on Google File System (GFS) and
MapReduce. These concepts influenced Hadoop's architecture.
 2005:
Doug Cutting added Distributed File System (DFS) and MapReduce
to Nutch.
 2006:
Yahoo hired Doug Cutting, and Hadoop was spun out from Nutch as
a separate project.
 2008:
Cloudera, a major Hadoop vendor, was founded.
 2009:
Doug Cutting joined Cloudera.
7. HADOOP OVERVIEW
 Hadoop is an open-source software framework designed for storing and
processing massive amounts of data across clusters of inexpensive, commodity
hardware. It accomplishes two major goals:
 Massive data storage
 Faster data processing (in parallel)
7.1 Key Aspects of Hadoop
[Link]-Source Software:
 Free to download, use, and contribute to.
[Link] Approach:
 Provides all necessary tools and programs to develop and execute distributed
data processing.
[Link]:
 Data is divided and stored across multiple connected computers (nodes).
 Processing tasks are distributed and performed in parallel, improving speed.
[Link] Storage:
 Stores colossal volumes of data using low-cost hardware.
5. Faster Processing:
 Handles large-scale data processing in parallel for quick results.

7.2 Hadoop Components

Core Components
1. HDFS (Hadoop Distributed File System):
 Main storage system which splits and distributes data across
multiple nodes.
 Built-in redundancy ensures data durability.
2. MapReduce:
 Programming model for processing and computing large
datasets across distributed nodes in a parallel fashion.
Hadoop Ecosystem (Supporting Projects)
 HIVE: Data warehousing, SQL-like queries
 PIG: High-level scripting for data analysis
 SQOOP: Data transfer between Hadoop and RDBMS

 HBASE: NoSQL database on Hadoop

 FLUME: Collecting and loading log data
 OOZIE: Workflow scheduling
 MAHOUT: Machine learning
7.3 Conceptual Layers of Hadoop
 Data Storage Layer:
Handles storing huge amounts of data (HDFS).
 Data Processing Layer:
Processes data in parallel to extract insights (MapReduce).
 [Diagram of conceptual layers: 'Data Storage' and 'Data Processing'. Figure 5.9]
7.4 High-Level Architecture
 Hadoop uses a Master-Slave architecture:
 Master Node (NameNode):
 Manages file system namespace and controls access (where to find what data).
 Responsible for coordination, storage partitioning.
 Slave Nodes (DataNodes):
 Actually store the data and perform computation (processing tasks).
 Each node runs components for:
 Computation (MapReduce)
 Storage (HDFS)
 ![High-level architecture diagram showing one master node and multiple slave nodes,
each handling both computation and storage. Figure 5.10]
[Link] CASE OF HADOOP
8.1 ClickStream Data
 ClickStream data refers to the sequence of mouse clicks made by users as they
navigate websites.
 By analyzing this data, businesses can understand customer purchasing
behavior.
 Marketers use ClickStream analysis to optimize product pages, promotional
content, and other aspects of their websites to improve business outcomes
Why Use Hadoop for ClickStream Data?
 Three Key Benefits:
 Integration with Other Data Sources
 Hadoop enables joining ClickStream data with other business data sources such
as:
 CRM (Customer Relationship Management)-customer
history,preference
 Customer demographics- age, location,etc
 Sales data- purchases, conversations.
 Advertising campaign data- ad impressions, clicks, conversations
 This comprehensive integration delivers deeper insights into customer behavior.
 Cost-Effective Scalability
 Hadoop’s scalable architecture allows storage of years of clickstream
data with minimal incremental cost.
 This means you can perform historical and year-over-year analysis without the
rising storage costs that traditional systems may incur
 Flexible Analysis with Hive and Pig
 Business analysts can use Apache Hive or Apache Pig to:
 Organize clickstream data by user session
 Refine and preprocess it
 Feed it into visualization or advanced analytics tools
 These tools make it much easier to extract valuable patterns from massive
clickstream datasets.
Benefit Description

Joins ClickStream with CRM & sales

data Enables combined analysis for deeper insights

Facilitates long-term analysis without high

Stores years of data inexpensively incremental cost

Analysis using Hive or Pig Streamlines data organization and analytics

[Link] (HADOOP DISTRIBUTED FILE SYTEM)
What is HDFS?
HDFS (Hadoop Distributed File System) is the primary storage system of
Hadoop, designed for storing and managing large volumes of data reliably and
efficiently across multiple nodes.
Key Features of HDFS
1. Storage Component of Hadoop:
HDFS is responsible for storing all the data in a Hadoop cluster.
2. Distributed File System:
It spreads data across multiple computers (nodes) and allows parallel reading
and writing, improving speed and fault tolerance.
3. Modeled after Google File System:
The design draws inspiration from Google’s GFS, using similar ideas like
block storage and metadata management.
4. Optimized for High Throughput:
HDFS uses large block sizes and tries to move computation closer to where
data resides (“data locality”) for efficient processing of big files.
[Link] for Fault Tolerance:
Each file is split into blocks (default 64MB) and each block
is replicated (default 3 copies for a file size of 192 MB) across the
cluster. This ensures data is safe even if nodes fail.
[Link] Block Re-Replication:
If a node fails, HDFS automatically re-copies (“re-replicates”) its
blocks to healthy nodes to maintain the required replication factor.
[Link] Very Large Files:
HDFS excels at storing and processing files that are gigabytes or
terabytes in size.
[Link] on Top of Existing File Systems:
It operates over native OS file systems like ext3/ext4, managing
how and where files are stored and retrieved.
Architectural Highlights
 Block Structured:
Files are broken into blocks. Each block is replicated and stored on different
nodes for fault tolerance and efficient access.
 Default Block Size:
Each block is typically 64MB (can be configured).
 Default Replication Factor:
By default, each block has 3 copies on different nodes.
9.1 HDFS Daemons
 When a file (e.g., “[Link]” of size 192MB) is stored, it is split into three
blocks of 64MB each. Each block is replicated three times (default setting) and
distributed across different nodes.
9.1.1 NameNode: Manages all file system metadata (file names, block locations,
permissions). It ensures files are split and replicated as per configuration.
 The NameNode is the master server in the HDFS architecture. Its primary role
is to manage the file system namespace. Here are its key responsibilities:
 Manages Metadata: Keeps records of the directory structure, file names,
permissions, and the location of blocks within the cluster.
 Stores the Namespace: The complete mapping of file names to blocks and
blocks to DataNodes is called the file system namespace. This is stored in a file
called the FsImage.
 Tracks Transactions: Every change to the metadata (like creating, deleting, or
renaming files) is logged in an EditLog (transaction log).
 Rack Awareness: Uses rack IDs to identify DataNodes in different racks,
optimizing data placement and network utilization.
 No Data Storage: The NameNode does NOT store the actual data—only the
metadata.
 Single Point of Management: Typically, there is one active NameNode per
cluster (for high availability, a standby may exist).
9.1.2 DataNodes: Store the actual data blocks on their local disks.
 If a DataNode fails, HDFS automatically makes new replicas to
maintain the set replication factor.
 The DataNodes are the worker nodes in HDFS and are responsible
for storing the actual file data in the form of blocks. Key points about
DataNodes include:
 Stores Data Blocks: Each DataNode stores a set of blocks on its
local disks. The files are split into blocks (default: 64MB), and each
block is replicated as per the replication factor (default: 3).
 Handles Read/Write Requests: DataNodes serve client read and
write requests by accessing the physical blocks.
 Block Reports: Regularly send block reports and heartbeats to
the NameNode to confirm availability and block status.
 Re-Replication: If a DataNode fails or a block is lost, the HDFS
system uses other DataNodes’ replicas to restore the lost blocks
automatically.
 Scalability and Fault Tolerance: Data blocks are spread and
replicated across many DataNodes, ensuring fault tolerance and
improved performance.
 Together, the NameNode and DataNodes form the backbone of
HDFS, allowing for distributed, fault-tolerant, and highly
scalable data storage and retrieval.
9.1.3 Secondary NameNode:
 Purpose:
The Secondary NameNode periodically takes a snapshot of HDFS
metadata (the information that tells Hadoop which data blocks belong to which
file, where they are stored, etc.) at intervals specified in the Hadoop
configuration.
 Functionality:
Unlike its name might suggest, the Secondary NameNode is not a backup
NameNode. Instead, it helps the main NameNode by:
 Merging the FsImage (a persistent snapshot of HDFS metadata) and
the EditLog (which records all recent changes to metadata).
 Producing a new, up-to-date FsImage file, which helps prevent the
EditLog from growing too large and keeps recovery times manageable in
case of failure.
 Memory Requirements:
The memory requirements for the Secondary NameNode are similar to those
of the main NameNode.
 Recommendation: It is best practice to run the NameNode and Secondary
NameNode on different machines to avoid overloading one machine.
 Disaster Recovery:
 If the main NameNode fails, the Secondary NameNode can be manually
used to help restore the cluster—but not automatically.
 The Secondary NameNode does not continuously record real-time HDFS
metadata changes; it only has the latest snapshot from its last merge
operation.
9.2 anatomy of file read
Steps in HDFS File Read
[Link] Initiates Read (Open):
 The client (using the HDFS Client API) wants to read a file and
calls open() on the DistributedFileSystem.
[Link] Block Location (NameNode):
 The DistributedFileSystem communicates with the NameNode to request
the location of the file’s data blocks.
 The NameNode responds with the addresses of the DataNodes that store
replicas of each block.
[Link] Stream Initialization:
 The DistributedFileSystem returns a FSDataInputStream object to the
client.
 This stream enables the client to read from the file. It contains information
about which DataNodes hold the required blocks.
 Reading Data from DataNodes:
 The client uses the FSDataInputStream to call read().
 For the first file block, the client connects to the nearest (most optimal)
DataNode containing that block.
 Streaming Blocks:
 The client repeatedly calls read() to stream the data from the connected
DataNode.
 When it reaches the end of the current block, the client closes the
connection to that DataNode.
 The input stream then moves to the next block, connecting to the best
DataNode for that block, and the process continues until the whole file is
read.
 Closing Connection:
 After the file is completely read, the client calls close() on
the FSDataInputStream to close the connection and end the operation.
9.3 Anatomy of file write

Steps in HDFS File Write:

1. Client Requests File Creation
o The HDFS Client (running inside Client JVM) asks the DistributedFileSystem to
create a new file.
2. NameNode Involvement
o The DistributedFileSystem contacts the NameNode to create a file entry (metadata
only, no data yet).
o The NameNode checks if the file already exists, user permissions, etc.
o If valid, it creates an empty file entry in its namespace.
3. Write Starts
o The client starts writing data to the file using FSDataOutputStream.
4. Data Written in Packets → Sent to DataNodes
o The data is split into packets.
o These packets are sent to a pipeline of DataNodes (as per replication
factor, usually 3).
o Example: Client → DataNode1 → DataNode2 → DataNode3.
[Link] Packets (Ack)
Each DataNode stores the packet and then sends an acknowledgment back
through the pipeline:
DataNode3 → DataNode2 → DataNode1 → Client.
This ensures all replicas have safely stored the data before moving to the
next packet.
[Link] Operation
After finishing all writes, the client calls close() on FSDataOutputStream.
[Link] Notification
Finally, the DistributedFileSystem informs the NameNode that the file
write is complete.
The NameNode updates its metadata to mark the file as closed and ready
for use.
9.4 Steps of the Default Replica Placement Strategy
 First Replica
 Placed on the same node as the client writing the data. This optimizes for
write performance since one copy is immediately local.
 Second Replica
 Placed on a node that is on a different rack from the first.
 Racks are physical groupings of nodes; having replicas on different racks
protects data from rack-level failures (like a network switch or power
outage).
 Third Replica
 Placed on the same rack as the second replica, but on a different
node within that rack. This balances fault tolerance and network usage.
key Points of the Strategy
 Pipeline Creation:
Once replica locations are determined, Hadoop sets up a pipeline between
these nodes for data transfer and replication during writes.
 Reliability:
By placing one replica on the local node, one on a remote rack, and one more
on the same remote rack, Hadoop ensures that data is protected from both
node and rack failures, while also reducing cross-rack network traffic.
 In summary, this strategy ensures a balance of reliability (protection from
failures) and network efficiency by distributing replicas intelligently across
nodes and racks within the Hadoop cluster.
9.5 Working with HDFS Command
common HDFS (Hadoop Distributed File System) commands and their objectives.

1. Listing Directories and Files at the Root of HDFS

 Objective: Get the list of directories and files at the root of HDFS.
 Command:
text
hadoop fs -ls /
-ls → list files and directories.
/ → root directory in HDFS.
2. Listing All Directories and Files Recursively
 Objective: Get the list of all directories and files in HDFS recursively.
 Command:
text
hadoop fs -ls -R /
-R → recursive (goes inside subdirectories too).
This shows all files and folders in HDFS.
3. Creating a Directory
 Objective: Create a directory (e.g., sample) in HDFS.
 Command:
text
hadoop fs -mkdir /sample
-mkdir → make directory.
/sample → directory name in HDFS.
Creates a new directory named /sample.
4. Copying a File from Local to HDFS
 Objective: Copy a file from the local file system to HDFS.
 Command:
Text
hadoop fs -put /root/sample/[Link] /sample/[Link]
-put → upload file from local filesystem to HDFS.
First path → local file location.
Second path → destination in HDFS.
Uploads [Link] from the local path to the /sample directory in HDFS.
5. Copying a File from HDFS to Local
 Objective: Copy a file from HDFS to the local file system.
 Command:

hadoop fs -get /sample/[Link] /root/sample/[Link]

-get → download file from HDFS to local filesystem.
First path → HDFS file.
Second path → local destination.
download [Link] from your local machine to HDFS.

6. Copy from Local Using copyFromLocal

 Objective: Copy a file from local file system to HDFS using copyFromLocal.
 Command:

hadoop fs -copyFromLocal /root/sample/[Link] /sample/[Link]

Another way to upload a local file to HDFS.
-copyFromLocal works like -put.
7. Copy to Local Using copyToLocal
 Objective: Copy a file from HDFS to local file system using copyToLocal.
 Command:
hadoop fs -copyToLocal /sample/[Link] /root/sample/[Link]
-copyToLocal works like -get.
Another command to download a file from HDFS to local.

8. Displaying Contents of a File

 Objective: Display the contents of a file stored in HDFS on the console.
 Command:
hadoop fs -cat /sample/[Link]

-cat → shows file content on the terminal.

Outputs the contents of [Link] from HDFS.
9. Copying a File within HDFS
 Objective: Copy a file from one directory to another within HDFS.
 Command:
hadoop fs -cp /sample/[Link] /sample1
-cp → copy within HDFS.
First path → source file.
Second path → destination directory.
Copies [Link] to /sample1.

10. Removing a Directory

 Objective: Remove a directory from HDFS.
 Command:
hadoop fs -rm -r /sample1
-rm → remove file/directory.
-r → recursive (needed for directories).
Deletes the /sample1 directory and all its contents recursively.
9.6 Special Features of HDFS
1. Data Replication
 What it is: HDFS automatically makes multiple copies (replicas) of every data
block.
 Why it matters: Clients (applications accessing HDFS) do not need to keep
track of where each piece of data is stored. If you need to read or write data,
the system automatically directs the operation to the nearest available copy
(replica) to ensure high performance and reliability.
 Benefit: This ensures that even if some machines fail, the data remains safe
and quickly accessible.
2. Data Pipeline
 What it is: When a client writes data to HDFS, it doesn't write to all replicas at
once. Instead, it sends the data block to the first DataNode (storage server) in a
list called a "pipeline."
 How it works:
 The first DataNode writes the block, then passes it to the second
DataNode, which in turn passes it to the third, and so on, until all replicas
receive the data.
 This process continues for all blocks of the file, making sure every replica
is written in order.
 Benefit: This pipeline process allows for efficient and reliable replication of
blocks, improves write performance, and maintains data integrity across the
system.
[Link] DATA WITH HADOOP

MapReduce is a software framework that processes massive amounts of data in

parallel by splitting the input data into independent chunks.
10.1Mapreduce Daemons
The two main roles in the MapReduce framework are:
[Link] (Master):
JobTracker is the boss in Hadoop’s MapReduce.
 When you give Hadoop some code (a job), the JobTracker:
 Makes a plan for how the work will be split.
 Assigns small tasks to different computers (nodes).
 Keeps an eye on all tasks.
 If something fails, it gives the work to another node.
 There is only one JobTracker in the whole cluster.
[Link] (Slave)
TaskTracker is the worker in Hadoop’s MapReduce.
 Every computer (node) has one TaskTracker.
 The TaskTracker:
 Gets tasks from the JobTracker and runs them.
 Starts multiple JVMs (small programs) to do work in parallel.
 Sends a heartbeat message regularly to the JobTracker to say “I’m alive and
working.”
 If the JobTracker doesn’t hear from it, it thinks the TaskTracker has failed.
Phases in MapReduce:
 Map: Converts inputs into key-value pairs, with each map task operating in parallel.
 Reduce: Combines outputs from the mappers and generates a reduced result,
aggregating the map outputs.

 Example:(map)
Input: A list of words in a file.

 Map output:
 (cat, 1)
 (dog, 1)
 (cat, 1)

Reduce input:
 (cat, [1,1])
 (dog, [1])

Reduce output:
 (cat, 2)
 (dog, 1)
10.2 How Does MapReduce Work?
MapReduce is a programming model for processing large amounts of data in parallel
across a distributed cluster (multiple computers). It breaks the data analysis task into two
main stages:
 It divides work into two phases:
 1. Map Phase → Takes input data, processes it, and converts it into key-value pairs.
 2. Reduce Phase → Aggregates (combines) and summarizes the output from the map
phase.
The Components in the Diagram

1. Client
The user/application that submits a job (task to be executed).
2. JobTracker (Master Node)
Receives the job from the client.
Divides the job into smaller tasks.
Assigns tasks to TaskTrackers on different worker nodes.
Monitors the progress and handles failures.
3. TaskTracker (Worker Node)
Executes the tasks (Map and Reduce) assigned by the JobTracker.
Reports the task status (success/failure) back to the JobTracker.
4. Map Tasks
Process raw input data.
Transform it into intermediate key-value pairs.
Example: Counting words → ("apple", 1), ("banana", 1).
5. Reduce Tasks
Collect all key-value pairs from the Map tasks.
Aggregate them to give the final result.
Example: Combine counts → ("apple", 5), ("banana", 3).
10.3 MapReduce Example
Objective
 Count the number of occurrences of each word in a large collection of text files
using Hadoop MapReduce.
Key Components
 Driver Class
 Sets up and configures the job (specifies mapper, reducer, input/output paths, etc.)
 Example: WordCounter class
 Mapper Class
 Reads lines of text, splits them into words, and outputs each word as a key with
value 1.
 Example: WordCounterMap class
 Reducer Class
 Sums up the counts for each word key to get the total occurrences.
 Example: (Reducer code not shown here, but typically combines values for keys)
What this driver does:
 Creates a Hadoop job.
 Attaches
the Mapper (WordCounterMap) and Reducer
(WordCounterRed).
 Sets input file and output folder.
 Submits job to Hadoop framework for execution.
1) This is a Hadoop MapReduce program to count the number of times each
word appears in a text file.
[Link]: DRIVER PROGRAM

Package [Link]; /Defines the Java package.

The program imports Hadoop classes:
Import [Link]; /Needed for input/output error handling.
For example, if Hadoop fails to read/write files, IOException is thrown.
import [Link]; /Hadoop uses its own special class → Path.
In Hadoop, we don’t use normal Java File.
Instead, we use Hadoop’s Path class to point to HDFS files.
import [Link];
import [Link];
Hadoop doesn’t use normal Java types (int, String) because they are not
serializable by default.
IntWritable = Hadoop’s version of int
Text = Hadoop’s version of String
import [Link];

import [Link];

Job is used to set everything about the MapReduce program (mapper, reducer, inpoutput,
etc).

import [Link];

Base classes for writing Mapper and Reducer.

import [Link];
import [Link];
import [Link];
import [Link];
These are ready-made input/output classes.
 FileInputFormat → Reads files from HDFS as input.
 TextInputFormat → Treats each line of text as one record.
 FileOutputFormat → Writes job’s result to HDFS.
 TextOutputFormat → Writes output as plain text.
public class WordCounter {
Defines the main driver class of the program.
public static void main (String [] args) throws Exception {
•Entry point of the program (when you run it).
•throws Exception means: if something goes wrong (I/O, Hadoop error), program exits.
Job job = new Job();

•Creates a job configuration object.

•Hadoop uses this to know what to run.

[Link]("wordcounter");

Helpful for logs and debugging.

When you run Hadoop jobs, you’ll see the job name in tracking UI.

[Link]([Link]);

Tells Hadoop which JAR file contains your program.

It uses this class (WordCounter) to locate the JAR.

[Link]([Link]);
[Link]([Link]);

•Tells Hadoop which Mapper and Reducer to use.

•These are user-defined classes (you’ll write them separately).
[Link]([Link]);
[Link]([Link]);
•Defines the output data type from the Reducer.
•Output will be → (word, count)
•word → Text
•count → IntWritable

[Link](job, new Path("/sample/[Link]"));

[Link](job, new Path("/sample/wordcount"));
 Tells Hadoop where to read input from (/sample/[Link])
 And where to write results (/sample/wordcount)
 Both are HDFS paths (not local file system).
[Link]([Link](true) ? 0 : 1);
}
}
 [Link](true) → Submits job to Hadoop and waits until it finishes.
 If job is successful → exit with code 0.
 If job fails → exit with code 1.
What Happens When You Run It
 Input file ([Link]) contains some text.
 Mapper splits lines into words → emits (word, 1).
 Hadoop shuffles/sorts words.
 Reducer adds up counts → emits (word, total_count).
 Output is saved in /sample/wordcount/part-r-00000.
[Link] MapReduce program for Word Count.
package [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class WordCounterMap extends Mapper<LongWritable, Text, Text,

IntWritable> {
Defines the Mapper class.
 <LongWritable, Text, Text, IntWritable> means:
 Input key → LongWritable (position of line in file).
 Input value → Text (the actual line of text).
 Output key → Text (the word).
 Output value → IntWritable (the count = 1).
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
•This is the map method that Hadoop calls automatically for each line of the input.
•Parameters:
•key → line offset in file (not usually used in word count).
•value → the actual text line.
•context → used to emit output (key-value pairs).
•throws IOException, InterruptedException → handles errors that may occur during
processing.
String[] words = [Link]().split(",");
•[Link]() → Convert the Hadoop Text object into a normal Java String.
•.split(",") → Split the line into words, using comma (,) as a separator.
So if a line is: "apple,banana,apple" → it becomes an array ["apple", "banana",
"apple"].
for (String word : words) {
Loop through each word in the array.
Example: For ["apple", "banana", "apple"], it loops 3 times.

 These outputs are sent to Hadoop’s Shuffle & Sort phase before going to the Reducer.
3)[Link]:Reduce Class
package [Link];
Defines that this class belongs to the package [Link].
Packages are like folders in Java used to organize classes.

import [Link];
import [Link];
import [Link];
import [Link];
public class WordCounterRed extends Reducer<Text, IntWritable, Text, IntWritable>
{
 Defines the Reducer class.
 <Text, IntWritable, Text, IntWritable> means:
 Input key → Text (the word).
 Input values → IntWritable (list of counts, usually many 1s).
 Output key → Text (the word).
 Output value → IntWritable (the final count).
@Override
protected void reduce(Text word, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
 This is the reduce method that Hadoop calls automatically for each unique key
(word).
 Parameters:
 word → one unique word (e.g., "apple").
 values → all the counts (e.g., [1, 1, 1, 1]).
 context → used to emit final results (word, total_count).
Integer count = 0;
 Initialize a local counter variable.
 This will be used to sum up all the counts for the current word.

for (IntWritable val : values) {

count += [Link]();
}
 Loops through the list of counts (values).
 [Link]() converts Hadoop’s IntWritable into a normal int.
 Adds each count to count.
 Example: for "apple" with values [1,1,1], after loop → count = 3.
[Link](word, new IntWritable(count));
}
}
 Emits the final result for that word.
 Example: (apple, 3)
 This gets written to the output file in HDFS.
SQL MapReduce

Access Interactive and Batch Batch

Structure Static Dynamic

Updates Read/write many times Write once, read many times

Integrity High Low

Scalability Nonlinear Linear

[Link] RESOURCES AND APPLICATIONS WITH HADOOP
YARN (YET ANOTHER RESOURCES NEGOTIATOR)

What is Hadoop YARN?

 YARN is a core component introduced in Hadoop 2.x.
 Stands for Yet Another Resource Negotiator.
 It’s responsible for cluster resource management and enables running
multiple types of applications (not just MapReduce) on Hadoop.
 YARN allows Hadoop to support various processing models: batch, interactive,
streaming, graph, and more.
11.1 Limitations of Hadoop 1.0 Architecture
 Single NameNode Bottleneck: In Hadoop 1.0, only one NameNode manages
the entire file system's metadata for the Hadoop cluster. If it becomes
overloaded or fails, it can disrupt the whole cluster.
 Restricted Processing Model: Hadoop 1.0 is built mainly for batch-oriented
processing using MapReduce. It doesn’t support interactive, online, or
streaming workloads.
 Not Suitable for Interactive or Advanced Analysis: The MapReduce model
is poorly suited for real-time, interactive analytics, or in-memory machine
learning.
 Resource Utilization Issues: Resource slots are rigidly divided into "map"
and "reduce" slots. Sometimes, map slots are full while reduce slots sit idle,
and vice versa, resulting in inefficient resource use.
 MapReduce Constrained: MapReduce is responsible for both resource
management and data processing, leading to limited flexibility and scalability.
11.3 Hadoop 2: HDFS
 Components:
 (a) Namespace: Handles files/directories, manages creation, modification, and
deletion.
 (b) Blocks Storage Service: Handles data blocks, replication, and storage
management.
 HDFS 2 Features:
 Horizontal Scalability: Uses federation of multiple independent NameNodes,
allowing more clusters to scale out and store bigger datasets.
 High Availability: Passive Standby NameNode enables a backup NameNode
that can take over automatically if the active one fails, preventing single points
of failure.
How It Works:
 Multiple NameNodes, each with its own namespace, share
the DataNodes for block storage—enabling parallel
management and failover.
 All namespace changes are saved to shared storage. If the
Active NameNode fails, the Passive Standby is updated and
quickly takes over.
11.4 Hadoop 2 YARN: Taking Hadoop Beyond Batch
YARN helps us to store all data in one place. WE can interact in multiple ways to
get predictable performance and quality of [Link] was originally architected
by Yahoo.
Active NameNode: This is the primary NameNode that handles all client requests
(read/write operations) in HDFS.
Passive NameNode (Standby): This is a backup NameNode that remains in
standby mode. It does not serve requests directly but is kept updated.
Shared Edit Logs: Both Active and Passive NameNodes share the edit log storage.
The Active NameNode writes changes (metadata updates) into the shared log.
The Passive NameNode continuously reads these logs to stay synchronized with
the Active NameNode.
Purpose: This setup ensures high availability (HA). If the Active NameNode fails,
the Passive can quickly take over without major downtime.
Figure 5.27: Hadoop 1.x vs Hadoop 2.x
 This figure shows the architectural difference between Hadoop
versions.
Hadoop 1.x
 Consists of two main components:
1. MapReduce: Handles both cluster resource management and data
processing.
2. HDFS (Hadoop Distributed File System): Provides redundant and
reliable storage.
Limitation: MapReduce was tightly coupled with resource
management → it could only run MapReduce jobs, no support for
other processing models.
Hadoop 2.x
Enhanced architecture by separating cluster management from data
processing:
1. YARN (Yet Another Resource Negotiator): A new resource manager
that handles cluster management and job scheduling.
2. MapReduce: Still present but now only for data processing.
3. Others: Hadoop 2.x supports other data processing engines like
Spark, Tez, etc., not just MapReduce.
4. HDFS: Continues to provide redundant and reliable storage.
Advantages of Hadoop 2.x:
Supports multiple frameworks beyond MapReduce (e.g., Spark,
Storm).
Better resource utilization through YARN.
Scalability and flexibility improved.
This diagram is about Hadoop YARN (Yet Another Resource Negotiator).
It shows how YARN manages different kinds of applications in a Hadoop cluster.

1. Applications Supported
At the top, different kinds of applications are shown that can run on Hadoop with the
help of YARN:
Batch (MR) → Traditional MapReduce batch jobs.
Interactive (TEZ) → Framework for fast interactive query processing (used in Hive,
Pig).
Online (HBASE) → Real-time, online database processing with HBase.
Streaming (Storm) → Stream data processing in real time.
In-Memory (Spark) → In-memory analytics for very fast data computation.
Others (Graph, Search) → Specialized applications like graph processing
(e.g., Giraph) or search engines.
YARN Architecture Diagram :
Components:
 Client – Submits jobs to YARN.
 ResourceManager (RM) – The master that manages resources
across the cluster.
 NodeManager (NM) – Runs on each node and manages
containers on that node.
 ApplicationMaster (AppMstr) – Created per application;
negotiates resources from RM and works with NodeManagers.
 Container – A resource allocation (CPU, memory, etc.) where
tasks run.
Flow of Execution (Steps in Diagram):
[Link] Submission –
The client submits the application/job to the ResourceManager.
[Link] ApplicationMaster –
The RM contacts a NodeManager to start an ApplicationMaster
for that job in a container.
[Link] Request –
The ApplicationMaster communicates with the
ResourceManager, asking for resources (containers) to run tasks.
[Link] Allocation –
The ResourceManager allocates containers across various
NodeManagers.
[Link] Containers –
The ApplicationMaster communicates with the NodeManagers to
launch containers.
[Link] Execution –
Containers run the actual tasks (map, reduce, or other processing
logic).
[Link] Communication –
The client communicates with the ResourceManager or directly
with the ApplicationMaster for status updates.
[Link] & Completion –
The ApplicationMaster monitors tasks, reports progress to the
ResourceManager, and releases resources once the job completes.
Example with WordCount (MapReduce job)
 You submit a WordCount job.
 YARN creates an ApplicationMaster just for that job.
 That AppMaster:
 Asks RM → “I need containers for mappers and reducers.”
 RM allocates them on available nodes.
 AppMaster tells NodeManagers on those nodes → “Run mapper here, run
reducer there.”
 Collects progress, retries failures if needed.
 When the job is done → AppMaster shuts down.

Introduction to Hadoop and Big Data Analytics
No ratings yet
Introduction to Hadoop and Big Data Analytics
83 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
42 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
50 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
23 pages
Introduction to Hadoop Ecosystem Guide
No ratings yet
Introduction to Hadoop Ecosystem Guide
64 pages
Understanding Hadoop for Big Data Processing
No ratings yet
Understanding Hadoop for Big Data Processing
61 pages
Hadoop vs RDBMS: Data Processing Insights
No ratings yet
Hadoop vs RDBMS: Data Processing Insights
27 pages
Understanding Hadoop for Big Data Management
No ratings yet
Understanding Hadoop for Big Data Management
35 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
22 pages
Introduction to Hadoop for Big Data
No ratings yet
Introduction to Hadoop for Big Data
60 pages
Hadoop vs RDBMS: Key Differences Explained
No ratings yet
Hadoop vs RDBMS: Key Differences Explained
23 pages
Big Data and Mapreduce Challenges, Opportunities and Trends
No ratings yet
Big Data and Mapreduce Challenges, Opportunities and Trends
9 pages
Big Data Framework: Hadoop Overview
No ratings yet
Big Data Framework: Hadoop Overview
23 pages
Introduction to Hadoop and Its Advantages
No ratings yet
Introduction to Hadoop and Its Advantages
51 pages
Introduction to Hadoop for Big Data
No ratings yet
Introduction to Hadoop for Big Data
31 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
22 pages
GFS vs HDFS in Big Data Context
0% (1)
GFS vs HDFS in Big Data Context
55 pages
NoSQL and Hadoop for Big Data Solutions
No ratings yet
NoSQL and Hadoop for Big Data Solutions
22 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
4 pages
Introduction to Hadoop for Big Data
No ratings yet
Introduction to Hadoop for Big Data
8 pages
Understanding Hadoop and Big Data Analysis
No ratings yet
Understanding Hadoop and Big Data Analysis
8 pages
Hadoop 4: Overview and Ecosystem
No ratings yet
Hadoop 4: Overview and Ecosystem
44 pages
Understanding Hadoop in Big Data Systems
No ratings yet
Understanding Hadoop in Big Data Systems
50 pages
Overview of Hadoop Framework and Features
No ratings yet
Overview of Hadoop Framework and Features
29 pages
Doug Cutting and Hadoop Evolution
No ratings yet
Doug Cutting and Hadoop Evolution
26 pages
Understanding Hadoop: Big Data Framework
No ratings yet
Understanding Hadoop: Big Data Framework
27 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
40 pages
Introduction to Hadoop for Big Data
No ratings yet
Introduction to Hadoop for Big Data
58 pages
Big Data Characteristics and Hadoop Overview
No ratings yet
Big Data Characteristics and Hadoop Overview
7 pages
Overview of Hadoop Architecture
100% (1)
Overview of Hadoop Architecture
32 pages
Mike Cafarella and Hadoop's Origins
No ratings yet
Mike Cafarella and Hadoop's Origins
10 pages
Introduction to Hadoop for Big Data
No ratings yet
Introduction to Hadoop for Big Data
60 pages
Hadoop Architecture and Setup Guide
No ratings yet
Hadoop Architecture and Setup Guide
11 pages
NoSQL and Big Data Management Overview
No ratings yet
NoSQL and Big Data Management Overview
92 pages
Introduction to Hadoop and Big Data Models
No ratings yet
Introduction to Hadoop and Big Data Models
113 pages
Understanding Hadoop: Concepts & Analysis
No ratings yet
Understanding Hadoop: Concepts & Analysis
23 pages
Understanding Hadoop and Its Ecosystem
No ratings yet
Understanding Hadoop and Its Ecosystem
90 pages
Introduction to Hadoop Monitoring
No ratings yet
Introduction to Hadoop Monitoring
34 pages
Hadoop vs RDBMS: Key Differences Explained
No ratings yet
Hadoop vs RDBMS: Key Differences Explained
3 pages
Introduction to Hadoop and File Systems
No ratings yet
Introduction to Hadoop and File Systems
47 pages
Big Data Management Applications Overview
No ratings yet
Big Data Management Applications Overview
29 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
61 pages
Comprehensive Guide to Hadoop Basics
No ratings yet
Comprehensive Guide to Hadoop Basics
90 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
29 pages
Overview of the Hadoop Ecosystem
No ratings yet
Overview of the Hadoop Ecosystem
34 pages
Introduction to Hadoop Big Data Framework
No ratings yet
Introduction to Hadoop Big Data Framework
15 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
6 pages
Introduction to Hadoop for Big Data
No ratings yet
Introduction to Hadoop for Big Data
14 pages
Hadoop Overview and Big Data Challenges
100% (1)
Hadoop Overview and Big Data Challenges
397 pages
Understanding Hadoop Basics and Benefits
No ratings yet
Understanding Hadoop Basics and Benefits
21 pages
Introduction to Hadoop for Big Data
No ratings yet
Introduction to Hadoop for Big Data
41 pages
Introduction to Hadoop for Big Data
No ratings yet
Introduction to Hadoop for Big Data
61 pages
Understanding Hadoop Partitioner
No ratings yet
Understanding Hadoop Partitioner
180 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
50 pages
Deep Learning and Reinforcement Learning Course
No ratings yet
Deep Learning and Reinforcement Learning Course
16 pages
E-Waste Management in India: Overview & Regulations
No ratings yet
E-Waste Management in India: Overview & Regulations
4 pages
Deep Learning Experiments and Projects
No ratings yet
Deep Learning Experiments and Projects
15 pages
Candidate Registration Certificate Details
No ratings yet
Candidate Registration Certificate Details
1 page
MapReduce Programming Overview
No ratings yet
MapReduce Programming Overview
39 pages
Introduction to MongoDB Features
No ratings yet
Introduction to MongoDB Features
50 pages
AI & ML Curriculum Overview 2023-24
No ratings yet
AI & ML Curriculum Overview 2023-24
4 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
61 pages
Understanding Big Data and Its Types
No ratings yet
Understanding Big Data and Its Types
57 pages
Big Data Analytics Overview and Tools
No ratings yet
Big Data Analytics Overview and Tools
92 pages
Key Machine Learning Concepts and Problems
No ratings yet
Key Machine Learning Concepts and Problems
1 page
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
23 pages
Word Processing Techniques in Computing
No ratings yet
Word Processing Techniques in Computing
92 pages
Goals of AI Research Explained
No ratings yet
Goals of AI Research Explained
44 pages
Introduction to Renewable Energy Concepts
No ratings yet
Introduction to Renewable Energy Concepts
90 pages
Machine Learning Key Concepts and Problems
No ratings yet
Machine Learning Key Concepts and Problems
1 page
Bridging Ethics and HCAI Practice
No ratings yet
Bridging Ethics and HCAI Practice
68 pages
Understanding Human-Centered AI Principles
100% (1)
Understanding Human-Centered AI Principles
61 pages
Language Modelling: Grammar vs. Statistics
No ratings yet
Language Modelling: Grammar vs. Statistics
79 pages
Overview of Indian Knowledge System
No ratings yet
Overview of Indian Knowledge System
16 pages
BAIL606 Machine Learning Lab Syllabus
No ratings yet
BAIL606 Machine Learning Lab Syllabus
15 pages
BRMK557 Model Question Paper 2022-23
No ratings yet
BRMK557 Model Question Paper 2022-23
2 pages
DBMS Lab Manual for BCS403
No ratings yet
DBMS Lab Manual for BCS403
11 pages
Python File Handling Basics
No ratings yet
Python File Handling Basics
13 pages
Power BI LinkedIn Assessment Insights
No ratings yet
Power BI LinkedIn Assessment Insights
3 pages
Distributed Database Management Systems Guide
No ratings yet
Distributed Database Management Systems Guide
82 pages
Primary vs Surrogate Keys Explained
No ratings yet
Primary vs Surrogate Keys Explained
2 pages
Lead Data Analyst Resume
No ratings yet
Lead Data Analyst Resume
2 pages
Visual Data Representation Techniques
No ratings yet
Visual Data Representation Techniques
21 pages
GNPS Workflow for Metabolomic Analysis
No ratings yet
GNPS Workflow for Metabolomic Analysis
24 pages
Crop Price Prediction with Machine Learning
No ratings yet
Crop Price Prediction with Machine Learning
10 pages
MySQL Basics and Node.js Integration
No ratings yet
MySQL Basics and Node.js Integration
6 pages
SESEMAT Program's Impact in Uganda
No ratings yet
SESEMAT Program's Impact in Uganda
20 pages
Remote Data Engineer (L5) Position
No ratings yet
Remote Data Engineer (L5) Position
2 pages
Big Data Security and Privacy Review
No ratings yet
Big Data Security and Privacy Review
10 pages
Standard LBA Counts for Disk Drives
No ratings yet
Standard LBA Counts for Disk Drives
3 pages
Editing in ArcGIS 10 Workshop Guide
No ratings yet
Editing in ArcGIS 10 Workshop Guide
24 pages
SQL GROUP BY, HAVING, Procedures & Triggers
No ratings yet
SQL GROUP BY, HAVING, Procedures & Triggers
30 pages
SQL Interview Questions & Answers Guide
No ratings yet
SQL Interview Questions & Answers Guide
14 pages
Informatica PowerCenter Interview Guide
No ratings yet
Informatica PowerCenter Interview Guide
21 pages
Essential UNIX Commands for Admins
100% (1)
Essential UNIX Commands for Admins
4 pages
Data Warehouse Fundamentals and Architecture
No ratings yet
Data Warehouse Fundamentals and Architecture
3 pages
Best Practices For Modeling, Changing, and Transporting SAP HANA Delivery Units
No ratings yet
Best Practices For Modeling, Changing, and Transporting SAP HANA Delivery Units
13 pages
Rwanda Higher Education Qualifications Framework
No ratings yet
Rwanda Higher Education Qualifications Framework
24 pages
Thapar University 2025 Date Sheet
No ratings yet
Thapar University 2025 Date Sheet
12 pages
Upload 5 Documents for Download
No ratings yet
Upload 5 Documents for Download
6 pages
Why Python Excels in Data Science
No ratings yet
Why Python Excels in Data Science
10 pages
Dampak Mutasi Jabatan dan Kepuasan Kerja
100% (1)
Dampak Mutasi Jabatan dan Kepuasan Kerja
15 pages
Guardium Data Encryption Quiz Results
No ratings yet
Guardium Data Encryption Quiz Results
14 pages
Visual Analytics in Supply Chain Management
No ratings yet
Visual Analytics in Supply Chain Management
19 pages
Mean of Grouped Data Lesson Plan
No ratings yet
Mean of Grouped Data Lesson Plan
4 pages
Analyzing Crime Trends with R
No ratings yet
Analyzing Crime Trends with R
5 pages
Geospatial Data Management Overview
No ratings yet
Geospatial Data Management Overview
33 pages
Fundamentals of Database Management
No ratings yet
Fundamentals of Database Management
53 pages

Understanding Hadoop for Big Data

Uploaded by

Understanding Hadoop for Big Data

Uploaded by

1.

PARAMETERS RDBMS HADOOP

Suitable for structured, unstructured data. Supports

Processing OLTP Analytical, Big Data Processing

Needs expensive hardware or

Cost around $10,000 to

7.2 Hadoop Components

 HBASE: NoSQL database on Hadoop

Joins ClickStream with CRM & sales

Facilitates long-term analysis without high

Analysis using Hive or Pig Streamlines data organization and analytics

Steps in HDFS File Write:

1. Listing Directories and Files at the Root of HDFS

hadoop fs -get /sample/[Link] /root/sample/[Link]

6. Copy from Local Using copyFromLocal

hadoop fs -copyFromLocal /root/sample/[Link] /sample/[Link]

8. Displaying Contents of a File

-cat → shows file content on the terminal.

10. Removing a Directory

MapReduce is a software framework that processes massive amounts of data in

Package [Link]; /Defines the Java package.

Base classes for writing Mapper and Reducer.

•Creates a job configuration object.

Helpful for logs and debugging.

Tells Hadoop which JAR file contains your program.

•Tells Hadoop which Mapper and Reducer to use.

[Link](job, new Path("/sample/[Link]"));

public class WordCounterMap extends Mapper<LongWritable, Text, Text,

[Link](new Text(word), new IntWritable(1));

for (IntWritable val : values) {

Access Interactive and Batch Batch

Structure Static Dynamic

Updates Read/write many times Write once, read many times

Integrity High Low

Scalability Nonlinear Linear

What is Hadoop YARN?

You might also like