0% found this document useful (0 votes)
15 views

BIG Data_Unit_2

The document provides an overview of Hadoop, including its history, architecture, and key components such as HDFS and Map-Reduce. It discusses the evolution of Hadoop from its origins inspired by Google's research, its scalability, fault tolerance, and the rich ecosystem of tools surrounding it. Additionally, it covers data formats used in Hadoop, the functionality of Hadoop Streaming and Pipes, and the Map-Reduce framework's role in processing large datasets.

Uploaded by

rajsneha25597
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

BIG Data_Unit_2

The document provides an overview of Hadoop, including its history, architecture, and key components such as HDFS and Map-Reduce. It discusses the evolution of Hadoop from its origins inspired by Google's research, its scalability, fault tolerance, and the rich ecosystem of tools surrounding it. Additionally, it covers data formats used in Hadoop, the functionality of Hadoop Streaming and Pipes, and the Map-Reduce framework's role in processing large datasets.

Uploaded by

rajsneha25597
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

United Institute of Management

Department of Computer Application


Subject Name: Big Data Subject Code: KCA022(Elective-2)
UNIT 2

Hadoop: History of Hadoop, Apache Hadoop, the Hadoop Distributed File System,
components of Hadoop, data format, analysing data with Hadoop, scaling out, Hadoop
streaming, Hadoop pipes, Hadoop Echo System.
Map-Reduce: Map-Reduce framework and basics, how Map Reduce works, developing a
Map Reduce application, unit tests with MR unit, test data and local tests, anatomy of a Map
Reduce job run, failures, job scheduling, shuffle and sort, task execution, Map Reduce types,
input formats, output formats, Map Reduce features, Real-world Map Reduce

History of Hadoop
Early Origins: The origins of Hadoop can be traced back to early 2000 when Doug Cutting
and Mike Cafarella, two software engineers, were working on an open-source web search
engine project called Nutch. They were confronted with the challenge of processing and
indexing vast amounts of web data efficiently.

Google's Influence: Around the same time, Google had published two groundbreaking
papers – one on the Google File System (GFS) and the other on the MapReduce
programming model. These papers presented a revolutionary approach to handling and
processing large datasets in a distributed computing environment.

Hadoop's Birth: Inspired by Google's research papers, Cutting and Cafarella recognized the
potential of these concepts and began working on an open-source implementation. In 2005,
they introduced Hadoop, naming it after Cutting's son's toy elephant. Hadoop aimed to solve
the challenges of distributed storage and processing of enormous volumes of data.

Becoming an Apache Project: Hadoop quickly gained attention and began to evolve as a
community-driven project. In 2006, it became a top-level project under the Apache Software
Foundation. This step was pivotal in fostering a collaborative environment for development
and innovation.

Key Components: Hadoop's core components included the Hadoop Distributed File System
(HDFS) for distributed storage and the Map-Reduce programming model for distributed data
processing. These components formed the backbone of Hadoop's architecture.

Rapid Growth: Hadoop's adaptability and scalability made it a popular choice for
organizations dealing with big data challenges. Companies like Yahoo, Facebook, and
LinkedIn adopted Hadoop to manage and analyze massive datasets. The Hadoop ecosystem
continued to expand with the addition of various projects and tools.
Contributions from the Community: Hadoop's growth was driven by contributions from a
vibrant open-source community. It became a powerful platform for various data-related tasks,
including data storage, batch processing, and real-time analytics.

Ongoing Development: Hadoop has continued to evolve with each new release. Over time,
it introduced features like YARN (Yet Another Resource Negotiator) to better support various
processing engines, not just Map-Reduce.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 2


Apache Hadoop

Apache Hadoop is an open-source, Java-based framework that has revolutionized the way
large datasets are stored, processed, and analyzed. It was created to address the challenges of
managing vast volumes of data in a distributed computing environment. Here's an overview
of Apache Hadoop:

Distributed Data Processing: Hadoop is designed to distribute and process data across
clusters of commodity hardware. This distributed approach allows it to handle large datasets
that cannot fit on a single machine.

Scalability: Hadoop offers a scalable architecture. As data volumes grow, you can simply add
more machines to the cluster to handle the increased load. This horizontal scaling makes it
suitable for big data applications.

Fault Tolerance: Hadoop is inherently fault-tolerant. It can recover from hardware failures
because it stores multiple copies of data across the cluster. If one machine fails, the data is
still accessible from other nodes.

Hadoop Distributed File System (HDFS): HDFS is Hadoop's primary storage system. It
divides large files into smaller blocks and distributes these blocks across the cluster. HDFS
provides high throughput and data redundancy.

Map-Reduce: Map-Reduce is a programming model for processing and generating large


datasets that Hadoop pioneered. It divides a task into smaller sub-tasks and processes them in
parallel across the cluster. This approach is well-suited for batch processing tasks like data
analysis.

YARN (Yet Another Resource Negotiator): YARN is a resource management layer that
separates the job scheduling and cluster resource management functions, allowing Hadoop to
support various data processing engines beyond Map-Reduce.

Hadoop Ecosystem: Hadoop has a rich ecosystem of related projects and tools. These
include Apache Pig, Apache Hive, Apache HBase, and Apache Spark, among others. Each
tool is designed to address specific big data challenges, such as data processing, querying,
and real-time analytics.

Open Source Community: Hadoop is maintained and developed by an active open-source


community. This collaborative environment has led to continuous improvements and
innovations.

Use Cases: Hadoop is used in a wide range of applications, including data warehousing, log
processing, recommendation systems, fraud detection, and more. It is particularly valuable
for organizations dealing with large and diverse datasets.

Commercial Distributions: Several vendors offer commercial distributions of Hadoop,


providing additional features, support, and management tools. Cloudera, Hortonworks, and
MapR are some of the well-known vendors in this space.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 3


Challenges: While Hadoop offers numerous advantages, it also presents challenges,
including complexity, the need for skilled administrators and developers, and the need to
design and tune clusters for optimal performance.

The Hadoop Distributed File System (HDFS)


The Hadoop Distributed File System (HDFS) is a pivotal component of the Apache Hadoop
framework, designed for the storage and management of large-scale data. It's a distributed file
system that allows Hadoop to efficiently store and access enormous volumes of data. Here's
an overview of HDFS:

Distributed Storage: HDFS divides large files into smaller blocks (typically 128 MB or 256
MB in size) and distributes these blocks across a cluster of commodity hardware. This
distributed storage approach enables Hadoop to handle datasets that are too large to fit on a
single machine.

Redundancy and Fault Tolerance: HDFS is built for fault tolerance. It maintains multiple
copies of each data block across different nodes in the cluster. If a node or block becomes
unavailable due to hardware failure, HDFS can still retrieve the data from other copies,
ensuring data availability and integrity.

Master-Slave Architecture: HDFS follows a master-slave architecture. It comprises two


primary types of nodes:

NameNode: This is the master node that manages the file system's namespace and metadata.
It keeps track of the structure of the file system and where data blocks are located.
DataNode: These are the slave nodes responsible for storing and managing the actual data
blocks. They report to the NameNode, informing it about the health and availability of data
blocks.
Data Replication: HDFS replicates data blocks to ensure fault tolerance. Typically, it creates
three copies of each block by default. This replication factor can be configured to meet
specific requirements, balancing fault tolerance and storage capacity.

High Throughput: HDFS is optimized for high throughput rather than low-latency access. It
is well-suited for batch processing tasks like data analysis and large-scale data storage.

Write-Once, Read-Many Model: HDFS follows a write-once, read-many model, making it


ideal for data that is primarily written once and then read multiple times. This model
simplifies data consistency and replication.

Consistency and Coherency: HDFS maintains a strong consistency model, ensuring that data
is consistent across all replicas. It also supports coherency, allowing data to be read
immediately after it's written.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 4


Scalability: HDFS is highly scalable. As the data volume grows, you can scale out by adding
more DataNodes to the cluster, accommodating larger datasets and higher processing loads.

Support for Streaming Data: HDFS is optimized for streaming access patterns, which are
common in big data processing. It is less suitable for random reads and writes.

Components of Hadoop
Apache Hadoop is composed of various components that work together to enable distributed
data storage and processing. These components form the core of the Hadoop ecosystem.
Here's an overview of the key components:

Hadoop Distributed File System (HDFS): HDFS is the primary storage component of
Hadoop. It divides large files into smaller blocks and stores multiple copies of these blocks
across the cluster. HDFS is designed for fault tolerance and high throughput data access.

MapReduce: MapReduce is a programming model and processing engine that allows for
distributed data processing. It breaks down tasks into smaller sub-tasks and processes them in
parallel across the cluster. While MapReduce is an essential component, newer processing
engines like Apache Spark have become popular alternatives.

Yet Another Resource Negotiator (YARN): YARN is the resource management layer of
Hadoop. It separates the job scheduling and cluster resource management functions, allowing
Hadoop to support various data processing engines beyond MapReduce. YARN enhances the
flexibility and efficiency of cluster resource utilization.

Hadoop Common: Hadoop Common includes the shared utilities and libraries used by other
Hadoop modules. It provides a set of common functions, configurations, and tools that
facilitate the operation of Hadoop clusters.

Hadoop Distributed Copy (DistCp): DistCp is a tool used for efficiently copying large
amounts of data within or between Hadoop clusters. It is particularly useful for migrating
data or creating backups.

Hadoop Archive (HAR): HAR is a file archiving tool used to bundle and compress multiple
small files into a single Hadoop archive file. This helps in reducing the number of files and
improving the efficiency of storage and processing.

Ambari: Apache Ambari is a management and monitoring tool for Hadoop clusters. It
provides a web-based interface for cluster provisioning, managing, and monitoring,
simplifying cluster administration tasks.

Hadoop High Availability (HA): Hadoop HA is a feature that ensures continuous


availability of HDFS by providing redundant NameNodes. In the event of a NameNode
failure, the standby NameNode can take over, minimizing downtime.

Other Ecosystem Projects: In addition to these core components, Hadoop has a rich
ecosystem of related projects and tools. Some notable projects include Apache Hive for data

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 5


warehousing, Apache Pig for data processing, Apache HBase for NoSQL database
capabilities, and Apache Spark for in-memory data processing, among others.

Data Format
Data format refers to the structure and organization of data, which determines how it can be
stored, processed, and interpreted by computer systems. In the context of big data and
Hadoop, data format is crucial for efficient storage and analysis. Here are some common data
formats used in this domain:

Structured Data: Structured data is highly organized and follows a specific schema, often in
tabular form. Examples include data stored in relational databases, spreadsheets, or CSV
(Comma-Separated Values) files. Structured data is easy to query and analyze.

Unstructured Data: Unstructured data lacks a specific format and doesn't fit neatly into
tables or rows. It includes text documents, images, videos, and social media posts. Analyzing
unstructured data often requires natural language processing and machine learning
techniques.

Semi-Structured Data: Semi-structured data falls between structured and unstructured data.
It has some level of organization, typically using tags, labels, or hierarchies. Examples
include JSON (JavaScript Object Notation), XML (eXtensible Markup Language), and log
files.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 6


Binary Data: Binary data represents information in a format that isn't human-readable. It's
often used for encoding images, audio, or any data that doesn't need to be directly interpreted
by humans. In Hadoop, binary formats can be efficient for certain data types.

Avro: Apache Avro is a popular data serialization system. It provides a compact, efficient,
and self-describing data format. Avro is often used in Hadoop for data storage and exchange.

Parquet: Apache Parquet is another columnar storage format that is optimized for big data
processing. It stores data in a highly compressed and columnar format, making it efficient for
analytical queries.

ORC (Optimized Row Columnar): ORC is a file format for Hadoop that provides a
lightweight, efficient, and self-describing structure for data storage. It's commonly used in
conjunction with Hadoop and Hive for data warehousing.

SequenceFile: SequenceFile is a binary file format used in Hadoop that can store data in key-
value pairs. It's optimized for large data volumes and is often used for intermediate data
storage in Map-Reduce jobs.

Thrift and Protocol Buffers: Thrift and Protocol Buffers are binary serialization formats
used for efficient data exchange between systems. They provide a compact and efficient way
to serialize structured data.

Delimited Text: Delimited text formats, such as CSV, TSV (Tab-Separated Values), and other
custom delimiters, are widely used for structured data storage. They are easy to work with
and can be processed using Hadoop tools.

Hadoop streaming

Hadoop Streaming uses UNIX standard streams as the interface between Hadoop and your
program so you can write MapReduce program in any language which can write to standard
output and read standard input. Hadoop offers a lot of methods to help non-Java development.
The primary mechanisms are Hadoop Pipes which gives a native C++ interface to Hadoop and
Hadoop Streaming which permits any program that uses standard input and output to be used
for map tasks and reduce tasks.
With this utility, one can create and run MapReduce jobs with any executable or script as the
mapper and/or the reducer.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 7


Features of Hadoop Streaming
Some of the key features associated with Hadoop Streaming are as follows :

• Hadoop Streaming is a part of the Hadoop Distribution System.


• It facilitates ease of writing Map Reduce programs and codes.
• Hadoop Streaming supports almost all types of programming
languages such as Python, C++, Ruby, Perl etc.
• The entire Hadoop Streaming framework runs on Java. However, the
codes might be written in different languages as mentioned in the
above point.
• The Hadoop Streaming process uses Unix Streams that act as an
interface between Hadoop and Map Reduce programs.
• Hadoop Streaming uses various Streaming Command Options and
the two mandatory ones are – -input directoryname or filename and -
output directoryname

Hadoop Streaming architecture

Hadoop Pipes
It is the name of the C++ interface to Hadoop MapReduce. Unlike Hadoop
Streaming which uses standard I/O to communicate with the map and reduce
code Pipes uses sockets as the channel over which the task tracker communicates
with the process running the C++ map or reduce function. JNI is not used.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 8


Map-Reduce framework and basics

MapReduce is the processing engine of Hadoop that processes and computes large
volumes of data. It is one of the most common engines used by Data Engineers to
process Big Data. It allows businesses and other organizations to run calculations
to:

• Determine the price for their products that yields the highest profits

• Know precisely how effective their advertising is and where they should
spend their ad dollars

• Make weather predictions

• Mine web clicks, sales records purchased from retailers, and Twitter
trending topics to determine what new products the company should
produce in the upcoming season

Before MapReduce, these calculations were complicated. Now, programmers can


tackle problems like these with relative ease. Data scientists have coded complex
algorithms into frameworks so that programmers can use them.

Companies no longer need an entire department of Ph.D. scientists to model data,


nor do they need a supercomputer to process large sets of data, as MapReduce runs
across a network of low-cost commodity machines.

There are two phases in the MapReduce programming model:

1. Mapping

2. Reducing

The following part of this MapReduce tutorial discusses both these phases.

Mapping and Reducing

A mapper class handles the mapping phase; it maps the data present in different
datanodes. A reducer class handles the reducing phase; it aggregates and reduces
the output of different datanodes to generate the final output.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 9


Data that is stored on multiple machines pass through mapping. The final output is
obtained after the data is shuffled, sorted, and reduced.

Input Data

Hadoop accepts data in various formats and stores it in HDFS. This input data is
worked upon by multiple map tasks.

Map Tasks

Map reads the data, processes it, and generates key-value pairs. The number of map
tasks depends upon the input file and its format.

Typically, a file in a Hadoop cluster is broken down into blocks, each with a default
size of 128 MB. Depending upon the size, the input file is split into multiple chunks. A
map task then runs for each chunk. The mapper class has mapper functions that
decide what operation is to be performed on each chunk.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 10


Reduce Tasks

In the reducing phase, a reducer class performs operations on the data generated
from the map tasks through a reducer function. It shuffles, sorts, and aggregates the
intermediate key-value pairs (tuples) into a set of smaller tuples.

Output

The smaller set of tuples is the final output and gets stored in HDFS.

Let us look at the MapReduce workflow in the next section of this MapReduce
tutorial.

MapReduce Workflow

The MapReduce workflow is as shown:

• The input data that needs to be processed using MapReduce is stored in


HDFS. The processing can be done on a single file or a directory that has
multiple files.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 11


• The input format defines the input specification and how the input files
would be split and read.

• The input split logically represents the data to be processed by an


individual mapper.

• RecordReader communicates with the input split and converts the data
into key-value pairs suitable to be read by the mapper.

• The mapper works on the key-value pairs and gives an intermittent output,
which goes for further processing.

• Combiner is a mini reducer that performs mini aggregation on the key-


value pairs generated by the mapper.

• Partitioner decides how outputs from combiners are sent to the reducers.

• The output of the partitioner is shuffled and sorted. This output is fed as
input to the reducer.

• The reducer combines all the intermediate values for the intermediate keys
into a list called tuples.

• The RecordWriter writes these output key-value pairs from reducer to the
output files.

• The output data gets stored in HDFS.

The next section of this MapReduce tutorial covers the architecture of MapReduce.

MapReduce Architecture

The architecture of MapReduce is as shown:

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 12


There is a client program or an API which intends to process the data. It submits the
job to the job tracker (resource manager in the case of Hadoop YARN framework).

Hadoop v1 had a job tracker as master, leading the Task Trackers. In Hadoop v2:

• Job tracker was replaced with ResourceManager

• Task tracker was replaced with NodeManager

The ResourceManager has to assign the job to the NodeManagers, which then
handles the processing on every node. Once an application to be run on the YARN
processing framework is submitted, it is handled by the ResourceManager.

The data which is stored in HDFS are broken down into one or multiple splits
depending on the input format. One or numerous map tasks, running within the
container on the nodes, work on these input splits.

There is some amount of RAM utilized for each map task. The same data, which
then goes through the reducing phase, would also use some RAM and CPU cores.
Internally, there are functions which take care of deciding the number of reducers,
doing a mini reduce, reading and processing the data from multiple data nodes.

This is how the MapReduce programming model makes parallel processing work.
Finally, the output is generated and gets stored in HDFS.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 13


MRUnit for Unit Testing:

MRUnit is a Java library specifically designed for unit testing Hadoop MapReduce jobs.
It allows you to write unit tests for your Mapper, Reducer, and Driver classes without the
need to run them on an actual Hadoop cluster.
MRUnit provides a set of APIs for creating test cases, setting input data, and verifying the
output, making it an essential tool for validating your MapReduce code.
2. Creating Test Data:

To perform unit tests, you'll need sample input data that mimics the data your MapReduce job
will process.
You can create test data as input for your Mapper and Reducer by using text files, key-value
pairs, or other relevant data structures.
Ensure that your test data covers various scenarios and edge cases that your MapReduce job
should handle.
3. Writing Unit Tests:

Develop JUnit test cases for your Mapper, Reducer, and Driver classes using MRUnit.
Set up the test environment, including configuring input data and expected output.
Invoke your Mapper and Reducer classes as part of the test case, providing the test data as
input.
Assert that the actual output matches the expected output using MRUnit assertions.
4. Running Local Tests:

Local tests using MRUnit are executed within your development environment (e.g., your
local development machine) and do not require a Hadoop cluster.
Running local tests is faster and more efficient for iterative development and debugging.
Local tests help you identify and resolve issues with your MapReduce code before deploying
it to a Hadoop cluster, reducing the risk of job failures and data processing errors in a
production environment.
5. Benefits of Local Tests:

Local tests allow you to validate the correctness of your MapReduce logic in a controlled
environment.
They help you catch and fix errors early in the development cycle, which is essential for
maintaining the reliability and efficiency of your MapReduce jobs.
Local tests are an integral part of the test-driven development (TDD) approach, where you
write tests before implementing the code.
6. Continuous Integration:

To ensure that unit tests are executed automatically and consistently as part of your
development workflow, consider integrating them into a continuous integration (CI) pipeline.
CI tools like Jenkins, Travis CI, or CircleCI can be configured to run unit tests for your
MapReduce jobs whenever changes are committed to the code repository.
In summary, unit testing with MRUnit and local tests are essential practices for validating the
correctness and functionality of your Hadoop MapReduce jobs. By creating test data, writing
unit tests, and running them locally, you can identify and address issues early in the
development process, leading to more robust and reliable MapReduce applications.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 14


Anatomy of Hadoop MapReduce Execution:

Once we give a MapReduce job the system will enter into a series of life cycle phases:

1. Job Submission Phase


2. Job Initialization Phase
3. Task Assignment Phase
4. Task Execution Phase
5. Progress update Phase
6. Failure Recovery
In order to run the MR program the hadoop uses the command-‘yarn jar client.jar job-class HDFS
input HDFS-output directory’, where yarn is an utility and jar is the command.Client.jar and job
class name written by the developer. When we execute on terminal the Yarn will initiate a set of
actions

1. Job Submission Phase


The Job Submitter invoke getNewApplicationId() on Resource Manager to get a new Application
id for the job initiated. Then it validates input and output paths given by the user and checks
whether the input file and output directory already exist or not. Job Submitter contact the
NameNode and get the metadata .It also compute the input splits ie. How many blocks of data are
needed and also creates a shared directory inside Resource Manager with corresponding
Application id. JobSubmitter then create a job.xml file which contains the following details of map
reduce job, JobClassName,Input,Output,InputFormat,OutputFormat,Number of Splits,,Mapper,
Reducer Class,Jar file name etc.. and copy the job.xml to shared directory .Ten copies of client.jar
are created and keep the first copy in the shared directory and keeps the remaining nine copies in
any configurable nodes in the network.

1. Generate new application id.


2. Validate the user input.
3. Gather metadata from NameNode.
4. Create job.xml file
5. Create ten copies of client.jar.
6. Create a shared directory(job.xml,client.jar).
7. Keep nine copies of client.jar in Datanodes.
2. Job Initialization
Once the job is submitted by the JS the control will move to the cluster side from edge node.
Resource Manager contains a candidate called Application Manager which maintains a queue of
application ids submitted by the clients. The default name of the queue is ‘default’. The applications
submitted in the queue are scheduled by ‘YARN Scheduler’ .To execute the job an Application
Master is required, which cannot be created by YARN Scheduler. So a request is given to the
Resource Manager to create Application Master. Resource manager will contact all Node
Managers(NM) in the cluster to check the container-0 specification (minimum 2 GB and maximum
20GB).If any Node Manager possesses container-0 specification will respond to RM with a positive
signal. In the new generations of Hadoop, NM continuously sends heart beats to the RM with its
specifications so it’s possible for the RM to select NM without network overhead.

1. Add application id into default queue.


2. JobScheduler starts scheduling.
3. Check with NM for container-0 specification.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 15


4. NM replays with its own specification.
5. Select and register the NM, create AppMaster for execution.
6. Pull shared information and creates required mapper ids.
3.Task Assignment Phase
One of the advantages of Hadoop framework is that the data splits are distributed across the
network in multiple data nodes .Thus phase three deals with deploying the mapper task in data
node which contains corresponding input splits(see figure 5). Thus the burden of processing in
one node is distributed among multiple nodes. Application Master uses the data locality concept.
It contacts RM for negotiating resources and submit the data node information which satisfy the
data locality criteria and keeps the informations. AM initiate YARN child creation on each
specified data node with the same application id as their task id and submit the mapper id to be
executed and it also create a temporary local directory in the specified data node. Thus the
phase three for task assignment comes to an end.

4. Task Execution Phase


At this point the control is transferred to the particular data nodes which possess the YARN child.
It then contact the shared directory on RM to get the related information and copy the files from
the shared directory of RM to the locally created temporary directory ie the client.jar and job.xml
etc.. This is the main theme of anatomy of mapreduce in hadoop which is pushing the process to
data, thus by making data locality possible.

5. Progress Update
Once the execution started in distributed data nodes the progress have to be send to the
respective initiative modules. Mapper and Reducer JVM execution environment sends progress
report to the corresponding AM periodically (every 1second).AM accumulate progress from all
MR tasks and communicate to client only if a change in progress happen for every 3 seconds. In
reverse client also sends the request of completion in every 5 seconds. Once all the tasks are
completed AM cleanup the temporary directory and send the response to the client with the
results. Mappers and reducers intermediate output is deleted only when all the tasks got the
completion response, otherwise if a reducer fails it still needs the output from the mapper
programs.

6. Failue Recovery
Hadoop provides a facility to store the trace of the user and system operation by using FSImage
and Edit.logs in the Namenode.The Secondary Namenode checkpoint backup mechanism
provides a hard backup technique for hadoop in case if the Namenode goes down

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 16


Map Reduce types

MapReduce is a programming model and processing technique designed for processing and
generating large datasets that can be parallelized across a distributed cluster of computers. It
consists of two main phases: the Map phase and the Reduce phase. While the basic
MapReduce model remains the same, there are different types of MapReduce programs based
on the nature of the tasks they perform. Here are a few types:

Batch Processing MapReduce:

Description: This is the most common type of MapReduce job. It involves processing a large
volume of data in a batch mode, where the entire dataset is processed and analyzed.

Real-time MapReduce:
Description: Unlike batch processing, real-time MapReduce processes data as it arrives,
providing results in near real-time.

Iterative MapReduce:
Description: Some algorithms, like machine learning algorithms, require multiple iterations
over the same dataset. Iterative MapReduce allows for the reuse of intermediate results
between iterations, reducing redundant computations.

Interactive MapReduce:
Description: In interactive MapReduce, the system allows users to query and interact with the
data in a more dynamic and ad-hoc manner compared to traditional batch processing.

Graph Processing using MapReduce:


Description: This type focuses on processing and analyzing graph-structured data efficiently,
such as finding connections and patterns in social networks.

Distributed MapReduce:
Description: This refers to MapReduce jobs that are executed on a distributed cluster of
machines to handle large-scale data processing tasks.

Machine Learning with MapReduce:


Description: MapReduce can be used to parallelize and distribute machine learning
algorithms, enabling the processing of large datasets in a scalable manner.
Use Cases: Training and processing large-scale machine learning models.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 17


FileInputFormat in Hadoop
It is the base class for all file-based InputFormats. Hadoop FileInputFormat specifies input
directory where data files are located. When we start a Hadoop job, FileInputFormat is
provided with a path containing files to read. FileInputFormat will read all files and divides
these files into one or more InputSplits.

TextInputFormat
It is the default InputFormat of MapReduce. TextInputFormat treats each line of each input
file as a separate record and performs no parsing. This is useful for unformatted data or line-
based records like log files.

• Key – It is the byte offset of the beginning of the line within the file (not whole
file just one split), so it will be unique if combined with the file name.
• Value – It is the contents of the line, excluding line terminators.
KeyValueTextInputFormat
It is similar to TextInputFormat as it also treats each line of input as a separate record. While
TextInputFormat treats entire line as the value, but the KeyValueTextInputFormat breaks the
line itself into key and value by a tab character (‘/t’). Here Key is everything up to the tab
character while the value is the remaining part of the line after tab character.
SequenceFileInputFormat
Hadoop SequenceFileInputFormat is an InputFormat which reads sequence files. Sequence
files are binary files that stores sequences of binary key-value pairs. Sequence files block-
compress and provide direct serialization and deserialization of several arbitrary data types
(not just text). Here Key & Value both are user-defined.
SequenceFileAsTextInputFormat
Hadoop SequenceFileAsTextInputFormat is another form of SequenceFileInputFormat
which converts the sequence file key values to Text objects. By
calling ‘tostring()’ conversion is performed on the keys and values. This InputFormat makes
sequence files suitable input for streaming.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 18


SequenceFileAsBinaryInputFormat

Hadoop SequenceFileAsBinaryInputFormat is a SequenceFileInputFormat using which we


can extract the sequence file’s keys and values as an opaque binary object.
NLineInputFormat
Hadoop NLineInputFormat is another form of TextInputFormat where the keys are byte
offset of the line and values are contents of the line. Each mapper receives a variable number
of lines of input with TextInputFormat and KeyValueTextInputFormat and the number
depends on the size of the split and the length of the lines. And if we want our mapper to
receive a fixed number of lines of input, then we use NLineInputFormat.
DBInputFormat

Hadoop DBInputFormat is an InputFormat that reads data from a relational database, using
JDBC. As it doesn’t have portioning capabilities, so we need to careful not to swamp the
database from which we are reading too many mappers. So it is best for loading relatively
small datasets, perhaps for joining with large datasets from HDFS using MultipleInputs. Here
Key is LongWritables while Value is DBWritables.

Output Format in MapReduce


The Hadoop Output Format checks the Output-Specification of the job. It determines how
RecordWriter implementation is used to write output to output files. In this blog, we are
going to see what is Hadoop Output Format, what is Hadoop RecordWriter, how
RecordWriter is used in Hadoop?

TextOutputFormat
MapReduce default Hadoop reducer Output Format is TextOutputFormat, which writes
(key, value) pairs on individual lines of text files and its keys and values can be of any type
since TextOutputFormat turns them to string by calling toString() on them. Each key-value
pair is separated by a tab character, which can be changed
using MapReduce.output.textoutputformat.separator property. KeyValueTextOutputFormat

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 19


is used for reading these output text files since it breaks lines into key-value pairs based on a
configurable separator.
SequenceFileOutputFormat
It is an Output Format which writes sequences files for its output and it is intermediate format
use between MapReduce jobs, which rapidly serialize arbitrary data types to the file; and the
corresponding SequenceFileInputFormat will deserialize the file into the same types and
presents the data to the next mapper in the same manner as it was emitted by the previous
reducer, since these are compact and readily compressible. Compression is controlled by the
static methods on SequenceFileOutputFormat.

SequenceFileAsBinaryOutputFormat
It is another form of SequenceFileInputFormat which writes keys and values to sequence file
in binary format.

MapFileOutputFormat
It is another form of FileOutputFormat in Hadoop Output Format, which is used to write
output as map files. The key in a MapFile must be added in order, so we need to ensure that
reducer emits keys in sorted order.
Any doubt yet in Hadoop Oputput Format? Please Ask.

MultipleOutputs
It allows writing data to files whose names are derived from the output keys and values, or in
fact from an arbitrary string.

LazyOutputFormat
Sometimes FileOutputFormat will create output files, even if they are empty.
LazyOutputFormat is a wrapper OutputFormat which ensures that the output file will be
created only when the record is emitted for a given partition.

DBOutputFormat
DBOutputFormat in Hadoop is an Output Format for writing to relational databases
and HBase. It sends the reduce output to a SQL table. It accepts key-value pairs, where the
key has a type extending DBwritable. Returned RecordWriter writes only the key to the
database with a batch SQL query.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 20


Key Features of MapReduce

Scalability
MapReduce can scale to process vast amounts of data by distributing tasks across a large
number of nodes in a cluster. This allows it to handle massive datasets, making it suitable for
Big Data applications.

Fault Tolerance

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 21


MapReduce incorporates built-in fault tolerance to ensure the reliable processing of data. It
automatically detects and handles node failures, rerunning tasks on available nodes as
needed.

Data Locality
MapReduce takes advantage of data locality by processing data on the same node where it is
stored, minimizing data movement across the network and improving overall performance.

Simplicity
The MapReduce programming model abstracts away many complexities associated with
distributed computing, allowing developers to focus on their data processing logic rather than
low-level details.

Cost-Effective Solution
Hadoop's scalable architecture and MapReduce programming framework make storing and
processing extensive data sets very economical.

Parallel Programming
Tasks are divided into programming models to allow for the simultaneous execution of
independent operations. As a result, programs run faster due to parallel processing, making it
easier for a process to handle each job. Thanks to parallel processing, these distributed tasks
can be performed by multiple processors. Therefore, all software runs faster.

Application Of MapReduce
Entertainment: To discover the most popular movies, based on what you like and what you
watched in this case Hadoop MapReduce help you out. It mainly focuses on their logs and
clicks.

E-commerce: Numerous E-commerce suppliers, like Amazon, Walmart, and eBay, utilize
the MapReduce programming model to distinguish most loved items dependent on clients’
inclinations or purchasing behavior.
It incorporates making item proposal Mechanisms for E-commerce inventories, examining
website records, buy history, user interaction logs, etc.

Data Warehouse: We can utilize MapReduce to analyze large data volumes in data
warehouses while implementing specific business logic for data insights.

Fraud Detection: Hadoop and MapReduce are utilized in monetary enterprises, including
organizations like banks, insurance providers, installment areas for misrepresentation
recognition, pattern distinguishing proof, or business metrics through transaction analysis.

KEY POINTS
History of Hadoop:

Hadoop's history is rooted in a project called Nutch, an open-source web search engine
developed by Doug Cutting and Mike Cafarella.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 22


Inspired by Google's MapReduce and Google File System (GFS), Cutting and Cafarella
created Hadoop, naming it after a stuffed toy elephant.
In 2006, Hadoop became an Apache project, fostering a robust open-source community.
Hadoop's evolution led to various versions and components, solidifying its role in Big Data
processing.
Apache Hadoop:

Apache Hadoop is an open-source, Java-based framework for distributed storage and


processing of large datasets.
It provides a scalable and fault-tolerant platform for Big Data applications.
Key components include Hadoop Distributed File System (HDFS) and the Map-Reduce
programming model.
The Hadoop Distributed File System (HDFS):

HDFS is Hadoop's primary storage system, designed for handling large datasets.
It breaks data into blocks (typically 128MB or 256MB) and replicates them across the cluster
for fault tolerance.
HDFS follows a master-slave architecture with a NameNode managing metadata and
DataNodes storing data.
Components of Hadoop:

NameNode: Manages the file system namespace and metadata.


DataNode: Stores the actual data blocks and reports to the NameNode.
Resource Manager: Manages and allocates cluster resources.
Node Manager: Monitors resource usage on each DataNode.
Job Tracker (deprecated): Used in earlier versions to manage Map-Reduce jobs.
Task Tracker (deprecated): Monitors and executes tasks in older Hadoop versions.
Data Format:

Hadoop uses a unique data format, often called "key-value pairs."


Data is stored as records with associated keys, facilitating parallel processing.
Analyzing Data with Hadoop:

Hadoop's core processing model is Map-Reduce.


Mappers process and map data into key-value pairs.
Reducers aggregate and process the mapped data to produce results.
Hadoop can handle structured and unstructured data, making it versatile.
Scaling Out:

Hadoop's scalability allows the addition of more commodity hardware to handle growing data
volumes.
Scaling out horizontally involves adding more machines to the cluster.
This architecture accommodates vast datasets and complex computations.
Hadoop Streaming:

Hadoop Streaming allows users to create and run Map-Reduce jobs with any executable or
script.
It's an interface for writing Map and Reduce functions in various programming languages.
Hadoop Pipes:

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 23


Hadoop Pipes is a C++ library for creating Map-Reduce applications.
It enables C++ programs to be used as Mapper and Reducer functions.
Hadoop Ecosystem:

The Hadoop ecosystem extends beyond HDFS and Map-Reduce, encompassing various
projects and tools.
Ecosystem components include Hive, Pig, HBase, Spark, and others, each designed for
specific tasks in the Big Data pipeline.

By: Shivam Bhardwaj, UIM(011) Email:[email protected] 24

You might also like