BIG Data_Unit_2
BIG Data_Unit_2
Hadoop: History of Hadoop, Apache Hadoop, the Hadoop Distributed File System,
components of Hadoop, data format, analysing data with Hadoop, scaling out, Hadoop
streaming, Hadoop pipes, Hadoop Echo System.
Map-Reduce: Map-Reduce framework and basics, how Map Reduce works, developing a
Map Reduce application, unit tests with MR unit, test data and local tests, anatomy of a Map
Reduce job run, failures, job scheduling, shuffle and sort, task execution, Map Reduce types,
input formats, output formats, Map Reduce features, Real-world Map Reduce
History of Hadoop
Early Origins: The origins of Hadoop can be traced back to early 2000 when Doug Cutting
and Mike Cafarella, two software engineers, were working on an open-source web search
engine project called Nutch. They were confronted with the challenge of processing and
indexing vast amounts of web data efficiently.
Google's Influence: Around the same time, Google had published two groundbreaking
papers – one on the Google File System (GFS) and the other on the MapReduce
programming model. These papers presented a revolutionary approach to handling and
processing large datasets in a distributed computing environment.
Hadoop's Birth: Inspired by Google's research papers, Cutting and Cafarella recognized the
potential of these concepts and began working on an open-source implementation. In 2005,
they introduced Hadoop, naming it after Cutting's son's toy elephant. Hadoop aimed to solve
the challenges of distributed storage and processing of enormous volumes of data.
Becoming an Apache Project: Hadoop quickly gained attention and began to evolve as a
community-driven project. In 2006, it became a top-level project under the Apache Software
Foundation. This step was pivotal in fostering a collaborative environment for development
and innovation.
Key Components: Hadoop's core components included the Hadoop Distributed File System
(HDFS) for distributed storage and the Map-Reduce programming model for distributed data
processing. These components formed the backbone of Hadoop's architecture.
Rapid Growth: Hadoop's adaptability and scalability made it a popular choice for
organizations dealing with big data challenges. Companies like Yahoo, Facebook, and
LinkedIn adopted Hadoop to manage and analyze massive datasets. The Hadoop ecosystem
continued to expand with the addition of various projects and tools.
Contributions from the Community: Hadoop's growth was driven by contributions from a
vibrant open-source community. It became a powerful platform for various data-related tasks,
including data storage, batch processing, and real-time analytics.
Ongoing Development: Hadoop has continued to evolve with each new release. Over time,
it introduced features like YARN (Yet Another Resource Negotiator) to better support various
processing engines, not just Map-Reduce.
Apache Hadoop is an open-source, Java-based framework that has revolutionized the way
large datasets are stored, processed, and analyzed. It was created to address the challenges of
managing vast volumes of data in a distributed computing environment. Here's an overview
of Apache Hadoop:
Distributed Data Processing: Hadoop is designed to distribute and process data across
clusters of commodity hardware. This distributed approach allows it to handle large datasets
that cannot fit on a single machine.
Scalability: Hadoop offers a scalable architecture. As data volumes grow, you can simply add
more machines to the cluster to handle the increased load. This horizontal scaling makes it
suitable for big data applications.
Fault Tolerance: Hadoop is inherently fault-tolerant. It can recover from hardware failures
because it stores multiple copies of data across the cluster. If one machine fails, the data is
still accessible from other nodes.
Hadoop Distributed File System (HDFS): HDFS is Hadoop's primary storage system. It
divides large files into smaller blocks and distributes these blocks across the cluster. HDFS
provides high throughput and data redundancy.
YARN (Yet Another Resource Negotiator): YARN is a resource management layer that
separates the job scheduling and cluster resource management functions, allowing Hadoop to
support various data processing engines beyond Map-Reduce.
Hadoop Ecosystem: Hadoop has a rich ecosystem of related projects and tools. These
include Apache Pig, Apache Hive, Apache HBase, and Apache Spark, among others. Each
tool is designed to address specific big data challenges, such as data processing, querying,
and real-time analytics.
Use Cases: Hadoop is used in a wide range of applications, including data warehousing, log
processing, recommendation systems, fraud detection, and more. It is particularly valuable
for organizations dealing with large and diverse datasets.
Distributed Storage: HDFS divides large files into smaller blocks (typically 128 MB or 256
MB in size) and distributes these blocks across a cluster of commodity hardware. This
distributed storage approach enables Hadoop to handle datasets that are too large to fit on a
single machine.
Redundancy and Fault Tolerance: HDFS is built for fault tolerance. It maintains multiple
copies of each data block across different nodes in the cluster. If a node or block becomes
unavailable due to hardware failure, HDFS can still retrieve the data from other copies,
ensuring data availability and integrity.
NameNode: This is the master node that manages the file system's namespace and metadata.
It keeps track of the structure of the file system and where data blocks are located.
DataNode: These are the slave nodes responsible for storing and managing the actual data
blocks. They report to the NameNode, informing it about the health and availability of data
blocks.
Data Replication: HDFS replicates data blocks to ensure fault tolerance. Typically, it creates
three copies of each block by default. This replication factor can be configured to meet
specific requirements, balancing fault tolerance and storage capacity.
High Throughput: HDFS is optimized for high throughput rather than low-latency access. It
is well-suited for batch processing tasks like data analysis and large-scale data storage.
Consistency and Coherency: HDFS maintains a strong consistency model, ensuring that data
is consistent across all replicas. It also supports coherency, allowing data to be read
immediately after it's written.
Support for Streaming Data: HDFS is optimized for streaming access patterns, which are
common in big data processing. It is less suitable for random reads and writes.
Components of Hadoop
Apache Hadoop is composed of various components that work together to enable distributed
data storage and processing. These components form the core of the Hadoop ecosystem.
Here's an overview of the key components:
Hadoop Distributed File System (HDFS): HDFS is the primary storage component of
Hadoop. It divides large files into smaller blocks and stores multiple copies of these blocks
across the cluster. HDFS is designed for fault tolerance and high throughput data access.
MapReduce: MapReduce is a programming model and processing engine that allows for
distributed data processing. It breaks down tasks into smaller sub-tasks and processes them in
parallel across the cluster. While MapReduce is an essential component, newer processing
engines like Apache Spark have become popular alternatives.
Yet Another Resource Negotiator (YARN): YARN is the resource management layer of
Hadoop. It separates the job scheduling and cluster resource management functions, allowing
Hadoop to support various data processing engines beyond MapReduce. YARN enhances the
flexibility and efficiency of cluster resource utilization.
Hadoop Common: Hadoop Common includes the shared utilities and libraries used by other
Hadoop modules. It provides a set of common functions, configurations, and tools that
facilitate the operation of Hadoop clusters.
Hadoop Distributed Copy (DistCp): DistCp is a tool used for efficiently copying large
amounts of data within or between Hadoop clusters. It is particularly useful for migrating
data or creating backups.
Hadoop Archive (HAR): HAR is a file archiving tool used to bundle and compress multiple
small files into a single Hadoop archive file. This helps in reducing the number of files and
improving the efficiency of storage and processing.
Ambari: Apache Ambari is a management and monitoring tool for Hadoop clusters. It
provides a web-based interface for cluster provisioning, managing, and monitoring,
simplifying cluster administration tasks.
Other Ecosystem Projects: In addition to these core components, Hadoop has a rich
ecosystem of related projects and tools. Some notable projects include Apache Hive for data
Data Format
Data format refers to the structure and organization of data, which determines how it can be
stored, processed, and interpreted by computer systems. In the context of big data and
Hadoop, data format is crucial for efficient storage and analysis. Here are some common data
formats used in this domain:
Structured Data: Structured data is highly organized and follows a specific schema, often in
tabular form. Examples include data stored in relational databases, spreadsheets, or CSV
(Comma-Separated Values) files. Structured data is easy to query and analyze.
Unstructured Data: Unstructured data lacks a specific format and doesn't fit neatly into
tables or rows. It includes text documents, images, videos, and social media posts. Analyzing
unstructured data often requires natural language processing and machine learning
techniques.
Semi-Structured Data: Semi-structured data falls between structured and unstructured data.
It has some level of organization, typically using tags, labels, or hierarchies. Examples
include JSON (JavaScript Object Notation), XML (eXtensible Markup Language), and log
files.
Avro: Apache Avro is a popular data serialization system. It provides a compact, efficient,
and self-describing data format. Avro is often used in Hadoop for data storage and exchange.
Parquet: Apache Parquet is another columnar storage format that is optimized for big data
processing. It stores data in a highly compressed and columnar format, making it efficient for
analytical queries.
ORC (Optimized Row Columnar): ORC is a file format for Hadoop that provides a
lightweight, efficient, and self-describing structure for data storage. It's commonly used in
conjunction with Hadoop and Hive for data warehousing.
SequenceFile: SequenceFile is a binary file format used in Hadoop that can store data in key-
value pairs. It's optimized for large data volumes and is often used for intermediate data
storage in Map-Reduce jobs.
Thrift and Protocol Buffers: Thrift and Protocol Buffers are binary serialization formats
used for efficient data exchange between systems. They provide a compact and efficient way
to serialize structured data.
Delimited Text: Delimited text formats, such as CSV, TSV (Tab-Separated Values), and other
custom delimiters, are widely used for structured data storage. They are easy to work with
and can be processed using Hadoop tools.
Hadoop streaming
Hadoop Streaming uses UNIX standard streams as the interface between Hadoop and your
program so you can write MapReduce program in any language which can write to standard
output and read standard input. Hadoop offers a lot of methods to help non-Java development.
The primary mechanisms are Hadoop Pipes which gives a native C++ interface to Hadoop and
Hadoop Streaming which permits any program that uses standard input and output to be used
for map tasks and reduce tasks.
With this utility, one can create and run MapReduce jobs with any executable or script as the
mapper and/or the reducer.
Hadoop Pipes
It is the name of the C++ interface to Hadoop MapReduce. Unlike Hadoop
Streaming which uses standard I/O to communicate with the map and reduce
code Pipes uses sockets as the channel over which the task tracker communicates
with the process running the C++ map or reduce function. JNI is not used.
MapReduce is the processing engine of Hadoop that processes and computes large
volumes of data. It is one of the most common engines used by Data Engineers to
process Big Data. It allows businesses and other organizations to run calculations
to:
• Determine the price for their products that yields the highest profits
• Know precisely how effective their advertising is and where they should
spend their ad dollars
• Mine web clicks, sales records purchased from retailers, and Twitter
trending topics to determine what new products the company should
produce in the upcoming season
1. Mapping
2. Reducing
The following part of this MapReduce tutorial discusses both these phases.
A mapper class handles the mapping phase; it maps the data present in different
datanodes. A reducer class handles the reducing phase; it aggregates and reduces
the output of different datanodes to generate the final output.
Input Data
Hadoop accepts data in various formats and stores it in HDFS. This input data is
worked upon by multiple map tasks.
Map Tasks
Map reads the data, processes it, and generates key-value pairs. The number of map
tasks depends upon the input file and its format.
Typically, a file in a Hadoop cluster is broken down into blocks, each with a default
size of 128 MB. Depending upon the size, the input file is split into multiple chunks. A
map task then runs for each chunk. The mapper class has mapper functions that
decide what operation is to be performed on each chunk.
In the reducing phase, a reducer class performs operations on the data generated
from the map tasks through a reducer function. It shuffles, sorts, and aggregates the
intermediate key-value pairs (tuples) into a set of smaller tuples.
Output
The smaller set of tuples is the final output and gets stored in HDFS.
Let us look at the MapReduce workflow in the next section of this MapReduce
tutorial.
MapReduce Workflow
• RecordReader communicates with the input split and converts the data
into key-value pairs suitable to be read by the mapper.
• The mapper works on the key-value pairs and gives an intermittent output,
which goes for further processing.
• Partitioner decides how outputs from combiners are sent to the reducers.
• The output of the partitioner is shuffled and sorted. This output is fed as
input to the reducer.
• The reducer combines all the intermediate values for the intermediate keys
into a list called tuples.
• The RecordWriter writes these output key-value pairs from reducer to the
output files.
The next section of this MapReduce tutorial covers the architecture of MapReduce.
MapReduce Architecture
Hadoop v1 had a job tracker as master, leading the Task Trackers. In Hadoop v2:
The ResourceManager has to assign the job to the NodeManagers, which then
handles the processing on every node. Once an application to be run on the YARN
processing framework is submitted, it is handled by the ResourceManager.
The data which is stored in HDFS are broken down into one or multiple splits
depending on the input format. One or numerous map tasks, running within the
container on the nodes, work on these input splits.
There is some amount of RAM utilized for each map task. The same data, which
then goes through the reducing phase, would also use some RAM and CPU cores.
Internally, there are functions which take care of deciding the number of reducers,
doing a mini reduce, reading and processing the data from multiple data nodes.
This is how the MapReduce programming model makes parallel processing work.
Finally, the output is generated and gets stored in HDFS.
MRUnit is a Java library specifically designed for unit testing Hadoop MapReduce jobs.
It allows you to write unit tests for your Mapper, Reducer, and Driver classes without the
need to run them on an actual Hadoop cluster.
MRUnit provides a set of APIs for creating test cases, setting input data, and verifying the
output, making it an essential tool for validating your MapReduce code.
2. Creating Test Data:
To perform unit tests, you'll need sample input data that mimics the data your MapReduce job
will process.
You can create test data as input for your Mapper and Reducer by using text files, key-value
pairs, or other relevant data structures.
Ensure that your test data covers various scenarios and edge cases that your MapReduce job
should handle.
3. Writing Unit Tests:
Develop JUnit test cases for your Mapper, Reducer, and Driver classes using MRUnit.
Set up the test environment, including configuring input data and expected output.
Invoke your Mapper and Reducer classes as part of the test case, providing the test data as
input.
Assert that the actual output matches the expected output using MRUnit assertions.
4. Running Local Tests:
Local tests using MRUnit are executed within your development environment (e.g., your
local development machine) and do not require a Hadoop cluster.
Running local tests is faster and more efficient for iterative development and debugging.
Local tests help you identify and resolve issues with your MapReduce code before deploying
it to a Hadoop cluster, reducing the risk of job failures and data processing errors in a
production environment.
5. Benefits of Local Tests:
Local tests allow you to validate the correctness of your MapReduce logic in a controlled
environment.
They help you catch and fix errors early in the development cycle, which is essential for
maintaining the reliability and efficiency of your MapReduce jobs.
Local tests are an integral part of the test-driven development (TDD) approach, where you
write tests before implementing the code.
6. Continuous Integration:
To ensure that unit tests are executed automatically and consistently as part of your
development workflow, consider integrating them into a continuous integration (CI) pipeline.
CI tools like Jenkins, Travis CI, or CircleCI can be configured to run unit tests for your
MapReduce jobs whenever changes are committed to the code repository.
In summary, unit testing with MRUnit and local tests are essential practices for validating the
correctness and functionality of your Hadoop MapReduce jobs. By creating test data, writing
unit tests, and running them locally, you can identify and address issues early in the
development process, leading to more robust and reliable MapReduce applications.
Once we give a MapReduce job the system will enter into a series of life cycle phases:
5. Progress Update
Once the execution started in distributed data nodes the progress have to be send to the
respective initiative modules. Mapper and Reducer JVM execution environment sends progress
report to the corresponding AM periodically (every 1second).AM accumulate progress from all
MR tasks and communicate to client only if a change in progress happen for every 3 seconds. In
reverse client also sends the request of completion in every 5 seconds. Once all the tasks are
completed AM cleanup the temporary directory and send the response to the client with the
results. Mappers and reducers intermediate output is deleted only when all the tasks got the
completion response, otherwise if a reducer fails it still needs the output from the mapper
programs.
6. Failue Recovery
Hadoop provides a facility to store the trace of the user and system operation by using FSImage
and Edit.logs in the Namenode.The Secondary Namenode checkpoint backup mechanism
provides a hard backup technique for hadoop in case if the Namenode goes down
MapReduce is a programming model and processing technique designed for processing and
generating large datasets that can be parallelized across a distributed cluster of computers. It
consists of two main phases: the Map phase and the Reduce phase. While the basic
MapReduce model remains the same, there are different types of MapReduce programs based
on the nature of the tasks they perform. Here are a few types:
Description: This is the most common type of MapReduce job. It involves processing a large
volume of data in a batch mode, where the entire dataset is processed and analyzed.
Real-time MapReduce:
Description: Unlike batch processing, real-time MapReduce processes data as it arrives,
providing results in near real-time.
Iterative MapReduce:
Description: Some algorithms, like machine learning algorithms, require multiple iterations
over the same dataset. Iterative MapReduce allows for the reuse of intermediate results
between iterations, reducing redundant computations.
Interactive MapReduce:
Description: In interactive MapReduce, the system allows users to query and interact with the
data in a more dynamic and ad-hoc manner compared to traditional batch processing.
Distributed MapReduce:
Description: This refers to MapReduce jobs that are executed on a distributed cluster of
machines to handle large-scale data processing tasks.
TextInputFormat
It is the default InputFormat of MapReduce. TextInputFormat treats each line of each input
file as a separate record and performs no parsing. This is useful for unformatted data or line-
based records like log files.
• Key – It is the byte offset of the beginning of the line within the file (not whole
file just one split), so it will be unique if combined with the file name.
• Value – It is the contents of the line, excluding line terminators.
KeyValueTextInputFormat
It is similar to TextInputFormat as it also treats each line of input as a separate record. While
TextInputFormat treats entire line as the value, but the KeyValueTextInputFormat breaks the
line itself into key and value by a tab character (‘/t’). Here Key is everything up to the tab
character while the value is the remaining part of the line after tab character.
SequenceFileInputFormat
Hadoop SequenceFileInputFormat is an InputFormat which reads sequence files. Sequence
files are binary files that stores sequences of binary key-value pairs. Sequence files block-
compress and provide direct serialization and deserialization of several arbitrary data types
(not just text). Here Key & Value both are user-defined.
SequenceFileAsTextInputFormat
Hadoop SequenceFileAsTextInputFormat is another form of SequenceFileInputFormat
which converts the sequence file key values to Text objects. By
calling ‘tostring()’ conversion is performed on the keys and values. This InputFormat makes
sequence files suitable input for streaming.
Hadoop DBInputFormat is an InputFormat that reads data from a relational database, using
JDBC. As it doesn’t have portioning capabilities, so we need to careful not to swamp the
database from which we are reading too many mappers. So it is best for loading relatively
small datasets, perhaps for joining with large datasets from HDFS using MultipleInputs. Here
Key is LongWritables while Value is DBWritables.
TextOutputFormat
MapReduce default Hadoop reducer Output Format is TextOutputFormat, which writes
(key, value) pairs on individual lines of text files and its keys and values can be of any type
since TextOutputFormat turns them to string by calling toString() on them. Each key-value
pair is separated by a tab character, which can be changed
using MapReduce.output.textoutputformat.separator property. KeyValueTextOutputFormat
SequenceFileAsBinaryOutputFormat
It is another form of SequenceFileInputFormat which writes keys and values to sequence file
in binary format.
MapFileOutputFormat
It is another form of FileOutputFormat in Hadoop Output Format, which is used to write
output as map files. The key in a MapFile must be added in order, so we need to ensure that
reducer emits keys in sorted order.
Any doubt yet in Hadoop Oputput Format? Please Ask.
MultipleOutputs
It allows writing data to files whose names are derived from the output keys and values, or in
fact from an arbitrary string.
LazyOutputFormat
Sometimes FileOutputFormat will create output files, even if they are empty.
LazyOutputFormat is a wrapper OutputFormat which ensures that the output file will be
created only when the record is emitted for a given partition.
DBOutputFormat
DBOutputFormat in Hadoop is an Output Format for writing to relational databases
and HBase. It sends the reduce output to a SQL table. It accepts key-value pairs, where the
key has a type extending DBwritable. Returned RecordWriter writes only the key to the
database with a batch SQL query.
Scalability
MapReduce can scale to process vast amounts of data by distributing tasks across a large
number of nodes in a cluster. This allows it to handle massive datasets, making it suitable for
Big Data applications.
Fault Tolerance
Data Locality
MapReduce takes advantage of data locality by processing data on the same node where it is
stored, minimizing data movement across the network and improving overall performance.
Simplicity
The MapReduce programming model abstracts away many complexities associated with
distributed computing, allowing developers to focus on their data processing logic rather than
low-level details.
Cost-Effective Solution
Hadoop's scalable architecture and MapReduce programming framework make storing and
processing extensive data sets very economical.
Parallel Programming
Tasks are divided into programming models to allow for the simultaneous execution of
independent operations. As a result, programs run faster due to parallel processing, making it
easier for a process to handle each job. Thanks to parallel processing, these distributed tasks
can be performed by multiple processors. Therefore, all software runs faster.
Application Of MapReduce
Entertainment: To discover the most popular movies, based on what you like and what you
watched in this case Hadoop MapReduce help you out. It mainly focuses on their logs and
clicks.
E-commerce: Numerous E-commerce suppliers, like Amazon, Walmart, and eBay, utilize
the MapReduce programming model to distinguish most loved items dependent on clients’
inclinations or purchasing behavior.
It incorporates making item proposal Mechanisms for E-commerce inventories, examining
website records, buy history, user interaction logs, etc.
Data Warehouse: We can utilize MapReduce to analyze large data volumes in data
warehouses while implementing specific business logic for data insights.
Fraud Detection: Hadoop and MapReduce are utilized in monetary enterprises, including
organizations like banks, insurance providers, installment areas for misrepresentation
recognition, pattern distinguishing proof, or business metrics through transaction analysis.
KEY POINTS
History of Hadoop:
Hadoop's history is rooted in a project called Nutch, an open-source web search engine
developed by Doug Cutting and Mike Cafarella.
HDFS is Hadoop's primary storage system, designed for handling large datasets.
It breaks data into blocks (typically 128MB or 256MB) and replicates them across the cluster
for fault tolerance.
HDFS follows a master-slave architecture with a NameNode managing metadata and
DataNodes storing data.
Components of Hadoop:
Hadoop's scalability allows the addition of more commodity hardware to handle growing data
volumes.
Scaling out horizontally involves adding more machines to the cluster.
This architecture accommodates vast datasets and complex computations.
Hadoop Streaming:
Hadoop Streaming allows users to create and run Map-Reduce jobs with any executable or
script.
It's an interface for writing Map and Reduce functions in various programming languages.
Hadoop Pipes:
The Hadoop ecosystem extends beyond HDFS and Map-Reduce, encompassing various
projects and tools.
Ecosystem components include Hive, Pig, HBase, Spark, and others, each designed for
specific tasks in the Big Data pipeline.