0% found this document useful (0 votes)
7 views

Big Data 3rd Module

Uploaded by

Smitha Rajesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Big Data 3rd Module

Uploaded by

Smitha Rajesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

MODULE 3

Apache Hadoop is a collection of open-source software utilities that facilitate


using a network of many computers to solve problems involving massive amounts
of data and computation. It provides a software framework for distributed
storage and processing of big data using the MapReduce programming model.
Originally designed for computer clusters built from commodity hardware still the
common use—it has also found use on clusters of higher-end hardware. All the
modules in Hadoop are designed with a fundamental assumption that hardware
failures are common occurrences and should be automatically handled by the
framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop
Distributed File System (HDFS), and a processing part which is a MapReduce
programming model. Hadoop splits files into large blocks and distributes them
across nodes in a cluster. It then transfers packaged code into nodes to process the
data in parallel. This approach takes advantage of data locality, where nodes
manipulate the data they have access to. This allows the dataset to
be processed faster and more efficiently than it would be in a more
conventional supercomputer architecture that relies on a parallel file system where
computation and data are distributed via high-speed networking.
The base Apache Hadoop framework is composed of the following modules:

• Hadoop Common – contains libraries and utilities needed by other Hadoop


modules;
• Hadoop Distributed File System (HDFS) – a distributed file-system that
stores data on commodity machines, providing very high aggregate bandwidth
across the cluster;
• Hadoop YARN – (introduced in 2012) a platform responsible for managing
computing resources in clusters and using them for scheduling users'
applications;
• Hadoop MapReduce – an implementation of the MapReduce programming
model for large-scale data processing.

1. Hadoop History
Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002
when they both started to work on Apache Nutch project. Apache Nutch project
was the process of building a search engine system that can index 1 billion pages.
After a lot of research on Nutch, they concluded that such a system will cost
around half a million dollars in hardware, and along with a monthly running cost of
$30, 000 approximately, which is very expensive. So, they realized that their
project architecture will not be capable enough to the workaround with billions of
pages on the web. So they were looking for a feasible solution which can reduce
the implementation cost as well as the problem of storing and processing of large
datasets.
In 2003, they came across a paper that described the architecture of Google’s
distributed file system, called GFS (Google File System) which was published by
Google, for storing the large data sets. Now they realize that this paper can solve
their problem of storing very large files which were being generated because of
web crawling and indexing processes. But this paper was just the half solution to
their problem.
In 2004, Google published one more paper on the technique MapReduce, which
was the solution of processing those large datasets. Now this paper was another
half solution for Doug Cutting and Mike Cafarella for their Nutch project. These
both techniques (GFS & MapReduce) were just on white paper at Google. Google
didn’t implement these two techniques. Doug Cutting knew from his work on
Apache Lucene ( It is a free and open-source information retrieval software library,
originally written in Java by Doug Cutting in 1999) that open-source is a great way
to spread the technology to more people. So, together with Mike Cafarella, he
started implementing Google’s techniques (GFS & MapReduce) as open-source in
the Apache Nutch project.
In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He
soon realized two problems:
(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters
(b) And that was looking impossible with just two people (Doug Cutting & Mike
Cafarella).
The engineering task in Nutch project was much bigger than he realized. So he
started to find a job with a company who is interested in investing in their efforts.
And he found Yahoo!.Yahoo had a large team of engineers that was eager to work
on this there project.
So in 2006, Doug Cutting joined Yahoo along with Nutch project. He wanted to
provide the world with an open-source, reliable, scalable computing framework,
with the help of Yahoo. So at Yahoo first, he separates the distributed computing
parts from Nutch and formed a new project Hadoop (He gave name Hadoop it was
the name of a yellow toy elephant which was owned by the Doug Cutting’s son.
and it was easy to pronounce and was the unique word.) Now he wanted to make
Hadoop in such a way that it can work well on thousands of nodes. So with GFS
and MapReduce, he started to work on Hadoop.
In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using
it.
In January of 2008, Yahoo released Hadoop as an open source project to
ASF(Apache Software Foundation). And in July of 2008, Apache Software
Foundation successfully tested a 4000 node cluster with Hadoop.
In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less
than 17 hours for handling billions of searches and indexing millions of web pages.
And Doug Cutting left the Yahoo and joined Cloudera to fulfill the challenge of
spreading Hadoop to other industries.
In December of 2011, Apache Software Foundation released Apache Hadoop
version 1.0.
And later in Aug 2013, Version 2.0.6 was available.
And currently, we have Apache Hadoop version 3.0 which released in December
2017.
Let’s Summarize above History :
2. HDFS Overview
Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly
faulttolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing.

Features of HDFS

• It is suitable for the distributed storage and processing.


• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of name node and data node help users to easily check
the status of cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.

HDFS Architecture

Given below is the architecture of a Hadoop File System. HDFS follows the
master-slave architecture and it has the following elements.
Namenode
The name node is the commodity hardware that contains the GNU/Linux
operating system and the name node software. It is a software that can be run on
commodity hardware. The system having the name node acts as the master server
and it does the following tasks −
• Manages the file system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and
opening files and directories.
Data node
The data node is a commodity hardware having the GNU/Linux operating system
and data node software. For every node (Commodity hardware/System) in a
cluster, there will be a data node. These nodes manage the data storage of their
system.
• Datanodes perform read-write operations on the file systems, as per client
request.
• They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will
be divided into one or more segments and/or stored in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount of
data that HDFS can read or write is called a Block. The default block size is
64MB, but it can be increased as per the need to change in HDFS configuration.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore HDFS should
have mechanisms for quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the
computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.
3. COMPONENTS OF HADOOP
Apache Pig : software for analysing large data sets that consists of a high-level
language similar to SQL for expressing data analysis programs, coupled with
infrastructure for evaluating these programs. It contains a compiler that produces
sequences of Map- Reduce programs.

HBase non-relational columnar distributed database designed to run on top of


Hadoop Distributed File system (HDFS). It is written in Java and modelled after
Google‟s Big Table. HBase is an example of a NoSQL data store.

Hive: it is Data warehousing application that provides the SQL interface and
relational model. Hive infrastructure is built on the top of Hadoop that help in
providing summarization, query and analysis.

Cascading : software abstraction layer for Hadoop, intended to hide the


underlying complexity of Map Reduce jobs. Cascading allows users to create and
execute data processing workflows on Hadoop clusters using any JVM-based
language.

Avro: it is a data serialization system and data exchange service. It is basically


used in Apache Hadoop. These services can be used together as well as
independently.

Big Top: It is used for packaging and testing the Hadoop ecosystem.

Oozie: Oozie is a java based web-application that runs in a java servlet. Oozie uses
the database to store definition of Workflow that is a collection of actions. It
manages the Hadoop jobs. So there are many advantages of hadoop that are:
Hadoop framework allows the user to quickly write and test distributed systems. It
is efficient, and it automatic distributes the data and work across the machines and
in turn, utilizes the underlying parallelism of the CPU cores. Hadoop does not rely
on hardware to provide fault-tolerance and high availability (FTHA), rather
Hadoop library itself has been designed to detect and handle failures at the
application layer.
4. ANALYSING BIG DATA WITH HADOOP
Big data analytics is the process of examining this large amount of different
data types, or big data, in an effort to uncover hidden patterns, unknown
correlations and other useful information.

Advantages of Big Data Analysis

Big data analysis allows market analysts, researchers and business users to
develop deep insights from the available data, resulting in numerous business
advantages. Business users are able to make a precise analysis of the data and
the key early indicators from this analysis can mean fortunes for the business.
Some of the exemplary use cases are as follows:
• Whenever users browse travel portals, shopping sites, search flights,
hotels or add a particular item into their cart, then Ad Targeting
companies can analyze this wide variety of data and activities and can
provide better recommendations to the user regarding offers, discounts
and deals based on the user browsing history and product history.
• In the telecommunications space, if customers are moving from one
service provider to another service provider, then by analyzing huge call
data records of the various issues faced by the customers can be
unearthed. Issues could be as wide-ranging as a significant increase in
the call drops or some network congestion problems. Based on
analyzing these issues, it can be identified if a telecom company needs
to place a new tower in a particular urban area or if they need to revive
the marketing strategy for a particular region as a new player has come
up there. That way customer churn can be proactively minimized.

Hadoop Data Analysis Technologies


Let’s have a look at the existing open source Hadoop data analysis
technologies to analyze the huge stock data being generated very frequently.
MapReduce
MapReduce is a framework using which we can write applications to process
huge amounts of data, in parallel, on large clusters of commodity hardware in a
reliable manner.
MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important
tasks, namely Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an
input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the
map job.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing
primitives are called mappers and reducers. Decomposing a data processing
application into mappers and reducers is sometimes nontrivial. But, once we write
an application in the MapReduce form, scaling the application to run over
hundreds, thousands, or even tens of thousands of machines in a cluster is merely
a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where
the data resides!
• MapReduce program executes in three stages, namely map stage, shuffle
stage, and reduce stage.
o Map stage − The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS). The input file is passed to
the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage
and the Reduce stage. The Reducer’s job is to process the data that
comes from the mapper. After processing, it produces a new set of
output, which will be stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
• Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
• After completion of the given tasks, the cluster collects and reduces the data
to form an appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs (Java Perspective)


The MapReduce framework operates on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of
<key, value> pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework
and hence, need to implement the Writable interface. Additionally, the key classes
have to implement the Writable-Comparable interface to facilitate sorting by the
framework. Input and Output types of a MapReduce job − (Input) <k1, v1> →
map → <k2, v2> → reduce → <k3, v3>(Output).

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Terminology
• PayLoad − Applications implement the Map and the Reduce functions, and
form the core of the job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate
key/value pair.
• NamedNode − Node that manages the Hadoop Distributed File System
(HDFS).
• DataNode − Node where data is presented in advance before any
processing takes place.
• MasterNode − Node where JobTracker runs and which accepts job requests
from clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
• Job − A program is an execution of a Mapper and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer on a slice of data.
• Task Attempt − A particular instance of an attempt to execute a task on a
SlaveNode.

Apache Pig
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used
to analyze larger sets of data representing them as data flows. Pig is generally
used with Hadoop; we can perform all the data manipulation operations in
Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig
Latin. This language provides various operators using which programmers can
develop their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig
Latin language. All these scripts are internally converted to Map and Reduce
tasks. Apache Pig has a component known as Pig Engine that accepts the Pig
Latin scripts as input and converts those scripts into MapReduce jobs.
Why Do We Need Apache Pig?
Programmers who are not so good at Java normally used to struggle working with
Hadoop, especially while performing any MapReduce tasks. Apache Pig is a boon
for all such programmers.
• Using Pig Latin, programmers can perform MapReduce tasks easily
without having to type complex codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of
codes. For example, an operation that would require you to type 200 lines
of code (LoC) in Java can be easily done by typing as less as just 10 LoC in
Apache Pig. Ultimately Apache Pig reduces the development time by
almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you
are familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like
joins, filters, ordering, etc. In addition, it also provides nested data types
like tuples, bags, and maps that are missing from MapReduce.
Features of Pig
Apache Pig comes with the following features −
• Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only on
semantics of the language.
• Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig
Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Apache Pig Vs MapReduce
Listed below are the major differences between Apache Pig and MapReduce.

Apache Pig MapReduce

Apache Pig is a data flow language. MapReduce is a data processing


paradigm.

It is a high level language. MapReduce is low level and rigid.

Performing a Join operation in Apache Pig is It is quite difficult in MapReduce to


pretty simple. perform a Join operation between
datasets.

Any novice programmer with a basic knowledge Exposure to Java is must to work with
of SQL can work conveniently with Apache Pig. MapReduce.

Apache Pig uses multi-query approach, thereby MapReduce will require almost 20 times
reducing the length of the codes to a great extent. more the number of lines to perform the
same task.
There is no need for compilation. On execution, MapReduce jobs have a long
every Apache Pig operator is converted internally compilation process.
into a MapReduce job.

Apache Pig Vs SQL


Listed below are the major differences between Apache Pig and SQL.

Pig SQL

Pig Latin is a procedural language. SQL is a declarative language.

In Apache Pig, schema is optional. We can store data Schema is mandatory in SQL.
without designing a schema (values are stored as $01,
$02 etc.)

The data model in Apache Pig is nested relational. The data model used in SQL is flat
relational.

Apache Pig provides limited opportunity for Query There is more opportunity for
optimization. query optimization in SQL.

In addition to above differences, Apache Pig Latin −

• Allows splits in the pipeline.


• Allows developers to store data anywhere in the pipeline.
• Declares execution plans.
• Provides operators to perform ETL (Extract, Transform, and Load)
functions.
Apache Pig Vs Hive
Both Apache Pig and Hive are used to create MapReduce jobs. And in some
cases, Hive operates on HDFS in a similar way Apache Pig does. In the following
table, we have listed a few significant points that set Apache Pig apart from Hive.

Apache Pig Hive

Apache Pig uses a language called Pig Latin. Hive uses a language called HiveQL. It was
It was originally created at Yahoo. originally created at Facebook.

Pig Latin is a data flow language. HiveQL is a query processing language.

Pig Latin is a procedural language and it fits HiveQL is a declarative language.


in pipeline paradigm.

Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured data.

Applications of Apache Pig


Apache Pig is generally used by data scientists for performing tasks involving ad-
hoc processing and quick prototyping. Apache Pig is used −

• To process huge data sources such as web logs.


• To perform data processing for search platforms.
• To process time sensitive data loads.
Apache Pig – History
In 2006, Apache Pig was developed as a research project at Yahoo, especially to
create and execute MapReduce jobs on every dataset. In 2007, Apache Pig was
open sourced via Apache incubator. In 2008, the first release of Apache Pig came
out. In 2010, Apache Pig graduated as an Apache top-level project.
Apache Pig - Architecture
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It
is a highlevel data processing language which provides a rich set of data types and
operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write a
Pig script using the Pig Latin language, and execute them using any of the
execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these
scripts will go through a series of transformations applied by the Pig Framework,
to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and
thus, it makes the programmer’s job easy. The architecture of Apache Pig is
shown below.

Apache Pig Components


As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the
data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally,
these MapReduce jobs are executed on Hadoop producing the desired results.
Pig Latin Data Model
The data model of Pig Latin is fully nested and it allows complex non-atomic
datatypes such as map and tuple. Given below is the diagrammatical
representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int, long,
float, double, chararray, and bytearray are the atomic values of Pig. A piece of
data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields
can be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-
unique) is known as a bag. Each tuple can have any number of fields (flexible
schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but
unlike a table in RDBMS, it is not necessary that every tuple contain the same
number of fields or that the fields in the same position (column) have the same
type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, [email protected],}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is represented
by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).
Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data, and makes querying and
analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation
took it up and developed it further as an open source under the name Apache
Hive. It is used by different companies. For example, Amazon uses it in Amazon
Elastic MapReduce.
Hive is not

• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Features of Hive

• It stores schema in a database and processed data into HDFS.


• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes
each unit:

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data types,
and HDFS mapping.

HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and process it.

Execution Engine The conjunction part of HiveQL process Engine and MapReduce is
Hive Execution Engine. Execution engine processes the query and
generates results as same as MapReduce results. It uses the flavor
of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.

Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:

Step Operation
No.

1 Execute Query

The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.

2 Get Plan

The driver takes the help of query compiler that parses the query to check the
syntax and query plan or the requirement of query.

3 Get Metadata

The compiler sends metadata request to Metastore (any database).

4 Send Metadata

Metastore sends metadata as a response to the compiler.


5 Send Plan

The compiler checks the requirement and resends the plan to the driver. Up to here,
the parsing and compiling of a query is complete.

6 Execute Plan

The driver sends the execute plan to the execution engine.

7 Execute Job

Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.

7.1 Metadata Ops

Meanwhile in execution, the execution engine can execute metadata operations


with Metastore.

8 Fetch Result

The execution engine receives the results from Data nodes.

9 Send Results

The execution engine sends those resultant values to the driver.

10 Send Results

The driver sends the results to Hive Interfaces.

5. Hadoop streaming
Hadoop streaming is a utility that comes with the Hadoop distribution. This utility
allows you to create and run Map/Reduce jobs with any executable or script as the
mapper and/or the reducer. The utility will create a Map/Reduce job, submit the
job to an appropriate cluster, and monitor the progress of the job until it
completes.
When a script is specified for mappers, each mapper task will launch the script as
a separate process when the mapper is initialized. As the mapper task runs, it
converts its inputs into lines and feed the lines to the standard input (STDIN) of
the process. In the meantime, the mapper collects the line-oriented outputs from
the standard output (STDOUT) of the process and converts each line into a
key/value pair, which is collected as the output of the mapper. By default, the
prefix of a line up to the first tab character is the key and the rest of the line
(excluding the tab character) will be the value. If there is no tab character in the
line, then the entire line is considered as the key and the value is null. However,
this can be customized, as per one need.
When a script is specified for reducers, each reducer task will launch the script as
a separate process, then the reducer is initialized. As the reducer task runs, it
converts its input key/values pairs into lines and feeds the lines to the standard
input (STDIN) of the process. In the meantime, the reducer collects the line-
oriented outputs from the standard output (STDOUT) of the process, converts
each line into a key/value pair, which is collected as the output of the reducer. By
default, the prefix of a line up to the first tab character is the key and the rest of
the line (excluding the tab character) is the value. However, this can be
customized as per specific requirements.

You might also like