Unit 3
Unit 3
History of Hadoop, the Hadoop Distributed File System, Components of Hadoop, Analyzing the Data with
Hadoop, Scaling Out, Hadoop Streaming, Design of HDFS, Java interfaces to HDFS Basics, Developing a
Map Reduce Application, How Map Reduce Works, Anatomy of a Map Reduce Job run, Failures, Job
Scheduling, Shuffle and Sort, Task execution, Map Reduce Types and Formats, Map Reduce Features
Hadoop environment.
---------------------------------------------------------------------------------------------------------------------
HISTORY OF HADOOP
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search
library. Hadoop has its origins in Apache Nutch, an open source web search engine, its a part of the
Lucene project.
Nutch was started in 2002, and a working crawler and search system quickly emerged. However,
its creators realized that their architecture wouldn’t scale to the billions of pages on the Web.
In 2004, Nutch’s developers set about writing an open source implementation, the Nutch
Distributed Filesystem (NDFS) as similar to Google distributed File System (GFS) introduced in 2003.
In 2004, Google published the paper that introduced MapReduce to the world. Early in 2005, the
Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all
the major Nutch algorithms had been ported to run using MapReduce and NDFS.
In February 2006 they moved out of Nutch to form an independent subproject of Lucene called
Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and
the resources to turn Hadoop into a system that ran at web scale. This was demonstrated in February
2008.
In January 2008, Hadoop was made its own top-level project at Apache, confirming its success
and its diverse, active community. By this time, Hadoop was being used by many other companies
besides Yahoo!, such as Last.fm, Facebook, and the New York Times.
In one well-publicized feat, the New York Times used Amazon’s EC2 compute cloud to crunch
through 4 terabytes of scanned archives from the paper, converting them to PDFs for the Web. 13 The
processing took less than 24 hours to run using 100 machines.
In April 2008, Hadoop broke a world record to become the fastest system to sort an entire
terabyte of data. Running on a 910-node cluster, Hadoop sorted 1 terabyte in 209 seconds (just under
3.5 minutes), beating the previous year’s winner of 297 seconds.
1
In November of the same year, Google reported that its MapReduce implementation sorted 1
terabyte in 68 seconds. Then, in April 2009, it was announced that a team at Yahoo! had used Hadoop to
sort 1 terabyte in 62 seconds.
The trend since then has been to sort even larger volumes of data at ever faster rates
Today, Hadoop is widely used in mainstream enterprises. Hadoop’s role as a general purpose
storage and analysis platform for big data has been recognized by the industry, and this fact is reflected
in the number of products that use or incorporate Hadoop in some way. Commercial Hadoop support is
available from large, established enterprise vendors, including EMC, IBM, Microsoft, and Oracle, as well
as from specialist Hadoop companies such as Cloudera, Hortonworks, and MapR.
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed
Filesystem, informally referred as DFS. HDFS is Hadoop’s flagship filesystem, a general purpose filesystem
integrates with other storage systems (such as the local filesystem and Amazon S3).
COMPONENTS OF HADOOP
Hadoop is a framework that uses distributed storage and parallel processing to store and manage
big data. It is the software most used by data analysts to handle big data, and its market size continues to
grow.
1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.
3. Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource management unit.
2
I. HADOOP HDFS:
The Hadoop Distributed File System (HDFS) is Hadoop’s storage layer. HDFS is housed on multiple
servers, data is divided into blocks based on file size. These blocks are then randomly distributed and
stored across slave machines.
NameNode
NameNode is the master server. In a non-high availability cluster, there can be only one
NameNode. In a high availability cluster, there is a possibility of two NameNodes, and if there are two
NameNodes there is no need for a secondary NameNode.
NameNode holds metadata information on the various DataNodes, their locations, the size of
each block, etc. It also helps to execute file system operations, such as opening, closing, renaming files
and directories.
Secondary NameNode
The secondary NameNode server is responsible for maintaining a copy of the metadata in the
disk. The main purpose of the secondary NameNode is to create a new NameNode in case of failure.
In a high availability cluster, there are two NameNodes: active and standby. The secondary
NameNode performs a similar function to the standby NameNode.
Hadoop Cluster in a rack-aware cluster, nodes are placed in racks and each rack has its own rack
switch. Rack switches are connected to a core switch, which ensures a switch failure will not render a
rack unavailable.
HDFS Read and Write mechanisms are parallel activities. To read or write a file in HDFS, a client
must interact with the namenode. The namenode checks the privileges of the client and gives
permission to read or write on the data blocks.
Data Node
3
Datanodes store and maintain the blocks. While there is only one namenode, there can be
multiple datanodes. Datanodes are responsible for retrieving the data blocks when requested by the
namenode
The data nodes read, write, process, and replicate the data. They also send signals every 10
seconds to the name node. These signals show the status of the data node.
Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource management layer of
Hadoop and is responsible for resource allocation and job scheduling. YARN is the middle layer between
HDFS and MapReduce in the Hadoop architecture.
Resource Manager
Resource Manager manages the resource allocation in the cluster and tracks how many resources
are available in the cluster and each node manager’s contribution.
a) Scheduler: Allocating resources to various running applications and scheduling resources based on the
requirements of the application; it doesn’t monitor or track the status of the applications
b) Application Manager: Accepting job submissions from the client and restarting application masters in
case of failure
Application Master
Application Master manages the resource needs of individual applications and interacts with the
scheduler to acquire the required resources. It connects with the node manager to execute and monitor
tasks.
Node Manager
Node Manager tracks running jobs and sends signals (or heartbeats) to the resource manager to
inform the status of a node. It also monitors each container’s resource utilization.
Container
4
Container houses a collection of resources like RAM, CPU, and network bandwidth. The container
provides the rights to an application to use specific resource amounts.
III. MAPREDUCE:
Map Phase
Map Phase stores data in the form of blocks. Data is read, processed and given a key-value pair in
this phase. It is responsible for running a particular task on one or multiple splits or inputs.
Reduce Phase
The reduce Phase receives the key-value pair from the map phase. The key-value pair is then
aggregated into smaller sets and an output is produced. Processes such as shuffling and sorting occur in
the reduce phase.
The mapper function handles the input data and runs a function on every input split (known as
map tasks). There can be one or multiple map tasks based on the size of the file. Data is then sorted,
shuffled, and moved to the reduce phase, where a reduce function aggregates the data and provides the
output.
The input data is stored in the HDFS and read using an input format.
5
The file is split into multiple chunks based on the size of the file and the input format.
The record reader reads the data from the input splits and forwards this information to the mapper.
The mapper breaks the records in every chunk into a list of data elements (or key-value pairs).
The combiner works on the intermediate data created by the map tasks and acts as a mini reducer to
reduce the data.
The partitioner decides how many reduce tasks will be required to aggregate the data.
The data is then sorted and shuffled based on their key-value pairs and sent to the reduce function.
Based on the output format decided by the reduce function, the output data is then stored on the
HDFS.
MapReduce works by breaking the processing into two phases: the map phase and the reduce phase.
Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer.
The programmer also specifies two functions: the map function and the reduce function.
The input to our map phase is the raw NCDC(National Climatic Data Center)data. We choose a text
input format that consider each line in the dataset as a text value. The key is the offset of the
beginning of the line from the beginning of the file.
Our map function is simple. Here we take the year and the air temperature, because these are the
only fields we are interested in. And then reduce function can do its work on it: finding the maximum
temperature for each year.
To visualize the way the map works, consider the following sample lines of input data:
0067011990999991950051507004...9999999N9+00001+99999999999...
6
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
(0,0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
The map function extracts the year and the air temperature (indicated in bold text), and emits
them as its output
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework before being sent to
the reduce function.
This processing sorts and groups the key-value pairs by key. So, the reduce function sees the
following input:
All the reduce function has to do now is iterate through the list and pick up the maximum reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year.
7
JAVA MAP REDUCE
1. a map function
2. a reduce function
I. The map function is represented by the Mapper class, which declares an abstract map() method.
Following code shows the implementation of our map function.
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
8
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+')
else
CODE EXPLANATION
The Mapper class is a generic type, with four formal type parameters that specify the input key, input
value, output key, and output value types of the map function.
Rather than using built-in Java types, Hadoop provides its own set of basic types that are
optimized for network serialization. These are found in the org.apache.hadoop.io package.
The map() method also provides an instance of Context to write the output to. In this case, we write
the year as a Text object and the temperature is wrapped in an IntWritable .
9
We write an output record only if the temperature is present and the quality code indicates the
temperature reading is OK.
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException
CODE EXPLANATION
Again, four formal type parameters are used to specify the input and output types, this time for
the reduce function. The input types of the reduce function must match the output types of the map
10
function: Text and IntWritable . And in this case, the output types of the reduce function are Text and
IntWritable , for a year and its maximum temperature, which we find by iterating through the
temperatures and comparing each with a record of the highest found so far.
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
if (args.length != 2)
System.exit(-1);
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
11
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
CODE EXPLANATION
1. A Job object forms the specification of the job and gives you control over how the job is run.
2. When we run this job on a Hadoop cluster, we will package the code into a JAR file (which Hadoop
will distribute around the cluster). For this purpose, we can pass a class in the Job ’s setJarByClass()
method, which Hadoop will use to locate the relevant JAR file by looking for the JAR file containing
this class.
3. An input path is specified by calling the addInputPath() method on FileInputFormat , and it can be a
single file, a directory or a file pattern
5. Next, we specify the map and reduce types to use via the setMapperClass() and setReducerClass()
methods.
6. The setOutputKeyClass() and setOutputValueClass() methods control the output types for the
reduce function
7. Now we are ready to run the job. The waitForCompletion() method on Job submits the job and waits
for it to finish.
8. The return value of the waitForCompletion() method is a Boolean indicating success ( true ) or failure
( false ), which we translate into the program’s exit code of 0 or 1.
12
SCALING OUT
13
HADOOP STREAMING
Hadoop provides an API that allows you to write your map and reduce functions in languages other
than Java.
Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program,
so you can use any language that can read standard input and write to standard output to write your
MapReduce program.
Mapped input data is passed over standard input to your map function, which processes it line by
line and writes lines to standard output.
Input to the reduce function is in the same format—a tab-separated key-value pair—passed over
standard input.
The reduce function reads lines from standard input, which the framework guarantees are sorted by
key, and writes its results to standard output.
MapReduce program for finding maximum temperatures by year in Streaming using Ruby Language
#!/usr/bin/env ruby
STDIN.each_line do |line|
val = line
end
14
CODE EXPLANATION: This program iterates over lines from standard input by executing a block for each
line from STDIN (a global constant of type IO ). The block pulls out the relevant fields from each input
line and, if the temperature is valid, writes the year and the temperature separated by a tab character, \t
, to standard output (using puts).
#!/usr/bin/env ruby
STDIN.each_line do |line|
puts "#{last_key}\t#{max_val}"
else
end
end
CODE EXPLANATION
Again, the program iterates over lines from standard input, but this time we have to store some
state as we process each key group. In this case, the keys are the years, and we store the last key seen
and the maximum temperature seen so far for that key. The MapReduce framework ensures that the
keys are ordered, so we know that if a key is different from the previous one, we have moved into a new
key group.
For each line, we pull out the key and value. Then, if we’ve just finished a group ( last_key &&
last_key != key ), we write the key and the maximum temperature for that group, separated by a tab
character, before resetting the maximum temperature for the new key. If we haven’t just finished a
group, we just update the maximum temperature for the current key.
15
16
DESIGN OF HDFS
17
HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware. Let’s examine this statement in more detail:
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in
size. There are Hadoop clusters running today that store petabytes of data.
HDFS is built around the idea that the most efficient data processing pattern is a write-once,
read-many-times pattern. A dataset is typically generated or copied from source, and then various
analyses are performed on that dataset over time. Each analysis will involve a large proportion of the
dataset, so the time to read the whole dataset is more important.
Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of
commodity hardware (commonly available hardware) for which the chance of node failure across the
cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the case of such failure.
There are the applications for which using HDFS does not work so well. Below are the areas where
HDFS is not a good fit today:
Applications that require low-latency access to data, in the tens of milliseconds range, will not
work well with HDFS. Remember, HDFS is designed to give a high throughput of data, but with a more
delay. HBase is currently a better choice for low-latency access.
In HDFS, the namenode holds filesystem's metadata in memory, the number of files stored in a
filesystem depends on the amount of memory on the namenode. As a rule of thumb, each file, directory,
and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you
would need at least 300 MB of memory. Although storing millions of files is feasible, billions is beyond
the capability of current hardware
Files in HDFS may be written to by a single writer. Writes are always made at the end of the file, in
append-only fashion. There is no support for multiple writers or for modifications at arbitrary offsets in
the file
18