0% found this document useful (0 votes)
20 views

Unit 3

BIG DATA ANALYTICS
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Unit 3

BIG DATA ANALYTICS
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

INTRODUCTION TO HADOOP

History of Hadoop, the Hadoop Distributed File System, Components of Hadoop, Analyzing the Data with
Hadoop, Scaling Out, Hadoop Streaming, Design of HDFS, Java interfaces to HDFS Basics, Developing a
Map Reduce Application, How Map Reduce Works, Anatomy of a Map Reduce Job run, Failures, Job
Scheduling, Shuffle and Sort, Task execution, Map Reduce Types and Formats, Map Reduce Features
Hadoop environment.

---------------------------------------------------------------------------------------------------------------------

HISTORY OF HADOOP
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search
library. Hadoop has its origins in Apache Nutch, an open source web search engine, its a part of the
Lucene project.

Nutch was started in 2002, and a working crawler and search system quickly emerged. However,
its creators realized that their architecture wouldn’t scale to the billions of pages on the Web.

In 2004, Nutch’s developers set about writing an open source implementation, the Nutch
Distributed Filesystem (NDFS) as similar to Google distributed File System (GFS) introduced in 2003.

In 2004, Google published the paper that introduced MapReduce to the world. Early in 2005, the
Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all
the major Nutch algorithms had been ported to run using MapReduce and NDFS.

In February 2006 they moved out of Nutch to form an independent subproject of Lucene called
Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and
the resources to turn Hadoop into a system that ran at web scale. This was demonstrated in February
2008.

In January 2008, Hadoop was made its own top-level project at Apache, confirming its success
and its diverse, active community. By this time, Hadoop was being used by many other companies
besides Yahoo!, such as Last.fm, Facebook, and the New York Times.

In one well-publicized feat, the New York Times used Amazon’s EC2 compute cloud to crunch
through 4 terabytes of scanned archives from the paper, converting them to PDFs for the Web. 13 The
processing took less than 24 hours to run using 100 machines.

In April 2008, Hadoop broke a world record to become the fastest system to sort an entire
terabyte of data. Running on a 910-node cluster, Hadoop sorted 1 terabyte in 209 seconds (just under
3.5 minutes), beating the previous year’s winner of 297 seconds.

1
In November of the same year, Google reported that its MapReduce implementation sorted 1
terabyte in 68 seconds. Then, in April 2009, it was announced that a team at Yahoo! had used Hadoop to
sort 1 terabyte in 62 seconds.

The trend since then has been to sort even larger volumes of data at ever faster rates

Today, Hadoop is widely used in mainstream enterprises. Hadoop’s role as a general purpose
storage and analysis platform for big data has been recognized by the industry, and this fact is reflected
in the number of products that use or incorporate Hadoop in some way. Commercial Hadoop support is
available from large, established enterprise vendors, including EMC, IBM, Microsoft, and Oracle, as well
as from specialist Hadoop companies such as Cloudera, Hortonworks, and MapR.

THE HADOOP DISTRIBUTED FILE SYSTEM


When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary
to partition it across a number of separate machines. Filesystems that manage the storage across a
network of machines are called distributed filesystems. Since they are network based, all the
complications of network programming has to be resolved, thus making distributed filesystems more
complex than regular disk filesystems. For example, one of the biggest challenges is making the
filesystem tolerate node failure without suffering data loss.

Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed
Filesystem, informally referred as DFS. HDFS is Hadoop’s flagship filesystem, a general purpose filesystem
integrates with other storage systems (such as the local filesystem and Amazon S3).

COMPONENTS OF HADOOP
Hadoop is a framework that uses distributed storage and parallel processing to store and manage
big data. It is the software most used by data analysts to handle big data, and its market size continues to
grow.

There are three components of Hadoop:

1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.

2. Hadoop MapReduce - Hadoop MapReduce is the processing unit.

3. Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource management unit.

2
I. HADOOP HDFS:

The Hadoop Distributed File System (HDFS) is Hadoop’s storage layer. HDFS is housed on multiple
servers, data is divided into blocks based on file size. These blocks are then randomly distributed and
stored across slave machines.

There are three components of the Hadoop Distributed File System:

1. NameNode (a.k.a. masternode): Contains metadata in RAM and disk

2. Secondary NameNode: Contains a copy of NameNode’s metadata on disk

3. Slave Node: Contains the actual data in the form of blocks

NameNode

NameNode is the master server. In a non-high availability cluster, there can be only one
NameNode. In a high availability cluster, there is a possibility of two NameNodes, and if there are two
NameNodes there is no need for a secondary NameNode.

NameNode holds metadata information on the various DataNodes, their locations, the size of
each block, etc. It also helps to execute file system operations, such as opening, closing, renaming files
and directories.

Secondary NameNode

The secondary NameNode server is responsible for maintaining a copy of the metadata in the
disk. The main purpose of the secondary NameNode is to create a new NameNode in case of failure.

In a high availability cluster, there are two NameNodes: active and standby. The secondary
NameNode performs a similar function to the standby NameNode.

Hadoop Cluster - Rack Based Architecture

Hadoop Cluster in a rack-aware cluster, nodes are placed in racks and each rack has its own rack
switch. Rack switches are connected to a core switch, which ensures a switch failure will not render a
rack unavailable.

HDFS Read and Write Mechanism

HDFS Read and Write mechanisms are parallel activities. To read or write a file in HDFS, a client
must interact with the namenode. The namenode checks the privileges of the client and gives
permission to read or write on the data blocks.

Data Node

3
Datanodes store and maintain the blocks. While there is only one namenode, there can be
multiple datanodes. Datanodes are responsible for retrieving the data blocks when requested by the
namenode

The data nodes read, write, process, and replicate the data. They also send signals every 10
seconds to the name node. These signals show the status of the data node.

II. HADOOP YARN:

Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource management layer of
Hadoop and is responsible for resource allocation and job scheduling. YARN is the middle layer between
HDFS and MapReduce in the Hadoop architecture.

The elements of YARN include:

1. Resource Manager (one per cluster)

2. Application Master (one per application)

3. Node Managers (one per node)

Resource Manager

Resource Manager manages the resource allocation in the cluster and tracks how many resources
are available in the cluster and each node manager’s contribution.

It has two main components:

a) Scheduler: Allocating resources to various running applications and scheduling resources based on the
requirements of the application; it doesn’t monitor or track the status of the applications

b) Application Manager: Accepting job submissions from the client and restarting application masters in
case of failure

Application Master

Application Master manages the resource needs of individual applications and interacts with the
scheduler to acquire the required resources. It connects with the node manager to execute and monitor
tasks.

Node Manager

Node Manager tracks running jobs and sends signals (or heartbeats) to the resource manager to
inform the status of a node. It also monitors each container’s resource utilization.

Container

4
Container houses a collection of resources like RAM, CPU, and network bandwidth. The container
provides the rights to an application to use specific resource amounts.

The scenario of YARN is shown in the below diagram:

III. MAPREDUCE:

MapReduce is a framework conducting distributed and parallel processing of large volumes of


data. It is written using a number of programming languages.

It has two main phases: Map Phase and Reduce Phase.

Map Phase

Map Phase stores data in the form of blocks. Data is read, processed and given a key-value pair in
this phase. It is responsible for running a particular task on one or multiple splits or inputs.

Reduce Phase

The reduce Phase receives the key-value pair from the map phase. The key-value pair is then
aggregated into smaller sets and an output is produced. Processes such as shuffling and sorting occur in
the reduce phase.

The mapper function handles the input data and runs a function on every input split (known as
map tasks). There can be one or multiple map tasks based on the size of the file. Data is then sorted,
shuffled, and moved to the reduce phase, where a reduce function aggregates the data and provides the
output.

MapReduce Job Execution

 The input data is stored in the HDFS and read using an input format.

5
 The file is split into multiple chunks based on the size of the file and the input format.

 The default chunk size is 128 MB but can be customized.

 The record reader reads the data from the input splits and forwards this information to the mapper.

 The mapper breaks the records in every chunk into a list of data elements (or key-value pairs).

 The combiner works on the intermediate data created by the map tasks and acts as a mini reducer to
reduce the data.

 The partitioner decides how many reduce tasks will be required to aggregate the data.

 The data is then sorted and shuffled based on their key-value pairs and sent to the reduce function.

 Based on the output format decided by the reduce function, the output data is then stored on the
HDFS.

ANALYZING THE DATA WITH HADOOP


Hadoop is a framework that uses distributed storage and parallel processing to store and manage
big data. To take advantage of the parallel processing that Hadoop provides, every client query is
expressed as a MapReduce job. After some local, we will be able to run it on a cluster of machines.

Map and Reduce

 MapReduce works by breaking the processing into two phases: the map phase and the reduce phase.

 Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer.

 The programmer also specifies two functions: the map function and the reduce function.

 The input to our map phase is the raw NCDC(National Climatic Data Center)data. We choose a text
input format that consider each line in the dataset as a text value. The key is the offset of the
beginning of the line from the beginning of the file.

 Our map function is simple. Here we take the year and the air temperature, because these are the
only fields we are interested in. And then reduce function can do its work on it: finding the maximum
temperature for each year.

 To visualize the way the map works, consider the following sample lines of input data:

0067011990999991950051507004...9999999N9+00001+99999999999...

6
0043011990999991950051512004...9999999N9+00221+99999999999...

0043011990999991950051518004...9999999N9-00111+99999999999...

0043012650999991949032412004...0500001N9+01111+99999999999...

0043012650999991949032418004...0500001N9+00781+99999999999...

 These lines are presented to the map function as key-value pairs:

(0,0067011990999991950051507004...9999999N9+00001+99999999999...)

(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)

(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)

(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)

(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)

 The map function extracts the year and the air temperature (indicated in bold text), and emits
them as its output

(1950, 0)

(1950, 22)

(1950, −11)

(1949, 111)

(1949, 78)

 The output from the map function is processed by the MapReduce framework before being sent to
the reduce function.

 This processing sorts and groups the key-value pairs by key. So, the reduce function sees the
following input:

(1949, [111, 78])

(1950, [0, 22, −11])

 All the reduce function has to do now is iterate through the list and pick up the maximum reading:

(1949, 111)

(1950, 22)

 This is the final output: the maximum global temperature recorded in each year.

 The whole data flow is illustrated in Figure:

7
JAVA MAP REDUCE

Here, how the MapReduce program works is expressed in Java code.

Three things are needed:

1. a map function

2. a reduce function

3. and some code to run the job

I. The map function is represented by the Mapper class, which declares an abstract map() method.
Following code shows the implementation of our map function.

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private static final int MISSING = 9999;

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException

String line = value.toString();

8
String year = line.substring(15, 19);

int airTemperature;

if (line.charAt(87) == '+')

airTemperature = Integer.parseInt(line.substring(88, 92));

else

{ airTemperature = Integer.parseInt(line.substring(87, 92));

String quality = line.substring(92, 93);

if (airTemperature != MISSING && quality.matches("[01459]"))

context.write(new Text(year), new IntWritable(airTemperature));

CODE EXPLANATION

 The Mapper class is a generic type, with four formal type parameters that specify the input key, input
value, output key, and output value types of the map function.

 For the present example,

1. the input key is a long integer offset,

2. the input value is a line of text,

3. the output key is a year,

4. and the output value is an air temperature (an integer).

Rather than using built-in Java types, Hadoop provides its own set of basic types that are
optimized for network serialization. These are found in the org.apache.hadoop.io package.

 substring() method to extract the columns we are interested in.

 The map() method also provides an instance of Context to write the output to. In this case, we write
the year as a Text object and the temperature is wrapped in an IntWritable .

9
 We write an output record only if the temperature is present and the quality code indicates the
temperature reading is OK.

II. The reduce function is similarly defined using a Reducer

Following Code shows Reducer for the maximum temperature example

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable>

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException

int maxValue = Integer.MIN_VALUE;

for (IntWritable value : values)

maxValue = Math.max(maxValue, value.get());

context.write(key, new IntWritable(maxValue));

CODE EXPLANATION

Again, four formal type parameters are used to specify the input and output types, this time for
the reduce function. The input types of the reduce function must match the output types of the map

10
function: Text and IntWritable . And in this case, the output types of the reduce function are Text and
IntWritable , for a year and its maximum temperature, which we find by iterating through the
temperatures and comparing each with a record of the highest found so far.

The third piece of code runs the MapReduce job

Application to find the maximum temperature in the weather dataset

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MaxTemperature

public static void main(String[] args) throws Exception

if (args.length != 2)

System.err.println("Usage: MaxTemperature <input path> <output path>");

System.exit(-1);

Job job = new Job();

job.setJarByClass(MaxTemperature.class);

job.setJobName("Max temperature");

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

11
job.setMapperClass(MaxTemperatureMapper.class);

job.setReducerClass(MaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

CODE EXPLANATION

1. A Job object forms the specification of the job and gives you control over how the job is run.

2. When we run this job on a Hadoop cluster, we will package the code into a JAR file (which Hadoop
will distribute around the cluster). For this purpose, we can pass a class in the Job ’s setJarByClass()
method, which Hadoop will use to locate the relevant JAR file by looking for the JAR file containing
this class.

3. An input path is specified by calling the addInputPath() method on FileInputFormat , and it can be a
single file, a directory or a file pattern

4. The output path is specified by the static setOutputPath() method on FileOutputFormat

5. Next, we specify the map and reduce types to use via the setMapperClass() and setReducerClass()
methods.

6. The setOutputKeyClass() and setOutputValueClass() methods control the output types for the
reduce function

7. Now we are ready to run the job. The waitForCompletion() method on Job submits the job and waits
for it to finish.

8. The return value of the waitForCompletion() method is a Boolean indicating success ( true ) or failure
( false ), which we translate into the program’s exit code of 0 or 1.

12
SCALING OUT

13
HADOOP STREAMING
 Hadoop provides an API that allows you to write your map and reduce functions in languages other
than Java.

 Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program,
so you can use any language that can read standard input and write to standard output to write your
MapReduce program.

 Mapped input data is passed over standard input to your map function, which processes it line by
line and writes lines to standard output.

 A map output key-value pair is written as a single tab-delimited line.

 Input to the reduce function is in the same format—a tab-separated key-value pair—passed over
standard input.

 The reduce function reads lines from standard input, which the framework guarantees are sorted by
key, and writes its results to standard output.

MapReduce program for finding maximum temperatures by year in Streaming using Ruby Language

#!/usr/bin/env ruby

STDIN.each_line do |line|

val = line

year, temp, q = val[15,4], val[87,5], val[92,1]

puts "#{year}\t#{temp}" if (temp != "+9999" && q =~ /[01459]/)

end

14
CODE EXPLANATION: This program iterates over lines from standard input by executing a block for each
line from STDIN (a global constant of type IO ). The block pulls out the relevant fields from each input
line and, if the temperature is valid, writes the year and the temperature separated by a tab character, \t
, to standard output (using puts).

Reduce function for maximum temperature in Ruby

#!/usr/bin/env ruby

last_key, max_val = nil, -1000000

STDIN.each_line do |line|

key, val = line.split("\t")

if last_key && last_key != key

puts "#{last_key}\t#{max_val}"

last_key, max_val = key, val.to_i

else

last_key, max_val = key, [max_val, val.to_i].max

end

end

puts "#{last_key}\t#{max_val}" if last_key

CODE EXPLANATION

Again, the program iterates over lines from standard input, but this time we have to store some
state as we process each key group. In this case, the keys are the years, and we store the last key seen
and the maximum temperature seen so far for that key. The MapReduce framework ensures that the
keys are ordered, so we know that if a key is different from the previous one, we have moved into a new
key group.

For each line, we pull out the key and value. Then, if we’ve just finished a group ( last_key &&
last_key != key ), we write the key and the maximum temperature for that group, separated by a tab
character, before resetting the maximum temperature for the new key. If we haven’t just finished a
group, we just update the maximum temperature for the current key.

15
16
DESIGN OF HDFS

17
HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware. Let’s examine this statement in more detail:

Very large files

“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in
size. There are Hadoop clusters running today that store petabytes of data.

Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a write-once,
read-many-times pattern. A dataset is typically generated or copied from source, and then various
analyses are performed on that dataset over time. Each analysis will involve a large proportion of the
dataset, so the time to read the whole dataset is more important.

Commodity hardware

Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of
commodity hardware (commonly available hardware) for which the chance of node failure across the
cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the case of such failure.

There are the applications for which using HDFS does not work so well. Below are the areas where
HDFS is not a good fit today:

Low-latency data access

Applications that require low-latency access to data, in the tens of milliseconds range, will not
work well with HDFS. Remember, HDFS is designed to give a high throughput of data, but with a more
delay. HBase is currently a better choice for low-latency access.

Lots of small files

In HDFS, the namenode holds filesystem's metadata in memory, the number of files stored in a
filesystem depends on the amount of memory on the namenode. As a rule of thumb, each file, directory,
and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you
would need at least 300 MB of memory. Although storing millions of files is feasible, billions is beyond
the capability of current hardware

Multiple writers, arbitrary file modifications

Files in HDFS may be written to by a single writer. Writes are always made at the end of the file, in
append-only fashion. There is no support for multiple writers or for modifications at arbitrary offsets in
the file

18

You might also like