BDA Module 2 - Notes PDF
BDA Module 2 - Notes PDF
Module – 1:
Hadoop Distributed File System Basics, Running
Example Programs and Benchmarks, Hadoop
MapReduce Framework, MapReduce Programming
Module – 2:
Essential Hadoop Tools, Hadoop YARN
Applications, Managing Hadoop with Apache
Ambari, Basic Hadoop Administration Procedures
Module – 3
Business Intelligence Concepts and Application, Data
Warehousing, Data Mining, Data Visualization
3
Module – 4
Decision Trees, Regression, Artificial Neural Networks,
Cluster Analysis, Association Rule Mining
Module – 5
Text Mining, Naïve-Bayes Analysis, Support Vector
Machines, Web Mining, Social Network Analysis
4
Text Books:
1. Douglas Eadline,"Hadoop 2 Quick-Start Guide: Learn
the Essentials of Big Data Computing in the Apache
Hadoop 2 Ecosystem", 1stEdition, Pearson Education,
2016. ISBN-13: 978-9332570351
2. Anil Maheshwari, “Data Analytics”, 1st Edition, McGraw
Hill Education, 2017. ISBN-13: 978-9352604180
6
Reference Books:
1) Tom White, “Hadoop: The Definitive Guide”, 4th
Edition, O’Reilly Media, 2015.ISBN-13: 978-
9352130672
2) Boris Lublinsky, Kevin T.Smith, Alexey
Yakubovich,"Professional Hadoop Solutions",
1stEdition, Wrox Press, 2014ISBN-13: 978-
8126551071
3) Eric Sammer,"Hadoop Operations: A Guide for
Developers and Administrators", 1stEdition, O'Reilly
Media, 2012.ISBN-13: 978-9350239261
7
DEFINING BIG DATA
Big data is a field that treats ways to analyze,
systematically extract information from, or
otherwise deal with data sets that are too large or
complex to be dealt with by traditional data-
processing application software.
Large volume data processing- often measured in
peta bytes
8
CHARACTERISTICS THAT DEFINE BIG DATA
■ Volume: Large volumes clearly define Big Data. In
some cases, the sheer size of the data makes it
impossible to evaluate by more conventional means.
■ Variety: Data may come from a variety of sources and
not necessarily be related" to other data sources.
■ Velocity: The term velocity in this context refers to how
fast the data can be generated and processed.
■ Variability: Data may be highly variable, incomplete,
and inconsistent.
■ Complexity: Relationships between data sources may
not be entirely clear and not amenable to traditional
relational methods.
9
HADOOP DISTRIBUTED FILE SYSTEM DESIGN
FEATURES
The Hadoop Distributed File System (HDFS) was designed for
Big Data processing.
Although capable of supporting many users simultaneously,
HDFS is not designed as a true parallel file system.
Rather, the design assumes a large file write-once/ read-many
model that enables other optimizations and relaxes many of
the concurrence and coherence overhead requirements of a
true parallel file system.
HDFS rigorously restricts data writing to one user at a time. All
additional writes are append-only," and there is no random
writing to HDFS files.
Bytes are always appended to the end of a stream, and byte
streams are guaranteed to be stored in the order written. 10
HADOOP DISTRIBUTED FILE SYSTEM DESIGN
FEATURES
HDFS is designed for data streaming where large
amounts of data are read from disk in bulk. The HDFS
block size is typically 64MB or 128MB.
A principal design aspect of Hadoop MapReduce is the
emphasis on moving the computation to the, data rather
than moving the data to the computation.
Finally, Hadoop clusters assume node (and even rack)
failure will occur at some point. To deal with this situation,
HDFS has a redundant design that can tolerate system
failure and still provide the data needed by the compute
part of the program.
11
IMPORTANT ASPECTS OF HDFS
The write-once/read-many design is intended to facilitate
streaming reads.
Files may be appended, but random seeks are not
permitted. There is no caching of data.
Converged data storage and processing happen on the
same server nodes.
"Moving computation is cheaper than moving data."
A reliable file system maintains multiple copies of data
13
The design is a master/slave architecture in which the
master (NameNode) manages the file system
namespace and regulates access to files by clients.
File system namespace operations such as opening,
various
system
roles in an
HDFS
deployment
15
When a client writes data, it first communicates with the
NameNode and requests to create a file.
The NameNode determines how many blocks are
needed and provides the client with the Datanodes that
will store the data.
As part of the storage process, the data blocks are
16
To Read data, the client requests a file from the
NameNode, which returns the best Datanodes from
which to read the data.
The client then accesses the data directly from the
Datanodes
once the metadata has been delivered to the client,
the NameNode steps back and lets the conversation
between the client and the Datanodes proceed.
While data transfer is progressing, the NameNode
also monitors the Datanodes by listening for
heartbeats sent from Datanodes
The lack of a heartbeat signal indicates a potential,
node failure.-In such a case, the NameNode will route
around the failed Datanode and begin re-replicating
17
the now-missing blocks
The mappings between data blocks and the
physical DataNodes are not kept in persistent
storage on the NameNode.
For performance reasons, the NameNode stores all
metadata in memory.
Upon startup, each DataNode provides a block
report (which it keeps in persistent storage) to the
NameNode.
The block reports are sent every 10 heartbeats.
(The interval between reports is a configurable
property.)
The reports enable the NameNode to keep an up-
to-date account of all data blocks in the cluster
18
In almost all Hadoop deployments, there is a
SecondaryNameNode(now called CheckPointNode).
It is not an active failover node and cannot replace the
primary NameNode in case of its failure.
The purpose of the SecondaryNameNode is to perform
periodic checkpoincs that evaluate the status of the
NameNode.
It also has two disk files that track changes to the
metadata:
An image of the file system state when the NameNode was
started. This file begins with fsimage_* and is used only at
startup by the NameNode.
A series of modifications done to the file system after starting
the Namenode. These files begin with edit_* and reflect the
changes made after the fsimage-file was read.
19
The SecondaryNameNode periodically downloads
fsimage and edits files, joins them into a new fsimage,
and uploads the new fsimage file to the NameNode.
Thus, when the NameNode restarts, the fsimage file
is reasonably up-to-date and requires only the edit
logs to be applied since the last checkpoint.
If the SecondaryNameNode were not running, a
restart of the NameNode could take a prohibitively
long time due to the number of changes to the file
system
20
Various roles in HDFS can be summarized as follows:
cop.“
HDFS provides a single namespace that is managed by
the NameNode
Data is redundantly stored on Datanodes; there is no
data on the NameNode
The SecondaryNameNode performs checkpoints of
NameNode file system's state but is not a failover node.
21
HDFS BLOCK REPLICATION
27
One of the NameNode machines is in the Active state, and
the other is in the Standby state.
Like a single NameNode cluster, the Active NameNode is
responsible for all client HDFS operations in the cluster.
The Standby NameNode maintains enough state to provide
a fast failover (if required).
To guarantee the file system state is preserved, both the
Active and Standby NameNodes receive block reports from
the Datanodes.
The Active node also sends all file system edits to a quorum
of Journal nodes.
At least three physically separate JournalNode daemons
are required, because edit log modifications must be written
to a majority of the JournalNodes.
This design will enable the system to tolerate the failure of
a single JournalNode machine. 28
The Standby node continuously reads the edits from
the JournalNodes to ensure its namespace is
synchronized with that of the Active node.
In the event of an Active NameNode failure, the
health.
Zookeeper is a highly available service for maintaining
30
HDFS USER COMMANDS
Usage: hdfs [--config confdir] COMMAND
Command Usage
dfs run a file system command on the file
systems supported in Hadoop.
namenode -format format the DFS file system
secondarynamenode run the DFS secondary namenode
31
journalnode run the DFS journalnode
Command Usage
cacheadmin configure the HDFS cache
32
Command Usage
33
General HDFS Commands
34
Make a Directory in HDFS
35
Copy Files to HDFS
If a full path is not supplied, your home directory is
assumed.
36
Copy Files from HDFS
37
Delete a File within HDFS
39
USING HDFS IN PROGRAMS
When using Java, reading from and writing to
Hadoop DFS is no different from the corresponding
operations with other file systems
To be able to read from or write to HDFS, you need
to create a Configuration object and pass
configuration parameters to it using Hadoop
configuration files
40
package org.myorg;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
41
public void addFile(String source, String dest) throws IOException {
Configuration conf = new Configuration();
// Conf object will read the HDFS configuration parameters
from
// these XML files.
conf.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/etc/hadoop/conf/hdfs-
site.xml")); FileSystem fileSystem = FileSystem.get(conf);
// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') +
1,
source.length());
// Create the destination path including the
filename. if (dest.charAt(dest.length() - 1) != '/') {
dest = dest + "/" + filename;
} else {
dest = dest + filename;
}
// System.out.println("Adding file to " + destination);
// Check if the file already exists
Path path = new Path(dest);
if (fileSystem.exists(path)) {
System.out.println("File " + dest + " already
exists"); return;
}
// Create a new file and write data to it.
FSDataOutputStream out =
fileSystem.create(path);
InputStream in = new BufferedInputStream(new
FileInputStream( new File(source)));
byte[] b = new
byte[1024]; int
numBytes = 0;
while ((numBytes = in.read(b)) >
0) { out.write(b, 0, numBytes);
}
// Close all the file descripters
in.close();
out.close(); 43
fileSystem.clos()
;
public void readFile(String file) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/etc/hadoop/conf/core-
site.xml"));
FileSystem fileSystem = FileSystem.get(conf); Path path = new
Path(file);
if (!fileSystem.exists(path)) {
System.out.println("File " + file + " does not exists"); return;
}
FSDataInputStream in = fileSystem.open(path);
String filename = file.substring(file.lastIndexOf('/') + 1, file.length());
44
OutputStream out = new BufferedOutputStream(new)
FileOutputStream( new File(filename))); byte[] b = new
byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) { out.write(b, 0,
numBytes);
}
in.close();
out.close();
fileSystem.close(); }
45
public void deleteFile(String file) throws IOException
{
Configuration conf = new Configuration();
conf.addResource(new)
Path("/etc/hadoop/conf/core-site.xml")); FileSystem fileSystem =
FileSystem.get(conf); Path path = new Path(file);
if (!fileSystem.exists(path)) {
System.out.println("File " + file + " does not exists");
return;
}
fileSystem.delete(new Path(file), true);
fileSystem.close();
}
46
public void mkdir(String dir) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/etc/hadoop/conf/core-
site.xml"));
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path(dir); if
(fileSystem.exists(path)) {
System.out.println("Dir " + dir + " already not exists");
return;
}
fileSystem.mkdirs(path);
fileSystem.close();
}
47
RUNNING BASIC HADOOP BENCHMARKS
48
RUNNING THE TERASORT TEST
The terasort benchmark sorts a specified amount of
randomly generated data.
This benchmark provides combined testing of the
HDFS and MapReduce layers of a Hadoop cluster.
A full terasort benchmark run consists of the following
three steps:
1. Generating the input data via teragen program.
2.Running the actual terasort benchmark on the input
data.
3.Validating the sorted output data via the
teravalidate program.
49
In general, each row is 100 bytes long; thus the
total amount of data written is 100 times the
number of rows specified as part of the benchmark
(i.e., to write 100GB of data, use 1 billion rows).
The input and output directories need to be
specified in HDFS.
The following sequence of commands will run the
benchmark for 50GB of data as user hdfs.
1. Run teragen to generate rows of random data to
sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-
examples.jar teragen 500000000 /user/hdfs/TeraGen-50GB
50
2. Run terasort to sort the database
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-
examples.jar terasort /user/hdfs/TeraGen-50GB
/user/hdfs/TeraSort-50GB
51
RUNNING THE TESTDFSIO BENCHMARK
Hadoop also includes an HDFS benchmark application
called TestDFSIO.
The TestDFSIO benchmark is a read and write test for
HDFS.
That is, it will write or read a number of files to and from
53
MAPREDUCE MODEL
The MapReduce computation model provides a
very powerful tool for many applications
There are two stages:
A Mapping Stage & A Reducing Stage.
In the mapping stage, a mapping procedure is
applied to input data
Map takes a set of data and converts it into another
set of data, where individual elements are broken
down into tuples (key/value pairs).
Reduce tasks shuffle and reduce the data.
54
For instance, assume you need to count how many
times the name “Kutuzov” appears in the novel War
and Peace.
One solution is to gather 20 friends and give them
55
SIMPLE MAPPER SCRIPT
#!/bin/bash
while readline ; do
for token in $line; do
if [ "$token" = "Kutuzov" ] ; then
echo "Kutuzov,1"
elif [ "$token" = "Petersburg" ] ; then
echo "Petersburg,1"
fi
done
56
done
SIMPLE REDUCER SCRIPT
#!/bin/bash
kcount=0
pcount=0
while read line ; do
57
Formally, The mapper and reducer functions are both
defined with respect to data structured in (key, value)
pairs.
The mapper takes one pair of data with a type in one
data domain, and returns a list of pairs in a different
domain:
Map(key1,value1) → list(key2,value2)
60
61
62
FAULT TOLERANCE
The design of MapReduce makes it possible to easily
recover from the failure of one or many map processes.
For example, should a server fail, the map tasks that
were running on that machine could easily be restarted
on another working server because there is no
dependence on any other map task.
In functional language terms, the map tasks “do not
share state” with other mappers
the application will run more slowly because work needs
to be redone, but it will complete.
63
SPECULATIVE EXECUTION
Bottlenecks or failures.
When one part of a MapReduce process runs slowly, it
ultimately slows down everything else because the application
cannot complete until all processes are finished.
It is possible to start a copy of a running map process without
disturbing any other running mapper processes.
For example, suppose that as most of the map tasks are
coming to a close, the ApplicationMaster notices that some
are still running and schedules redundant copies of the
remaining jobs on less busy or free servers.
Should the secondary processes finish first, the other first
processes are then terminated (or vice versa). This process is
known as speculative execution.
• 64
Module 1 at a glance