0% found this document useful (0 votes)
213 views101 pages

BDA Module 2 - Notes PDF

The document provides an overview of the key concepts of Hadoop Distributed File System (HDFS). HDFS is designed for large file storage and uses a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. Data is replicated across DataNodes for reliability. The NameNode works with a SecondaryNameNode for periodic checkpoints to

Uploaded by

Nidhi Srivastava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
213 views101 pages

BDA Module 2 - Notes PDF

The document provides an overview of the key concepts of Hadoop Distributed File System (HDFS). HDFS is designed for large file storage and uses a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. Data is replicated across DataNodes for reliability. The NameNode works with a SecondaryNameNode for periodic checkpoints to

Uploaded by

Nidhi Srivastava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Big Data Analytics

Module – 1:
Hadoop Distributed File System Basics, Running
Example Programs and Benchmarks, Hadoop
MapReduce Framework, MapReduce Programming
Module – 2:
Essential Hadoop Tools, Hadoop YARN
Applications, Managing Hadoop with Apache
Ambari, Basic Hadoop Administration Procedures

Module – 3
Business Intelligence Concepts and Application, Data
Warehousing, Data Mining, Data Visualization

3
Module – 4
Decision Trees, Regression, Artificial Neural Networks,
Cluster Analysis, Association Rule Mining

Module – 5
Text Mining, Naïve-Bayes Analysis, Support Vector
Machines, Web Mining, Social Network Analysis

4
Text Books:
1. Douglas Eadline,"Hadoop 2 Quick-Start Guide: Learn
the Essentials of Big Data Computing in the Apache
Hadoop 2 Ecosystem", 1stEdition, Pearson Education,
2016. ISBN-13: 978-9332570351
2. Anil Maheshwari, “Data Analytics”, 1st Edition, McGraw
Hill Education, 2017. ISBN-13: 978-9352604180

6
 Reference Books:
1) Tom White, “Hadoop: The Definitive Guide”, 4th
Edition, O’Reilly Media, 2015.ISBN-13: 978-
9352130672
2) Boris Lublinsky, Kevin T.Smith, Alexey
Yakubovich,"Professional Hadoop Solutions",
1stEdition, Wrox Press, 2014ISBN-13: 978-
8126551071
3) Eric Sammer,"Hadoop Operations: A Guide for
Developers and Administrators", 1stEdition, O'Reilly
Media, 2012.ISBN-13: 978-9350239261

7
DEFINING BIG DATA
 Big data is a field that treats ways to analyze,
systematically extract information from, or
otherwise deal with data sets that are too large or
complex to be dealt with by traditional data-
processing application software.
 Large volume data processing- often measured in
peta bytes

8
CHARACTERISTICS THAT DEFINE BIG DATA
 ■ Volume: Large volumes clearly define Big Data. In
some cases, the sheer size of the data makes it
impossible to evaluate by more conventional means.
 ■ Variety: Data may come from a variety of sources and
not necessarily be related" to other data sources.
 ■ Velocity: The term velocity in this context refers to how
fast the data can be generated and processed.
 ■ Variability: Data may be highly variable, incomplete,
and inconsistent.
 ■ Complexity: Relationships between data sources may
not be entirely clear and not amenable to traditional
relational methods.
9
HADOOP DISTRIBUTED FILE SYSTEM DESIGN
FEATURES
 The Hadoop Distributed File System (HDFS) was designed for
Big Data processing.
 Although capable of supporting many users simultaneously,
HDFS is not designed as a true parallel file system.
 Rather, the design assumes a large file write-once/ read-many
model that enables other optimizations and relaxes many of
the concurrence and coherence overhead requirements of a
true parallel file system.
 HDFS rigorously restricts data writing to one user at a time. All
additional writes are append-only," and there is no random
writing to HDFS files.
 Bytes are always appended to the end of a stream, and byte
streams are guaranteed to be stored in the order written. 10
HADOOP DISTRIBUTED FILE SYSTEM DESIGN
FEATURES
 HDFS is designed for data streaming where large
amounts of data are read from disk in bulk. The HDFS
block size is typically 64MB or 128MB.
 A principal design aspect of Hadoop MapReduce is the
emphasis on moving the computation to the, data rather
than moving the data to the computation.
 Finally, Hadoop clusters assume node (and even rack)
failure will occur at some point. To deal with this situation,
HDFS has a redundant design that can tolerate system
failure and still provide the data needed by the compute
part of the program.
11
IMPORTANT ASPECTS OF HDFS
 The write-once/read-many design is intended to facilitate
streaming reads.
 Files may be appended, but random seeks are not
permitted. There is no caching of data.
 Converged data storage and processing happen on the
same server nodes.
 "Moving computation is cheaper than moving data."
 A reliable file system maintains multiple copies of data

across the cluster. Consequently, failure of a single node


(or even a rack in a large cluster) will not bring down the
file system.
 A specialized file system is used, which is not designed
for general use.
HDFS COMPONENTS
 The design of HDFS is based on two types of nodes: a
NameNode and multiple Datanodes.
 In a basic design, a single NameNode manages all the
metadata needed to store and retrieve the actual data
from the Datanodes.
 No data is actually stored on the NameNode

 For a minimal Hadoop installation, there needs to be a


single NameNode daemon and a single Datanode
daemon running on at least one machine

13
 The design is a master/slave architecture in which the
master (NameNode) manages the file system
namespace and regulates access to files by clients.
 File system namespace operations such as opening,

closing, and renaming files and directories are all


managed by the NameNode.
 The NameNode also determines the mapping of blocks
to DataNodes and handles DataNode failures.
 The slaves (DataNodes) are responsible for serving
read and write requests from the file system to the
clients.
 The NameNode manages block creation, deletion, and
replication.
14
`

various
system
roles in an
HDFS
deployment

15
 When a client writes data, it first communicates with the
NameNode and requests to create a file.
 The NameNode determines how many blocks are
needed and provides the client with the Datanodes that
will store the data.
 As part of the storage process, the data blocks are

replicated after they arc written to the assigned node.


 After the Datanode acknowledges that the file block

replication is complete, the client closes the file and


informs the NameNode that the operation is complete

16
 To Read data, the client requests a file from the
NameNode, which returns the best Datanodes from
which to read the data.
 The client then accesses the data directly from the
Datanodes
 once the metadata has been delivered to the client,
the NameNode steps back and lets the conversation
between the client and the Datanodes proceed.
 While data transfer is progressing, the NameNode
also monitors the Datanodes by listening for
heartbeats sent from Datanodes
 The lack of a heartbeat signal indicates a potential,
node failure.-In such a case, the NameNode will route
around the failed Datanode and begin re-replicating
17
the now-missing blocks
 The mappings between data blocks and the
physical DataNodes are not kept in persistent
storage on the NameNode.
 For performance reasons, the NameNode stores all
metadata in memory.
 Upon startup, each DataNode provides a block
report (which it keeps in persistent storage) to the
NameNode.
 The block reports are sent every 10 heartbeats.
(The interval between reports is a configurable
property.)
 The reports enable the NameNode to keep an up-
to-date account of all data blocks in the cluster
18
 In almost all Hadoop deployments, there is a
SecondaryNameNode(now called CheckPointNode).
 It is not an active failover node and cannot replace the
primary NameNode in case of its failure.
 The purpose of the SecondaryNameNode is to perform
periodic checkpoincs that evaluate the status of the
NameNode.
 It also has two disk files that track changes to the
metadata:
An image of the file system state when the NameNode was
started. This file begins with fsimage_* and is used only at
startup by the NameNode.
A series of modifications done to the file system after starting
the Namenode. These files begin with edit_* and reflect the
changes made after the fsimage-file was read.
19
 The SecondaryNameNode periodically downloads
fsimage and edits files, joins them into a new fsimage,
and uploads the new fsimage file to the NameNode.
 Thus, when the NameNode restarts, the fsimage file
is reasonably up-to-date and requires only the edit
logs to be applied since the last checkpoint.
 If the SecondaryNameNode were not running, a
restart of the NameNode could take a prohibitively
long time due to the number of changes to the file
system

20
Various roles in HDFS can be summarized as follows:

 HDFS uses a master/slave model designed for large


file reading/streaming.
 The NameNode is a metadata server or "data traffic

cop.“
 HDFS provides a single namespace that is managed by
the NameNode
 Data is redundantly stored on Datanodes; there is no
data on the NameNode
 The SecondaryNameNode performs checkpoints of
NameNode file system's state but is not a failover node.

21
HDFS BLOCK REPLICATION

 When HDFS writes a file, it is replicated across the cluster


 The amount of replication is based on the value of dfs.replication in
the hdfs-site.xml
 default value can be overruled with hdfs dfs-setup command
 Hadoop clusters containing more than eight Datanodes, the
replication value is usually set to 3
 In addition, the HDFS default block size is often 64MB, a typical
operating system, the block size is 4KB or 8KB.
 The HDFS default block size is not the minimum block size,
however. if a 20KB file is written to HDFS, it will create a block that
is approximately 20KB in size. (The underlying file system may
have a minimal block size that increases the actual file size.)
 If a file of size 80MB is written to HDFS, a 64MB block and a
16MB block will be created 22
23
HDFS SAFE MODE
 When the NameNode starts, it enters a read-only safe
mode where blocks cannot be replicated or deleted.
 Safe Mode enables the NameNode to perform two
important processes:
 1. The previous file system state is reconstructed by
loading the fsimage file into memory and replaying the
edit log.
 2. The mapping between blocks and data nodes is
created by waiting for enough of the DataNodes to
register so that at least one copy of the data is
available.
Not all Datanodes are required to register before HDFS exits
from Safe Mode.
24
HADOOP YARN
 Apache Hadoop YARN is the resource management
and job scheduling technology in the open source
Hadoop distributed processing framework.

 YARN stands for Yet Another Resource Negotiator

 The fundamental idea of YARN is to split up the


functionalities of resource management and job
scheduling/monitoring into separate daemons.

 The idea is to have a global ResourceManager (RM)


and per-application ApplicationMaster (AM).

 An application is either a single job or a DAG (Directed


25
Acyclic Graph ) of jobs.
RACK AWARENESS
 Rack awareness deals with data locality.
 The main design goals of Hadoop MapReduce is to move the
computation to the data.
 Hadoop cluster will exhibit three levels of data locality:
1. Data resides on the local machine (best)
2. Data resides in the same rack (better)
3. Data resides in a different rack (good)
 When the YARN scheduler is assigning MapReduce
containers to work as mappers, it will try to place the
container first on the local machine, then on the same rack,
and finally on another rack
 HDFS can be made rack-aware by using user-derived script
that enables the master node to map the network topology of
the cluster 26
NameNode High Availability

27
 One of the NameNode machines is in the Active state, and
the other is in the Standby state.
 Like a single NameNode cluster, the Active NameNode is
responsible for all client HDFS operations in the cluster.
 The Standby NameNode maintains enough state to provide
a fast failover (if required).
 To guarantee the file system state is preserved, both the
Active and Standby NameNodes receive block reports from
the Datanodes.
 The Active node also sends all file system edits to a quorum
of Journal nodes.
 At least three physically separate JournalNode daemons
are required, because edit log modifications must be written
to a majority of the JournalNodes.
 This design will enable the system to tolerate the failure of
a single JournalNode machine. 28
 The Standby node continuously reads the edits from
the JournalNodes to ensure its namespace is
synchronized with that of the Active node.
 In the event of an Active NameNode failure, the

Standby node reads all remaining edits from the


JournalNodes before promoting itself to the Active
state.
 To prevent confusion between NameNodes, the
JournalNodes allow only one NameNode to be a
writer at a time.
 During failover, the NameNode that is chosen to

become active takes over the role of writing to the


JournalNodes. 29
 A Secondary-NameNode is not required in the HA
configuration because the Standby node also performs
the tasks of the Secondary NameNode.
 Apache ZooKeeper is used to monitor the NameNode

health.
 Zookeeper is a highly available service for maintaining

small amounts of coordination data, notifying clients of


changes in that data, and monitoring clients for failures.
 HDFS failover relies on ZooKeeper for failure detection
and for Standby to Active NameNode election.

30
HDFS USER COMMANDS
 Usage: hdfs [--config confdir] COMMAND

where COMMAND is one of:

Command Usage
dfs run a file system command on the file
systems supported in Hadoop.
namenode -format format the DFS file system
secondarynamenode run the DFS secondary namenode

namenode run the DFS namenode

31
journalnode run the DFS journalnode
Command Usage
cacheadmin configure the HDFS cache

crypto configure HDFS encryption zones

storagepolicies get all the existing block storage policies

version print the version

32
Command Usage

zkfc run the ZK Failover Controller daemon

datanode run a DFS datanode


dfsadmin run a DFS admin client
haadmin run a DFS HA admin client
fsck run a DFS file system checking utility
balancer run a cluster balancing utility
nfs3 run an NFS version 3 gateway

33
General HDFS Commands

The version of HDFS can be found from the version option

List Files in HDFS


To list the files in the root HDFS directory

34
Make a Directory in HDFS

35
Copy Files to HDFS
If a full path is not supplied, your home directory is
assumed.

The file transfer can be confirmed by using the


-ls command

36
Copy Files from HDFS

Files can be copied back to your local file system using

Copy Files within HDFS

37
Delete a File within HDFS

• when the fs.trash.interval option is set to a non-zero


value in core-site.xml, all deleted files are moved to
the user’s .Trash directory.
• This can be avoided by including the –skipTrash
option.
$ hdfs dfs -rm -skipTrash stuff/test
Deleted stuff/test
38
Get an HDFS Status Report

Regular users can get an abbreviated HDFS status


report using

39
USING HDFS IN PROGRAMS
 When using Java, reading from and writing to
Hadoop DFS is no different from the corresponding
operations with other file systems
 To be able to read from or write to HDFS, you need
to create a Configuration object and pass
configuration parameters to it using Hadoop
configuration files

40
package org.myorg;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;

import java.io.FileInputStream; import


java.io.FileOutputStream; import
java.io.IOException; import
java.io.InputStream; import
java.io.OutputStream;
import org.apache.hadoop.fs.FSDataInputStream; import
org.apache.hadoop.fs.FSDataOutputStream; import
org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path; public
class HDFSClient {
public HDFSClient() {
}

41
public void addFile(String source, String dest) throws IOException {
Configuration conf = new Configuration();
// Conf object will read the HDFS configuration parameters
from
// these XML files.
conf.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/etc/hadoop/conf/hdfs-
site.xml")); FileSystem fileSystem = FileSystem.get(conf);
// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') +
1,
source.length());
// Create the destination path including the
filename. if (dest.charAt(dest.length() - 1) != '/') {
dest = dest + "/" + filename;
} else {
dest = dest + filename;
}
// System.out.println("Adding file to " + destination);
// Check if the file already exists
Path path = new Path(dest);
if (fileSystem.exists(path)) {
System.out.println("File " + dest + " already
exists"); return;
}
// Create a new file and write data to it.
FSDataOutputStream out =
fileSystem.create(path);
InputStream in = new BufferedInputStream(new
FileInputStream( new File(source)));
byte[] b = new
byte[1024]; int
numBytes = 0;
while ((numBytes = in.read(b)) >
0) { out.write(b, 0, numBytes);
}
// Close all the file descripters
in.close();
out.close(); 43
fileSystem.clos()
;
public void readFile(String file) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/etc/hadoop/conf/core-

site.xml"));
FileSystem fileSystem = FileSystem.get(conf); Path path = new
Path(file);
if (!fileSystem.exists(path)) {
System.out.println("File " + file + " does not exists"); return;
}
FSDataInputStream in = fileSystem.open(path);
String filename = file.substring(file.lastIndexOf('/') + 1, file.length());

44
OutputStream out = new BufferedOutputStream(new)
FileOutputStream( new File(filename))); byte[] b = new
byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) { out.write(b, 0,
numBytes);
}
in.close();
out.close();
fileSystem.close(); }

45
public void deleteFile(String file) throws IOException
{
Configuration conf = new Configuration();
conf.addResource(new)
Path("/etc/hadoop/conf/core-site.xml")); FileSystem fileSystem =
FileSystem.get(conf); Path path = new Path(file);
if (!fileSystem.exists(path)) {
System.out.println("File " + file + " does not exists");
return;
}
fileSystem.delete(new Path(file), true);
fileSystem.close();
}

46
public void mkdir(String dir) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/etc/hadoop/conf/core-
site.xml"));
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path(dir); if
(fileSystem.exists(path)) {
System.out.println("Dir " + dir + " already not exists");
return;
}
fileSystem.mkdirs(path);
fileSystem.close();
}

47
RUNNING BASIC HADOOP BENCHMARKS

 Many Hadoop benchmarks can provide insight into


cluster performance.
 The best benchmarks are always those that reflect real
application performance.
 The two benchmarks terasort and TestDFSIO, provide a
good sense of how well your Hadoop installation is
operating
 These benchmarks are designed for full Hadoop cluster
installations
 These tests assume a multi-disk HDFS environment

48
RUNNING THE TERASORT TEST
 The terasort benchmark sorts a specified amount of
randomly generated data.
 This benchmark provides combined testing of the
HDFS and MapReduce layers of a Hadoop cluster.
 A full terasort benchmark run consists of the following
three steps:
1. Generating the input data via teragen program.
2.Running the actual terasort benchmark on the input
data.
3.Validating the sorted output data via the
teravalidate program.

49
 In general, each row is 100 bytes long; thus the
total amount of data written is 100 times the
number of rows specified as part of the benchmark
(i.e., to write 100GB of data, use 1 billion rows).
 The input and output directories need to be
specified in HDFS.
 The following sequence of commands will run the
benchmark for 50GB of data as user hdfs.
1. Run teragen to generate rows of random data to
sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-
examples.jar teragen 500000000 /user/hdfs/TeraGen-50GB

50
2. Run terasort to sort the database
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-
examples.jar terasort /user/hdfs/TeraGen-50GB
/user/hdfs/TeraSort-50GB

3. Run teravalidate to validate the sort.


$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-
examples.jar teravalidate /user/hdfs/TeraSort-50GB
/user/hdfs/TeraValid-50GB

To report results, the time for the actual sort (terasort) is


measured and the benchmark rate in megabytes/second
(MB/s) is calculated.

51
RUNNING THE TESTDFSIO BENCHMARK
 Hadoop also includes an HDFS benchmark application
called TestDFSIO.
 The TestDFSIO benchmark is a read and write test for
HDFS.
 That is, it will write or read a number of files to and from

HDFS and is designed in such a way that it will use one


map task per file
 The file size and number of files are specified as
command-line arguments. Similar to the terasort
benchmark, we should run this test as user hdfs.
 In addition to TestDFSIO, NNBench (load testing the
NameNode) and MRBench (load testing the MapReduce
framework) are commonly used Hadoop benchmarks. 52
The steps to run TestDFSIO are as follows:
1. Run TestDFSIO in write mode and create data.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-client-jobclient-
tests.jar TestDFSIO -write -nrFiles 16 -fileSize 1000

2. Run TestDFSIO in read mode.

$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-client-


jobclient-tests.jar TestDFSIO -read -nrFiles 16 -fileSize 1000

3. Clean up the TestDFSIO data

$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-client-


jobclient-tests.jar TestDFSIO -clean

53
MAPREDUCE MODEL
 The MapReduce computation model provides a
very powerful tool for many applications
 There are two stages:
A Mapping Stage & A Reducing Stage.
 In the mapping stage, a mapping procedure is
applied to input data
 Map takes a set of data and converts it into another
set of data, where individual elements are broken
down into tuples (key/value pairs).
 Reduce tasks shuffle and reduce the data.

54
 For instance, assume you need to count how many
times the name “Kutuzov” appears in the novel War
and Peace.
 One solution is to gather 20 friends and give them

each a section of the book to search. This step is


the map stage.
 The reduce phase happens when everyone is done
counting and you sum the total as your friends tell
you their counts.

55
SIMPLE MAPPER SCRIPT

#!/bin/bash
while readline ; do
for token in $line; do
if [ "$token" = "Kutuzov" ] ; then
echo "Kutuzov,1"
elif [ "$token" = "Petersburg" ] ; then
echo "Petersburg,1"
fi
done
56
done
SIMPLE REDUCER SCRIPT

#!/bin/bash
kcount=0
pcount=0
while read line ; do

57
 Formally, The mapper and reducer functions are both
defined with respect to data structured in (key, value)
pairs.
 The mapper takes one pair of data with a type in one
data domain, and returns a list of pairs in a different
domain:
Map(key1,value1) → list(key2,value2)

 The reducer function is then applied to each key–value


pair, which in turn produces a collection of values in the
same domain:

Reduce(key2, list (value2)) → list(value3)


58
 1. Input Splits: HDFS distributes and replicates data
over multiple servers
 2. Map Step: MapReduce will try to execute the mapper
on the machines where the block resides. Because the
file is replicated in HDFS, the least busy node with the
data will be chosen.
 3. Combiner step (optional): It is possible to provide an
optimization or pre-reduction as part of the map stage
where key—value pairs are combined prior to the next
stage
 4. Shuffle step: Before the parallel reduction stage can
complete, similar keys must be combined and counted
by the same reducer process
 5. Reduce Step. The final step is the actual reduction. In
this stage, the data reduction is performed as per the
programmer's design. The reduce step is also optional. 59
The results are written to HDFS.
MapReduce Parallel Data Flow

60
61
62
FAULT TOLERANCE
 The design of MapReduce makes it possible to easily
recover from the failure of one or many map processes.
 For example, should a server fail, the map tasks that
were running on that machine could easily be restarted
on another working server because there is no
dependence on any other map task.
 In functional language terms, the map tasks “do not
share state” with other mappers
 the application will run more slowly because work needs
to be redone, but it will complete.

63
SPECULATIVE EXECUTION
Bottlenecks or failures.
 When one part of a MapReduce process runs slowly, it
ultimately slows down everything else because the application
cannot complete until all processes are finished.
 It is possible to start a copy of a running map process without
disturbing any other running mapper processes.
 For example, suppose that as most of the map tasks are
coming to a close, the ApplicationMaster notices that some
are still running and schedules redundant copies of the
remaining jobs on less busy or free servers.
 Should the secondary processes finish first, the other first
processes are then terminated (or vice versa). This process is
known as speculative execution.
• 64
Module 1 at a glance

You might also like