PDF Bigdata 15cs82 Vtu Module 1 2 Notes
PDF Bigdata 15cs82 Vtu Module 1 2 Notes
Module-1
Hadoop Distributed File System (HDFS) Basics
Hadoop Distributed File System (HDFS) is a distributed file system which is designed to run on
commodity hardware. Commodity hardware is cheaper in cost. Since Hadoop requires processing
power of multiple machines and since it is expensive to deploy costly hardware, we use commodity
hardware. When commodity hardware is used, failures are more common rather than an exception.
HDFS is highly fault-tolerant and is designed to run on commodity hardware.
HDFS provides high throughput access to the data stored. So it is extremely useful when you want
to build applications which require large data sets.
HDFS was originally built as infrastructure layer for Apache Nutch. It is now pretty much part of
Apache Hadoop project.
1
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
DataNodes serves the read and write requests from HDFS file system clients. They are also
responsible for creation of block replicas and for checking if blocks are corrupted or not. It sends
the ping messages to the NameNode in the form of block mappings.
1. HDFS exposes Java/C API using which user can write an application to interact with HDFS.
Application using this API Interacts with Client Library (present on the same client machine).
2. Client (Library) connects to the NameNode using RPC. The communication between them
happens using ClientProtocol. Major functionality in ClientProtocol includes Create (creates a file
in name space), Append (add to the end of already existing file), Complete (client has finished
writing to file), Read etc.
3. Client (Library) interacts with DataNode directly using DataTransferProtocol. The
DataTransferProtocol defines operations to read a block, write to block, get checksum of block,
copy the block etc.
4. Interaction between NameNode and DataNode. It‘s always Data Node which initiates the
communication first and NameNode just responds to the requests intiated. The communication
usually involves DataNode Registration, DataNode sending heart beat messages, DataNode
sending blockreport, DataNode notifying the receipt of Block from a client or another DataNode
during replication of blocks.
Assumptions and Goals
1. Hardware Failure Hardware Failure is the norm rather than the exception. The entire HDFS
file system may consist of hundreds or thousands of server machines that stores pieces of file
system data. The fact that there are a huge number of components and that each component has a
non-trivial probability of failure means that some component of HDFS is always non-functional.
Therefore, detection of faults and automatically recovering quickly from those faults are core
architectural goals of HDFS.
2. Streaming Data Access: Applications that run on HDFS need streaming access to their data
sets. They are not general purpose applications that typically run on a general purpose file system.
HDFS is designed more for batch processing rather than interactive use by users. The emphasis is
on throughput of data access rather than latency of data access. POSIX imposes many hard
requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a
few key areas have been traded off to further enhance data throughout rates.
3. Large Data Sets: Applications that run on HDFS have large data sets. This means that a typical
file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should
provide high aggregate data bandwidth and should scale to hundreds
h undreds of nodes in a single cluster. It
should support tens of millions of files in a single cluster.
4. Simple Coherency Model: Most HDFS applications need write-once-read-many access model
for files. A file once created, written and closed need not be changed. This assumption simplifies
data coherency issues and enables high throughout data access. A Map-Reduce application or a
Web-Crawler application fits perfectly with this model. There is a plan to support appending-writes
to a file in future.
2
BIG DATA ANALYTICS"
“
3
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each
file as a sequence of blocks; all blocks in a file except the last block are the same size. Blocks
belonging to a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. Files in HDFS are write-once and have strictly one writer at any time. An
application can specify the number of replicas of a file. The replication factor can be specified at
file creation time and can be changed later. The Namenode makes all decisions regarding
replication of blocks. It periodically receives Heartbeat and a Blockreport from each of the
Datanodes in the cluster. A receipt of a heartbeat implies that the Datanode is in good health and is
serving data as desired. A Blockreport contains a list of all blocks on that Datanode.
4
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
If a HDFS cluster spans multiple data centers, then a replica that is resident in the local data center
is preferred over remote replicas.
SafeMode
On startup, the Namenode enters a special state called Safemode. Replication of data blocks does
not occur when the Namenode is in Safemode state. The Namenode receives Heartbeat and
Blockreport from the Datanodes. A Blockreport contains the list of data blocks that a Datanode
reports to the Namenode. Each block has a specified minimum number of replicas. A block is
considered safely-replicated when the minimum number of replicas of that data block has checked
in with the Namenode. When a configurable percentage of safely-replicated data blocks checks in
with the Namenode (plus an additional 30 seconds), the Namenode exits the Safemode state. It
then determines the list of data blocks (if any) that have fewer than the specified number of
replicas. The Namenode then replicates these blocks to other Datanodes.
5
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Call (RPC) abstraction wraps the ClientProtocol and the DatanodeProtocol. By design, the
Namenode never initiates an RPC. It responds to RPC requests issued by a Datanode or a client.
Robustness
The primary objective of HDFS is to store data reliably even in the presence of failures. The three
types of common failures are Namenode failures, Datanode failures and network partitions.
Cluster Rebalancing
The HDFS architecture is compatible with data rebalancing schemes. It is possible that data may
move automatically from one Datanode to another if the free space on a Datanode falls below a
certain threshold. Also, a sudden high demand for a particular file can dynamically cause creation
of additional replicas and rebalancing of other data in the cluster. These types of rebalancing
schemes are not yet implemented.
Data Correctness
It is possible that a block of data fetched from a Datanode is corrupted. This corruption can occur
because of faults in the storage device, a bad network or buggy software. The HDFS client
implements checksum checking on the contents of a HDFS file. When a client creates a HDFS file,
it computes a checksum of each block on the file and stores these checksums in a separate hidden
file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it
received from a Datanode satisfies the checksum stored in the checksum file. If not, then the client
can opt to retrieve that block from another Datanode that has a replica of that block.
6
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
for the HDFS cluster. If a Namenode machine fails, manual intervention is necessary. Currently,
automatic restart and failover of the Namenode software to another machine is not supported.
Snapshots
Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot-
feature may be to roll back a corrupted cluster to a previously known good point in time. HDFS
current does not support snapshots but it will be supported it in future release.
Data Blocks
HDFS is designed to support large files. Applications that are compatible with HDFS are those that
deal with large data sets. These applications write the data only once; they read the data one or
more times and require that reads are satisfied at streaming speeds. HDFS supports write-once-
read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, a HDFS file is
chopped up into 128MB chunks, and each chunk could reside in different Datanodes.
Staging
A client-request to create a file does not reach the Namenode immediately. In fact, the HDFS client
caches the file data into a temporary local file. An application-write is transparently redirected to
this temporary local file. When the local file accumulates data worth over a HDFS block size, the
client contacts the Namenode. The Namenode inserts the file name into the file system hierarchy
and allocates a data block for it. The Namenode responds to the client request with the identity of
the Datanode(s) and the destination data block. The client flushes the block of data from the local
temporary file to the specified Datanode. When a file is closed, the remaining un-flushed data in
the temporary local file is transferred to the Datanode. The client then instructs the Namenode that
the file is closed. At this point, the Namenode commits the file creation operation into a persistent
store. If the Namenode dies before the file is closed, the file is lost. The above approach has been
adopted after careful consideration of target applications that run on HDFS. Applications need
streaming writes to files. If a client writes to a remote file directly without any client side
buffering, the network speed and the congestion in the network impacts throughput considerably.
This approach is not without precedence either. Earlier distributed file system, e.g. AFS have used
client side caching to improve performance. A POSIX requirement has been relaxed to achieve
higher performance of data uploads.
Pipelining
When a client is writing data to a HDFS file, its data is first written to a local file as explained
above. Suppose the HDFS file has a replication factor of three. When the local file accumulates a
block of user data, the client retrieves a list of Datanodes from the Namenode. This list represents
the Datanodes that will host a replica of that block. The client then flushes the data block to the
first Datanode. The first Datanode starts receiving the data in small portions (4 KB), writes each
portion to its local repository and transfers that portion to the second Datanode in the list. The
second Datanode, in turn, starts receiving each portion of the data block, writes that portion to its
repository and then flushes that portion to the third Datanode. The third Datanode writes the data to
7
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
The compiler internally converts pig latin to MapReduce. It produces a sequential set of
MapReduce jobs, and that‘s an abstraction (which works like black box). PIG was initially
developed by Yahoo. It gives you a platform for building data flow for ETL (Extract, Transform
and Load), processing and analyzing huge data sets. In PIG, first the load command, loads the
data. Then we perform various functions on it like grouping, filtering, joining, sorting, etc. At last,
either you can dump the data on the screen or you can store the result back in HDFS
Apache Pig has several usage modes. The first is a local mode in which all processing is done on
the local machine. The non-local (cluster) modes are MapReduce and Tez. These modes execute
the job on the cluster using either the MapReduce engine or the optimized Tez engine. There are
also interactive and batch modes available; they enable Pig applications to be developed locally in
interactive modes, using small amounts of data, and then run at scale on the cluster in a production
mode.
Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel at home
while working in a Hadoop Ecosystem. Basically, HIVE is a data warehousing component which
performs reading, writing and managing large data sets in a distributed environment using SQL-
like interface.
The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
It has 2 basic components: Hive Command Line and JDBC/ODBC driver. The Hive Command
line interface is used to execute HQL commands. While, Java Database Connectivity (JDBC) and
Object Database Connectivity (ODBC) is used to establish connection from data storage.
Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing
(i.e. Batch query processing) and real time processing (i.e. Interactive query processing). It
supports all primitive data types of SQL. You can use predefined functions, or write tailored user
defined functions (UDF) also to accomplish your specific needs.
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use
Sqoop to import data from a relational database management system (RDBMS) into the Hadoop
Distributed File System (HDFS), transform the data in Hadoop, and then export the data back into
an RDBMS.
14
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Sqoop can be used with any Java Database Connectivity (JDBC) – compliant database and has been
tested on Microsoft SQL Server, PostgresSQL, MySQL, and Oracle
When we submit Sqoop command, our main task gets divided into sub tasks which is handled by
individual Map Task internally. Map Task is the sub task, which imports part of data to the Hadoop
Ecosystem. Collectively, all Map tasks imports the whole data
When we submit our Job, it is mapped into Map Tasks which brings the chunk of data from
HDFS. These chunks are exported to a structured data destination. Combining all these exported
chunks of data, we receive the whole data at the destination, which in most of the cases is an
RDBMS (MYSQL/Oracle/SQL Server).
Apache Solr and Apache Lucene are the two services which are used for searching and indexing in
Hadoop Ecosystem.
15
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
APACHE AMBARI
Ambari is an Apache Software Foundation Project which aims at making Hadoop ecosystem more
manageable. It includes software for provisioning, managing and monitoringApache Hadoop
clusters.
Apache Flume is an independent agent designed to collect, transport, and store data into HDFS.
Often data transport involves a number of Flume agents that may traverse a series of machines and
locations. Flume is often used for log files, social media-generated data, email messages, and just
about any continuous data source.
As shown in Figure 2.3, a Flume agent is composed of three components.
16
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Fig 2.6 YARN architecture with two clients (MapReduce and MPI).
21
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Distributed-Shell
As described earlier in this chapter, Distributed-Shell is an example application included with the
Hadoop core components that demonstrates how to write applications on top of YARN. It provides
a simple method for running shell commands and scripts in containers in parallel on a Hadoop
YARN cluster.
Hadoop MapReduce
MapReduce was the first YARN framework and drove many of YARN‘s requirements. It is
integrated tightly with the rest of the Hadoop ecosystem projects, such as Apache Pig, Apache
Hive, and Apache Oozie.
Apache Tez
One great example of a new YARN framework is Apache Tez. Many Hadoop jobs involve the
execution of a complex directed acyclic graph (DAG) of tasks using separate MapReduce stages.
Apache Tez generalizes this process and enables these tasks to be spread across stages so that they
can be run as a single, all-encompassing job. Tez can be used as a MapReduce replacement for
projects such as Apache Hive and Apache Pig. No changes are needed to the Hive or Pig
applications.
Apache Giraph
Apache Giraph is an iterative graph processing system built for high scalability. Facebook, Twitter,
and LinkedIn use it to create social graphs of users. Giraph was originally written to run on
standard Hadoop V1 using the MapReduce framework, but that approach proved inefficient and
totally unnatural for various reasons.. In addition, using the flexibility of YARN, the Giraph
developers plan on implementing their own web interface to monitor job progress.
Apache Spark
Spark was initially developed for applications in which keeping data in memory improves
performance, such as iterative algorithms, which are common in machine learning, and interactive
data mining. Spark differs from classic MapReduce in two important ways. First, Spark holds
22
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
intermediate results in memory, rather than writing them to disk. Second, Spark supports more than
just MapReduce functions; that is, it greatly expands the set of possible analyses that can be
executed over HDFS data stores. It also provides APIs in Scala, Java, and Python.
Since 2013, Spark has been running on production YARN clusters at Yahoo!. The advantage of
porting and running Spark on top of YARN is the common resource management and a single
underlying file system
Apache Storm
Traditional MapReduce jobs are expected to eventually finish, but Apache Storm continuously
processes messages until it is stopped. This framework is designed to process unbounded streams
of data in real time. It can be used in any programming language. The basic Storm use-cases
include real-time analytics, online machine learning, continuous computation, distributed RPC
(remote procedure calls), ETL (extract, transform, and load), and more. Storm provides fast
performance, is scalable, is fault tolerant, and provides processing guarantees. It works directly
under YARN and takes advantage of the common data and resource management substrate.
23