Big Data AnalyticUnit2
Big Data AnalyticUnit2
Unit II
HDFS
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the
status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the Linux operating system and the
namenode software. It is a software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the following tasks −
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening
files and directories.
Datanode
The datanode is a commodity hardware having the Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client
request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
Generally, the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS can
read or write is called a Block. The default block size is 64MB, but it can be increased as per
the need to change in HDFS configuration.
HDFS Interfaces
Hadoop is written in Java, and all Hadoop filesystem interactions are mediated through the Java
API. The filesystem shell, for example, is a Java application that uses the Java File System class
to provide filesystem operations. The other filesystem interfaces are discussed briefly in this
section. These interfaces are most commonly used with HDFS, since the other filesystems in
Hadoop typically have existing tools to access the underlying filesystem (FTP clients for FTP, S3
tools for S3, etc.), but many of them will work with any Hadoop filesystem.
Thrift
The Thrift API in the “thriftfs” contrib module remedies this deficiency by exposing Hadoop
filesystems as an Apache Thrift service, making it easy for any language that has Thrift bindings
to interact with a Hadoop filesystem, such as HDFS.
To use the Thrift API, run a Java server that exposes the Thrift service and acts as a proxy to the
Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the
same machine as your application. For installation and usage instructions, please refer to the
documentation in the src/contrib/thriftfs directory of the Hadoop distribution.
C
Hadoop provides a C library called libhdfs that mirrors the Java FileSystem interface (it was
written as a C library for accessing HDFS, but despite its name it can be used to access any
Hadoop filesystem). It works using the Java Native Interface (JNI) to calla Java filesystem client.
FUSE
Filesystem in Userspace (FUSE) allows filesystems that are implemented in user space to be
integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows any Hadoop
filesystem (but typically HDFS) to be mounted as a standard filesystem. You can then use Unix
utilities (such as ls and cat) to interact with the filesystem, as well as POSIX libraries to access
the filesystem from any programming language.
WebDAV
WebDAV is a set of extensions to HTTP to support editing and updating files. WebDAV shares
can be mounted as filesystems on most operating systems, so by exposing HDFS (or other
Hadoop filesystems) over WebDAV, it’s possible to access HDFS as a standardfilesystem.
At the time of this writing, WebDAV support in Hadoop (which is implemented by calling the
Java API to Hadoop) is still under development, and can be tracked at https:
//issues.apache.org/jira/browse/HADOOP-496.
HTTP
HDFS defines a read-only interface for retrieving directory listings and data over HTTP.
Directory listings are served by the namenode’s embedded web server (which runs on port
50070) in XML format, while file data is streamed from datanodes by their web servers (running
on port 50075). This protocol is not tied to a specific HDFS version, making it possible to write
clients that can use HTTPto read data from HDFS clusters that run different versions of Hadoop.
HftpFile System is a such a client: it is a Hadoop filesystem that talks to HDFS over HTTP
(HsftpFileSystem is the HTTPS variant).
FTP
Although not complete at the time of this writing (https://issues.apache.org/jira/
browse/HADOOP-3199), there is an FTP interface to HDFS, which permits the use of the FTP
protocol to interact with HDFS. This interface is a convenient way to transfer data into and out of
HDFS using existing FTP clients.
The FTP interface to HDFS is not to be confused with FTPFileSystem, which exposes any FTP
server as a Hadoop filesystem.
HDFS Interfaces :
Features of HDFS interfaces are :
Input Reader :
The input reader reads the upcoming data and splits it into the data blocks of the
appropriate size (64 MB to 128 MB).
Once input reads the data, it generates the corresponding key-value pairs.
The input files reside in HDFS.
Map Function :
The map function process the upcoming key-value pairs and generated the
corresponding output key-value pairs.
The mapped input and output types may be different from each other.
Partition Function :
The partition function assigns the output of each Map function to the appropriate
reducer.
The available key and value provide this function.
It returns the index of reducers.
Shuffling and Sorting :
The data are shuffled between nodes so that it moves out from the map and get ready
to process for reduce function.
The sorting operation is performed on input data for Reduce function.
Reduce Function :
The Reduce function is assigned to each unique key.
These keys are already arranged in sorted order.
The values associated with the keys can iterate the Reduce and generates the
corresponding output.
Output Writer :
Once the data flow from all the above phases, the Output writer executes.
The role of the Output writer is to write the Reduce output to the stable storage.
Consolidation
A very common scenario in log collection is large number of logs producing client
sending data to few consumer agents that are attached to the storage sub system.
For example, log collected from hundreds of servers send to a dozen of agents that
write to hdfs clusters.
This can be achieved in Flume by configuring a number of first tier agents with an
avro sink, all pointing to an avro source of single agent.
This source of second tier agent consolidates the received events into a single channel
which is consumed by a sink to its final destination.
For example if the data is collected from Facebook, Twitter and Instagram then three agents
will be working with each web server and all three agents will be connected to one single
centralised agent that centralised agent will aggregate all collected data and will send to
centralised storage (HDFS). The consolidation process is depicted in figure below:
Apache Sqoop
Definition: Sqoop is a tool designed to transfer data between Hadoop and Relational
database.
We can use Sqoop to import data from relational database management system
(RDBMS) into the Hadoop distributed file system (HDFS) and export the data back
into relational database.
Sqoop – SQL to Hadoop and Hadoop to SQL
Sqoop can be used with any java database connectivity (JDBC) -compliant database
and has been tested on Microsoft SQL server, PostgreSQL, MySQL and Oracle.
In version 1 of Sqoop, data were accessed using connectors written for specific
database.
Version 2 does not support connectors, instead it uses generic JDBC connector for
data transfer.
Sqoop Tool
. CRC File
For every file created in the Hadoop LocalFileSystem, a hidden file with the same name in
the same directory with the extension .<filename>.crc is created. This file maintains the
checksum of each chunk of data (512 bytes) in the file. The maintenance of metadata helps in
detecting read error before throwing ChecksumException by the LocalFileSystem.
Compression
Keeping in mind the volume of data Hadoop deals with, compression is not a luxury but a
requirement. There are many obvious benefits of file compression rightly used by Hadoop. It
economizes storage requirements and is a must-have capability to speed up data transmission
over the network and disks. There are many tools, techniques, and algorithms commonly used
by Hadoop. Many of them are quite popular and have been used in file compression over the
ages. For example, gzip, bzip2, LZO, zip, and so forth are often used.
Codecs
If you are reading a compressed file, you can normally infer the codec to use by looking at its
filename extension. A file ending in .gz can be read with GzipCodec, and so on. Following
are the compression formats available.
Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. As before,
HDFS will store the file as 16 blocks. However, creating a split for each block won’t work
since it is impossible to start reading at an arbitrary point in the gzip stream, and therefore
impossible for a map task to read its split independently of the others. The gzip format uses
DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed
blocks. The problem is that the start of each block is not distinguished in any way that would
allow a reader positioned at an arbitrary point in the stream to advance to the beginning of the
next block, thereby synchronizing itself with the stream. For this reason, gzip does not
support splitting.
In this case, MapReduce will do the right thing, and not try to split the gzipped file, since it
knows that the input is gzip-compressed (by looking at the filename extension) and that gzip
does not support splitting. This will work, but at the expense of locality: a single map will
process the 16 HDFS blocks, most of which will not be local to the map. Also, with fewer
maps, the job is less granular, and so may take longer to run.
If the file in our hypothetical example were an LZO file, we would have the same problem
since the underlying compression format does not provide a way for a reader to synchronize
itself with the stream. A bzip2 file, however, does provide a synchronization marker between
blocks (a 48-bit approximation of pi), so it does support splitting. (Table 4-1 lists whether
each compression format supports splitting.)
For collections of files, the issues are slightly different. ZIP is an archive format, so it can
combine multiple files into a single ZIP archive. Each file is compressed separately, and the
locations of all the files in the archive are stored in a central directory at the end of the ZIP
file. This property means that ZIP files support splitting at file boundaries, with each split
containing one or more files from the ZIP archive. At the time of this writing, however,
Hadoop does not support ZIP files as an input format.
Serialization
The process that turns structured objects to stream of bytes is called serialization. This is
specifically required for data transmission over the network or persisting raw data in
disks. Deserialization is just the reverse process, where a stream of bytes is transformed into a
structured object. This is particularly required for object implementation of the raw bytes.
Therefore, it is not surprising that distributed computing uses this in a couple of distinct
areas: inter-process communication and data persistence.
Hadoop uses RPC (Remote Procedure Call) to enact inter-process communication between
nodes. Therefore, the RPC protocol uses the process of serialization and deserialization to
render a message to the stream of bytes and vice versa and sends it across the network.
However, the process must be compact enough to best use the network bandwidth, as well as
fast, interoperable, and flexible to accommodate protocol updates over time.
The writable Interface
The writable defines two methods: one for writing its state to a DataOutput binary stream,
and one for reading its state from a DataInput binary stream:
package org.apache.hadoop.io;
import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;
Writable Classes
Hadoop comes with a large selection of writable classes in the org.apache.hadoop.io package
Text
Text is a writable for UTF-8 squences. It can be thought of as the writable equivalent of
java.lang.string. The Text class uses an int (with a variable-length encoding) to store the
number of bytes in the string encoding, so the maximum value is 2 GB.
Furthermore, Text uses standard UTF-8, which makes it potentially easier to interoperate
with other tools that understand UTF-8.
Avro
Avro is an open source project that provides data serialization and data exchange services for
Apache Hadoop. These services can be used together or independently. Avro facilitates the
exchange of big data between programs written in any language. With the serialization
service, programs can efficiently serialize data into files or into messages. The data storage is
compact and efficient. Avro stores both the data definition and the data together in one
message or file.
Avro stores the data definition in JSON format making it easy to read and interpret; the data
itself is stored in binary format making it compact and efficient. Avro files include markers
that can be used to split large data sets into subsets suitable for Apache
MapReduce processing. Some data exchange services use a code generator to interpret the
data definition and produce code to access the data. Avro doesn't require this step, making it
ideal for scripting languages.
File based data-structure
There are a couple of high-level containers that elaborate the specialized data structure in
Hadoop to hold special types of data. For example, to maintain a binary log,
the SequenceFile container provides the data structure to persist binary key-value pairs. We
then can use the key, such as a timestamp represented by LongWritable and value
by Writable, which refers to logged quantity.
These two containers are interoperable and can be converted to and from each other.