0% found this document useful (0 votes)
11 views

Big Data AnalyticUnit2

Uploaded by

Khushboi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Big Data AnalyticUnit2

Uploaded by

Khushboi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Big Data Analytics

Unit II
HDFS
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.

Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the
status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

HDFS Architecture
Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the Linux operating system and the
namenode software. It is a software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the following tasks −
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and opening
files and directories.
Datanode
The datanode is a commodity hardware having the Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems, as per client
request.
 They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
Generally, the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS can
read or write is called a Block. The default block size is 64MB, but it can be increased as per
the need to change in HDFS configuration.

Command Line Interface:

 The HDFS can be manipulated through a Java API or through a command-line


interface.
 The File System (FS) shell includes various shell-like commands that directly
interact with the Hadoop Distributed File System (HDFS) as well as other file
systems that Hadoop supports.
 Below are the commands supported :
 appendToFile: Append the content of the text file in the HDFS.
 cat: Copies source paths to stdout.
 checksum: Returns the checksum information of a file.
 chgrp : Change group association of files. The user must be the owner of files,
or else a super-user.
 chmod : Change the permissions of files. The user must be the owner of the
file, or else a super-user.
 chown: Change the owner of files. The user must be a super-user.
 copyFromLocal: This command copies all the files inside the test folder in
the edge node to the test folder in the HDFS.
 copyToLocal : This command copies all the files inside the test folder in
the HDFS to the test folder in the edge node.
 count: Count the number of directories, files and bytes under the paths that
match the specified file pattern.
 cp: Copy files from source to destination. This command allows multiple
sources as well in which case the destination must be a directory.
 createSnapshot: HDFS Snapshots are read-only point-in-time copies of the
file system. Snapshots can be taken on a subtree of the file system or the entire
file system. Some common use cases of snapshots are data backup, protection
against user errors and disaster recovery.
 deleteSnapshot: Delete a snapshot from a snapshot table directory. This
operation requires the owner privilege of the snapshottable directory.
 df: Displays free space
 du: Displays sizes of files and directories contained in the given directory
or the length of a file in case its just a file.
 expunge: Empty the Trash.
 find: Finds all files that match the specified expression and applies selected
actions to them. If no path is specified then defaults to the current working
directory. If no expression is specified then defaults to -print.
 get Copy files to the local file system.
 getfacl: Displays the Access Control Lists (ACLs) of files and directories. If
a directory has a default ACL, then getfacl also displays the default ACL.
 getfattr: Displays the extended attribute names and values for a file or
directory.
 getmerge : Takes a source directory and a destination file as input and
concatenates files in src into the destination local file.
 help: Return usage output.
 ls: list files
 lsr: Recursive version of ls.
 mkdir: Takes path URI’s as argument and creates directories.
 moveFromLocal: Similar to put command, except that the source localsrc
is deleted after it’s copied.
 moveToLocal: Displays a “Not implemented yet” message.
 mv: Moves files from source to destination. This command allows multiple
sources as well in which case the destination needs to be a directory.
 put : Copy single src, or multiple srcs from local file system to the
destination file system. Also reads input from stdin and writes to
destination file system.
 renameSnapshot : Rename a snapshot. This operation requires the owner
privilege of the snapshottable directory.
 rm : Delete files specified as args.
 rmdir : Delete a directory.
 rmr : Recursive version of delete.
 setfacl : Sets Access Control Lists (ACLs) of files and directories.
 setfattr : Sets an extended attribute name and value for a file or directory.
 setrep: Changes the replication factor of a file. If the path is a directory
then the command recursively changes the replication factor of all files
under the directory tree rooted at the path.
 stat : Print statistics about the file/directory at <path> in the specified
format.
 tail: Displays the last kilobyte of the file to stdout.
 test : Hadoop fs -test -[defsz] URI.
 text: Takes a source file and outputs the file in text format. The allowed
formats are zip and TextRecordInputStream.
 touchz: Create a file of zero length.
 truncate: Truncate all files that match the specified file pattern to the specified
length.
 usage: Return the help for an individual command.

HDFS Interfaces
Hadoop is written in Java, and all Hadoop filesystem interactions are mediated through the Java
API. The filesystem shell, for example, is a Java application that uses the Java File System class
to provide filesystem operations. The other filesystem interfaces are discussed briefly in this
section. These interfaces are most commonly used with HDFS, since the other filesystems in
Hadoop typically have existing tools to access the underlying filesystem (FTP clients for FTP, S3
tools for S3, etc.), but many of them will work with any Hadoop filesystem.

Thrift
The Thrift API in the “thriftfs” contrib module remedies this deficiency by exposing Hadoop
filesystems as an Apache Thrift service, making it easy for any language that has Thrift bindings
to interact with a Hadoop filesystem, such as HDFS.

To use the Thrift API, run a Java server that exposes the Thrift service and acts as a proxy to the
Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the
same machine as your application. For installation and usage instructions, please refer to the
documentation in the src/contrib/thriftfs directory of the Hadoop distribution.

C
Hadoop provides a C library called libhdfs that mirrors the Java FileSystem interface (it was
written as a C library for accessing HDFS, but despite its name it can be used to access any
Hadoop filesystem). It works using the Java Native Interface (JNI) to calla Java filesystem client.

FUSE
Filesystem in Userspace (FUSE) allows filesystems that are implemented in user space to be
integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows any Hadoop
filesystem (but typically HDFS) to be mounted as a standard filesystem. You can then use Unix
utilities (such as ls and cat) to interact with the filesystem, as well as POSIX libraries to access
the filesystem from any programming language.

Fuse-DFS is implemented in C using libhdfs as the interface to HDFS. Documentation for


compiling and running Fuse-DFS is located in the src/contrib/fuse-dfs directory of the Hadoop
distribution.

WebDAV
WebDAV is a set of extensions to HTTP to support editing and updating files. WebDAV shares
can be mounted as filesystems on most operating systems, so by exposing HDFS (or other
Hadoop filesystems) over WebDAV, it’s possible to access HDFS as a standardfilesystem.
At the time of this writing, WebDAV support in Hadoop (which is implemented by calling the
Java API to Hadoop) is still under development, and can be tracked at https:
//issues.apache.org/jira/browse/HADOOP-496.

Other HDFS Interfaces


There are two interfaces that are specific to HDFS:

HTTP
HDFS defines a read-only interface for retrieving directory listings and data over HTTP.
Directory listings are served by the namenode’s embedded web server (which runs on port
50070) in XML format, while file data is streamed from datanodes by their web servers (running
on port 50075). This protocol is not tied to a specific HDFS version, making it possible to write
clients that can use HTTPto read data from HDFS clusters that run different versions of Hadoop.
HftpFile System is a such a client: it is a Hadoop filesystem that talks to HDFS over HTTP
(HsftpFileSystem is the HTTPS variant).

FTP
Although not complete at the time of this writing (https://issues.apache.org/jira/
browse/HADOOP-3199), there is an FTP interface to HDFS, which permits the use of the FTP
protocol to interact with HDFS. This interface is a convenient way to transfer data into and out of
HDFS using existing FTP clients.

The FTP interface to HDFS is not to be confused with FTPFileSystem, which exposes any FTP
server as a Hadoop filesystem.

HDFS Interfaces :
Features of HDFS interfaces are :

1. Create new file


2. Upload files/folder
3. Set Permission
4. Copy
5. Move
6. Rename
7. Delete
8. Drag and Drop
9. HDFS File viewer
Data Flow :

 MapReduce is used to compute a huge amount of data.


 To handle the upcoming data in a parallel and distributed form, the data has to flow
from various phases:

 Input Reader :
 The input reader reads the upcoming data and splits it into the data blocks of the
appropriate size (64 MB to 128 MB).
 Once input reads the data, it generates the corresponding key-value pairs.
 The input files reside in HDFS.
 Map Function :
 The map function process the upcoming key-value pairs and generated the
corresponding output key-value pairs.
 The mapped input and output types may be different from each other.
 Partition Function :
 The partition function assigns the output of each Map function to the appropriate
reducer.
 The available key and value provide this function.
 It returns the index of reducers.
 Shuffling and Sorting :
 The data are shuffled between nodes so that it moves out from the map and get ready
to process for reduce function.
 The sorting operation is performed on input data for Reduce function.
 Reduce Function :
 The Reduce function is assigned to each unique key.
 These keys are already arranged in sorted order.
 The values associated with the keys can iterate the Reduce and generates the
corresponding output.
 Output Writer :
 Once the data flow from all the above phases, the Output writer executes.
 The role of the Output writer is to write the Reduce output to the stable storage.

Data Ingest with Apache Flume


Definition -: Apache flume is a tool used for collecting, aggregating and transporting
large amount of streaming data to a centralized data store.
Flume Properties
 Streaming data is continuously generated data. For example, log files, social media
generated data, email messages or just any continuous data.
 Flume is highly reliable, distributed and configurable tool.
 It is principally designed to copy streaming data (continuous data) from various web
servers to HDFS.
HDFS put Command
 Hadoop file system shell provides commands to insert data into Hadoop and
read from it.
 We can use ‘put’ command to transfer the data and read from it.
 You can use data to insert the data as shown below:

$ hadoop fs -put /Local source path /HDFS destination path

Drawbacks of put command


 Using put command we can transfer only one file at a time.
 If we use put command, the data is needed to be packaged and should be ready for
upload. Since the web server generates the data continuously, it is a very difficult
task.
Here we need a solution that can overcome the drawback of put command and transfer the
streaming data from data generators to centralised stores (specially HDFS) with less delay.
Here the ‘Flume’ comes into picture. Flume is used transfer streaming data into HDFS from
different servers.
Flume Architecture:
The figure below showing the overall process of data transferring through Flume. These are
data generators like Facebook, Twitter and clod storage at one side of the figure that
generates data continuously, this data is gathered at one place and aggregated by Flume and
then it transferred to Centralised store (HDFS, HBase).
Apache Flume Agent
Apache Flume itself have three components that are shown in the figure below. Apache
Flume is usually called as Apache Agent that have three components named as Source,
Channel and Sink. At both side of Apache agent data source and hdfs are there but question is
that why cannot we transfer the data directly from web servers to hdfs? Because the rate of
data generated by the sources may be very high by the rate of data accepting at hdfs. Due to
that data might face some losses. Therefore, for preventing that data loss Flume agents is
used. In Apache agent the data is collected by the source from web servers and is being put
into the channel. Channel is works as a buffer in this case. The channel transfers the data to
the sink and sink transfer the data to hdfs.

Setting Multi Agent Flow


 In order to flow the data across multiple agent or hops, the sink of the previous agent
and the source of the current agent need to be ARVO type with the sink pointing to
the hostname (or IP address) and port of the source.
 Within Flume there can be multiple agents and before reaching the final destination,
an event may travel through more than one agent.
 This is known as multi agent flow.
If we want to connect two agents serially then we have to keep this thing in mind that the sink
of the first agent and the source of the other agent should be of type AVRO. The serial
connection is depicted in the figure below:

Consolidation
 A very common scenario in log collection is large number of logs producing client
sending data to few consumer agents that are attached to the storage sub system.
 For example, log collected from hundreds of servers send to a dozen of agents that
write to hdfs clusters.
 This can be achieved in Flume by configuring a number of first tier agents with an
avro sink, all pointing to an avro source of single agent.
 This source of second tier agent consolidates the received events into a single channel
which is consumed by a sink to its final destination.
For example if the data is collected from Facebook, Twitter and Instagram then three agents
will be working with each web server and all three agents will be connected to one single
centralised agent that centralised agent will aggregate all collected data and will send to
centralised storage (HDFS). The consolidation process is depicted in figure below:
Apache Sqoop
Definition: Sqoop is a tool designed to transfer data between Hadoop and Relational
database.
 We can use Sqoop to import data from relational database management system
(RDBMS) into the Hadoop distributed file system (HDFS) and export the data back
into relational database.
Sqoop – SQL to Hadoop and Hadoop to SQL
 Sqoop can be used with any java database connectivity (JDBC) -compliant database
and has been tested on Microsoft SQL server, PostgreSQL, MySQL and Oracle.
 In version 1 of Sqoop, data were accessed using connectors written for specific
database.
 Version 2 does not support connectors, instead it uses generic JDBC connector for
data transfer.

Sqoop Tool

RDBMS Import Hadoop File


system
(MySQL, Oracle,
PostgreSQL, SQL) (HDFS, HIVE, HBase)
Export

Apache Sqoop Import method


 The import is done in two steps.
 In first step, Sqoop examine the database to gather the necessary metadata for the data
to be imported.
 The second step Sqoop submit the map job (no reduce job) to any cluster.
 This job does the actual data transfer using the metadata captured in the previous step.
 Not that each node doing the import must have access to the database.
 The imported data are saved in HDFS directory.
 Sqoop will use database name for the directory or user can specify any alternative
directory where data will be populated.
 By default, these files contain comma delimiter fields, with new lines separating
different records. User can also specify the new delimiter of his choice.
 Content of HDFS is: Rohan sharma,30,CSE\n Rahul,30,ECE
 Once placed in HDFS, the data is ready for processing.
Name Age Department
Rohan Sharma 30 CSE
Rahul 30 ECE

Sqoop import Process Design

Apache Sqoop Export task


 Data export from HDFS works in similar fashion.
 The export is done in two steps.
 The process is to examine the database for metadata.
 The export step uses the map job for writing the data to the database.
 Sqoop divides the input data into few prats and then push individual maps to database.
 Again, this process assumes the map task have an access to the database.
Apache Sqoop Version Changes:

Hadoop Input and Output:


Unlike any I/O subsystem, Hadoop also comes with a set of primitives. These primitive
considerations, although generic in nature, go with the Hadoop IO system as well with some
special connotation to it, of course. Hadoop deals with multi-terabytes of datasets; a special
consideration on these primitives will give an idea how Hadoop handles data input and
output.
Data Integrity
Data integrity means that data should remain accurate and consistent all across its storing,
processing, and retrieval operations. To ensure that no data is lost or corrupted during
persistence and processing, Hadoop maintains stringent data integrity constraints. Every
read/write operation occurs in disks, more so through the network is prone to errors. And, the
volume of data that Hadoop handles only aggravates the situation. The usual way to detect
corrupt data is through checksums.
How Check Sum is computed
A checksum is computed when data first enters into the system and is sent across the channel
during the retrieval process. The retrieving end computes the checksum again and matches
with the received ones. If it matches exactly then the data deemed to be error free else it
contains error. But the problem is – what if the checksum sent itself is corrupt? This is highly
unlikely because it is a small data, but not an undeniable possibility. Using the right kind of
hardware such as ECC memory can be used to alleviate the situation.
This is mere detection. Therefore, to correct the error, another technique, called CRC (Cyclic
Redundancy Check), is used.
CRC (Cycle Redundancy Check)
Hadoop takes it further and creates a distinct checksum for every 512 (default) bytes of data.
Because CRC-32 is 4 bytes only, the storage overhead is not an issue. All data that enters into
the system is verified by the datanodes before being forwarded for storage or further
processing. Data sent to the datanode pipeline is verified through checksums and any
corruption found is immediately notified to the client with ChecksumException. The client
read from the datanode also goes through the same drill. The datanodes maintain a log of
checksum verification to keep track of the verified block. The log is updated by the datanode
upon receiving a block verification success signal from the client. This type of statistics helps
in keeping the bad disks at bay.
DataBlockScanner
Apart from this, a periodic verification on the block store is made with the help
of DataBlockScanner running along with the datanode thread in the background. This
protects data from corruption in the physical storage media.
Replicas
Hadoop maintains a copy or replicas of data. This is specifically used to recover data from
massive corruption. Once the client detects an error while reading a block, it immediately
reports to the datanode about the bad block from the namenode before
throwing ChecksumException. The namenode then marks it as a bad block and schedules any
further reference to the block to its replicas. In this way, the replica is maintained with other
replicas and the marked bad block is removed from the system.

. CRC File

For every file created in the Hadoop LocalFileSystem, a hidden file with the same name in
the same directory with the extension .<filename>.crc is created. This file maintains the
checksum of each chunk of data (512 bytes) in the file. The maintenance of metadata helps in
detecting read error before throwing ChecksumException by the LocalFileSystem.
Compression

Keeping in mind the volume of data Hadoop deals with, compression is not a luxury but a
requirement. There are many obvious benefits of file compression rightly used by Hadoop. It
economizes storage requirements and is a must-have capability to speed up data transmission
over the network and disks. There are many tools, techniques, and algorithms commonly used
by Hadoop. Many of them are quite popular and have been used in file compression over the
ages. For example, gzip, bzip2, LZO, zip, and so forth are often used.

Codecs

A codec is the implementation of a compression-decompression algorithm. In Hadoop, a


codec is represented by an implementation of the Compression codec interface.

Compressing and decompressing streams with CompressionCodec


Compression codec has two methods that allows you to easily compress or decompress data.
To compress data being written to an output stream, use the
CreateOutpurStream(OutputStream out) methos to create a CompressOutpurStream to which
you write your un compressed data to have it written in compressed form to the underlying
stream. Conversely, to decompress data being read from an input stream, call
createInputStram(InputStream in) to obtain CompressionInputStream, which allows you to
read uncompressed data from the underlying stream. compression formats available for
Hadoop.

Compression format Hadoop CompressionCodec


DEFLATE org.apache.hadoop.io.compress.DefaultCodec
Gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO com.hadoop.compression.lzo.LzopCodec

Inferring CompressionCodecs using CompressionCodecFactory

If you are reading a compressed file, you can normally infer the codec to use by looking at its
filename extension. A file ending in .gz can be read with GzipCodec, and so on. Following
are the compression formats available.

Compression Filename Multiple


format Tool Algorithm extension files Splittable
DEFLATE[a] N/A DEFLATE .deflate No No
Gzip gzip DEFLATE .gz No No
ZIP zip DEFLATE .zip Yes Yes, at file
boundaries
bzip2 bzip2 bzip2 .bz2 No Yes
LZO lzop LZO .lzo No No

Compression and Input Splits

When considering how to compress data that will be processed by MapReduce, it is


important to understand whether the compression format supports splitting. Consider an
uncompressed file stored in HDFS whose size is 1 GB. With a HDFS block size of 64 MB,
the file will be stored as 16 blocks, and a MapReduce job using this file as input will create
16 input splits, each processed independently as input to a separate map task.

Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. As before,
HDFS will store the file as 16 blocks. However, creating a split for each block won’t work
since it is impossible to start reading at an arbitrary point in the gzip stream, and therefore
impossible for a map task to read its split independently of the others. The gzip format uses
DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed
blocks. The problem is that the start of each block is not distinguished in any way that would
allow a reader positioned at an arbitrary point in the stream to advance to the beginning of the
next block, thereby synchronizing itself with the stream. For this reason, gzip does not
support splitting.

In this case, MapReduce will do the right thing, and not try to split the gzipped file, since it
knows that the input is gzip-compressed (by looking at the filename extension) and that gzip
does not support splitting. This will work, but at the expense of locality: a single map will
process the 16 HDFS blocks, most of which will not be local to the map. Also, with fewer
maps, the job is less granular, and so may take longer to run.

If the file in our hypothetical example were an LZO file, we would have the same problem
since the underlying compression format does not provide a way for a reader to synchronize
itself with the stream. A bzip2 file, however, does provide a synchronization marker between
blocks (a 48-bit approximation of pi), so it does support splitting. (Table 4-1 lists whether
each compression format supports splitting.)

For collections of files, the issues are slightly different. ZIP is an archive format, so it can
combine multiple files into a single ZIP archive. Each file is compressed separately, and the
locations of all the files in the archive are stored in a central directory at the end of the ZIP
file. This property means that ZIP files support splitting at file boundaries, with each split
containing one or more files from the ZIP archive. At the time of this writing, however,
Hadoop does not support ZIP files as an input format.

Serialization

The process that turns structured objects to stream of bytes is called serialization. This is
specifically required for data transmission over the network or persisting raw data in
disks. Deserialization is just the reverse process, where a stream of bytes is transformed into a
structured object. This is particularly required for object implementation of the raw bytes.
Therefore, it is not surprising that distributed computing uses this in a couple of distinct
areas: inter-process communication and data persistence.

Hadoop uses RPC (Remote Procedure Call) to enact inter-process communication between
nodes. Therefore, the RPC protocol uses the process of serialization and deserialization to
render a message to the stream of bytes and vice versa and sends it across the network.
However, the process must be compact enough to best use the network bandwidth, as well as
fast, interoperable, and flexible to accommodate protocol updates over time.
The writable Interface
The writable defines two methods: one for writing its state to a DataOutput binary stream,
and one for reading its state from a DataInput binary stream:

package org.apache.hadoop.io;

import java.io.DataOutput;

import java.io.DataInput;

import java.io.IOException;

public interface Writable {

void write(DataOutput out) throws IOException;


void readFields(DataInput in) throws IOException;

Writable Classes
Hadoop comes with a large selection of writable classes in the org.apache.hadoop.io package

Text
Text is a writable for UTF-8 squences. It can be thought of as the writable equivalent of
java.lang.string. The Text class uses an int (with a variable-length encoding) to store the
number of bytes in the string encoding, so the maximum value is 2 GB.
Furthermore, Text uses standard UTF-8, which makes it potentially easier to interoperate
with other tools that understand UTF-8.

Avro

Avro is an open source project that provides data serialization and data exchange services for
Apache Hadoop. These services can be used together or independently. Avro facilitates the
exchange of big data between programs written in any language. With the serialization
service, programs can efficiently serialize data into files or into messages. The data storage is
compact and efficient. Avro stores both the data definition and the data together in one
message or file.

Avro stores the data definition in JSON format making it easy to read and interpret; the data
itself is stored in binary format making it compact and efficient. Avro files include markers
that can be used to split large data sets into subsets suitable for Apache
MapReduce processing. Some data exchange services use a code generator to interpret the
data definition and produce code to access the data. Avro doesn't require this step, making it
ideal for scripting languages.
File based data-structure
There are a couple of high-level containers that elaborate the specialized data structure in
Hadoop to hold special types of data. For example, to maintain a binary log,
the SequenceFile container provides the data structure to persist binary key-value pairs. We
then can use the key, such as a timestamp represented by LongWritable and value
by Writable, which refers to logged quantity.

There is another container, a sorted derivation of SequenceFile, called MapFile. It provides


an index for convenient lookups by key.

These two containers are interoperable and can be converted to and from each other.

You might also like