100% found this document useful (2 votes)
2K views35 pages

Seminar Report PDF

This document provides an overview of Hadoop, an open source framework for distributed storage and processing of large datasets. Hadoop was developed to address the growing amounts of data, known as "big data", by providing reliable, economical and scalable storage and analysis capabilities. The key components of Hadoop are HDFS for distributed file storage and MapReduce for distributed processing of large datasets in parallel across clusters of computers. Overall, Hadoop provides a scalable system for reliable storage and analysis of large datasets.

Uploaded by

Crystal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
2K views35 pages

Seminar Report PDF

This document provides an overview of Hadoop, an open source framework for distributed storage and processing of large datasets. Hadoop was developed to address the growing amounts of data, known as "big data", by providing reliable, economical and scalable storage and analysis capabilities. The key components of Hadoop are HDFS for distributed file storage and MapReduce for distributed processing of large datasets in parallel across clusters of computers. Overall, Hadoop provides a scalable system for reliable storage and analysis of large datasets.

Uploaded by

Crystal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

HADOOP

A Seminar Report
Submitted to
MR. RAMESH KUMAR
Submitted by
AVANTIKA BISHT
UNIVERSITY ROLL NO:150216
in partial fulfillment for the award of the degree
of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
at
GOVIND BALLABH PANT INSTITUTE OF ENGINEERING AND
TECHNOLOGY
Department of Computer Science
Pauri Garhwal
FEBRUARY 2018
ABSTRACT

Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models. A
Hadoop frame-worked application works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage. The main reason for its
development was advancement of big data. Big data is a term used to describe the voluminous
amount of data that would cause too much time and cost to load into a relational database for
analysis. In a nutshell, is what Hadoop provides: a reliable shared storage and analysis System.
The storage is provided by HDFS, and analysis by Map Reduce.
The modest cost of commodity hardware makes Hadoop useful for storing and
combining data such as transactional, social media, sensor, machine, scientific, click streams,
etc. The low-cost storage lets you keep information that is not deemed currently critical but that
you might want to analyze later. Because Hadoop was designed to deal with volumes of data in a
variety of shapes and forms, it can run analytical algorithms. Big data analytics on Hadoop can
help your organization operate more efficiently, uncover new opportunities and derive next-level
competitive advantage. The sandbox approach provides an opportunity to innovate with minimal
investment.

II
Contents
1.Reason To Hadoop ................................................................................................................... - 3 -
What Is Big Data
Big Data Challenges
Google’s Solution
Hadoop:Big Data Sulution
2. Introduction to Hadoop…………………………………………………………………….-3-7-
History
Introduction
Features Of Hadoop
Hadoop Core Components
How Does Hadoop Work
3. Hadoop Distributed File System ………………………………………………………..-8-11-
Features Of Hdfs
Description Of Hdfs
Goals Of Hdfs
Design of Hdfs
4. Mapreduce………………………………………………………………………….….…-12-14-
The Algorithm
5. Hadoop Ecosystem……………………………………………………………………….-13-20-
6. Advantages Of Hadoop…………………………………………………………………..-21-22-
7. Applications of Hadoop……………………………………………………………….....-23-25-
When Not To Use Hadoop
Disadvantages Of Hadoop
8.Future Perspective……………………………………………………………………………-26-
9.Research Papers……………………………………………………………………………....-27-
10.Hadoop Research Topic…..………………………………………………………………...-28-
11.Case study………………………………………………………..…………………………-29-
12.References…………………………………………………………………………………..-30-

III
List of Figure

Figure 1.1: Growth of big data ……………………………………………………………1

Figure 1.2: Solving problem of big data by Hadoop………………………………………2

Figure 2.1: Hadoop Features……………………………………………………………….3

Figure 2.2: Hadoop Feature………………………………………………………………..5

Figure 2.3: hadoop part…………………………………………………………………….7

Figure 3.1: HDFS architecture……………………………………………………………..8

Figure 4.1.MapReduce…………………………………………………………………….13

Figure 5.1: Hadoop ecosystem…………………………………………………………….16

IV
ACKNOWLEDGEMENT

I would like to express my gratitude to Lord Almighty, the most Beneficent and the
most Merciful, for completion of this seminar report. I wish to thank my parents for their
continuing support and encouragement. I also wish to thank them for providing me with the
opportunity to reach this far in my studies. I would like to thank particularly our Supervisor
Mr. Ramesh Kumar for his patience, support and encouragement throughout the completion of
this seminar report and having faith in us. At last but not the least I am greatly indebted to all
other persons who directly or indirectly helped me during this work.

V
REASON TO HADOOP

Due to the advent of new technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing rapidly every year. The
amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If
you pile up the data in the form of disks it may fill an entire football field. The same amount
was created in every two days in 2011, and in every ten minutes in 2013. This rate is still
growing enormously which led big data.

1. What is Big Data?


Big Data is a collection of large datasets that cannot be processed using traditional
computing techniques. It is not a single technique or a tool, rather it involves many areas of
business and technology. Thus Big Data includes huge volume, high velocity, and extensible
variety of data.

Figure 1.1: Growth of big data


Source: www.edureka.co/blog/hadooptutorial

2. Big Data Challenges


The major challenges associated with big data are as follows:
1. Capturing data- 1 -
2. Storage
-1-
3. Searching
4. Sharing
5. Transfer

3. Google’s Solution
Google solved this problem using an algorithm called MapReduce. This algorithm
divides the task into small parts and assigns them to many computers, and collects the results
from them which when integrated, form the result dataset.

Figure 1.2: Solving problem of big data by Hadoop


Source: www.edureka.co/blog/hadooptutorial

4. Hadoop :Big Data solution


Using the solution provided by Google, Doug Cutting and his team developed an Open
Source Project called HADOOP. Hadoop runs applications using the MapReduce algorithm,
where the data is processed in parallel with others. In short, Hadoop is used to develop
applications that could perform complete statistical analysis on huge amounts of data.

To solve the storage issue and processing issue, two core components were created in
Hadoop – HDFS and YARN. HDFS solves the storage issue as it stores the data in a distributed
fashion and is easily scalable. And, YARN solves the processing issue by reducing the
processing time drastically.

-2-
INTRODUCTION TO HADOOP

1. History

According to its co-founders, Doug Cutting and Mike Cafarella, the genesis of Hadoop
was the "Google File System" paper that was published in October 2003. This paper spawned
another one from Google – "MapReduce: Simplified Data Processing on Large Clusters".
Development started on the Apache Nutch project, but was moved to the new Hadoop subproject
in January 2006. Doug Cutting, who was working at Yahoo! at the time, named it after his son's
toy elephant. The initial code that was factored out of Nutch consisted of about 5,000 lines of
code for HDFS and about 6,000 lines of code for MapReduce.

2. Introduction

Hadoop is an open-source software framework used for storing and processing Big
Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under
the Apache v2 license. Hadoop was developed, based on the paper written by Google on
MapReduce system and it applies concepts of functional programming. Hadoop is written in the
Java programming language and ranks among the highest-level Apache projects. Hadoop was
developed by Doug Cutting and Michael J. Cafarella.

Figure 2.1: Hadoop Features

Source: www.edureka.co/blog/hadooptutorial

-3-
3. Features of hadoop

Reliability: When machines are working in tandem, if one of the machines fails, another machine
will take over the responsibility and work in a reliable and fault tolerant fashion. Hadoop
infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.

Economical: Hadoop uses commodity hardware (like your PC, laptop). For example, in a small
Hadoop cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-
10 TB hard disk and Xeon processors, but if I would have used hardware-based RAID
with Oracle for the same purpose, I would end up spending 5x times more at least. So, the cost of
ownership of a Hadoop-based project is pretty minimized. It is easier to maintain the Hadoop
environment and is economical as well. Also, Hadoop is an open source software and hence
there is no licensing cost.

Scalability: Hadoop has the inbuilt capability of integrating seamlessly with cloud-based
services. So, if you are installing Hadoop on a cloud, you don’t need to worry about the
scalability factor because you can go ahead and procure more hardware and expand your setup
within minutes whenever required.

Flexibility: Hadoop is very flexible in terms of ability to deal with all kinds of data. We
discussed “Variety” in our previous blog on Big Data Tutorial, where data can be of any kind
and Hadoop can store and process them all, whether it is structured, semi-structured or
unstructured data.

These 4 characteristics make Hadoop a front-runner as a solution to Big Data challenges.

4. Hadoop Core Components

One is HDFS (storage) and the other is YARN (processing). HDFS stands for Hadoop
Distributed File System, which is a scalable storage unit of Hadoop whereas YARN is used for

-4-
process the data i.e. stored in the HDFS in a distributed and parallel fashion

Figure 2.2: Hadoop Feature

Source: www.edureka.co/blog/hadooptutorial

 Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
 Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
 sHadoop YARN – a platform responsible for managing computing resources in clusters
and using them for scheduling users' applications; and
 Hadoop MapReduce – an implementation of the MapReduce programming model for
large-scale data processing

The Hadoop framework itself is mostly written in the Java programming language,
with some native code in C and command line utilities written as shell scripts. Though
MapReduce Java code is common, any programming language can be used with "Hadoop
Streaming" to implement the "map" and "reduce" parts of the user's program

-5-
The core of Apache Hadoop consists of a storage part, known as Hadoop
Distributed File System (HDFS), and a processing part which is a MapReduce programming
model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then
transfers packaged code into nodes to process the data in parallel. This approach takes advantage
of data locality, where nodes manipulate the data they have access to. This allows the dataset to
be processed faster and more efficiently than it would be in a more conventional supercomputer
architecture that relies on a parallel file system where computation and data are distributed via
high-speed networking.

5. How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large
scale processing, but as an alternative, you can tie together many commodity computers with
single-CPU, as a single functional distributed system and practically, the clustered machines can
read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than
one high-end server. So this is the first motivational factor behind using Hadoop that it runs
across clustered and low-cost machines. Hadoop runs code across a cluster of computers. This
process includes the following core tasks that Hadoop performs:
 Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.

-6-
Figure 2.3: hadoop parts

Source: www.edureka.co/blog/hadooptutorial

Stage 1
A user/application can submit a job to the Hadoop (a hadoop job client) for required process by
specifying the following items:
1. The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce
functions.
3. The job configuration by setting different parameters specific to the job.

Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration to
the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to
the job-client.
Stage 3

The TaskTrackers on different nodes execute the task as per MapReduce implementation and
output of the reduce function is stored into the output files on the file system.

-7-
Hadoop Distributed File System (HDFS)

Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault-tolerant and
designed using low-cost hardware. HDFS holds very large amount of data and provides easier
access. To store such huge data, the files are stored across multiple machines. These files are
stored in redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing.

1. Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the status of
cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

Figure 3.1: HDFS architecture


Source: www.edureka.co/blog/hadooptutorial

-8-
2. Description of features-
NameNode

 It is the master daemon that maintains and manages the DataNodes (slave nodes)
 It records the metadata of all the blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
 It records each and every change that takes place to the file system metadata
 If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
 It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster
to ensure that the DataNodes are live
 It keeps a record of all the blocks in the HDFS and DataNode in which they are stored

DataNode

 It is the slave daemon which run on each slave machine


 The actual data is stored on DataNodes
 It is responsible for serving read and write requests from the clients
 It is also responsible for creating blocks, deleting blocks and replicating the same based
on the decisions taken by the NameNode
 It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by
default, this frequency is set to 3 seconds

Block
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS
can read or write is called a Block. The default block size is 64MB, but it can be increased
as per the need to change in HDFS configuration.

In HDFS, Namenode is a master node and Datanodes are slaves. Namenode contains
the metadata about the data stored in Data nodes, like which data block is stored in which data
node, where are the replications of the data block kept etc. The actual data is stored in Data
-9-
Nodes. I also want to add, we actually replicate the data blocks present in Data Nodes, and by
default, the replication factor is 3. Since we are using commodity hardware and we know the
failure rate of these hardwares are pretty high, so if one of the DataNodes fails, HDFS will still
have the copy of those lost data blocks. That’s the reason we need to replicate the data block.
You can configure replication factor based on your requirements.

3. Goals of HDFS
[1] . Fault detection and recovery: Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms
for quick and automatic fault detection and recovery.
[2]. Huge datasets: HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
[3]. Hardware at data: A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces thenetwork
traffic and increases the throughput.

4. The Design of HDFS


HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters on commodity hardware. Let’s examine this statement in more
detail:
Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes
in size. There are Hadoop clusters running today that store petabytes
of data.
Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once,
read-many-times pattern. A dataset is typically generated or copied from source, then various
analyses are performed on that dataset over time. Each analysis will involve a large proportion, if
not all, of the dataset, so the time to read the whole dataset is more important than the latency in
reading the first record.
Commodity hardware

- 10 -
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on
clusters of commodity hardware (commonly available hardware available from multiple vendors)
for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is
designed to carry on working without a noticeable interruption to the user in the face of such
failure.

- 11 -
MapReduce

MapReduce is a framework using which we can write applications to process huge


amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing primitives are
called mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce form,
scaling the application to run over hundreds, thousands, or even tens of thousands of machines in
a cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.

MapReduce is a programming framework that allows us to perform distributed and parallel


processing on large data sets in a distributed environment. MapReduce consists of two distinct
tasks - Map and Reduce.

 As the name MapReduce suggests, reducer phase takes place after mapper phase has been
completed.
 So, the first is the map job, where a block of data is read and processed to produce key-
value pairs as intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input to the Reducer.
 The reducer receives the key-value pair from multiple map jobs.
 Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair)
into a smaller set of tuples or key-value pairs which is the final output.

- 12 -
1. The Algorithm

Generally MapReduce paradigm is based on sending the computer to where the data
resides! MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
 Map stage: The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the
data and creates several small chunks of data.
 Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS. During a MapReduce
job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes. Most of the
computing takes place on nodes with data on local disks that reduces the network
traffic.After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.

Figure:4.1.MapReduce

Source: www.edureka.co/blog/hadooptutorial

- 13 -
MapReduce works by breaking the processing into two phases: the map phase and the reduce
phase. Each phase has key-value pairs as input and output, the types of which may be chosen by
the programmer. The programmer also specifies two functions: the map function and the reduce
function.

In a MapReduce program, Map() and Reduce() are two functions.


 The Map function performs actions like filtering, grouping and sorting.
 While Reduce function aggregates and summarizes the result produced by map function.
 The result generated by the Map function is a key value pair (K, V) which acts as the
input for Reduce function.

- 14 -
HADOOP ECOSYSTEM

Hadoop Ecosystem is neither a programming language nor a service, it is a platform or


framework which solves big data problems. You can consider it as a suite which encompasses a
number of services (ingesting, storing, analyzing and maintaining) inside it.

 HDFS -> Hadoop Distributed File System


 YARN -> Yet Another Resource Negotiator
 MapReduce -> Data processing using programming
 Spark -> In-memory Data Processing
 PIG, HIVE-> Data Processing Services using Query (SQL-like)
 HBase -> NoSQL Database
 Mahout, Spark MLlib -> Machine Learning
 Apache Drill -> SQL on Hadoop
 Zookeeper -> Managing Cluster
 Oozie -> Job Scheduling
 Flume, Sqoop -> Data Ingesting Services
 Solr & Lucene -> Searching & Indexing
 Ambari -> Provision, Monitor and Maintain cluster

- 15 -
Figure :5.1:hadoop ecosystem
Source: www.edureka.co/blog/hadooptutorial

YARN

Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing
activities by allocating resources and scheduling tasks.

 It has two major components, i.e. ResourceManager and NodeManager.

ResourceManager

 It is a cluster level (one for each cluster) component and runs on the master machine
 It manages resources and schedule applications running on top of YARN
 It has two components: Scheduler & ApplicationManager
 The Scheduler is responsible for allocating resources to the various running applications
 The ApplicationManager is responsible for accepting job submissions and negotiating the
first container for executing the application
 It keeps a track of the heartbeats from the Node Manager

- 16 -
NodeManager

 It is a node level component (one on each node) and runs on each slave machine
 It is responsible for managing containers and monitoring resource utilization in each
container
 It also keeps track of node health and log management
 It continuously communicates with ResourceManager to remain up-to-date

Apache Pig

1. Pig is made up of two pieces:


• The language used to express data flows, called Pig Latin.
• The execution environment to run Pig Latin programs. There are currently two
environments: local execution in a single JVM and distributed execution on a
Hadoop cluster. It Supports SQL like command structure. It Gives a platform for
building data flow for ETL (Extract, Transform and Load), processing and
analyzing huge data sets.
• 10 line of pig latin = approx. 200 lines of Map-Reduce Java code.
• The compiler internally converts pig latin to MapReduce. It produces a sequential
set of MapReduce jobs, and that’s an abstraction (which works like black box).
• PIG was initially developed by Yahoo.
2. Pig isn’t suitable for all data processing tasks, however. Like MapReduce, it is designed
for batch processing of data. If you want to perform a query that touches only a small
amount of data in a large dataset, then Pig will not perform well, since it is set up to scan
the whole dataset, or at least large portions of it.
3. Pig has two execution types or modes: local mode and Hadoop mode.
 Local mode
In local mode, Pig runs in a single JVM and accesses the local file system. This mode is
suitable only for small datasets
 Hadoop mode

- 17 -
In Hadoop mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop
cluster. The cluster may be a pseudo- or fully distributed cluster. Hadoop mode (with a
fully distributed cluster) is what you use when you want to run Pig on large datasets.

APACHE HIVE

 Face book created HIVE .Data warehousing component which performs reading, writing
and managing large data sets in a distributed environment using SQL-like interface.

HIVE + SQL = HQL

 The query language of Hive is called Hive Query Language(HQL), which is very similar
like SQL.
 It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
 The Hive Command line interface is used to execute HQL commands.
 While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC)
is used to establish connection from data storage.
 Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).
 It supports all primitive data types of SQL.

MAHOUT

Renowned for machine learning. Provides an environment for creating machine learning
applications which are scalable.

Machine learning algorithms allow us to build self-learning machines that evolve by itself
without being explicitly programmed. Based on user behavior, data patterns and past experiences
it makes important future decisions. It a descendant of Artificial Intelligence (AI).

What Mahout does?

- 18 -
 Performs collaborative filtering, clustering and classification. Some people also consider
frequent item set missing as Mahout’s function. Let us understand them individually:

Apache Spark

It is a framework for real time data analytics in a distributed computing environment.

 Written in Scala and was originally developed at the University of California, Berkeley.
 It executes in-memory computations to increase speed of data processing over Map-
Reduce.
 It is 100x faster than Hadoop for large scale data. Therefore, it requires high processing
power than Map-Reduce

HBase

 HBase is an open source, non-relational distributed database. It is a NoSQL database. It is


modelled after Google’s BigTable, which is a distributed storage system designed to cope
up with large data sets.
 Designed to run on top of HDFS and provides BigTable like capabilities. It gives us a
fault tolerant way of storing sparse data, which is common in most Big Data use cases.
The HBase is written in Java.

AMBARI

 Ambari is an Apache Software Foundation Project which aims at making Hadoop


ecosystem more manageable.

The Ambari provides:

 Hadoop cluster provisioning


 Hadoop cluster management
 Hadoop cluster monitoring

Apache Zookeeper
- 19 -
Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of
various services in a Hadoop Ecosystem. Apache Zookeeper coordinates with various services in
a distributed environment. Before Zookeeper, it was very difficult and time consuming to
coordinate between different services in Hadoop Ecosystem. The services earlier had many
problems with interactions like common configuration while synchronizing data. Even if the
services are configured, changes in the configurations of the services make it complex and
difficult to handle. The grouping and naming was also a time-consuming factor. Due to the
above problems, Zookeeper was introduced. It saves a lot of time by performing synchronization,
configuration maintenance, grouping and naming.

Apache Oozie

 Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For Apache jobs,
Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them together as
one logical work. There are two kinds of Oozie jobs:
 Oozie workflow: These are sequential set of actions to be executed. You can assume it as
a relay race. Where each athlete waits for the last one to complete his part.
 Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made
available to it. Think of this as the response-stimuli system in our body. In the same
manner as we respond to an external stimulus, an Oozie coordinator responds to the
availability of data and it rests otherwise.

Apache Flume

Ingesting data is an important part of our Hadoop Ecosystem.

 The Flume is a service which helps in ingesting unstructured and semi-structured data
into HDFS.
 It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving large amount of data sets.
 It helps us to ingest online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS.

- 20 -
Advantages of Hadoop

1. Scalable
Hadoop is a highly scalable storage platform, because it can stores and distribute very large data
sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational
database systems (RDBMS) that can’t scale to process large amounts of data, Hadoop enables
businesses to run applications on thousands of nodes involving many thousands of terabytes of
data.
2. Cost effective
Hadoop also offers a cost effective storage solution for businesses’ exploding data sets. The
problem with traditional relational database management systems is that it is extremely cost
prohibitive to scale to such a degree in order to process such massive volumes of data. In an
effort to reduce costs, many companies in the past would have had to down-sample data and
classify it based on certain assumptions as to which data was the most valuable. The raw data
would be deleted, as it would be too cost-prohibitive to keep. While this approach may have
worked in the short term, this meant that when business priorities changed, the complete raw
data set was not available, as it was too expensive to store.
3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different types of data
(both structured and unstructured) to generate value from that data. This means businesses can
use Hadoop to derive valuable business insights from data sources such as social media, email
conversations. Hadoop can be used for a wide variety of purposes, such as log processing,
recommendation systems, data warehousing, market campaign analysis and fraud detection.
4. Fast
Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data
wherever it is located on a cluster. The tools for data processing are often on the same servers
where the data is located, resulting in much faster data processing. If you’re dealing with large
volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just
minutes, and petabytes in hours.
5.Resilient to failure

- 21 -
A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node,
that data is also replicated to other nodes in the cluster, which means that in the event of failure,
there is another copy available for use.

- 22 -
Application of Hadoop

 Hadoop in Healthcare Sector

 Hadoop for Telecom Industry

 Hadoop in Retail Sector

 Hadoop in the Financial Sector

1. How Is Hadoop Being Used?


 Low-cost storage and data archive
The modest cost of commodity hardware makes Hadoop useful for storing and combining
data such as transactional, social media, sensor, machine, scientific, click streams, etc.
The low-cost storage lets you keep information that is not deemed currently critical but
that you might want to analyze later.
 Sandbox for discovery and analysis
Because Hadoop was designed to deal with volumes of data in a variety of shapes and
forms, it can run analytical algorithms. Big data analytics on Hadoop can help your
organization operate more efficiently, uncover new opportunities and derive next-level
competitive advantage. The sandbox approach provides an opportunity to innovate with
minimal investment.
 Data lake
Data lakes support storing data in its original or exact format. The goal is to offer a raw
or unrefined view of data to data scientists and analysts for discovery and analytics. It
helps them ask new or difficult questions without constraints. Data lakes are not a
replacement for data warehouses. In fact, how to secure and govern data lakes is a huge
topic for IT. They may rely on data federation techniques to create a logical data
structures.
 Complement your data warehouse
We're now seeing Hadoop beginning to sit beside data warehouse environments, as well
as certain data sets being offloaded from the data warehouse into Hadoop or new types of

- 23 -
data going directly to Hadoop. The end goal for every organization is to have a right
platform for storing and processing data of different schema, formats, etc. to support
different use cases that can be integrated at different levels.
 IoT and Hadoop
Things in the IoT need to know what to communicate and when to act. At the core of the
IoT is a streaming, always on torrent of data. Hadoop is often used as the data store for
millions or billions of transactions. Massive storage and processing capabilities also
allow you to use Hadoop as a sandbox for discovery and definition of patterns to be
monitored for prescriptive instruction. You can then continuously improve these
instructions, because Hadoop is constantly being updated with new data that doesn’t
match previously defined patterns.

2. When to not to use Hadoop ?

Following are some of those scenarios :

 Low Latency data access : Quick access to small parts of data


 Multiple data modification : Hadoop is a better fit only if we are primarily concerned
about reading data and not writing data.
 Lots of small files : Hadoop is a better fit in scenarios, where we have few but large files.

3. Disadvantages of hadoop

As the backbone of so many implementations, Hadoop is almost synomous with big data.

1. Security Concerns
Just managing a complex applications such as Hadoop can be challenging. A simple example can
be seen in the Hadoop security model, which is disabled by default due to sheer complexity. If
whoever managing the platform lacks of know how to enable it, your data could be at huge risk.
Hadoop is also missing encryption at the storage and network levels, which is a major selling
point for government agencies and others that prefer to keep their data under wraps.
2. Vulnerable By Nature

- 24 -
Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The
framework is written almost entirely in Java, one of the most widely used yet controversial
programming languages in existence. Java has been heavily exploited by cybercriminals and as a
result, implicated in numerous security breaches.
3. Not Fit for Small Data
While big data is not exclusively made for big businesses, not all big data platforms are suited
for small data needs. Unfortunately, Hadoop happens to be one of them. Due to its high capacity
design, the Hadoop Distributed File System, lacks the ability to efficiently support the random
reading of small files. As a result, it is not recommended for organizations with small quantities
of data.
4. Potential Stability Issues
Like all open source software, Hadoop has had its fair share of stability issues. To avoid these
issues, organizations are strongly recommended to make sure they are running the latest stable
version, or run it under a third-party vendor equipped to handle such problems.
5. General Limitations

 The article introducesApache Flume, MillWheel, and Google’s own Cloud Dataflow as
possible solutions. What each of these platforms have in common is the ability to
improve the efficiency and reliability of data collection, aggregation, and integration. The
main point the article stresses is that companies could be missing out on big benefits by
using Hadoop alone.

- 25 -
Future perspective

Because of all this hassle in the industry, Imarticus Learning has created CBDH
(certificate in big data and hadoop) program designed to ensure that you are job ready to take up
assignments in Big Data Analytics using the Hadoop framework. This functional skill building
program not only equips you with essential concepts of Hadoop but also gives you the required
work experience in Big Data and Hadoop through the implementation of real-life industry
projects. Since the data market forecast is strong and here to stay the knowledge of Hadoop and
related technology will act as a career boost in India with its growing analytics market.

1. Other applications of hadoop

The HDFS file system is not restricted to MapReduce jobs. It can be used for other
applications, many of which are under development at Apache. The list includes the HBase
database, the Apache Mahout machine learning system, and the Apache Hive Data Warehouse
system. Hadoop can, in theory, be used for any sort of work that is batch-oriented rather than
real-time, is very data-intensive, and benefits from parallel processing of data. It can also be used
to complement a real-time system, such as lambda architecture, Apache Storm, Flink and Spark
Streaming.

As of October 2009, commercial applications of Hadoop included:-

 log and/or clickstream analysis of various kinds


 marketing analytics
 machine learning and/or sophisticated data mining
 image processing
 processing of XML messages
 web crawling and/or text processing
 general archiving, including of relational/tabular data, e.g. for compliance

- 26 -
Research Papers

Some papers influenced the birth and growth of Hadoop and big data processing. Some of these
are:

1. Jeffrey Dean, Sanjay Ghemawat (2004) MapReduce: Simplified Data Processing


on Large Clusters, Google. This paper inspired Doug Cutting to develop an open-
source implementation of the Map-Reduce framework. He named it Hadoop, after
his son's toy elephant.
2. Michael Franklin, Alon Halevy, David Maier (2005) From Databases to
Dataspaces: A New Abstraction for Information Management. The authors
highlight the need for storage systems to accept all data formats and to provide
APIs for data access that evolve based on the storage system's understanding of
the data.

- 27 -
Hadoop Research Topics

 Ability to make Hadoop scheduler resource aware, especially CPU, memory and IO
resources. The current implementation is based on statically configured slots.
 Abilty to make a map-reduce job take new input splits even after a map-reduce job has
already started.
 Ability to dynamically increase replicas of data in HDFS based on access patterns. This is
needed to handle hot-spots of data.
 Ability to extend the map-reduce framework to be able to process data that resides partly
in memory. One assumption of the current implementation is that the map-reduce
framework is used to scan data that resides on disk devices. But memory on commodity
machines is becoming larger and larger. A cluster of 3000 machines with 64 GB each can
keep about 200TB of data in memory! It would be nice if the hadoop framework can
support caching the hot set of data on the RAM of the tasktracker machines. Performance
should increase dramatically because it is costly to serialize/compress data from the disk
into memory for every query.
 Heuristics to efficiently 'speculate' map-reduce tasks to help work around machines that
are laggards. In the cloud, the biggest challenge for fault tolerance is not to handle
failures but rather anomalies that makes parts of the cloud slow (but not fail completely),
these impact performance of jobs.
 Make map-reduce jobs work across data centers. In many cases, a single hadoop cluster
cannot fit into a single data center and a user has to partition the dataset into two hadoop
clusters in two different data centers.
 High Availability of the JobTracker. In the current implementation, if the JobTracker
machine dies, then all currently running jobs fail.
 Ability to create snapshots in HDFS. The primary use of these snapshots is to retrieve a
dataset that was erroneously modified/deleted by a buggy application.

- 28 -
Case Studies

Hadoop Usage at Last.fm


Last.fm: The Social Music Revolution
Founded in 2002, Last.fm is an Internet radio and music community website that
offers many services to its users, such as free music streams and downloads, music and event
recommendations, personalized charts, and much more. There are about 25 million people who
use Last.fm every month, generating huge amounts of data that need to be processed. One
example of this is users transmitting information indicating which songs they are listening to
(this is known as “scrobbling”). This data is processed and stored by Last.fm, so the user can
access it directly (in the form of charts), and it is also used to make decisions about users’
musical tastes and compatibility, and artist and track similarity.
Hadoop at Last.fm
As Last.fm’s service developed and the number of users grew from thousands to
millions, storing, processing and managing all the incoming data became increasingly
challenging. Fortunately, Hadoop was quickly becoming stable enough and was enthusiastically
adopted as it became clear how many problems it solved. It was first used at Last.fm in early
2006 and was put into production a few months later. There were several reasons for adopting
Hadoop at Last.fm:
• The distributed filesystem provided redundant backups for the data stored on it (e.g., web logs,
user listening data) at no extra cost.
• Scalability was simplified through the ability to add cheap, commodity hardware when
required.
• The cost was right (free) at a time when Last.fm had limited financial resources.
• The open source code and active community meant that Last.fm could freely modify Hadoop to
add custom features and patches. Hadoop provided a flexible framework for running distributed
computing algorithms with a relatively easy learning curve. Hadoop has now become a crucial
part of Last.fm’s infrastructure, currently consisting of two Hadoop clusters spanning over 50
machines, 300 cores, and 100 TB of disk space. Hundreds of daily jobs are run on the clusters
performing operations, such as logfile analysis, evaluation of A/B tests, ad hoc processing, and
charts generation.

- 29 -
References

[1] "Hadoop Releases",Apache Software Foundation. Retrieved 2014-12-06.


[2] "What is the Hadoop Distributed File System (HDFS)?", IBM. Retrieved 2014-10-30
[3] Hadoop-the Definitive Guide, O’reilly 2009, Yahoo! Press
[4] Map reduce: Simplified Data Processing On Large Clusters , Jeffrey Dean And Sanjay
Ghemawat.
[3] http://hadoop.apache.org
[4] https://www.edureka.co/blog/hadoop-tutorial
[5] https://www.tutorialspoint.com/hadoop
[6] https://github.com/apache/hadoop
[7] https://www.slideshare.net/ApacheApex/introduction-to-hadoop
[8] http://www.cloudera.com/hadoop-training

- 30 -

You might also like