Seminar Report PDF
Seminar Report PDF
A Seminar Report
Submitted to
MR. RAMESH KUMAR
Submitted by
AVANTIKA BISHT
UNIVERSITY ROLL NO:150216
in partial fulfillment for the award of the degree
of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
at
GOVIND BALLABH PANT INSTITUTE OF ENGINEERING AND
TECHNOLOGY
Department of Computer Science
Pauri Garhwal
FEBRUARY 2018
ABSTRACT
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models. A
Hadoop frame-worked application works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage. The main reason for its
development was advancement of big data. Big data is a term used to describe the voluminous
amount of data that would cause too much time and cost to load into a relational database for
analysis. In a nutshell, is what Hadoop provides: a reliable shared storage and analysis System.
The storage is provided by HDFS, and analysis by Map Reduce.
The modest cost of commodity hardware makes Hadoop useful for storing and
combining data such as transactional, social media, sensor, machine, scientific, click streams,
etc. The low-cost storage lets you keep information that is not deemed currently critical but that
you might want to analyze later. Because Hadoop was designed to deal with volumes of data in a
variety of shapes and forms, it can run analytical algorithms. Big data analytics on Hadoop can
help your organization operate more efficiently, uncover new opportunities and derive next-level
competitive advantage. The sandbox approach provides an opportunity to innovate with minimal
investment.
II
Contents
1.Reason To Hadoop ................................................................................................................... - 3 -
What Is Big Data
Big Data Challenges
Google’s Solution
Hadoop:Big Data Sulution
2. Introduction to Hadoop…………………………………………………………………….-3-7-
History
Introduction
Features Of Hadoop
Hadoop Core Components
How Does Hadoop Work
3. Hadoop Distributed File System ………………………………………………………..-8-11-
Features Of Hdfs
Description Of Hdfs
Goals Of Hdfs
Design of Hdfs
4. Mapreduce………………………………………………………………………….….…-12-14-
The Algorithm
5. Hadoop Ecosystem……………………………………………………………………….-13-20-
6. Advantages Of Hadoop…………………………………………………………………..-21-22-
7. Applications of Hadoop……………………………………………………………….....-23-25-
When Not To Use Hadoop
Disadvantages Of Hadoop
8.Future Perspective……………………………………………………………………………-26-
9.Research Papers……………………………………………………………………………....-27-
10.Hadoop Research Topic…..………………………………………………………………...-28-
11.Case study………………………………………………………..…………………………-29-
12.References…………………………………………………………………………………..-30-
III
List of Figure
Figure 4.1.MapReduce…………………………………………………………………….13
IV
ACKNOWLEDGEMENT
I would like to express my gratitude to Lord Almighty, the most Beneficent and the
most Merciful, for completion of this seminar report. I wish to thank my parents for their
continuing support and encouragement. I also wish to thank them for providing me with the
opportunity to reach this far in my studies. I would like to thank particularly our Supervisor
Mr. Ramesh Kumar for his patience, support and encouragement throughout the completion of
this seminar report and having faith in us. At last but not the least I am greatly indebted to all
other persons who directly or indirectly helped me during this work.
V
REASON TO HADOOP
Due to the advent of new technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing rapidly every year. The
amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If
you pile up the data in the form of disks it may fill an entire football field. The same amount
was created in every two days in 2011, and in every ten minutes in 2013. This rate is still
growing enormously which led big data.
3. Google’s Solution
Google solved this problem using an algorithm called MapReduce. This algorithm
divides the task into small parts and assigns them to many computers, and collects the results
from them which when integrated, form the result dataset.
To solve the storage issue and processing issue, two core components were created in
Hadoop – HDFS and YARN. HDFS solves the storage issue as it stores the data in a distributed
fashion and is easily scalable. And, YARN solves the processing issue by reducing the
processing time drastically.
-2-
INTRODUCTION TO HADOOP
1. History
According to its co-founders, Doug Cutting and Mike Cafarella, the genesis of Hadoop
was the "Google File System" paper that was published in October 2003. This paper spawned
another one from Google – "MapReduce: Simplified Data Processing on Large Clusters".
Development started on the Apache Nutch project, but was moved to the new Hadoop subproject
in January 2006. Doug Cutting, who was working at Yahoo! at the time, named it after his son's
toy elephant. The initial code that was factored out of Nutch consisted of about 5,000 lines of
code for HDFS and about 6,000 lines of code for MapReduce.
2. Introduction
Hadoop is an open-source software framework used for storing and processing Big
Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under
the Apache v2 license. Hadoop was developed, based on the paper written by Google on
MapReduce system and it applies concepts of functional programming. Hadoop is written in the
Java programming language and ranks among the highest-level Apache projects. Hadoop was
developed by Doug Cutting and Michael J. Cafarella.
Source: www.edureka.co/blog/hadooptutorial
-3-
3. Features of hadoop
Reliability: When machines are working in tandem, if one of the machines fails, another machine
will take over the responsibility and work in a reliable and fault tolerant fashion. Hadoop
infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.
Economical: Hadoop uses commodity hardware (like your PC, laptop). For example, in a small
Hadoop cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-
10 TB hard disk and Xeon processors, but if I would have used hardware-based RAID
with Oracle for the same purpose, I would end up spending 5x times more at least. So, the cost of
ownership of a Hadoop-based project is pretty minimized. It is easier to maintain the Hadoop
environment and is economical as well. Also, Hadoop is an open source software and hence
there is no licensing cost.
Scalability: Hadoop has the inbuilt capability of integrating seamlessly with cloud-based
services. So, if you are installing Hadoop on a cloud, you don’t need to worry about the
scalability factor because you can go ahead and procure more hardware and expand your setup
within minutes whenever required.
Flexibility: Hadoop is very flexible in terms of ability to deal with all kinds of data. We
discussed “Variety” in our previous blog on Big Data Tutorial, where data can be of any kind
and Hadoop can store and process them all, whether it is structured, semi-structured or
unstructured data.
One is HDFS (storage) and the other is YARN (processing). HDFS stands for Hadoop
Distributed File System, which is a scalable storage unit of Hadoop whereas YARN is used for
-4-
process the data i.e. stored in the HDFS in a distributed and parallel fashion
Source: www.edureka.co/blog/hadooptutorial
Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
sHadoop YARN – a platform responsible for managing computing resources in clusters
and using them for scheduling users' applications; and
Hadoop MapReduce – an implementation of the MapReduce programming model for
large-scale data processing
The Hadoop framework itself is mostly written in the Java programming language,
with some native code in C and command line utilities written as shell scripts. Though
MapReduce Java code is common, any programming language can be used with "Hadoop
Streaming" to implement the "map" and "reduce" parts of the user's program
-5-
The core of Apache Hadoop consists of a storage part, known as Hadoop
Distributed File System (HDFS), and a processing part which is a MapReduce programming
model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then
transfers packaged code into nodes to process the data in parallel. This approach takes advantage
of data locality, where nodes manipulate the data they have access to. This allows the dataset to
be processed faster and more efficiently than it would be in a more conventional supercomputer
architecture that relies on a parallel file system where computation and data are distributed via
high-speed networking.
It is quite expensive to build bigger servers with heavy configurations that handle large
scale processing, but as an alternative, you can tie together many commodity computers with
single-CPU, as a single functional distributed system and practically, the clustered machines can
read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than
one high-end server. So this is the first motivational factor behind using Hadoop that it runs
across clustered and low-cost machines. Hadoop runs code across a cluster of computers. This
process includes the following core tasks that Hadoop performs:
Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
-6-
Figure 2.3: hadoop parts
Source: www.edureka.co/blog/hadooptutorial
Stage 1
A user/application can submit a job to the Hadoop (a hadoop job client) for required process by
specifying the following items:
1. The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce
functions.
3. The job configuration by setting different parameters specific to the job.
Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration to
the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to
the job-client.
Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation and
output of the reduce function is stored into the output files on the file system.
-7-
Hadoop Distributed File System (HDFS)
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault-tolerant and
designed using low-cost hardware. HDFS holds very large amount of data and provides easier
access. To store such huge data, the files are stored across multiple machines. These files are
stored in redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing.
1. Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
-8-
2. Description of features-
NameNode
It is the master daemon that maintains and manages the DataNodes (slave nodes)
It records the metadata of all the blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
It records each and every change that takes place to the file system metadata
If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster
to ensure that the DataNodes are live
It keeps a record of all the blocks in the HDFS and DataNode in which they are stored
DataNode
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS
can read or write is called a Block. The default block size is 64MB, but it can be increased
as per the need to change in HDFS configuration.
In HDFS, Namenode is a master node and Datanodes are slaves. Namenode contains
the metadata about the data stored in Data nodes, like which data block is stored in which data
node, where are the replications of the data block kept etc. The actual data is stored in Data
-9-
Nodes. I also want to add, we actually replicate the data blocks present in Data Nodes, and by
default, the replication factor is 3. Since we are using commodity hardware and we know the
failure rate of these hardwares are pretty high, so if one of the DataNodes fails, HDFS will still
have the copy of those lost data blocks. That’s the reason we need to replicate the data block.
You can configure replication factor based on your requirements.
3. Goals of HDFS
[1] . Fault detection and recovery: Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms
for quick and automatic fault detection and recovery.
[2]. Huge datasets: HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
[3]. Hardware at data: A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces thenetwork
traffic and increases the throughput.
- 10 -
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on
clusters of commodity hardware (commonly available hardware available from multiple vendors)
for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is
designed to carry on working without a noticeable interruption to the user in the face of such
failure.
- 11 -
MapReduce
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing primitives are
called mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce form,
scaling the application to run over hundreds, thousands, or even tens of thousands of machines in
a cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
As the name MapReduce suggests, reducer phase takes place after mapper phase has been
completed.
So, the first is the map job, where a block of data is read and processed to produce key-
value pairs as intermediate outputs.
The output of a Mapper or map job (key-value pairs) is input to the Reducer.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair)
into a smaller set of tuples or key-value pairs which is the final output.
- 12 -
1. The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data
resides! MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
Map stage: The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the
data and creates several small chunks of data.
Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS. During a MapReduce
job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes. Most of the
computing takes place on nodes with data on local disks that reduces the network
traffic.After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
Figure:4.1.MapReduce
Source: www.edureka.co/blog/hadooptutorial
- 13 -
MapReduce works by breaking the processing into two phases: the map phase and the reduce
phase. Each phase has key-value pairs as input and output, the types of which may be chosen by
the programmer. The programmer also specifies two functions: the map function and the reduce
function.
- 14 -
HADOOP ECOSYSTEM
- 15 -
Figure :5.1:hadoop ecosystem
Source: www.edureka.co/blog/hadooptutorial
YARN
Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing
activities by allocating resources and scheduling tasks.
ResourceManager
It is a cluster level (one for each cluster) component and runs on the master machine
It manages resources and schedule applications running on top of YARN
It has two components: Scheduler & ApplicationManager
The Scheduler is responsible for allocating resources to the various running applications
The ApplicationManager is responsible for accepting job submissions and negotiating the
first container for executing the application
It keeps a track of the heartbeats from the Node Manager
- 16 -
NodeManager
It is a node level component (one on each node) and runs on each slave machine
It is responsible for managing containers and monitoring resource utilization in each
container
It also keeps track of node health and log management
It continuously communicates with ResourceManager to remain up-to-date
Apache Pig
- 17 -
In Hadoop mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop
cluster. The cluster may be a pseudo- or fully distributed cluster. Hadoop mode (with a
fully distributed cluster) is what you use when you want to run Pig on large datasets.
APACHE HIVE
Face book created HIVE .Data warehousing component which performs reading, writing
and managing large data sets in a distributed environment using SQL-like interface.
The query language of Hive is called Hive Query Language(HQL), which is very similar
like SQL.
It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
The Hive Command line interface is used to execute HQL commands.
While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC)
is used to establish connection from data storage.
Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).
It supports all primitive data types of SQL.
MAHOUT
Renowned for machine learning. Provides an environment for creating machine learning
applications which are scalable.
Machine learning algorithms allow us to build self-learning machines that evolve by itself
without being explicitly programmed. Based on user behavior, data patterns and past experiences
it makes important future decisions. It a descendant of Artificial Intelligence (AI).
- 18 -
Performs collaborative filtering, clustering and classification. Some people also consider
frequent item set missing as Mahout’s function. Let us understand them individually:
Apache Spark
Written in Scala and was originally developed at the University of California, Berkeley.
It executes in-memory computations to increase speed of data processing over Map-
Reduce.
It is 100x faster than Hadoop for large scale data. Therefore, it requires high processing
power than Map-Reduce
HBase
AMBARI
Apache Zookeeper
- 19 -
Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of
various services in a Hadoop Ecosystem. Apache Zookeeper coordinates with various services in
a distributed environment. Before Zookeeper, it was very difficult and time consuming to
coordinate between different services in Hadoop Ecosystem. The services earlier had many
problems with interactions like common configuration while synchronizing data. Even if the
services are configured, changes in the configurations of the services make it complex and
difficult to handle. The grouping and naming was also a time-consuming factor. Due to the
above problems, Zookeeper was introduced. It saves a lot of time by performing synchronization,
configuration maintenance, grouping and naming.
Apache Oozie
Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For Apache jobs,
Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them together as
one logical work. There are two kinds of Oozie jobs:
Oozie workflow: These are sequential set of actions to be executed. You can assume it as
a relay race. Where each athlete waits for the last one to complete his part.
Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made
available to it. Think of this as the response-stimuli system in our body. In the same
manner as we respond to an external stimulus, an Oozie coordinator responds to the
availability of data and it rests otherwise.
Apache Flume
The Flume is a service which helps in ingesting unstructured and semi-structured data
into HDFS.
It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving large amount of data sets.
It helps us to ingest online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS.
- 20 -
Advantages of Hadoop
1. Scalable
Hadoop is a highly scalable storage platform, because it can stores and distribute very large data
sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational
database systems (RDBMS) that can’t scale to process large amounts of data, Hadoop enables
businesses to run applications on thousands of nodes involving many thousands of terabytes of
data.
2. Cost effective
Hadoop also offers a cost effective storage solution for businesses’ exploding data sets. The
problem with traditional relational database management systems is that it is extremely cost
prohibitive to scale to such a degree in order to process such massive volumes of data. In an
effort to reduce costs, many companies in the past would have had to down-sample data and
classify it based on certain assumptions as to which data was the most valuable. The raw data
would be deleted, as it would be too cost-prohibitive to keep. While this approach may have
worked in the short term, this meant that when business priorities changed, the complete raw
data set was not available, as it was too expensive to store.
3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different types of data
(both structured and unstructured) to generate value from that data. This means businesses can
use Hadoop to derive valuable business insights from data sources such as social media, email
conversations. Hadoop can be used for a wide variety of purposes, such as log processing,
recommendation systems, data warehousing, market campaign analysis and fraud detection.
4. Fast
Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data
wherever it is located on a cluster. The tools for data processing are often on the same servers
where the data is located, resulting in much faster data processing. If you’re dealing with large
volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just
minutes, and petabytes in hours.
5.Resilient to failure
- 21 -
A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node,
that data is also replicated to other nodes in the cluster, which means that in the event of failure,
there is another copy available for use.
- 22 -
Application of Hadoop
- 23 -
data going directly to Hadoop. The end goal for every organization is to have a right
platform for storing and processing data of different schema, formats, etc. to support
different use cases that can be integrated at different levels.
IoT and Hadoop
Things in the IoT need to know what to communicate and when to act. At the core of the
IoT is a streaming, always on torrent of data. Hadoop is often used as the data store for
millions or billions of transactions. Massive storage and processing capabilities also
allow you to use Hadoop as a sandbox for discovery and definition of patterns to be
monitored for prescriptive instruction. You can then continuously improve these
instructions, because Hadoop is constantly being updated with new data that doesn’t
match previously defined patterns.
3. Disadvantages of hadoop
As the backbone of so many implementations, Hadoop is almost synomous with big data.
1. Security Concerns
Just managing a complex applications such as Hadoop can be challenging. A simple example can
be seen in the Hadoop security model, which is disabled by default due to sheer complexity. If
whoever managing the platform lacks of know how to enable it, your data could be at huge risk.
Hadoop is also missing encryption at the storage and network levels, which is a major selling
point for government agencies and others that prefer to keep their data under wraps.
2. Vulnerable By Nature
- 24 -
Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The
framework is written almost entirely in Java, one of the most widely used yet controversial
programming languages in existence. Java has been heavily exploited by cybercriminals and as a
result, implicated in numerous security breaches.
3. Not Fit for Small Data
While big data is not exclusively made for big businesses, not all big data platforms are suited
for small data needs. Unfortunately, Hadoop happens to be one of them. Due to its high capacity
design, the Hadoop Distributed File System, lacks the ability to efficiently support the random
reading of small files. As a result, it is not recommended for organizations with small quantities
of data.
4. Potential Stability Issues
Like all open source software, Hadoop has had its fair share of stability issues. To avoid these
issues, organizations are strongly recommended to make sure they are running the latest stable
version, or run it under a third-party vendor equipped to handle such problems.
5. General Limitations
The article introducesApache Flume, MillWheel, and Google’s own Cloud Dataflow as
possible solutions. What each of these platforms have in common is the ability to
improve the efficiency and reliability of data collection, aggregation, and integration. The
main point the article stresses is that companies could be missing out on big benefits by
using Hadoop alone.
- 25 -
Future perspective
Because of all this hassle in the industry, Imarticus Learning has created CBDH
(certificate in big data and hadoop) program designed to ensure that you are job ready to take up
assignments in Big Data Analytics using the Hadoop framework. This functional skill building
program not only equips you with essential concepts of Hadoop but also gives you the required
work experience in Big Data and Hadoop through the implementation of real-life industry
projects. Since the data market forecast is strong and here to stay the knowledge of Hadoop and
related technology will act as a career boost in India with its growing analytics market.
The HDFS file system is not restricted to MapReduce jobs. It can be used for other
applications, many of which are under development at Apache. The list includes the HBase
database, the Apache Mahout machine learning system, and the Apache Hive Data Warehouse
system. Hadoop can, in theory, be used for any sort of work that is batch-oriented rather than
real-time, is very data-intensive, and benefits from parallel processing of data. It can also be used
to complement a real-time system, such as lambda architecture, Apache Storm, Flink and Spark
Streaming.
- 26 -
Research Papers
Some papers influenced the birth and growth of Hadoop and big data processing. Some of these
are:
- 27 -
Hadoop Research Topics
Ability to make Hadoop scheduler resource aware, especially CPU, memory and IO
resources. The current implementation is based on statically configured slots.
Abilty to make a map-reduce job take new input splits even after a map-reduce job has
already started.
Ability to dynamically increase replicas of data in HDFS based on access patterns. This is
needed to handle hot-spots of data.
Ability to extend the map-reduce framework to be able to process data that resides partly
in memory. One assumption of the current implementation is that the map-reduce
framework is used to scan data that resides on disk devices. But memory on commodity
machines is becoming larger and larger. A cluster of 3000 machines with 64 GB each can
keep about 200TB of data in memory! It would be nice if the hadoop framework can
support caching the hot set of data on the RAM of the tasktracker machines. Performance
should increase dramatically because it is costly to serialize/compress data from the disk
into memory for every query.
Heuristics to efficiently 'speculate' map-reduce tasks to help work around machines that
are laggards. In the cloud, the biggest challenge for fault tolerance is not to handle
failures but rather anomalies that makes parts of the cloud slow (but not fail completely),
these impact performance of jobs.
Make map-reduce jobs work across data centers. In many cases, a single hadoop cluster
cannot fit into a single data center and a user has to partition the dataset into two hadoop
clusters in two different data centers.
High Availability of the JobTracker. In the current implementation, if the JobTracker
machine dies, then all currently running jobs fail.
Ability to create snapshots in HDFS. The primary use of these snapshots is to retrieve a
dataset that was erroneously modified/deleted by a buggy application.
- 28 -
Case Studies
- 29 -
References
- 30 -