Unit 5 Cloud Computing
Unit 5 Cloud Computing
■Hadoop: Hadoop software is a framework that permits for the distributed processing of huge
data sets across clusters of computers using simple programming models. In simple terms,
Hadoop is a framework for processing ‘Big Data’. Hadoop was created by Doug Cutting. It is
designed to divide from single servers to thousands of machines, each having local
computation and storage. Hadoop is an open-source software. The core of Apache Hadoop
consists of a storage part, known as the Hadoop Distributed File System(HDFS), and a
processing part which may be a Map-Reduce programming model. Hadoop splits files into
large blocks and distributes them across nodes during a cluster. It then transfers packaged code
into nodes to process the info in parallel.
■MapReduce: MapReduce is a programming model that is used for processing and generating
large data sets on clusters of computers. It was introduced by Google. Mapreduce is a concept
or a method for large scale parallelization. It is inspired by functional
programming’s map() and reduce() functions.
MapReduce program is executed in three stages they are:
Mapping: Mapper’s job is to process input data. Each node applies the map function to the
local data.
Shuffle: Here nodes are redistributed where data is based on the output keys.(output keys
are produced by map function).
Reduce: Nodes are now processed into each group of output data, per key in parallel.
So, One of the two components of Hadoop is Map Reduce. The first component of Hadoop is,
Hadoop Distributed File System (HDFS) is responsible for storing the file. The second
component that is, Map Reduce is responsible for processing the file. Suppose there is a word
file containing some text. Let us name this file as sample.txt. We use Hadoop to deal with
huge files but for the sake of easy explanation over here, we are taking a text file as an
example. So, let’s assume that this sample.txt file contains few lines as text. The content of
the file is as follows:
Hello I am Google
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
Hence, the above 8 lines are the content of the file. Let’s assume that while storing this file in
Hadoop, HDFS break this file into four parts and name each part
as first.txt, second.txt, third.txt, and fourth.txt. So, we can easily see that the above file will be
divided into four equal parts and each part will contain 2 lines. First two lines will be in the
file first.txt, next two lines in second.txt, next two in third.txt and the last two lines will be
stored in fourth.txt. All these files will be stored in Data Nodes and the Name Node will
contain the metadata about them. All this is the task of HDFS.
Now, suppose a user wants to process this file. Here is what Map-Reduce comes into the
picture. Suppose this user wants to run a query on this sample.txt. So, instead of
bringing sample.txt on the local computer, we will send this query on the data. To keep a track
of our request, we use Job Tracker (a master service). Job Tracker traps our request and keeps
a track of it.
Now suppose that the user wants to run his query on sample.txt and want the output
in result.output file. Let the name of the file containing the query is query.jar. So, the user
will write a query like:
J$hadoop jar query.jar DriverCode sample.txt result.output
1. query.jar : query file that needs to be processed on the input file.
2. sample.txt: input file.
3. result.output: directory in which output of the processing will be received.
So, now the Job Tracker traps this request and asks Name Node to run this request
on sample.txt. Name Node then provides the metadata to the Job Tracker. Job Tracker now
knows that sample.txt is stored in first.txt, second.txt, third.txt, and fourth.txt. As all these four
files have three copies stored in HDFS, so the Job Tracker communicates with the Task
Tracker (a slave service) of each of these files but it communicates with only one copy of
each file which is residing nearest to it. Applying the desired code on
local first.txt, second.txt, third.txt and fourth.txt is a process. This process is called Map.
In Hadoop terminology, the main file sample.txt is called input file and its four subfiles are
called input splits. So, in Hadoop the number of mappers for an input file are equal to number
of input splits of this input file. In the above case, the input file sample.txt has four input splits
hence four mappers will be running to process it. The responsibility of handling these mappers
is of Job Tracker.
Note that the task trackers are slave services to the Job Tracker. So, in case any of the local
machines breaks down then the processing over that part of the file will stop and it will halt the
complete process. So, each task tracker sends heartbeat and its number of slots to Job Tracker
in every 3 seconds. This is called the status of Task Trackers. In case any task tracker goes
down, the Job Tracker then waits for 10 heartbeat times, that is, 30 seconds, and even after that
if it does not get any status, then it assumes that either the task tracker is dead or is extremely
busy. So it then communicates with the task tracker of another copy of the same file and
directs it to process the desired code over it. Similarly, the slot information is used by the Job
Tracker to keep a track of how many tasks are being currently served by the task tracker and
how many more tasks can be assigned to it. In this way, the Job Tracker keeps track of our
request.
Now, suppose that the system has generated output for individual first.txt, second.txt, third.txt,
and fourth.txt. But this is not the user’s desired output. To produce the desired output, all these
individual outputs have to be merged or reduced to a single output. This reduction of multiple
outputs to a single one is also a process which is done by REDUCER. In Hadoop, as many
reducers are there, those many number of output files are generated. By default, there is
always one reducer per cluster.
Note: Map and Reduce are two different processes of the second component of Hadoop, that
is, Map Reduce. These are also called phases of Map Reduce. Thus we can say that Map
Reduce has two phases. Phase 1 is Map and Phase 2 is Reduce.
■Functioning of Map Reduce
Now, let us move back to our sample.txt file with the same content. Again it is being divided
into four input splits namely, first.txt, second.txt, third.txt, and fourth.txt. Now, suppose we
want to count number of each word in the file. That is the content of the file looks like:
Hello I am Google
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
Figure 1
An update of an entity occurs in a transaction that is retried a fixed number of times if other
processes are trying to update the same entity simultaneously. Your application can execute
multiple data store operations in a single transaction which either all succeed or all fail together.
The data store implements transactions across its distributed network using “entity groups.” A
transaction manipulates entities within a single group. Entities of the same group are stored
together for efficient execution of transactions. Your GAE application can assign entities to
groups when the ent-ties are created. The performance of the data store can be enhanced by in-
memory caching using the memcache, which can also be used independently of the data store.
Recently, Google added the blobstore which is suitable for large files as its size limit is 2 GB.
There are several mechanisms for incorporating external resources. The Google SDC Secure
Data Connection can tunnel through the Internet and link your intranet to an external GAE
application. The URL Fetch operation provides the ability for applications to fetch resources and
communicate with other hosts over the Internet using HTTP and HTTPS requests. There is a
specialized mail mechanism to send e-mail from your GAE application. Applications can access
resources on the Internet, such as web services or other data, using GAE’s URL fetch service.
The URL fetch service retrieves web resources using the same high-speed Google infrastructure
that retrieves web pages for many other Google products. There are dozens of Google
“corporate” facilities including maps, sites, groups, calendar, docs, and YouTube, among others.
These support the Google Data API which can be used inside GAE.
An application can use Google Accounts for user authentication. Google Accounts handles user
account creation and sign-in, and a user that already has a Google account (such as a Gmail
account) can use that account with your app. GAE provides the ability to manipulate image data
using a dedicated Images service which can resize, rotate, flip, crop, and enhance images. An
application can perform tasks outside of responding to web requests. Your application can
perform these tasks on a schedule that you configure, such as on a daily or hourly basis using
“cron jobs,” handled by the Cron service.
Alternatively, the application can perform tasks added to a queue by the application itself, such
as a background task created while handling a request. A GAE application is configured to
consume resources up to certain limits or quotas. With quotas, GAE ensures that your application
won’t exceed your budget, and that other applications running on GAE won’t impact the
performance of your app. In particular, GAE use is free up to certain quotas.