0% found this document useful (0 votes)
6 views

Unit 5 Cloud Computing

Uploaded by

asta11732
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit 5 Cloud Computing

Uploaded by

asta11732
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Unit 5

1. Federation in the Cloud


One challenge in creating and managing a globally decentralized cloud computing environment
is maintaining consistent connectivity between untrusted components while remaining fault-
tolerant. A key opportunity for the emerging cloud industry will be in defining a federated cloud
ecosystem by connecting multiple cloud computing providers using a common standard. The
concept allows for multiple providers to interact seamlessly with others
1.1 Four Levels of Federation
XMPP is the Extensible Messaging and Presence Protocol, a set of open technologies for instant
messaging, presence, multi-party chat, voice and video calls, collaboration, lightweight
middleware, content syndication, and generalized routing of XML data. Technically speaking,
federation is the ability for two XMPP servers in different domains to exchange XML stanzas.
According to the XEP-0238: XMPP Protocol Flows for Inter-Domain Federation, there are at
least four basic types of federation.:
■Permissive federation- Permissive federation occurs when a server accepts a connection from
a peer network server without verifying its identity using DNS lookups or certificate checking.
The lack of verification or authentication may lead to domain spoofing (the unauthorized use of a
third-party domain name in an email message in order to pretend to be someone else), which
opens the door to widespread spam and other abuses. With the release of the open source jabberd
1.2 server in October 2000, which included support for the Server Dialback protocol (fully
supported in Jabber XCP), permissive federation met its demise on the XMPP network.
■Verified federation-This type of federation occurs when a server accepts a connection from a
peer after the identity of the peer has been verified. It uses information obtained via DNS and by
means of domain-specific keys exchanged beforehand. The connection is not encrypted, and the
use of identity verification effectively prevents domain spoofing. To make this work, federation
requires proper DNS setup, and that is still subject to DNS poisoning attacks. Verified federation
has been the default service policy on the open XMPP since the release of the open-source
jabberd 1.2 server.
■Encrypted federation- In this mode, a server accepts a connection from a peer if and only if
the peer supports Transport Layer Security (TLS) as defined for XMPP in Request for
Comments (RFC) 3920. The peer must present a digital certificate. The certificate may be self-
signed, but this prevents using mutual authentication. If this is the case, both parties proceed to
weakly verify identity using Server Dialback. XEP-0220 defines the Server Dialback protocol,
which is used between XMPP servers to provide identity verification. Server Dialback uses the
DNS as the basis for verifying identity; the basic approach is that when a receiving server
receives a server-to-server connection request from an originating server, it does not accept the
request until it has verified a key with an authoritative server for the domain asserted by the
originating server. Although Server Dialback does not provide strong authentication or trusted
federation, and although it is subject to DNS poisoning attacks, it has effectively prevented most
instances of address spoofing on the XMPP net-work since its release in 2000. This results in an
encrypted connection with weak identity verification.
■Trusted federation- Here, a server accepts a connection from a peer only under the stipulation
that the peer supports TLS and the peer can present a digital certificate issued by a root
certification authority (CA) that is trusted by the authenticating server. The list of trusted root
CAs may be determined by one or more factors, such as the operating system, XMPP server
software, or local service policy. In trusted federation, the use of digital certificates results not
only in a channel encryption but also in strong authentication. The use of trusted domain
certificates effectively prevents DNS poisoning attacks but makes federation

2. Federated Services and Applications


Clouds typically consist of all the users, devices, services, and applications connected to the
network. In order to fully leverage the capabilities of this cloud structure, a participant needs the
ability to find other entities of interest. Such entities might be end users, multiuser chat rooms,
real-time content feeds, user directories, data relays, messaging gateways, etc. Finding these
entities is a process called discovery. XMPP uses service discovery to find the aforementioned
entities. The discovery protocol enables any network participant to query another entity
regarding its identity, capabilities, and associated entities. When a participant connects to the
network, it queries the authoritative server for its particular domain about the entities associated
with that authoritative server. In response to a service discovery query, the authoritative server
informs the inquirer about services hosted there and may also detail services that are available
but hosted elsewhere. XMPP includes a method for maintaining personal lists of other entities,
known as roster technology, which enables end users to keep track of various types of entities.
Usually, these lists are comprised of other entities the users are interested in or interact with
regularly. Most XMPP deployments include custom directories so that internal users of those
services can easily find what they are looking for.

3. The Future of Federation


The implementation of federated communications is a precursor to building a seamless cloud that
can interact with people, devices, information feeds, documents, application interfaces, and other
entities. The power of a federated, presence-enabled communications infrastructure is that it
enables software developers and service providers to build and deploy such applications without
asking permission from a large, centralized communications operator. The process of server-to-
server federation for the purpose of inter-domain communication has played a large role in the
success of XMPP, which relies on a small set of simple but powerful mechanisms for domain
checking and security to generate verified, encrypted, and trusted connections between any two
deployed servers. These mechanisms have provided a stable, secure foundation for growth of the
XMPP network and similar real-time technologies.

4. Hadoop and MapReduce

■Hadoop: Hadoop software is a framework that permits for the distributed processing of huge
data sets across clusters of computers using simple programming models. In simple terms,
Hadoop is a framework for processing ‘Big Data’. Hadoop was created by Doug Cutting. It is
designed to divide from single servers to thousands of machines, each having local
computation and storage. Hadoop is an open-source software. The core of Apache Hadoop
consists of a storage part, known as the Hadoop Distributed File System(HDFS), and a
processing part which may be a Map-Reduce programming model. Hadoop splits files into
large blocks and distributes them across nodes during a cluster. It then transfers packaged code
into nodes to process the info in parallel.

■MapReduce: MapReduce is a programming model that is used for processing and generating
large data sets on clusters of computers. It was introduced by Google. Mapreduce is a concept
or a method for large scale parallelization. It is inspired by functional
programming’s map() and reduce() functions.
MapReduce program is executed in three stages they are:
 Mapping: Mapper’s job is to process input data. Each node applies the map function to the
local data.
 Shuffle: Here nodes are redistributed where data is based on the output keys.(output keys
are produced by map function).
 Reduce: Nodes are now processed into each group of output data, per key in parallel.

So, One of the two components of Hadoop is Map Reduce. The first component of Hadoop is,
Hadoop Distributed File System (HDFS) is responsible for storing the file. The second
component that is, Map Reduce is responsible for processing the file. Suppose there is a word
file containing some text. Let us name this file as sample.txt. We use Hadoop to deal with
huge files but for the sake of easy explanation over here, we are taking a text file as an
example. So, let’s assume that this sample.txt file contains few lines as text. The content of
the file is as follows:

Hello I am Google
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths

Hence, the above 8 lines are the content of the file. Let’s assume that while storing this file in
Hadoop, HDFS break this file into four parts and name each part
as first.txt, second.txt, third.txt, and fourth.txt. So, we can easily see that the above file will be
divided into four equal parts and each part will contain 2 lines. First two lines will be in the
file first.txt, next two lines in second.txt, next two in third.txt and the last two lines will be
stored in fourth.txt. All these files will be stored in Data Nodes and the Name Node will
contain the metadata about them. All this is the task of HDFS.
Now, suppose a user wants to process this file. Here is what Map-Reduce comes into the
picture. Suppose this user wants to run a query on this sample.txt. So, instead of
bringing sample.txt on the local computer, we will send this query on the data. To keep a track
of our request, we use Job Tracker (a master service). Job Tracker traps our request and keeps
a track of it.
Now suppose that the user wants to run his query on sample.txt and want the output
in result.output file. Let the name of the file containing the query is query.jar. So, the user
will write a query like:
J$hadoop jar query.jar DriverCode sample.txt result.output
1. query.jar : query file that needs to be processed on the input file.
2. sample.txt: input file.
3. result.output: directory in which output of the processing will be received.
So, now the Job Tracker traps this request and asks Name Node to run this request
on sample.txt. Name Node then provides the metadata to the Job Tracker. Job Tracker now
knows that sample.txt is stored in first.txt, second.txt, third.txt, and fourth.txt. As all these four
files have three copies stored in HDFS, so the Job Tracker communicates with the Task
Tracker (a slave service) of each of these files but it communicates with only one copy of
each file which is residing nearest to it. Applying the desired code on
local first.txt, second.txt, third.txt and fourth.txt is a process. This process is called Map.
In Hadoop terminology, the main file sample.txt is called input file and its four subfiles are
called input splits. So, in Hadoop the number of mappers for an input file are equal to number
of input splits of this input file. In the above case, the input file sample.txt has four input splits
hence four mappers will be running to process it. The responsibility of handling these mappers
is of Job Tracker.
Note that the task trackers are slave services to the Job Tracker. So, in case any of the local
machines breaks down then the processing over that part of the file will stop and it will halt the
complete process. So, each task tracker sends heartbeat and its number of slots to Job Tracker
in every 3 seconds. This is called the status of Task Trackers. In case any task tracker goes
down, the Job Tracker then waits for 10 heartbeat times, that is, 30 seconds, and even after that
if it does not get any status, then it assumes that either the task tracker is dead or is extremely
busy. So it then communicates with the task tracker of another copy of the same file and
directs it to process the desired code over it. Similarly, the slot information is used by the Job
Tracker to keep a track of how many tasks are being currently served by the task tracker and
how many more tasks can be assigned to it. In this way, the Job Tracker keeps track of our
request.
Now, suppose that the system has generated output for individual first.txt, second.txt, third.txt,
and fourth.txt. But this is not the user’s desired output. To produce the desired output, all these
individual outputs have to be merged or reduced to a single output. This reduction of multiple
outputs to a single one is also a process which is done by REDUCER. In Hadoop, as many
reducers are there, those many number of output files are generated. By default, there is
always one reducer per cluster.

Note: Map and Reduce are two different processes of the second component of Hadoop, that
is, Map Reduce. These are also called phases of Map Reduce. Thus we can say that Map
Reduce has two phases. Phase 1 is Map and Phase 2 is Reduce.
■Functioning of Map Reduce

Now, let us move back to our sample.txt file with the same content. Again it is being divided
into four input splits namely, first.txt, second.txt, third.txt, and fourth.txt. Now, suppose we
want to count number of each word in the file. That is the content of the file looks like:

Hello I am Google
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths

Then the output of the ‘word count’ code will be like:


Hello - 1
I-1
am - 1
Google - 1
How - 2 (How is written two times in the entire file)
Similarly
Are - 3
are - 2
….and so on
Thus in order to get this output, the user will have to send his query on the data. Suppose the
query ‘word count’ is in the file wordcount.jar. So, the query will look like:

J$hadoop jar wordcount.jar DriverCode sample.txt result.output


■Types of File Format in Hadoop
Now, as we know that there are four input splits, so four mappers will be running. One on each
input split. But, Mappers don’t run directly on the input splits. It is because the input splits
contain text but Mappers don’t understand the text. Mappers understand (key, value) pairs
only. Thus the text in input splits first needs to be converted to (key, value) pairs. This is
achieved by Record Readers. Thus we can also say that as many numbers of input splits are
there, those many numbers of record readers are there.
In Hadoop terminology, each line in a text is termed as a ‘record’. How record reader converts
this text into (key, value) pair depends on the format of the file. In Hadoop, there are four
formats of a file. These formats are Predefined Classes in Hadoop.
Four types of formats are:
1. TextInputFormat
2. KeyValueTextInputFormat
3. SequenceFileInputFormat
4. SequenceFileAsTextInputFormat
By default, a file is in TextInputFormat. Record reader reads one record(line) at a time. While
reading, it doesn’t consider the format of the file. But, it converts each record into (key, value)
pair depending upon its format. For the time being, let’s assume that the first input
split first.txt is in TextInputFormat. Now, the record reader working on this input split converts
the record in the form of (byte offset, entire line). For example first.txt has the content:
Hello I am Google
How can I help you
So, the output of record reader has two pairs (since two records are there in the file). The first
pair looks like (0, Hello I am Google) and the second pair looks like (26, How can I help you).
Note that the second pair has the byte offset of 26 because there are 25 characters in the first
line and the newline operator (\n) is also considered a character. Thus, after the record reader
as many numbers of records is there, those many numbers of (key, value) pairs are there. Now,
the mapper will run once for each of these pairs. Similarly, other mappers are also running for
(key, value) pairs of different input splits. Thus in this way, Hadoop breaks a big task into
smaller tasks and executes them in parallel execution.

■ Shuffling and Sorting


Now, the Mapper provides an output corresponding to each (key, value) pair provided by the
record reader. Let us take the first input split of first.txt. The two pairs so generated for this file
by the record reader are (0, Hello I am Google) and (26, How can I help you). Now Mapper
takes one of these pair at a time and produces output like (Hello, 1), (I, 1), (am, 1) and
(Google, 1) for the first pair and (How, 1), (can, 1), (I, 1), (help, 1) and (you, 1) for the second
pair. Similarly, we have outputs of all the Mappers. Note that this data contains duplicate keys
like (I, 1) and further (how, 1) etc. These duplicate keys also need to be taken care of. This data
is also called Intermediate Data. Before passing this intermediate data to the reducer, it is first
passed through two more stages, called Shuffling and Sorting.
1. Shuffling Phase: This phase combines all values associated to an identical key. For eg,
(Are, 1) is there three times in the input file. So after the shuffling phase, the output will be like
(Are, [1,1,1]).
2. Sorting Phase: Once shuffling is done, the output is sent to the sorting phase where all
the (key, value) pairs are sorted automatically. In Hadoop sorting is an automatic process
because of the presence of an inbuilt interface called WritableComparableInterface.
After the completion of the shuffling and sorting phase, the resultant output is then sent to the
reducer. Now, if there are n (key, value) pairs after the shuffling and sorting phase, then the
reducer runs n times and thus produces the final result in which the final processed output is
there. In the above case, the resultant output after the reducer processing will get stored in the
directory result.output as specified in the query code written to process the query on the data.
5. Programming the Google App Engine(GAE)
Figure 1 below summarizes some key features of GAE programming model for two supported
languages: Java and Python. A client environment that includes an Eclipse plug-in for Java
allows debugging GAE on local machine. Also, the GWT Google Web Toolkit is available for
Java web application developers. Developers can use this, or any other language using a JVM-
based interpreter or compiler, such as JavaScript or Ruby. Python is often used with frameworks
such as Django and CherryPy, but Google also supplies a built in webapp Python environment.
There are several powerful constructs for storing and accessing data. The data store is a NOSQL
data management system for entities that can be, at most, 1MB in size and are labeled by a set of
schema-less properties. Queries can retrieve entities of a given kind filtered and sorted by the
values of the properties. Java offers Java Data Object(JDO) and Java Persistence API (JPA)
interfaces implemented by the open source Data Nucleus Access platform, while Python has a
SQL-like query language called GQL. The data store is strongly consistent and uses optimistic
concurrency control.

Figure 1
An update of an entity occurs in a transaction that is retried a fixed number of times if other
processes are trying to update the same entity simultaneously. Your application can execute
multiple data store operations in a single transaction which either all succeed or all fail together.
The data store implements transactions across its distributed network using “entity groups.” A
transaction manipulates entities within a single group. Entities of the same group are stored
together for efficient execution of transactions. Your GAE application can assign entities to
groups when the ent-ties are created. The performance of the data store can be enhanced by in-
memory caching using the memcache, which can also be used independently of the data store.
Recently, Google added the blobstore which is suitable for large files as its size limit is 2 GB.
There are several mechanisms for incorporating external resources. The Google SDC Secure
Data Connection can tunnel through the Internet and link your intranet to an external GAE
application. The URL Fetch operation provides the ability for applications to fetch resources and
communicate with other hosts over the Internet using HTTP and HTTPS requests. There is a
specialized mail mechanism to send e-mail from your GAE application. Applications can access
resources on the Internet, such as web services or other data, using GAE’s URL fetch service.
The URL fetch service retrieves web resources using the same high-speed Google infrastructure
that retrieves web pages for many other Google products. There are dozens of Google
“corporate” facilities including maps, sites, groups, calendar, docs, and YouTube, among others.
These support the Google Data API which can be used inside GAE.

An application can use Google Accounts for user authentication. Google Accounts handles user
account creation and sign-in, and a user that already has a Google account (such as a Gmail
account) can use that account with your app. GAE provides the ability to manipulate image data
using a dedicated Images service which can resize, rotate, flip, crop, and enhance images. An
application can perform tasks outside of responding to web requests. Your application can
perform these tasks on a schedule that you configure, such as on a daily or hourly basis using
“cron jobs,” handled by the Cron service.

Alternatively, the application can perform tasks added to a queue by the application itself, such
as a background task created while handling a request. A GAE application is configured to
consume resources up to certain limits or quotas. With quotas, GAE ensures that your application
won’t exceed your budget, and that other applications running on GAE won’t impact the
performance of your app. In particular, GAE use is free up to certain quotas.

You might also like