0% found this document useful (0 votes)

591 views33 pages

HDFS Architecture and MapReduce Overview

The document discusses Hadoop, an open source framework for distributed processing of large datasets. It describes Hadoop's architecture including HDFS for storage, YARN for resource management, and MapReduce for parallel processing. HDFS uses a master-slave architecture with a NameNode master and DataNode slaves. It stores replicated blocks of data across clusters for fault tolerance. MapReduce allows distributed processing of large datasets using a map and reduce algorithm run on Hadoop clusters.

Uploaded by

steffinamorin L

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

591 views33 pages

HDFS Architecture and MapReduce Overview

Uploaded by

steffinamorin L

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Hadoop Framework
MapReduce Techniques

CS8791-Cloud Computing Unit V Notes

UNIT V CLOUD TECHNOLOGIES AND ADVANCEMENTS

Hadoop – MapReduce – Virtual Box - Google App Engine – Programming Environment for
Google App Engine –– Open Stack – Federation in the Cloud – Four Levels of Federation –
Federated Services and Applications – Future of Federation
5.1 Introduction to Hadoop Framework
 Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming
models.
 Hadoop is designed to scale up from single server to thousands of machines, each offering
local computation and storage.
 Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes.

Users of Hadoop:
 Hadoop is running search on some of the Internet's largest sites:
o Amazon Web Services: Elastic MapReduce
o AOL: Variety of uses, e.g., behavioral analysis & targeting
o Ebay: Search optimization (532-node cluster)
o Facebook: Reporting/analytics, machine learning (1100 m.)
o LinkedIn: People You May Know (2x50 machines)
o Twitter: Store + process tweets, log files, other data Yahoo: >36,000 nodes; biggest
cluster is 4,000 nodes
Hadoop Architecture
 Hadoop has a Master Slave Architecture for both Storage & Processing
 Hadoop framework includes following four modules:
 Hadoop Common: These are Java libraries and provide file system and OS level
abstractions and contains the necessary Java files and scripts required to start Hadoop.
VII SEMESTER 1

annauniversityedu.b
CS8791-Cloud Computing Unit V Notes

 Hadoop YARN: This is a framework for job scheduling and cluster resource management.
 Hadoop Distributed File System (HDFS): A distributed file system that provides high-
throughput access to application data.
 HadoopMapReduce: This is system for parallel processing of large data sets.

Figure 5.1 Hadoop Architecture

 The Hadoop core is divided into two fundamental layers:
 MapReduce engine
 HDFS
 The MapReduce engine is the computation engine running on top of HDFS as its data
storage manager.
 HDFS: HDFS is a distributed file system inspired by GFS that organizes files and stores
their data on a distributed computing system.
 HDFS Architecture: HDFS has a master/slave architecture containing a single Name Node
as the master and a number of Data Nodes as workers (slaves).

HDFS
 To store a file in this architecture,
 HDFS splits the file into fixed-size blocks (e.g., 64 MB) and stores them on workers (Data
Nodes).

VII SEMESTER 2

[Link]
CS8791-Cloud Computing Unit V Notes

 The mapping of blocks to Data Nodes is determined by the Name Node.

 The NameNode (master) also manages the file system’s metadata and namespace.
 Namespace is the area maintaining the metadata, and metadata refers to all the information
stored by a file system that is needed for overall management of all files.
 NameNode in the metadata stores all information regarding the location of input
splits/blocks in all DataNodes.
 Each DataNode, usually one per node in a cluster, manages the storage attached to the
node.
 Each DataNode is responsible for storing and retrieving its file blocks
HDFS- Features
Distributed file systems have special requirements
 Performance
 Scalability
 Concurrency Control
 Fault Tolerance
 Security Requirements

HDFS Fault Tolerance

Block replication:
 To reliably store data in HDFS, file blocks are replicated in this system.
 HDFS stores a file as a set of blocks and each block is replicated and distributed across the
whole cluster.
 The replication factor is set by the user and is three by default.
 Replica placement: The placement of replicas is another factor to fulfill the desired fault
tolerance in HDFS.
 Storing replicas on different nodes (DataNodes) located in different racks across the whole
cluster.
 HDFS stores one replica in the same node the original data is stored.
 One replica on a different node but in the same rack
 One replica on a different node in a different rack.
 Heartbeats and Blockreports are periodic messages sent to the NameNode by each
DataNode in a cluster.

[Link]
CS8791-Cloud Computing Unit V Notes

 Receipt of a Heartbeat implies that the DataNode is functioning properly.

 Each Blockreport contains a list of all blocks on a DataNode .
 The NameNode receives such messages because it is the sole decision maker of all replicas
in the system.
HDFS High Throughput
 Applications run on HDFS typically have large data sets.
 Individual files are broken into large blocks to allow HDFS to decrease the amount of
metadata storage required per file.
 The list of blocks per file will shrink as the size of individual blocks increases.
 By keeping large amounts of data sequentially within a block, HDFS provides fast
streaming reads of data.

HDFS- Read Operation

Reading a file :
 To read a file in HDFS, a user sends an “open” request to the NameNode to get the location
of file blocks.
 For each file block, the NameNode returns the address of a
set of DataNodes containing replica information for the requested file.
 The number of addresses depends on the number of block replicas.
 The user calls the read function to connect to the closest DataNode containing the first
block of the file.
 Then the first block is streamed from the respective DataNode to the user.
 The established connection is terminated and the same process is repeated for all blocks of
the requested file until the whole file is streamed to the user.
CS8791-Cloud Computing Unit V Notes

HDFS-Write Operation
Writing to a file:
 To write a file in HDFS, a user sends a “create” request to the NameNode to create a new
file in the file system namespace.
 If the file does not exist, the NameNode notifies the user and allows him to start writing
data to the file by calling the write function.
 The first block of the file is written to an internal queue termed the data queue.
 A data streamer monitors its writing into a DataNode.
 Each file block needs to be replicated by a predefined factor.
 The data streamer first sends a request to the NameNode to get a list of suitable DataNodes
to store replicas of the first block.
 The steamer then stores the block in the first allocated DataNode.
 Afterward, the block is forwarded to the second DataNode by the first DataNode.
 The process continues until all allocated DataNodes receive a replica of the first block from
the previous DataNode.
 Once this replication process is finalized, the same process starts for the second block.

VII SEMESTER 5

[Link]
CS8791-Cloud Computing Unit V Notes

5.1.1 Architecture of Mapreduce in Hadoop

• Distributed file system (HDFS)
• Execution engine (MapReduce)

Properties of Hadoop Engine

 HDFS has a master/slave architecture containing
 A single NameNode as the master and
 A number of DataNodes as workers (slaves).

VII SEMESTER 6

[Link]
CS8791-Cloud Computing Unit V Notes

 To store a file in this architecture, HDFS splits the file into fixed-size blocks (e.g., 64 MB)
and stores them on workers (DataNodes).
 The NameNode (master) also manages the file system’s metadata and namespace.
 Job Tracker is the master node (runs with the namenode)
o Receives the user’s job
o Decides on how many tasks will run (number of mappers)
o Decides on where to run each mapper (concept of locality)
 Task Tracker is the slave node (runs on each datanode)
o Receives the task from Job Tracker
o Runs the task until completion (either map or reduce task)
o Always in communication with the Job Tracker reporting progress (heartbeats)

Running a Job in Hadoop

 Three components contribute in running a job in this system:

o a user node,
o a JobTracker, and
o severalTaskTrackers.

VII SEMESTER 7

[Link]
CS8791-Cloud Computing Unit V Notes

 The data flow starts by calling the runJob(conf) function inside a user program running on
the user node, in which conf is an object containing some tuning parameters for the
MapReduce
 Job Submission: Each job is submitted from a user node to the JobTracker node.
 Task assignment : The JobTracker creates one map task for each computed input split
 Task execution : The control flow to execute a task (either map or reduce) starts inside the
TaskTracker by copying the job JAR file to its file system.
 Task running check : A task running check is performed by receiving periodic heartbeat
messages to the JobTracker from the TaskTrackers.
 Heartbeat: notifies the JobTracker that the sending TaskTracker is alive, and whether the
sending TaskTracker is ready to run a new task.

The Apache Hadoop project develops open-source software for reliable, scalable, distributed
computing, including:
 HadoopCore, our flagship sub-project, provides a distributed filesystem (HDFS) and
support for the MapReduce distributed computing metaphor.
 HBase builds on Hadoop Core to provide a scalable, distributed database.
 Pig is a high-level data-flow language and execution framework for parallel
[Link] is built on top of Hadoop Core.
 ZooKeeper is a highly available and reliable coordination system. Distributed
applications use ZooKeeper to store and mediate updates for critical shared state.
 Hive is a data warehouse infrastructure built on Hadoop Core that provides data
summarization, adhoc querying and analysis of datasets.

MAP REDUCE
 MapReduce is a programming model for data processing.
 MapReduce is designed to efficiently process large volumes of data by connecting many
commodity computers together to work in parallel
 Hadoop can run MapReduce programs written in various languages like Java, Ruby, and
Python
 MapReduce works by breaking the processing into two phases:
o The map phase and
VII SEMESTER 8

[Link]
CS8791-Cloud Computing Unit V Notes

o The reduce phase.

 Each phase has key-value pairs as input and output, the types of which may be chosen by
the programmer.
 The programmer also specifies two functions:
o The map function and
o The reduce function.
 In MapReduce, chunks are processed in isolation by tasks called Mappers
 The outputs from the mappers are denoted as intermediate outputs (IOs) and are brought
into a second set of tasks called Reducers
 The process of bringing together IOs into a set of Reducers is known as shuffling process
 The Reducers produce the final outputs (FOs)

 Overall, MapReduce breaks the data flow into two phases, map phase and reduce phase

Mapreduce Workflow
Application writer specifies
 A pair of functions called Mapper and Reducer and a set of input files and submits the job
 Input phase generates a number of FileSplits from input files (one per Map task)
 The Map phase executes a user function to transform input key-pairs into a new set of key-
pairs
 The framework Sorts & Shuffles the key-pairs to output nodes
 The Reduce phase combines all key-pairs with the same key into new keypairs
 The output phase writes the resulting pairs to files as
“parts” Characteristics of MapReduce is characterized by:
 Its simplified programming model which allows the user to quickly write and test
distributed systems
VII SEMESTER 9

[Link]
CS8791-Cloud Computing Unit V Notes

 Its efficient and automatic distribution of data and workload across machines
 Its flat scalability curve. Specifically, after a Mapreduce program is written and functioning
on 10 nodes, very little-if any- work is required for making that same program run on 1000
nodes
The core concept of MapReduce in Hadoop is that input may be split into logical
chunks, and each chunk may be initially processed independently, by a map task. The results of
these individual processing chunks can be physically partitioned into distinct sets, which are
then sorted. Each sorted chunk is passed to a reduce task.
A map task may run on any compute node in the cluster, and multiple map tasks may
berunning in parallel across the cluster. The map task is responsible for transforming the input
records into key/value pairs. The output of all of the maps will be partitioned, and each partition
will be sorted. There will be one partition for each reduce task. Each partition’s sorted keys and the
values associated with the keys are then processed by the reduce task. There may be multiple
reduce tasks running in parallel on the cluster.
The application developer needs to provide only four items to the Hadoop framework:
the class that will read the input records and transform them into one key/value pair per record,
a map method, a reduce method, and a class that will transform the key/value pairs that the
reduce method outputs into output records.
My first MapReduce application was a specialized web crawler. This crawler received
as input large sets of media URLs that were to have their content fetched and processed. The
media items were large, and fetching them had a significant cost in time and resources.
The job had several steps:
1. Ingest the URLs and their associated metadata.
2. Normalize the URLs.
3. Eliminate duplicate URLs.
4. Filter the URLs against a set of exclusion and inclusion filters.
5. Filter the URLs against a do not fetch list.
6. Filter the URLs against a recently seen set.
7. Fetch the URLs.
8. Fingerprint the content items.
9. Update the recently seen set.

VII SEMESTER 10

[Link]
CS8791-Cloud Computing Unit V Notes

10. Prepare the work list for the next application.

Figure 4.7: The MapReduce model

Map Reduce Example:

Input File:
Welcome to Hadoop Class
Hadoop is good

VII SEMESTER 11

[Link]
CS8791-Cloud Computing Unit V Notes

Hadoop is bad

5.2 VirtualBox
VirtualBox is a general-purpose full virtualizer for x86 hardware, targeted at server,
desktop, and embedded use. Developed initially by Innotek GmbH, it was acquired by Sun
Microsystems in 2008, which was, in turn, acquired by Oracle in 2010.
VirtualBox is an extremely feature rich, high-performance product for enterprise
customers, it is also the only professional solution that is freely available as Open Source Software
under the terms of the GNU General Public License (GPL) version 2. It supports Windows, Linux,
Macintosh, Sun Solaris, and FreeBSD. Virtual Box has supported Open Virtualization Format
(OVF) since version 2.2.0 (April 2009)

VII SEMESTER 12

[Link]
CS8791-Cloud Computing Unit V Notes

Operating system virtualization allows your computer’s hardware to run many operating system
images simultaneously. One of the most used instances of this is to test software or applications in
a different environment, rather than on a different computer. This can potentially save you quite a
bit of money by running multiple servers virtually on one computer.
Pros and Cons of Virtual Box over VMWare
 Virtual Box is a general-purpose full virtualizer for x86 hardware, targeted at server, desktop,
and embedded use.
 This product is a Type 2 hypervisor, so it’s virtualization host software that runs on an already
established operating system as an application.
 With VirtualBox, it’s also possible to share your clipboard between the virtualized and host
operating system.
 While VMWare functions on Windows and Linux, not Mac, Virtual Box works with
Windows, Mac, and Linux computers.
 Virtual Box truly has a lot of support because it's open-source and free. Being open-source
means that recent releases are sometimes a bit buggy, but also that they typically get fixed
relatively quickly.
 With VMWare player, instead, you have to wait for the company to release an update to fix
the bugs.
 Virtual Box offers you an unlimited number of snapshots.
 Virtual Box is easy to install, takes a smaller amount of resources, and is many people’s first
choice.
 VMWare often failed to detect my USB device Besides, VirtualBox can detect as well as
identify USB devices after installing Virtual Box Extension Pack.
 With VirtualBox Guest Addition, files can be dragged and copied between VirtualBox and
host.
 VMware outperforms VirtualBox in terms of CPU and memory utilization.
 VirtualBox has snapshots and VMware has rollback points to which you can revert back to in
case you break your virtual machine.
 VMware calls it Unity mode and VirtualBox calls it the seamless mode and they both enable
you to open application windows on the host machine, while the VM supporting that app is
running in the background quietly.

VII SEMESTER 13

[Link]
CS8791-Cloud Computing Unit V Notes

 In the case of VirtualBox, the UI is simple and clean. Your settings are split into Machine
Tools and Global Tools and the former is for creating, modifying, starting, stop and deleting
virtual machines. VMware, on the other hand, has a much more complicated UI, menu items
are named with technical terms which may seem like jargon to average users. This is
primarily because the VMware folks cater to cloud providers and server-side virtualizations
more.
 In case of VirtualBox, PCIe pass through can be accomplished, although you might have to
jump through some hoops. VMware on the other offers excellent customer support and would
help you out if you are in a fix.
 VirtualBox is basically a highly secure program that allows users to download and run OS as
a virtual machine. With Virtual Box, users are able to abstract their hardware via complete
virtualization thus guaranteeing a higher degree of protection from viruses running in the
guest OS.
 Virtual Box offers limited support for 3D graphics. And VMWare has a high-level 3D
graphics support with DX10 and OpenGL 3.3 support.
 The real advantage of VirtualBox over VMware server lies in its performance. VirtualBox
apparently runs faster than VMware server. A timed experiment of an installation of Windows
XP as the guest OS took 20 mins in VirtualBox and 35 mins on VMware server. A similar test
on the booting time of the guest OS also shows favor to VirtualBox with timing of 45secs
compared to 1min 39 secs on VMware server.
 In VirtualBox, the remote file sharing feature is built right in the package. Setting up remote
file sharing is easy and you only need to do it once: point the file path to the directory that you
want to share.

5.3 GOOGLE APPLICATION ENGINE (GAE)

 Google App Engine is a PaaS cloud that provides a complete Web service
environment(Platform)
 GAE provides Web application development platform for users.
 All required hardware, operating systems and software are provided to clients.
 Clients can develop their own applications, while App Engine runs the applications on
Google’s servers.

VII SEMESTER 14

[Link]
CS8791-Cloud Computing Unit V Notes

 GAE helps to easily develop an Web Application

 App Engine only supports the Java and Python programming languages.
 The Google App Engine (GAE) provides a powerful distributed data storage service.
GOOGLE CLOUD INFRASTRUCTURE

 Google has established cloud development by making use of large number of data centers.
 Eg: Google established cloud services in
 Gmail
 Google Docs
 Google Earth etc.
 These applications can support a large number of users simultaneously with High
Availability (HA).
 In 2008, Google announced the GAE web application platform.
 GAE enables users to run their applications on a large number of data centers.
 Google App Engine environment includes the following features :
 Dynamic web serving
 Persistent(constant) storage with queries, sorting, and transactions
 Automatic scaling and load balancing
 Provides Application Programming Interface(API) for authenticating users.
 Send email using Google Accounts.
 Local development environment that simulates(create) Google App Engine on your
computer.
GAE ARCHITECTURE

VII SEMESTER 15

[Link]
CS8791-Cloud Computing Unit V Notes

TECHNOLOGIES USED BY GOOGLE ARE

 Google File System(GFS) ->for storing large amounts of data.

 MapReduce->for application program development.
 Chubby-> for distributed application lock services.
 BigTable-> offers a storage service.
 Third-party application providers can use GAE to build cloud applications for providing
services.
 Inside each data center, there are thousands of servers forming different clusters.
 GAE runs the user program on Google’s infrastructure.
 Application developers now do not need to worry about the maintenance of servers.
 GAE can be thought of as the combination of several software components.
 GAE supports Python and Java programming environments.

FUNCTIONAL MODULES OF GAE

 The GAE platform comprises the following five major components.

 DataStore: offers data storage services based on BigTable techniques.
 The Google App Engine (GAE) provides a powerful distributed data storage service.
 This provides a secure data Storage.

VII SEMESTER 16

[Link]
CS8791-Cloud Computing Unit V Notes

GOOGLE SECURE DATA CONNECTOR (SDC)

FUNCTIONAL MODULES OF GAE

 When the user wants to get the data, he/she will first send an authorized data requests to
Google Apps.
 It forwards the request to the tunnel server.
 The tunnel servers validate the request identity.
 If the identity is valid, the tunnel protocol allows the SDC to set up a connection,
authenticate, and encrypt the data that flows across the Internet.
 SDC also validates whether a user is authorized to access a specified resource.
 Application runtime environment offers a platform for web programming and execution.
 It supports two development languages: Python and Java.
 Software Development Kit (SDK) is used for local application development.
 The SDK allows users to execute test runs of local applications and upload application
code.
 Administration console is used for easy management of user application development
cycles.
 GAE web service infrastructure provides special guarantee flexible use and management of
storage and network resources by GAE.
 Google offers essentially free GAE services to all Gmail account owners.

VII SEMESTER 17

[Link]
CS8791-Cloud Computing Unit V Notes

 We can register for a GAE account or use your Gmail account name to sign up for the
service.
 The service is free within a quota.
 If you exceed the quota, extra amount will be charged.
 Allows the user to deploy user-built applications on top of the cloud infrastructure.
 They are built using the programming languages and software tools supported by the
provider (e.g., Java, Python)

GAE APPLICATIONS

Well-known GAE applications

 Google Search Engine

 Google Docs
 Google Earth
 Gmail
 These applications can support large numbers of users simultaneously.
 Users can interact with Google applications via the web interface provided by each
application.
 Applications run in the Google data centers.
 Inside each data center, there might be thousands of server nodes to form different clusters.
 Each cluster can run multipurpose servers.

5. 3.1 Programming Support of Google App Engine

GAE programming model for two supported languages: Java and Python. A client
environment includes an Eclipse plug-in for Java allows you to debug your GAE on your local
machine. Google Web Toolkit is available for Java web application developers. Python is used
with frameworks such as Django and CherryPy, but Google also has webapp Python environment.

VII SEMESTER 18

[Link]
CS8791-Cloud Computing Unit V Notes

There are several powerful constructs for storing and accessing data. The data store is a
NOSQL data management system for entities. Java offers Java Data Object (JDO) and Java
Persistence API (JPA) interfaces implemented by the Data Nucleus Access platform, while Python
has a SQL-like query language called GQL. The performance of the data store can be enhanced by
in-memory caching using the memcache, which can also be used independently of the data store.
Recently, Google added the blobstore which is suitable for large files as its size limit is 2
GB. There are several mechanisms for incorporating external resources. The Google SDC Secure
Data Connection can tunnel through the Internet and link your intranet to an external GAE
application. The URL Fetch operation provides the ability for applications to fetch resources and
communicate with other hosts over the Internet using HTTP and HTTPS requests.

VII SEMESTER 19

[Link]
CS8791-Cloud Computing Unit V Notes

An application can use Google Accounts for user authentication. Google Accounts handles
user account creation and sign-in, and a user that already has a Google account (such as a Gmail
account) can use that account with your app. GAE provides the ability to manipulate image data
using a dedicated Images service which can resize, rotate, flip, crop, and enhance images. A GAE
application is configured to consume resources up to certain limits or quotas. With quotas, GAE
ensures that your application won’t exceed your budget, and that other applications running on
GAE won’t impact the performance of your app. In particular, GAE use is free up to certain
quotas.
Google File System (GFS)
GFS is a fundamental storage service for Google’s search engine. GFS was designed for
Google applications, and Google applications were built for GFS. There are several concerns in
GFS. rate). As servers are composed of inexpensive commodity components, it is the norm rather
than the exception that concurrent failures will occur all the time. Other concerns the file size in
GFS. GFS typically will hold a large number of huge files, each 100 MB or larger, with files that
are multiple GB in size quite common. Thus, Google has chosen its file data block size to be 64
MB instead of the 4 KB in typical traditional file systems. The I/O pattern in the Google
application is also special. Files are typically written once, and the write operations are often the
appending data blocks to the end of files. Multiple appending operations might be concurrent. The
customized API can simplify the problem and focus on Google applications.
Figure shows the GFS architecture. It is quite obvious that there is a single master in the
whole cluster. Other nodes act as the chunk servers for storing data, while the single master stores
the metadata. The file system namespace and locking facilities are managed by the master. The
master periodically communicates with the chunk servers to collect management information as
well as give instructions to the chunk servers to do work such as load balancing or fail recovery.

VII SEMESTER 20

[Link]
CS8791-Cloud Computing Unit V Notes

The master has enough information to keep the whole cluster in a healthy state. Google
uses a shadow master to replicate all the data on the master, and the design guarantees that all the
data operations are performed directly between the client and the chunk server. The control
messages are transferred between the master and the clients and they can be cached for future use.
With the current quality of commodity servers, the single master can handle a cluster of more than
1,000 nodes.

The mutation takes the following steps:

1. The client asks the master which chunk server holds the current lease for the chunk and the
locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses
(not shown).

VII SEMESTER 21

[Link]
CS8791-Cloud Computing Unit V Notes

2. The master replies with the identity of the primary and the locations of the other (secondary)
replicas. The client caches this data for future mutations. It needs to contact the master again only
when the primary becomes unreachable or replies that it no longer holds a lease.
3. The client pushes the data to all the replicas. Each chunk server will store the data in an internal
LRU buffer cache until the data is used or aged out. By decoupling the data flow from the control
flow, we can improve performance by scheduling the expensive data flow based on the network
topology regardless of which chunk server is the primary.
4. Once all the replicas have acknowledged receiving the data, the client sends a write request to
the primary. The request identifies the data pushed earlier to all the replicas. The primary assigns
consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which
provides the necessary serialization. It applies the mutation to its own local state in serial order.
5. The primary forwards the write request to all secondary replicas. Each secondary replica applies
mutations in the same serial number order assigned by the primary.
6. The secondaries all reply to the primary indicating that they have completed the operation.
7. The primary replies to the client. Any errors encountered at any replicas are reported to the
client. In case of errors, the write corrects at the primary and an arbitrary subset of the secondary
replicas. The client request is considered to have failed, and the modified region is left in an
inconsistent state. Our client code handles such errors by retrying the failed mutation

Big Table
BigTable was designed to provide a service for storing and retrieving structured and
semistructured data. BigTable applications include storage of web pages, per-user data, and
geographic locations. The database needs to support very high read/write rates and the scale might
be millions of operations per second. Also, the database needs to support efficient scans over all or
interesting subsets of data, as well as efficient joins of large one-to-one and one-to-many data sets.
The application may need to examine data changes over time.
The BigTable system is scalable, which means the system has thousands of servers,
terabytes of in-memory data, petabytes of disk-based data, millions of reads/writes per second, and
efficient scans. BigTable is used in many projects, including Google Search, Orkut, and Google
Maps/Google Earth, among others.

VII SEMESTER 22

[Link]
CS8791-Cloud Computing Unit V Notes

The BigTable system is built on top of an existing Google cloud infrastructure. BigTable uses the
following building blocks:
1. GFS: stores persistent state
2. Scheduler: schedules jobs involved in BigTable serving
3. Lock service: master election, location bootstrapping
4. MapReduce: often used to read/write BigTable data.

5.4 Open Stack

OpenStack is a free and open-source software platform for cloud computing.
OpenStack is a virtualization tool to manage your virtual infrastructure.
OpenStack consists of multiple components.
 Compute (Nova)
 Image Service (Glance)
 Object Storage (Swift)
 Dashboard (Horizon)
 Identity Service (Keystone)
 Networking (Neutron)
 Block Storage (Cinder)
 Telemetry (Ceilometer)
 Orchestration (Heat)
 Workflow (Mistral)
 Database (Trove)
 Elastic map reduce (Sahara)
 Bare metal (Ironic)
 Messaging (Zaqar)
 Shared file system (Manila)
 DNS (Designate)
 Search (Searchlight)
 Key manager (Barbican)
 Root Cause Analysis (Vitrage)

VII SEMESTER 23

[Link]
CS8791-Cloud Computing Unit V Notes

 Rule-based alarm actions (Aodh)

Compute (Nova)
OpenStack Compute is also known as OpenStack Nova.
Nova is the primary compute engine of OpenStack, used for deploying and managing virtual
machine.
OpenStack Compute manages pools of computer resources and work with virtualization
technologies.
Nova can be deployed using hypervisor technologies such as KVM, VMware, LXC, XenServer,
etc.
Image Service (Glance)
OpenStack image service offers storing and retrieval of virtual machine disk images.
OpenStack Compute makes use of this during VM provisioning.
Glance has client-server architecture which allows querying of virtual machine image.
While deploying new virtual machine instances, Glance uses the stored images as
templates. OpenStack Glance supports VirtualBox, VMWare and KVM virtual machine
images.

VII SEMESTER 24

[Link]
CS8791-Cloud Computing Unit V Notes

Object Storage (Swift)

OpenStack Swift creates redundant (repetition), scalable data storage to store petabytes of
accessible data.
The data can be included,retrieved and updated.
It has a distributed architecture, providing greater redundancy, scalability, and performance,
with no central point of control.
It helps organizations to store lots of data safely, cheaply and efficiently.

Dashboard (Horizon)
OpenStack Horizon is a web-based graphical interface that cloud administrators and users can
access tomanage OpenStack compute, storage and networking services.
To service providers it provides services such as monitoring,billing, and other management
tools.

Identity Service (Keystone)

Provides an authentication and authorization service for other OpenStack services. Provides a
catalog of OpenStack Services.
Keystone provides a central list of users, mapped against all the OpenStack services, which they
can access.
Keystone supports various forms of authentication like standard username & password
credentials.

Networking (Neutron)
Neutron provides networking capability like managing networks and IP addresses for
OpenStack.
OpenStack networking allows users to create their own networks and connects devices and
servers to one or more networks.
Neutron also offers an extension framework, which supports deploying and managing of
other network services such as virtual private networks (VPN), firewalls, load balancing, and
intrusion detection system (IDS)

VII SEMESTER 25

[Link]
CS8791-Cloud Computing Unit V Notes

Block Storage (Cinder)

Orchestrates multiple composite cloud applications by using templates.
It creates and manages service that provides persistent data storage to cloud computing
applications.
Provides persistent block storage to running virtual machine.
Cinder also provides a self-service application programming interface (API) to enable users to
request and consume storage resources.
A cloud user can manage their storage needs by integrating block storage volumes with
Dashboard and Nova.
It is appropriate for expandable file systems and database storage.

Telemetry (Ceilometer)
It provides customer billing, resource tracking, and alarming
capabilities across all OpenStack core components.

Orchestration (Heat)
Heat is a service to orchestrate (coordinates) multiple composite cloud applications using
templates.

Workflow (Mistral)
Mistral is a service that manages workflows.
User typically writes a workflow using workflow language and uploads the workflow definition.
 The user can start workflow manually.

Database (Trove)
Trove is Database as a Service for OpenStack.
Allows users to quickly and easily utilize the features of a database without the burden of
handling complex administrative tasks.

Elastic map reduce (Sahara)

Sahara is a component to easily and rapidly provision Hadoop clusters.

VII SEMESTER 26

[Link]
CS8791-Cloud Computing Unit V Notes

Users will specify several parameters like the Hadoop version number, the cluster topology
type, node flavor details (defining disk space, CPU and RAM settings), and others.

Bare metal (Ironic)

Ironic provisions bare metal machines instead of virtual machines.

Messaging (Zaqar)
Zaqar is a multi-tenant cloud messaging service for Web developers.

Shared file system (Manila)

Manila is the OpenStack Shared Filesystems service for providing Shared Filesystems as a
service.
Allows to create, delete, and give/deny access to a file.

DNS (Designate)
Designate is a multi-tenant API for managing DNS.

Search (Searchlight)
Searchlight provides advanced and consistent search capabilities across various OpenStack
cloud services.

Key manager (Barbican)

It provides secure storage, provisioning and management of secret data.
This includes keying material such as Symmetric Keys,Asymmetric Keys and Certificates.

Root Cause Analysis (Vitrage)

Vitrage is the OpenStack RCA (Root Cause Analysis) service for organizing, analyzing and
expanding OpenStack alarms & events, yielding insights regarding the root cause of problems and
deducing their existence before they are directly detected.

Rule-based alarm actions (Aodh)

VII SEMESTER 27

[Link]
CS8791-Cloud Computing Unit V Notes

This alarming service enables the ability to trigger actions based on defined rules against an
event data collected by Ceilometer.

5.5 Federation in the cloud

Inter cloud:
 The Inter-Cloud is an interconnected global "cloud of clouds" and an extension of
the Internet "network of networks" on which it is based.
 Inter-Cloud computing is interconnecting multiple cloud providers’ infrastructures.
 The main focus is on direct interoperability between public cloud service providers.
 To provide cloud services as utility successfully, interconnected clouds are required.
 Interoperability and portability are important factors.
 The limitations of cloud are that they have limited physical resources.
 If a cloud has exhausted all the computational and storage resources, it cannot provide
service to the clients.
 The Inter-Cloud environment provides benefits like diverse Geographical locations, better
application resilience and avoiding vendor lock-in to the cloud client.
 Benefits for the cloud provider are expand-on-demand and better service level agreements
(SLA) to the cloud client.
Types of Inter-Cloud
 Federation Clouds
 Multi-Cloud
Federation Clouds
 A Federation cloud is an Inter-Cloud where a set of cloud providers willingly
interconnect their cloud infrastructures in order to share resources among each other.

VII SEMESTER 28

[Link]
CS8791-Cloud Computing Unit V Notes

Multi-Cloud

A multi-cloud environment has no volunteer interconnection and sharing of the cloud

service providers’ infrastructures.

representatives.

cloud portfolios.

Cloud Federation

using a common standard.

services from disparate networks that can be accessed by a client.

public or external clouds like private or internal clouds

entities.

of physical data centers.

Benefits of cloud federation:
 The federation of cloud resources allows client to optimize enterprise IT service
delivery.
 The federation of cloud resources allows a client to choose best cloud service
providers
 In terms of flexibility cost and availability of services to meet a particular business or
technological need within their organization.
 Federation across different cloud resources pools allows applications to run in the

VII SEMESTER 29

[Link]
CS8791-Cloud Computing Unit V Notes

most appropriate infrastructure environments.

 The federation of cloud resources allows an enterprise to distribute workload around
the globe and move data between desperate networks and implement innovative
security models for user access to cloud resources
Cloud Federation and Implementation
 One weakness that exist in the federation of cloud resources is the difficulty in
programming connectivity.
 Connection between a client and a given external cloud provider is difficult as they each
possess their own unique network addressing scheme.
 Cloud providers must grant clients the permission to specify an addressing scheme for
each server the cloud provider has external to the internet.
 This provides customers to with the ability to access cloud services without the need for
reconfiguration when using resources from different service providers.
 Cloud federation can also be implemented behind a firewall which providing clients
with the menu of cloud services provided by one or more trusted entities
5.5.2 Four Levels of Federation:

Permissive Federation:

VII SEMESTER 30

[Link]
CS8791-Cloud Computing Unit V Notes

server without verifying its identity using DNS lookups or certificate checking.

pretend to be someone else), which opens the door to widespread spam and other abuses.
Verified Federation:

identity of the peer has been verified.

beforehand.

domain spoofing.

release of the open-source jabberd 1.2 server.

Encrypted federation:

Security (TLS) as defined for XMPP in Request for Comments (RFC) 3920.

to provide identity verification.

request from an originating server.

the domain asserted by the originating server.

VII SEMESTER 31

[Link]
CS8791-Cloud Computing Unit V Notes

instances of address spoofing on the XMPP network

Trusted federation:

supports TLS and the peer can present a digital certificate issued by a root certification
authority (CA) that is trusted by the authenticating server.

operating system, XMPP server software, or local service policy.

authentication.

easy to obtain.
5.5.3Federated Services and Applications:

the network.

ability to find other entities of interest.

directories, data relays, messaging gateways, etc.

entities.

its identity, capabilities, and associated entities.

particular domain about the entities associated with that authoritative server.

detail services that are available but hosted elsewhere.

technology, which enables end users to keep track of various types of entities.

VII SEMESTER 32

[Link]
CS8791-Cloud Computing Unit V Notes

Future of Federation:

cloud that can interact with people, devices, information feeds, documents, application
interfaces, and other entities.

applications without asking permission from a large, centralized communications

operator.

maintain several distributed data-centers or server-farms, for example to serve to multiple

geographically distributed offices, to implement HA, or to guarantee server proximity to
the end user. Resources and networks in these distributed data-centers are usually
configured as non-cooperative separate elements.

that usually do not cooperate with other institutions, except in same punctual situations
(e.g. in joint projects or initiatives). Many times, even different departments within the
same institution maintain their own non-cooperative infrastructures.

image formats, and access methods exposed by different providers that make very difficult
for an average user to move its applications from one cloud to another, so leading to a
vendor lock-in problem.

internal computing necessities and workloads. These infrastructures are often over-sized
to satisfy peak demand periods, and avoid performance slow-down. Hybrid cloud (or
cloud bursting) model is a solution to reduce the on-premise infrastructure size, so that it
can be dimensioned for an average load, and it is complemented with external resources
from a public cloud provider to satisfy peak demands.

might be cost-saving for the provider but is often undesirable for the user”. The
commission aims to develop with “stakeholders model terms for cloud computing service
level agreements for contracts”.

VII SEMESTER 33

[Link]

Common questions

Google App Engine's architecture supports high availability and scalability through features such as dynamic web serving, persistent storage with queries, automatic scaling, and load balancing. It operates on Google's infrastructure, leveraging vast data centers that house thousands of servers grouped into clusters. This setup allows GAE to distribute workloads efficiently and manage resources dynamically as demand fluctuates. The GAE environment also includes APIs for authentication and email services, aiding in rapid application deployment with robust performance under variable loads .

Google App Engine supports cloud application development primarily through its environment for Java and Python. It includes an Eclipse plug-in for Java developers and the Google Web Toolkit. For Python, frameworks like Django and webapp are available. The programming model provides constructs such as the DataStore, a NoSQL database, and other services like URL Fetch, memcache, and a dedicated Images service. GAE also handles user authentication via Google Accounts and includes quotas to manage resource usage. These tools and features enable developers to create scalable, managed applications without worrying about underlying infrastructure .

HDFS supports high-throughput requirements through its architecture, which involves dividing data into large blocks stored across multiple DataNodes. This design minimizes metadata overhead and allows parallel data access, thus increasing throughput. The NameNode manages metadata, while DataNodes ensure data is streamed sequentially for fast reads and writes. Its master-slave architecture, combined with fault tolerance via replication, ensures continuous data availability and performance even in the presence of node failures .

Google App Engine offers benefits like seamless scaling, robust infrastructure with global reach, built-in services (e.g., data storage via BigTable), and strong integration with other Google services. However, limitations include support for only two programming languages (Python and Java), potential cost implications beyond free quotas, and restrictions inherent in a PaaS model where developers relinquish some control over the infrastructure. Despite these limitations, GAE is valuable for developers needing scalability and ease-of-use without managing underlying hardware .

Hadoop consists of key components including HDFS, MapReduce, and YARN. HDFS is a distributed file system that stores data across multiple machines, providing high-throughput access to data. It has a master-slave architecture with a NameNode managing metadata and DataNodes storing data blocks. MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. YARN, or Yet Another Resource Negotiator, manages resources and job scheduling across the cluster. Together, these components enable the efficient processing of large data sets by distributing the workload across many nodes in a cluster .

In HDFS, reading a file involves the client sending an 'open' request to the NameNode, which provides the DataNode locations of the file blocks. The client then reads data sequentially from these nodes. Writing a file involves the client sending a 'create' request to the NameNode. The data is then streamed to the first DataNode and subsequently replicated to other DataNodes as specified by the replication factor. This pipeline method ensures data consistency and reliability by guaranteeing that all copies of data blocks are consistent and stored across different nodes .

The Google File System (GFS) is designed to handle large amounts of data across distributed systems, differing from traditional file systems by using larger file blocks (64 MB compared to 4 KB in traditional systems) to minimize metadata requirements and enable efficient storage of large files. GFS is built to accommodate high data throughput and recover from failures through its fault-tolerant architecture, where files are split into blocks and replicated across multiple servers. Its primary goals include reliability, scalability, performance, and fault tolerance to meet the demands of large-scale applications like Google's search engine .

MapReduce is crucial in the Hadoop ecosystem as it provides the computational engine that processes large data sets in parallel across a Hadoop cluster. It integrates with HDFS by utilizing the data stored across different nodes for map and reduce tasks. The JobTracker (master) assigns tasks to TaskTrackers (slaves), ensuring tasks are executed close to the data location to optimize performance. Integration with YARN allows for resource management and job scheduling, enhancing efficiency in task execution across the cluster .

HDFS manages fault tolerance by replicating file blocks across multiple DataNodes with a default replication factor of three. Block replication ensures that data is stored reliably even if some nodes fail. The placement strategy involves distributing replicas across different nodes and racks to prevent data loss due to node or rack failure. Heartbeats and Blockreports are also used to monitor the health of DataNodes, allowing the NameNode to make informed decisions about data replication and management .

VirtualBox offers advantages such as being more lightweight and faster in certain operations compared to VMware, especially for running guest operating systems like Windows XP. It includes built-in remote file sharing features and requires less configuration for PCIe pass-through. VMware, on the other hand, provides better support for 3D graphics and superior customer service. The choice between them depends on specific needs: VirtualBox might be preferred when speed and less resource usage are priorities, whereas VMware might be preferred for applications requiring advanced graphic support or enterprise-level technical support .

Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Understanding HDFS Architecture
No ratings yet
Understanding HDFS Architecture
18 pages
Global Cloud Resource Management
No ratings yet
Global Cloud Resource Management
20 pages
HDFS Design and Concepts Overview
No ratings yet
HDFS Design and Concepts Overview
16 pages
Overview of Apache Hadoop Framework
No ratings yet
Overview of Apache Hadoop Framework
76 pages
HDFS and Big Data Analytics Overview
No ratings yet
HDFS and Big Data Analytics Overview
18 pages
CCS334 Big Data Analytics Overview
No ratings yet
CCS334 Big Data Analytics Overview
18 pages
HBase: A Guide to Column-Oriented NoSQL
No ratings yet
HBase: A Guide to Column-Oriented NoSQL
23 pages
HDFS Read and Write Operations Explained
No ratings yet
HDFS Read and Write Operations Explained
7 pages
Introduction to Big Data Concepts
100% (1)
Introduction to Big Data Concepts
15 pages
Overview of Hadoop Modules and HDFS
No ratings yet
Overview of Hadoop Modules and HDFS
101 pages
Data Types and Attributes Overview
No ratings yet
Data Types and Attributes Overview
40 pages
Machine Learning in Data Science: Unit 5
No ratings yet
Machine Learning in Data Science: Unit 5
19 pages
Data Science Tools and Neo4j Overview
No ratings yet
Data Science Tools and Neo4j Overview
45 pages
Cloud Computing Architecture Overview
No ratings yet
Cloud Computing Architecture Overview
30 pages
NoSQL Database Overview and Applications
No ratings yet
NoSQL Database Overview and Applications
64 pages
Basics of Hadoop in Big Data Analytics
No ratings yet
Basics of Hadoop in Big Data Analytics
22 pages
Introduction to Hadoop: History & HDFS
100% (1)
Introduction to Hadoop: History & HDFS
43 pages
Mining Social Network Graphs Analysis
No ratings yet
Mining Social Network Graphs Analysis
30 pages
Hadoop Modules Overview and Features
No ratings yet
Hadoop Modules Overview and Features
6 pages
Schema-less Models in Big Data Analytics
No ratings yet
Schema-less Models in Big Data Analytics
125 pages
CCS367 Storage Technologies Overview
No ratings yet
CCS367 Storage Technologies Overview
53 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
5 pages
Understanding Kohonen Self-Organizing Maps
No ratings yet
Understanding Kohonen Self-Organizing Maps
15 pages
Understanding the 5 V's of Big Data
No ratings yet
Understanding the 5 V's of Big Data
2 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
11 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
83 pages
Digital Marketing in a Non-Line World
No ratings yet
Digital Marketing in a Non-Line World
11 pages
Data Analysis Using Hadoop Framework
No ratings yet
Data Analysis Using Hadoop Framework
4 pages
Non-Classical Information Retrieval Models
No ratings yet
Non-Classical Information Retrieval Models
8 pages
Kangaroo Transaction Model in Mobile DBs
No ratings yet
Kangaroo Transaction Model in Mobile DBs
41 pages
5 Applications
No ratings yet
5 Applications
48 pages
NoSQL Database Types and Challenges
No ratings yet
NoSQL Database Types and Challenges
20 pages
NoSQL Data Replication and Distribution Models
100% (1)
NoSQL Data Replication and Distribution Models
87 pages
Feed Forward Neural Network Overview
No ratings yet
Feed Forward Neural Network Overview
7 pages
RDBMS vs Hadoop: Key Differences
No ratings yet
RDBMS vs Hadoop: Key Differences
2 pages
Overview of Hadoop and Big Data Analytics
100% (1)
Overview of Hadoop and Big Data Analytics
25 pages
Unit Testing MapReduce with MRUnit
No ratings yet
Unit Testing MapReduce with MRUnit
31 pages
Unit 5 Ids
No ratings yet
Unit 5 Ids
19 pages
Hadoop and Python Integration Guide
No ratings yet
Hadoop and Python Integration Guide
50 pages
Understanding MapReduce Architecture
No ratings yet
Understanding MapReduce Architecture
5 pages
Cloud ML and AI Services Overview
No ratings yet
Cloud ML and AI Services Overview
20 pages
HDFS Data Blocks and High Availability
No ratings yet
HDFS Data Blocks and High Availability
10 pages
Understanding Community Cloud Models
No ratings yet
Understanding Community Cloud Models
72 pages
Full Stack Web Development Overview
0% (1)
Full Stack Web Development Overview
13 pages
HDFS Overview for BCS714D Notes
No ratings yet
HDFS Overview for BCS714D Notes
27 pages
Key Features of MapReduce in Hadoop
No ratings yet
Key Features of MapReduce in Hadoop
4 pages
Developing Pig Latin Scripts in Hadoop
No ratings yet
Developing Pig Latin Scripts in Hadoop
42 pages
Full Stack Web Development Syllabus
No ratings yet
Full Stack Web Development Syllabus
19 pages
HBase Overview: Architecture and Features
No ratings yet
HBase Overview: Architecture and Features
27 pages
Overview of Operating System Functions
No ratings yet
Overview of Operating System Functions
35 pages
Understanding Hive as a NoSQL Database
No ratings yet
Understanding Hive as a NoSQL Database
9 pages
DBMS All in One R19
No ratings yet
DBMS All in One R19
167 pages
MANETs and Localization in Mobile Networks
No ratings yet
MANETs and Localization in Mobile Networks
12 pages
Understanding Big Data Characteristics
No ratings yet
Understanding Big Data Characteristics
18 pages
Big Data and Hadoop Architecture Overview
No ratings yet
Big Data and Hadoop Architecture Overview
14 pages
HDFS Design and Hadoop Cluster Setup
No ratings yet
HDFS Design and Hadoop Cluster Setup
1 page
High Performance Computing Overview
No ratings yet
High Performance Computing Overview
61 pages
Key-Value and Graph NoSQL Databases
No ratings yet
Key-Value and Graph NoSQL Databases
12 pages
Understanding Hadoop Architecture
No ratings yet
Understanding Hadoop Architecture
33 pages
Embedded Systems & IoT Question Bank
No ratings yet
Embedded Systems & IoT Question Bank
21 pages
Shadow Paging and ARIES Recovery in DBMS
No ratings yet
Shadow Paging and ARIES Recovery in DBMS
36 pages
Install VMware and C Compiler on Linux
No ratings yet
Install VMware and C Compiler on Linux
80 pages
Algorithmic Problem Solving Techniques
100% (1)
Algorithmic Problem Solving Techniques
162 pages
Python Programming Lab Manual
No ratings yet
Python Programming Lab Manual
49 pages
Sleepy Hollow: A Timeless Legend
No ratings yet
Sleepy Hollow: A Timeless Legend
41 pages
Marketing Strategies and Brand Equity
No ratings yet
Marketing Strategies and Brand Equity
73 pages
Module5. Human Flourishing
No ratings yet
Module5. Human Flourishing
8 pages
Size Effect on Corrosion in Concrete
No ratings yet
Size Effect on Corrosion in Concrete
9 pages
Data Analysis and Interpretation Overview
No ratings yet
Data Analysis and Interpretation Overview
42 pages
Understanding Switchgear Types and Functions
No ratings yet
Understanding Switchgear Types and Functions
1 page
Application for Lab Assistant Position
No ratings yet
Application for Lab Assistant Position
3 pages
Understanding Constitutions and Their Functions
No ratings yet
Understanding Constitutions and Their Functions
14 pages
BR.Net Core Banking User Manual
No ratings yet
BR.Net Core Banking User Manual
78 pages
The Mystery of Banking: Murray N. Rothbard
No ratings yet
The Mystery of Banking: Murray N. Rothbard
77 pages
Mindfulness Meditations For Depression - 100 Practices For Solace and Self-Compassion
100% (1)
Mindfulness Meditations For Depression - 100 Practices For Solace and Self-Compassion
187 pages
Shefali Patel: Frontend Developer Profile
No ratings yet
Shefali Patel: Frontend Developer Profile
2 pages
Draft Amendments to E-Commerce Rules
No ratings yet
Draft Amendments to E-Commerce Rules
6 pages
Effective HR Management Strategies
No ratings yet
Effective HR Management Strategies
5 pages
Socio-Economic Development Overview
No ratings yet
Socio-Economic Development Overview
53 pages
UI/UX Designer Profile: Animesh Sadh
No ratings yet
UI/UX Designer Profile: Animesh Sadh
1 page
Cassius's Tragic Dilemma in Battle
No ratings yet
Cassius's Tragic Dilemma in Battle
5 pages
Overview of Intrauterine Devices (IUDs)
No ratings yet
Overview of Intrauterine Devices (IUDs)
80 pages
The Uses of Diversity: Essays by Chesterton
No ratings yet
The Uses of Diversity: Essays by Chesterton
196 pages
Identifying and Combating Fake News
100% (1)
Identifying and Combating Fake News
35 pages
PMI GTP Training Checklist Guide
No ratings yet
PMI GTP Training Checklist Guide
6 pages
Explore Topics on Scribd
No ratings yet
Explore Topics on Scribd
1 page
Academic Policies and Enrollment Guidelines
No ratings yet
Academic Policies and Enrollment Guidelines
12 pages
Green Olympiad 2019 Registration Details
No ratings yet
Green Olympiad 2019 Registration Details
3 pages
Overcoming Stage Fright: A Guide
No ratings yet
Overcoming Stage Fright: A Guide
1 page
Project Management in Tanzanian Organizations
No ratings yet
Project Management in Tanzanian Organizations
9 pages
Class Dynamics in College Experiences
No ratings yet
Class Dynamics in College Experiences
18 pages
Kas297a Altitude Selector Im 006-00512-0002 - 2
100% (3)
Kas297a Altitude Selector Im 006-00512-0002 - 2
21 pages
Spring Constant and Potential Energy Effects
No ratings yet
Spring Constant and Potential Energy Effects
8 pages
Seneca Indians' Land Rights Overview
No ratings yet
Seneca Indians' Land Rights Overview
39 pages

HDFS Architecture and MapReduce Overview

Uploaded by

HDFS Architecture and MapReduce Overview

Uploaded by

CS8791-Cloud Computing Unit V Notes

UNIT V CLOUD TECHNOLOGIES AND ADVANCEMENTS

Figure 5.1 Hadoop Architecture

 The mapping of blocks to Data Nodes is determined by the Name Node.

HDFS Fault Tolerance

 Receipt of a Heartbeat implies that the DataNode is functioning properly.

HDFS- Read Operation

5.1.1 Architecture of Mapreduce in Hadoop

Properties of Hadoop Engine

Running a Job in Hadoop

 Three components contribute in running a job in this system:

o The reduce phase.

10. Prepare the work list for the next application.

Figure 4.7: The MapReduce model

Map Reduce Example:

5.3 GOOGLE APPLICATION ENGINE (GAE)

 GAE helps to easily develop an Web Application

TECHNOLOGIES USED BY GOOGLE ARE

 Google File System(GFS) ->for storing large amounts of data.

FUNCTIONAL MODULES OF GAE

 The GAE platform comprises the following five major components.

GOOGLE SECURE DATA CONNECTOR (SDC)

FUNCTIONAL MODULES OF GAE

Well-known GAE applications

 Google Search Engine

5. 3.1 Programming Support of Google App Engine

The mutation takes the following steps:

5.4 Open Stack

 Rule-based alarm actions (Aodh)

Object Storage (Swift)

Identity Service (Keystone)

Block Storage (Cinder)

Elastic map reduce (Sahara)

Bare metal (Ironic)

Shared file system (Manila)

Key manager (Barbican)

Root Cause Analysis (Vitrage)

Rule-based alarm actions (Aodh)

5.5 Federation in the cloud

A multi-cloud environment has no volunteer interconnection and sharing of the cloud

using a common standard.

services from disparate networks that can be accessed by a client.

public or external clouds like private or internal clouds

of physical data centers.

most appropriate infrastructure environments.

identity of the peer has been verified.

release of the open-source jabberd 1.2 server.

to provide identity verification.

request from an originating server.

the domain asserted by the originating server.

instances of address spoofing on the XMPP network

operating system, XMPP server software, or local service policy.

ability to find other entities of interest.

directories, data relays, messaging gateways, etc.

its identity, capabilities, and associated entities.

detail services that are available but hosted elsewhere.

applications without asking permission from a large, centralized communications

maintain several distributed data-centers or server-farms, for example to serve to multiple

Common questions

In what ways does the Google App Engine’s architecture facilitate high availability and scalability for web applications?

In what ways does the Google App Engine’s architecture facilitate high availability and scalability for web applications?

Describe the programming environment for Google App Engine and how it supports cloud application development.

Describe the programming environment for Google App Engine and how it supports cloud application development.

How does the architecture and functionality of HDFS support the high-throughput requirements of data-intensive applications?

How does the architecture and functionality of HDFS support the high-throughput requirements of data-intensive applications?

Evaluate the benefits and limitations of the Google App Engine (GAE) platform for developing web applications.

Evaluate the benefits and limitations of the Google App Engine (GAE) platform for developing web applications.

What are the key components of the Hadoop framework and how do they support its distributed processing capabilities?

What are the key components of the Hadoop framework and how do they support its distributed processing capabilities?

Explain the process of reading and writing operations in HDFS and how they ensure data consistency and reliability.

Explain the process of reading and writing operations in HDFS and how they ensure data consistency and reliability.

How does the Google File System (GFS) differ from traditional file systems, and what are its primary design goals?

How does the Google File System (GFS) differ from traditional file systems, and what are its primary design goals?

What role does MapReduce play in the Hadoop ecosystem, and how does it integrate with other Hadoop components?

What role does MapReduce play in the Hadoop ecosystem, and how does it integrate with other Hadoop components?

How does HDFS manage fault tolerance and ensure data reliability in a distributed environment?