Unit 5-PLH
Unit 5-PLH
HDFS – Map Reduce – Google App Engine (GAE) – Programming Environment for GAE –
Architecture of GFS – Case Studies: Openstack, Heroku, and Docker Containers –Amazon
EC2, AWS, Microsoft Azure, Google Compute Engine
Users of Hadoop:
Hadoop is running search on some of the Internet's largest sites:
o Amazon Web Services: Elastic MapReduce
o AOL: Variety of uses, e.g., behavioral analysis & targeting
o Ebay: Search optimization (532-node cluster)
o Facebook: Reporting/analytics, machine learning (1100 m.)
o LinkedIn: People You May Know (2x50 machines)
o Twitter: Store + process tweets, log files, other data Yahoo: >36,000 nodes; biggest
cluster is 4,000 nodes
Hadoop Architecture
Hadoop has a Master Slave Architecture for both Storage & Processing
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and provide file system and OS level abstractions
and contains the necessary Java files and scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource management.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-
throughput access to application data.
HadoopMapReduce: This is system for parallel processing of large data sets.
HDFS
To store a file in this architecture,
HDFS splits the file into fixed-size blocks (e.g., 64 MB) and stores them on workers (Data
Nodes).
The mapping of blocks to Data Nodes is determined by the Name Node.
The NameNode (master) also manages the file system’s metadata and namespace.
Namespace is the area maintaining the metadata, and metadata refers to all the information
stored by a file system that is needed for overall management of all files.
NameNode in the metadata stores all information regarding the location of input
splits/blocks in all DataNodes.
Each DataNode, usually one per node in a cluster, manages the storage attached to the
node.
Each DataNode is responsible for storing and retrieving its file blocks
HDFS- Features
Distributed file systems have special requirements
Performance
Scalability
Concurrency Control
Fault Tolerance
Security Requirements
HDFS Fault Tolerance
Block replication:
To reliably store data in HDFS, file blocks are replicated in this system.
HDFS stores a file as a set of blocks and each block is replicated and distributed across the
whole cluster.
The replication factor is set by the user and is three by default.
Replica placement: The placement of replicas is another factor to fulfill the desired fault
tolerance in HDFS.
Storing replicas on different nodes (DataNodes) located in different racks across the whole
cluster.
HDFS stores one replica in the same node the original data is stored.
One replica on a different node but in the same rack
One replica on a different node in a different rack.
Heartbeats and Blockreports are periodic messages sent to the NameNode by each
DataNode in a cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly.
Each Blockreport contains a list of all blocks on a DataNode .
The NameNode receives such messages because it is the sole decision maker of all replicas
in the system.
HDFS High Throughput
Applications run on HDFS typically have large data sets.
Individual files are broken into large blocks to allow HDFS to decrease the amount of
metadata storage required per file.
The list of blocks per file will shrink as the size of individual blocks increases.
By keeping large amounts of data sequentially within a block, HDFS provides faststreaming
reads of data.
HDFS- Read Operation
Reading a file :
To read a file in HDFS, a user sends an “open” request to the NameNode to get the location
of file blocks.
For each file block, the NameNode returns the address of a
set of DataNodes containing replica information for the requested file.
The number of addresses depends on the number of block replicas.
The user calls the read function to connect to the closest DataNode containing the first
block of the file.
Then the first block is streamed from the respective DataNode to the user.
The established connection is terminated and the same process is repeated for all blocks of
the requested file until the whole file is streamed to the user.
MAP REDUCE
MapReduce is a programming model for data processing.
MapReduce is designed to efficiently process large volumes of data by connecting many
commodity computers together to work in parallel
Hadoop can run MapReduce programs written in various languages like Java, Ruby, and
Python
MapReduce works by breaking the processing into two phases:
o The map phase and
o The reduce phase.
Each phase has key-value pairs as input and output, the types of which may be chosen by
the programmer.
The programmer also specifies two functions:
o The map function and
o The reduce function.
In MapReduce, chunks are processed in isolation by tasks called Mappers
The outputs from the mappers are denoted as intermediate outputs (IOs) and are brought
into a second set of tasks called Reducers
The process of bringing together IOs into a set of Reducers is known as shuffling process
The Reducers produce the final outputs (FOs)
Overall, MapReduce breaks the data flow into two phases, map phase and reduce phase
Mapreduce Workflow
Application writer specifies
A pair of functions called Mapper and Reducer and a set of input files and submits the job
Input phase generates a number of FileSplits from input files (one per Map task)
The Map phase executes a user function to transform input key-pairs into a new set of key-
pairs
The framework Sorts & Shuffles the key-pairs to output nodes
The Reduce phase combines all key-pairs with the same key into new keypairs
The output phase writes the resulting pairs to files as “parts”
Characteristics of MapReduce is characterized by:
Its simplified programming model which allows the user to quickly write and test
distributed systems
Its efficient and automatic distribution of data and workload across machines
Its flat scalability curve. Specifically, after a Mapreduce program is written and functioning
on 10 nodes, very little-if any- work is required for making that same program run on 1000
nodes
The core concept of MapReduce in Hadoop is that input may be split into logical chunks,
and each chunk may be initially processed independently, by a map task. The results of these
individual processing chunks can be physically partitioned into distinct sets, which are then
sorted. Each sorted chunk is passed to a reduce task.
A map task may run on any compute node in the cluster, and multiple map tasks may
berunning in parallel across the cluster. The map task is responsible for transforming the input
records into key/value pairs. The output of all of the maps will be partitioned, and each partition will
be sorted. There will be one partition for each reduce task. Each partition’s sorted keys and thevalues
associated with the keys are then processed by the reduce task. There may be multiple reduce tasks
running in parallel on the cluster.
The application developer needs to provide only four items to the Hadoop framework:
the class that will read the input records and transform them into one key/value pair per record,
a map method, a reduce method, and a class that will transform the key/value pairs that the reduce
method outputs into output records.
My first MapReduce application was a specialized web crawler. This crawler received
as input large sets of media URLs that were to have their content fetched and processed. The
media items were large, and fetching them had a significant cost in time and resources.
The job had several steps:
1. Ingest the URLs and their associated metadata.
2. Normalize the URLs.
3. Eliminate duplicate URLs.
4. Filter the URLs against a set of exclusion and inclusion filters.
5. Filter the URLs against a do not fetch list.
6. Filter the URLs against a recently seen set.
7. Fetch the URLs.
8. Fingerprint the content items.
9. Update the recently seen set.
10. Prepare the work list for the next application.
Hadoop is bad
5.2 GOOGLE APPLICATION ENGINE (GAE)
🞂 Google App Engine is a PaaS cloud that provides a complete Web service
environment(Platform)
🞂 GAE provides Web application development platform for users.
🞂 All required hardware, operating systems and software are provided to clients.
🞂 Clients can develop their own applications, while App Engine runs the applications on
Google’s servers.
🞂 GAE helps to easily develop an Web Application
🞂 App Engine only supports the Java and Python programming languages.
🞂 The Google App Engine (GAE) provides a powerful distributed data storage service.
GOOGLE CLOUD INFRASTRUCTURE
🞂 Google has established cloud development by making use of large number of data centers.
🞂 Eg: Google established cloud services in
Gmail
Google Docs
Google Earth etc.
🞂 These applications can support a large number of users simultaneously with High
Availability (HA).
🞂 In 2008, Google announced the GAE web application platform.
🞂 GAE enables users to run their applications on a large number of data centers.
🞂 Google App Engine environment includes the following features :
Dynamic web serving
Persistent(constant) storage with queries, sorting, and transactions
Automatic scaling and load balancing
🞂 Provides Application Programming Interface(API) for authenticating users.
🞂 Send email using Google Accounts.
🞂 Local development environment that simulates(create) Google App Engine on your
computer.
GAE ARCHITECTURE
TECHNOLOGIES USED BY GOOGLE ARE
🞂 When the user wants to get the data, he/she will first send an authorized data requests to
Google Apps.
🞂 It forwards the request to the tunnel server.
🞂 The tunnel servers validate the request identity.
🞂 If the identity is valid, the tunnel protocol allows the SDC to set up a connection,
authenticate, and encrypt the data that flows across the Internet.
🞂 SDC also validates whether a user is authorized to access a specified resource.
🞂 Application runtime environment offers a platform for web programming and execution.
🞂 It supports two development languages: Python and Java.
🞂 Software Development Kit (SDK) is used for local application development.
🞂 The SDK allows users to execute test runs of local applications and upload application
code.
🞂 Administration console is used for easy management of user application development
cycles.
🞂 GAE web service infrastructure provides special guarantee flexible use and management of
storage and network resources by GAE.
🞂 Google offers essentially free GAE services to all Gmail account owners.
🞂 We can register for a GAE account or use your Gmail account name to sign up for the
service.
🞂 The service is free within a quota.
🞂 If you exceed the quota, extra amount will be charged.
🞂 Allows the user to deploy user-built applications on top of the cloud infrastructure.
🞂 They are built using the programming languages and software tools supported by the
provider (e.g., Java, Python)
GAE APPLICATIONS
GAE programming model for two supported languages: Java and Python. A client
environment includes an Eclipse plug-in for Java allows you to debug your GAE on your local
machine. Google Web Toolkit is available for Java web application developers. Python is used
with frameworks such as Django and CherryPy, but Google also has webapp Python environment.
There are several powerful constructs for storing and accessing data. The data store is a
NOSQL data management system for entities. Java offers Java Data Object (JDO) and Java
Persistence API (JPA) interfaces implemented by the Data Nucleus Access platform, while Python
has a SQL-like query language called GQL. The performance of the data store can be enhanced by
in-memory caching using the memcache, which can also be used independently of the data store.
Recently, Google added the blobstore which is suitable for large files as its size limit is 2
GB. There are several mechanisms for incorporating external resources. The Google SDC Secure
Data Connection can tunnel through the Internet and link your intranet to an external GAE
application. The URL Fetch operation provides the ability for applications to fetch resources and
communicate with other hosts over the Internet using HTTP and HTTPS requests.
An application can use Google Accounts for user authentication. Google Accounts handles
user account creation and sign-in, and a user that already has a Google account (such as a Gmail
account) can use that account with your app. GAE provides the ability to manipulate image data
using a dedicated Images service which can resize, rotate, flip, crop, and enhance images. A GAE
application is configured to consume resources up to certain limits or quotas. With quotas, GAE
ensures that your application won’t exceed your budget, and that other applications running on GAE
won’t impact the performance of your app. In particular, GAE use is free up to certain quotas.
Google File System (GFS)
GFS is a fundamental storage service for Google’s search engine. GFS was designed for
Google applications, and Google applications were built for GFS. There are several concerns in GFS.
rate). As servers are composed of inexpensive commodity components, it is the norm rather than the
exception that concurrent failures will occur all the time. Other concerns the file size in GFS. GFS
typically will hold a large number of huge files, each 100 MB or larger, with files that are multiple
GB in size quite common. Thus, Google has chosen its file data block size to be 64 MB instead of
the 4 KB in typical traditional file systems. The I/O pattern in the Google application is also special.
Files are typically written once, and the write operations are often the appending data blocks to the
end of files. Multiple appending operations might be concurrent. The customized API can simplify
the problem and focus on Google applications.
Figure shows the GFS architecture. It is quite obvious that there is a single master in the
whole cluster. Other nodes act as the chunk servers for storing data, while the single master stores
the metadata. The file system namespace and locking facilities are managed by the master. The
master periodically communicates with the chunk servers to collect management information as well
as give instructions to the chunk servers to do work such as load balancing or fail recovery.
The master has enough information to keep the whole cluster in a healthy state. Google uses
a shadow master to replicate all the data on the master, and the design guarantees that all the data
operations are performed directly between the client and the chunk server. The controlmessages are
transferred between the master and the clients and they can be cached for future use. With the current
quality of commodity servers, the single master can handle a cluster of more than 1,000 nodes.
Big Table
BigTable was designed to provide a service for storing and retrieving structured and
semistructured data. BigTable applications include storage of web pages, per-user data, and
geographic locations. The database needs to support very high read/write rates and the scale might
be millions of operations per second. Also, the database needs to support efficient scans over all or
interesting subsets of data, as well as efficient joins of large one-to-one and one-to-many data sets.
The application may need to examine data changes over time.
The BigTable system is scalable, which means the system has thousands of servers, terabytes
of in-memory data, petabytes of disk-based data, millions of reads/writes per second, andefficient
scans. BigTable is used in many projects, including Google Search, Orkut, and Google Maps/Google
Earth, among others.
The BigTable system is built on top of an existing Google cloud infrastructure. BigTable uses the
following building blocks:
1. GFS: stores persistent state
2. Scheduler: schedules jobs involved in BigTable serving
3. Lock service: master election, location bootstrapping
4. MapReduce: often used to read/write BigTable data.
Compute (Nova)
OpenStack Compute is also known as OpenStack Nova.
Nova is the primary compute engine of OpenStack, used for deploying and managing virtual
machine.
OpenStack Compute manages pools of computer resources and work with virtualization
technologies.
Nova can be deployed using hypervisor technologies such as KVM, VMware, LXC, XenServer,
etc.
Image Service (Glance)
OpenStack image service offers storing and retrieval of virtual machine disk images.
OpenStack Compute makes use of this during VM provisioning.
Glance has client-server architecture which allows querying of virtual machine image. While
deploying new virtual machine instances, Glance uses the stored images as templates.
OpenStack Glance supports VirtualBox, VMWare and KVM virtual machine images.
Object Storage (Swift)
OpenStack Swift creates redundant (repetition), scalable data storage to store petabytes of
accessible data.
The data can be included,retrieved and updated.
It has a distributed architecture, providing greater redundancy, scalability, and performance,
with no central point of control.
It helps organizations to store lots of data safely, cheaply and efficiently.
Dashboard (Horizon)
OpenStack Horizon is a web-based graphical interface that cloud administrators and users can
access tomanage OpenStack compute, storage and networking services.
To service providers it provides services such as monitoring,billing, and other management
tools.
Networking (Neutron)
Neutron provides networking capability like managing networks and IP addresses for
OpenStack.
OpenStack networking allows users to create their own networks and connects devices and
servers to one or more networks.
Neutron also offers an extension framework, which supports deploying and managing of other
network services such as virtual private networks (VPN), firewalls, load balancing, and intrusion
detection system (IDS)
Block Storage (Cinder)
Orchestrates multiple composite cloud applications by using templates.
It creates and manages service that provides persistent data storage to cloud computing
applications.
Provides persistent block storage to running virtual machine.
Cinder also provides a self-service application programming interface (API) to enable users to
request and consume storage resources.
A cloud user can manage their storage needs by integrating block storage volumes with
Dashboard and Nova.
It is appropriate for expandable file systems and database storage.
Telemetry (Ceilometer)
It provides customer billing, resource tracking, and alarming
capabilities across all OpenStack core components.
Orchestration (Heat)
Heat is a service to orchestrate (coordinates) multiple composite cloud applications using
templates.
Workflow (Mistral)
Mistral is a service that manages workflows.
User typically writes a workflow using workflow language and uploads the workflow definition.
The user can start workflow manually.
Database (Trove)
Trove is Database as a Service for OpenStack.
Allows users to quickly and easily utilize the features of a database without the burden of
handling complex administrative tasks.
Messaging (Zaqar)
Zaqar is a multi-tenant cloud messaging service for Web developers.
DNS (Designate)
Designate is a multi-tenant API for managing DNS.
Search (Searchlight)
Searchlight provides advanced and consistent search capabilities across various OpenStack
cloud services.
What is Azure?
Azure is Microsoft’s cloud platform, just like Google has it’s Google Cloud and Amazon has it’s
Amazon Web Service or AWS.000. Generally, it is a platform through which we can use Microsoft’s
resource. For example, to set up a huge server, we will require huge investment, effort, physical space
and so on. In such situations, Microsoft Azure comes to our rescue. It will provide us with virtual
machines, fast processing of data, analytical and monitoring tools and so on to make our work simpler.
The pricing of Azure is also simpler and cost-effective. Popularly termed as “Pay As You Go”, which
means how much you use, pay only for that.
Azure History
Microsoft unveiled Windows Azure in early October 2008 but it went to live after February 2010.
Later in 2014, Microsoft changed its name from Windows Azure to Microsoft Azure. Azure provided
a service platform for .NET services, SQL Services, and many Live Services. Many people were still
very skeptical about “the cloud”. As an industry, we were entering a brave new world with many
possibilities. Microsoft Azure is getting bigger and better in coming days. More tools and more
functionalities are getting added. It has two releases as of now. It’s famous version Microsoft Azure
v1 and later Microsoft Azure v2. Microsoft Azure v1 was more like JSON script driven then the new
version v2, which has interactive UI for simplification and easy learning. Microsoft Azure v2 is still
in the preview version.
How Azure can help in business?
Azure can help in our business in the following ways-
Capitaless: We don’t have to worry about the capital as Azure cuts out the high cost of hardware.
You simply pay as you go and enjoy a subscription-based model that’s kind to your cash flow.
Also, to set up an Azure account is very easy. You simply register in Azure Portal and select your
required subscription and get going.
Less Operational Cost: Azure has low operational cost because it runs on its own servers whose
only job is to make the cloud functional and bug-free, it’s usually a whole lot more reliable than
your own, on-location server.
Cost Effective: If we set up a server on our own, we need to hire a tech support team to monitor
them and make sure things are working fine. Also, there might be a situation where the tech support
team is taking too much time to solve the issue incurred in the server. So, in this regard is way too
pocket-friendly.
Easy Back Up and Recovery options: Azure keep backups of all your valuable data. In disaster
situations, you can recover all your data in a single click without your business getting affected.
Cloud-based backup and recovery solutions save time, avoid large up-front investment and roll up
third-party expertise as part of the deal.
Easy to implement: It is very easy to implement your business models in Azure. With a couple of
on-click activities, you are good to go. Even there are several tutorials to make you learn and
deploy faster.
Better Security: Azure provides more security than local servers. Be carefree about your critical
data and business applications. As it stays safe in the Azure Cloud. Even, in natural disasters, where
the resources can be harmed, Azure is a rescue. The cloud is always on.
Work from anywhere: Azure gives you the freedom to work from anywhere and everywhere. It
just requires a network connection and credentials. And with most serious Azure cloud services
offering mobile apps, you’re not restricted to which device you’ve got to hand.
Increased collaboration: With Azure, teams can access, edit and share documents anytime, from
anywhere. They can work and achieve future goals hand in hand. Another advantage of the Azure
is that it preserves records of activity and data. Timestamps are one example of the Azure’s record
keeping. Timestamps improve team collaboration by establishing transparency and increasing
accountability.
Microsoft Azure Services
Some following are the services of Microsoft Azure offers:
1. Compute: Includes Virtual Machines, Virtual Machine Scale Sets, Functions for serverless
computing, Batch for containerized batch workloads, Service Fabric for microservices and
container orchestration, and Cloud Services for building cloud-based apps and APIs.
2. Networking: With Azure you can use variety of networking tools, like the Virtual Network, which
can connect to on-premise data centers; Load Balancer; Application Gateway; VPN Gateway;
Azure DNS for domain hosting, Content Delivery Network, Traffic Manager, ExpressRoute
dedicated private network fiber connections; and Network Watcher monitoring and diagnostics
3. Storage: Includes Blob, Queue, File and Disk Storage, as well as a Data Lake Store, Backup and
Site Recovery, among others.
4. Web + Mobile: Creating Web + Mobile applications is very easy as it includes several services
for building and deploying applications.
5. Containers: Azure has a property which includes Container Service, which supports Kubernetes,
DC/OS or Docker Swarm, and Container Registry, as well as tools for microservices.
6. Databases: Azure has also includes several SQL-based databases and related tools.
7. Data + Analytics: Azure has some big data tools like HDInsight for Hadoop Spark, R Server,
HBase and Storm clusters
8. AI + Cognitive Services: With Azure developing applications with artificial intelligence
capabilities, like the Computer Vision API, Face API, Bing Web Search, Video Indexer, Language
Understanding Intelligent.
9. Internet of Things: Includes IoT Hub and IoT Edge services that can be combined with a variety
of machine learning, analytics, and communications services.
10. Security + Identity: Includes Security Center, Azure Active Directory, Key Vault and Multi-
Factor Authentication Services.
11. Developer Tools: Includes cloud development services like Visual Studio Team Services, Azure
DevTest Labs, HockeyApp mobile app deployment and monitoring, Xamarin cross-platform
mobile development and more.
Difference between AWS (Amazon Web Services), Google Cloud and Azure
How Azure works
It is essential to understand the internal workings of Azure so that we can design our
applications on Azure effectively with high availability, data residency, resilience, etc.
Microsoft Azure is completely based on the concept of virtualization. So, similar to other
virtualized data center, it also contains racks. Each rack has a separate power unit and network
switch, and also each rack is integrated with a software called Fabric-Controller. This Fabric-
controller is a distributed application, which is responsible for managing and monitoring
servers within the rack. In case of any server failure, the Fabric-controller recognizes it and
recovers it. And Each of these Fabric-Controller is, in turn, connected to a piece of software
called Orchestrator. This Orchestrator includes web-services, Rest API to create, update, and
delete resources.
When a request is made by the user either using PowerShell or Azure portal. First, it will go to
the Orchestrator, where it will fundamentally do three things:
Combinations of racks form a cluster. We have multiple clusters within a data center, and we
can have multiple Data Centers within an Availability zone, multiple Availability zones within
a Region, and multiple Regions within a Geography.
o Availability Zones: These are the physically separated location within an Azure
region. Each one of them is made up of one or more data centers, independent
configuration.
Google doesn't charge any upfront fees or require a time-period commitment for
GCE. Google's cloud services compete with Microsoft's Azure and Amazon Web
Services.
GCE provides administrators with VM, DNS server and load balancing capabilities.
VMs are available in a number of CPU and RAM configurations and Linux
distributions, including Debian and CentOS. Customers may use their own system
images for custom VMs.
With GCE, administrators can select the Google Cloud region and zone where their
data will be stored and used. GCE also offers tools for administrators to create
advanced networks on the regional level.
NOTE: Google Compute Engine is part of Google Cloud Platform, which includes many serverless services
that can be used in conjunction with GCE for computing, processing and storage tasks.
Cloud storage. Persistent disks feature high-performance block storage that lets
users take snapshots and create new persistent disks from the snapshot.
Confidential VMs. These VMs enable users to encrypt data while it's being
processed without negatively affecting performance.
Custom machine types. Users can customize VMs to suit business needs and
optimize cost effectiveness.
Live migration for VMs. VMs can migrate between host machines without
rebooting. This feature enables applications to continue running during
maintenance.
Local solid-state drives. These local SSDs are always encrypted and physically
attached to the host server. They have low latency compared to persistent disks.
Operating system (OS) support. Users can run a number of different OSes,
including Debian, CentOS, Red Hat Enterprise Linux, SUSE, Ubuntu and
Windows Server. GCE also includes patch management for OSes.
Payment. GCE offers per-second billing and committed use discounts with no
upfront costs or instance lock-in.
Sole-tenant nodes. These nodes are GCE servers dedicated to one tenant. They
make it easier to deploy bring-your-own-license (BYOL) applications and allow
the same machine types and VM configurations as standard compute instances.
Spot VMs. These are affordable instance options used for fault-
tolerant workloads and batch jobs. They help users cut costs, but they can be
prone to service interruptions. Spot VMs come with the same capabilities and
machine types as standard VMs.
Virtual machine manager. GCE comes with the VM manager, which helps
users manage OSes for large collections of VMs. GCE also provides right-sizing
recommendations to help customers use resources efficiently.
Why do businesses use Google Compute Engine?
There are many reasons organizations use Google Compute Engine, including these
three:
2. High-performing and scalable. Both of these features make GCE good for
businesses that need to process large quantities of data quickly.
BYOL. BYOL in Google Compute Engine lets customers run their Windows-
based applications in sole tenant nodes or with a license-included image.
Google Compute Engine is an IaaS tool, providing VMs that help organizations
build and manage servers, OSes and network devices. Customers can manage
infrastructure that Google hosts remotely.
Google App Engine is a platform as a service. PaaS tools provide developers with
a hosted environment to build applications. They help automate application design,
development, testing and deployment.
With App Engine, developers can deploy their code and the platform will
automatically adjust to handle the traffic volume. Compute Engine users must
manually adjust the infrastructure elements that host the application. This means
they get more flexibility and, in some cases, reduced costs.