0% found this document useful (0 votes)
12 views

Chapter 4

Uploaded by

suleymanabdu0931
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Chapter 4

Uploaded by

suleymanabdu0931
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Institute of Technology

School of Computing
Department of Software Engineering
Fundamentals of Cloud computing
“Chapter 3 :Parallel and Distributed programming-Paradigms ”

By Abebaw S.

12/23/2024 1
Outline
Parallel Computing vs Distributed computing
MapReduce, Twister and Iterative MapReduce
Hadoop Library from Apache
Mapping Applications
Programming Support
Google App Engine,
Amazon AWS
Cloud Software Environments
Eucalyptus
Open nebula
OpenStack
12/23/2024 2
Parallel Vs Distributed Computing
What is Parallel Computing ?
Parallel computing is a type of computing which multiple
processors working together in one system to solve parts of
a problem simultaneously
Often uses shared memory.
What is Distributed computing?
A system in which multiple independent systems
(computers) connected via a network to work together.
Each systems has its own memory.

12/23/2024 3
Parallel Distributed
Use Multiple processors or Multiple-independent computers
cores in a single machine. connected over a network.
Processors access common Each computer has its own
memory private memory.
Communication happens via Communication happens
shared memory or IPC through message passing over a
Fault tolerance is limited; network
failure affects the entire system. Fault tolerance is more complex
Scales by adding more and can handle individual
processors to a single machine. machine failures.
Examples: Multi-core Scales by adding more
processors, GPU computing. computers (nodes) to the system.
Examples: Cloud computing,
Hadoop, distributed databases.
12/23/2024 4
MapReduce
MapReduce is a programming model for processing large
datasets in a parallel and distributed manner.
It achieves this by abstracting the complexities of data
distribution, task scheduling, and fault tolerance from the
developer.
The model consists of two main functions:
I. Map: Processes input data and produces intermediate
key-value pairs.
II. Reduce: Groups the intermediate key-value pairs by key
and applies a function to produce the final output.

12/23/2024 5
MapReduce Data Flow
The data flow in a MapReduce job typically involves the
following stages:
i. Input and Splitting: input data is divided into smaller
chunks called "splits."
ii. Mapping: The Map function is applied to each split in
parallel by multiple map tasks.
iii. Shuffling and Sorting: The intermediate key-value pairs
generated by the map tasks are grouped by key and
sorted.
iv. Reducing: The Reduce function is applied to each
grouped set of values for a given key by multiple reduce
tasks.
v. Output: The output from the reduce tasks is written to the
output destination
12/23/2024 6
Cont’d …
Word Count Example
i. Input: A large text file containing words.
ii. Map: Each map task counts the occurrences of each word in
its assigned data split, emitting pairs like (word, count). For
example, if a split contains "apple apple banana," the map
output would be: (apple, 2), (banana, 1).
iii. Shuffle: The MapReduce framework groups intermediate
pairs with the same key (word) together.
iv. Reduce: Reduce tasks receive groups like (apple, [2, 1, 1...])
and sum the counts for each word, producing the final
v. Output: (apple, 4).

12/23/2024 7
Why MapReduce?
Scalability:
MapReduce designed to handle datasets that are too large to
be processed on a single machine.
By dividing the data and computation across multiple
machines.
It achieves scalability and can process petabytes of data
using thousands of nodes.
Fault Tolerance:
In distributed systems like those found in cloud
environments, hardware failures are common.
MapReduce is naturally fault-tolerant, as it has the ability
to re-run failed tasks on other available nodes.

12/23/2024 8
Cont’d…
Simplified Programming Model:
MapReduce provides a high-level abstraction that hides the
complexities of distributed computing from developers
Programmers only focuses on defining the map and reduce
functions without having to handle the complexities of data
distribution, task scheduling, or fault tolerance.

12/23/2024 9
How MapReduce Works? With examples

12/23/2024 10
With examples . . .
1. Input and Splitting:
Imagine you have a large text file containing a collection of
documents, and your goal is to count the occurrences of
each word
The MapReduce framework first splits this input file into
smaller chunks called input splits
For example, if your file is 1GB and you're using a cluster
with 10 nodes, you might split it into 100MB chunks.
Each input split is then assigned to a map task for
processing.

12/23/2024 11
With examples . . .
Splitting: Word Count
Input File:
"This is a sample document. It contains some words."
"We will count the occurrences of each word."
Input Splits:
Split 1: "This is a sample document. It contains some words."
Split 2: "We will count the occurrences of each word."

12/23/2024 12
With examples . . .
2. Mapping:
Each Map Function receives an input split and applies the user-
defined map function to each element (typically a line) in the
split.
The map function generates key-value pairs, with the key
typically representing the item of interest and the value
containing some related information.
3. Intermediate keys
Are the output generated by the map phase (key, value) pairs.
They are not the final result but are used in the next phase to
perform sorting, grouping, and aggregation.

12/23/2024 13
With examples . . .
Mapping: Word Count Example
Split 1: "This is a sample document. It contains some words."
Map Function Output:
("This", 1)
("is", 1)
("a", 1)
("sample", 1)
("document", 1)
("It", 1)
("contains", 1)
("some", 1)
("words", 1)

12/23/2024 14
With examples . . .
Mapping: Word Count Example

Split 2: "We will count the occurrences of each word."


Map Function Output:
("We", 1)
("will", 1)
("count", 1)
("the", 1)
("occurrences", 1)
("of", 1)
("each", 1)
("word", 1)

12/23/2024 15
With examples . . .
4. Shuffling and Sorting:
The MapReduce framework collects all the key-value pairs
produced by the map function and rearranges them, grouping
pairs with the same key together.
Within each group, the values are then sorted. This ensures that
all values associated with a particular key are sent to the same
"reduce" task.
Shuffling and Sorting: Word Count Example
You can see the Shuffled and Sorted Output in the next
slide

12/23/2024 16
With examples . . .

("This", [1])
("of", [1])
("a", [1])
("sample", [1])
("contains", [1])
("some", [1])
("count", [1])
("the", [1])
("document", [1])
("We", [1])
("each", [1])
("will", [1])
("is", [1])
("word", [2]) // "word" has
("It", [1])
two occurrences
("occurrences", [1])
("words", [1])

12/23/2024 17
With examples . . .
5. Reducing:
Each reduce task receives a group of key-value pairs with
the same key.
It applies the user-defined "reduce" function to this group,
combining the values to produce a final output.
Reducing: Word Count Example

Input to Reducer: ("word", [1, 1])


Reduce Function Output: ("word", 2)

12/23/2024 18
With examples . . .
6. Output:
The final outputs from all the reduce tasks are collected and
written to the output destination, which could be a file or a
database.
Output: Word Count Example
Output File:
("This", 1)
("a", 1)
("contains", 1)
("count", 1)
// ... other words and counts ...
("word", 2)
("words", 1)
12/23/2024 19

12/23/2024 20
Iterative MapReduce
Iterative MapReduce is a computational technique that
involves repeatedly applying the MapReduce programming
model to solve problems that require multiple iterations.
Many algorithms in fields like data clustering, machine
learning, and computer vision require iterative computations.
Iterative MapReduce extends the basic MapReduce model to
handle these iterative algorithms efficiently.

12/23/2024 21
Motivation in Iterative MapReduce
Traditional MapReduce implementations like Apache Hadoop,
while successful for single-step computations, encounter
several inefficiencies.
a) High I/O Costs:
Relies heavily on disk I/O for storing intermediate data
after each iteration.
This frequent disk access significantly impacts
performance.
b) Task Initialization Overhead:
Frameworks like Hadoop recreate map and reduce tasks for
every iteration.
This leads considerable overhead during task initialization
and data loading.
12/23/2024 22
Cont’d…
c) Limited Communication Mechanisms:
Traditional MapReduce often relies on file-based
communication between map and reduce stages.
This approach is less efficient than in-memory data transfer
methods, as it involves the additional overhead of
serialization, deserialization, and disk access.
d) Lack of Support for Static and Variable Data:
Static data remains constant across iterations
variable data updated in each iteration.
This lack of distinction leads to inefficiencies, as static data
is repeatedly loaded and processed even though it remains
unchanged.

12/23/2024 23
Cont’d…
c) Inefficient for Algorithms with Complex Communication
Patterns:
Traditional MapReduce is highly effective in applications
with simple data flow patterns.
However, algorithms that need more complex
communication structures, like those involving all-to-all
communication, can be difficult.
Iterative MapReduce tackles these limitations by extending the
MapReduce model with features designed for iterative
processing.

12/23/2024 24
Steps in Iterative MapReduce
1. Initialization:
The input data is loaded
Initial values are set for any variables that will be updated
iteratively.
2. Iteration:
A series of MapReduce jobs are executed in a loop.
Each iteration takes the output of the previous iteration as
input and produces a new set of output.

12/23/2024 25
Steps in Iterative MapReduce
3. Convergence Check:
After each iteration, a convergence check is performed to
determine whether the algorithm has reached a solution.
This check could involve comparing the current output to
the previous output or evaluating some other convergence
criterion.
4. Termination:
The iterations continue until the convergence criterion is
met or a maximum number of iterations is reached.
The final output is then produced.

12/23/2024 26
Iterative MapReduce: K-Means Clustering
K-Means aims to partition a set of data points into k clusters.
Here's how it can be implemented using Iterative MapReduce:
1. Initialization
Randomly select k data points as initial cluster centers.
Load the data points into the distributed file system.
2. Iteration
Map:
calculate the distance to each cluster center.
Assign the data point to the cluster with the nearest
center.
Generate key-value pairs where the key represents the
cluster ID and the value corresponds to the data point.
12/23/2024 27
Cont’d …
Reduce:
For each cluster ID, calculate the mean of all data points
assigned to that cluster.
Update the cluster center to be the new mean
3. Convergence Check:
Calculate the distance between the new cluster centers and
the previous cluster centers.
If the distance is below a threshold, the algorithm has
converged.
4. Termination:
Repeat steps 2 and 3 until convergence or a maximum
number of iterations is reached.
The final cluster centers define the k clusters.
12/23/2024 28
Iterative MapReduce: Twister
Twister is an enhanced MapReduce runtime designed
specifically for iterative computations.
It introduces several features that improve the performance of
iterative algorithms.
A. Distinguishes between static and variable date :
static data: unchanged throughout iterations
variable data updated in each iteration).
This allows for efficient data handling.
B. Long-Running Map/Reduce Tasks:
Eliminating the overhead of repeatedly loading static data

12/23/2024 29
Iterative MapReduce: Twister
C. Combine Operation:
Includes combine phase to aggregate the outputs of reduce
tasks, and to determine whether to proceed with iteration or
not.
Benefits of Iterative MapReduce:
Scalability: Can handle massive datasets by distributing
computations across a cluster of machines.
Fault Tolerance: Can recover from failures of individual
machines without losing progress.
Applicability: Suitable for a wide range of iterative
algorithms in various domains.

12/23/2024 30
Hadoop Library
Hadoop is an open-source software framework written in Java
It used to implements the MapReduce programming model
The purpose is processing vast amounts of data in a distributed
computing environment. It consists of two fundamental layers:
1. MapReduce Engine
Built on top of HDFS
It manages the data flow and control flow of MapReduce
jobs over distributed computing systems.
Like HDFS has a master/slave architecture consisting of a
single JobTracker as the master and a number of
TaskTrackers as the slaves (workers).
JobTracker manages the MapReduce job over a cluster and
is responsible for monitoring jobs and assigning tasks to
TaskTrackers.
12/23/2024 31
Hadoop Library
The TaskTracker manages the execution of the map and/or
reduce tasks on a single computation node in the cluster.
Each TaskTracker node has a number of simultaneous
execution slots, each executing either a map or a reduce
task.
Slots are defined as the number of simultaneous threads
supported by CPUs of the TaskTracker node.
For example, a TaskTracker node with N CPUs, each
supporting M threads, has M * N simultaneous execution
slots

12/23/2024 32
Hadoop Library

12/23/2024 33
Hadoop Library
Each data block is processed by one map task running on a
single slot.
Therefore, there is a one-to-one correspondence
between map tasks in a TaskTracker and data blocks in the
respective DataNode.
2. HDFS:
It is a distributed file system inspired by GFS
It organizes files and stores their data on a distributed
computing system.
It divides files into fixed-size blocks, typically 64MB, and
replicates them on multiple DataNodes for fault tolerance

12/23/2024 34
Mapping applications
Mapping applications involves strategically distributing the
workload and data across multiple processing units to improve
performance and efficiency.
This process considers the application's characteristics,
resource availability, and suitable programming models.
A. Image Processing:
Pleasingly parallel: Different portions of an image can be
processed independently.
Partitioning: Divide the image into smaller blocks.
Mapping: Assign each block to a separate processing unit.

12/23/2024 35
Mapping applications
B. Particle Interactions:
Loosely synchronous: Requires periodic communication
between units to exchange data on particle positions and
interactions.
Partitioning: Divide the simulation space into subdomains.
Mapping: Assign each subdomain to a processing unit,
ensuring neighboring units can exchange boundary data.
C. Large-Scale Data Analysis:
Data-intensive: processing massive datasets, often using the
MapReduce paradigm.
Partitioning: Split the dataset into smaller chunks.
Mapping: In MapReduce, the framework handles the
mapping of map and reduce tasks to processing units based
on data locality.
12/23/2024 36
Cont’d…
Data Locality:
Place computations close to their data to minimize data
transfer time.
Load Balancing:
Distribute tasks evenly to avoid bottlenecks.
Communication Optimization:
Reduce communication overhead by aligning mapping with
communication patterns.

12/23/2024 37
Programming Support of Google App Engine
Google App Engine (GAE) is a platform for building and
deploying web applications.
GAE provides a complete development environment with
SDKs, a runtime environment, and an administration console.
Developers can use Python, Java, or Go PLs.
GAE offers a data storage service (Datastore) based on
Google's BigTable technology.
It can store structured data and perform queries.
Memcache provides in-memory caching for improved
performance.

12/23/2024 38
Cont’d…
Some features offered by GAE are:
User authentication: Authenticate users via Google Accounts.
Image Manipulation: Handles image manipulation tasks.
Scheduled Tasks: Developers can schedule background tasks
and periodic jobs using cron jobs
URL fetch: Allows applications to retrieve data from external
websites.
Mail: Applications can use GAE's mail service to send email
messages

12/23/2024 39
Cont’d…
AWS offers robust programming support to help developers
build and deploy applications.
AWS provides a comprehensive set of tools and resources,
empowering developers to build diverse applications.
(SDKs)
1) SDKs:
AWS provides SDKs for a variety of PLs, such as Java,
Python, .NET, and more.
These SDKs offer convenient libraries and APIs, to
simplifying interactions with AWS services.
The SDKs abstract the complexities of making low-level
API calls, enabling developers to focus on application
logic.
12/23/2024 40
Cont’d…
2) CLI:
The AWS CLI allows developers to manage and interact
with AWS services directly from their command line.
This tool provides a text-based interface for performing a
wide range of tasks, including launching EC2 instances,
managing S3 buckets, and configuring other AWS services.
3) AWS Management Console:
The AWS Management Console offers a web-based visual
interface for managing AWS resources and services.
This console provides a user-friendly way to monitor and
control various aspects of AWS, including EC2 instances,
S3 storage, IAM users and permissions, and much more.

12/23/2024 41
Cont’d…
4) Infrastructure as Code (IaC) Tools
AWS supports Infrastructure as Code (IaC) tools such as
AWS CloudFormation and Terraform.
These tools enable developers to define and manage their
infrastructure using code, promoting automation and
reproducibility.
IaC simplifies infrastructure provisioning, configuration
management, and deployment.
5) Programming Models
A. MapReduce:
AWS supports the MapReduce programming model
through its Elastic MapReduce (EMR) service.
EMR allows users to process vast amounts of data
12/23/2024 using a managed Hadoop cluster. 42
Cont’d…
Several languages, including Hive, Pig, Cascading, Java,
Ruby, Perl, Python, PHP, R, and C++, are supported for
EMR programming
6. Serverless Computing:
AWS Lambda enables serverless computing, allowing
developers to run code without provisioning or managing
servers.
Lambda functions can be triggered by various events, such
as changes in S3 buckets or API requests.
7. Specific Service APIs
Amazon EC2: Provides APIs for managing EC2 instances,
including launching, terminating, and monitoring instances.

12/23/2024 43
Cont’d…
Amazon S3: Offers APIs for storing and retrieving objects
in S3 buckets, managing access control, and performing
other S3 operations.
Amazon SimpleDB: Supports a simple query language for
interacting with SimpleDB, enabling data storage and
retrieval.
Amazon SQS: Includes APIs for sending, receiving, and
managing messages in SQS queues.
7. Documentation and Support
AWS provides comprehensive documentation, including
tutorials, guides, and API references.
AWS also offers various support channels, including
online forums, developer communities, and paid support
options.
12/23/2024 44
Cloud Software Environments
Cloud software environments are specialized software that
provide the foundation for building and managing cloud
computing infrastructures.
They allow organizations to create private, public, or hybrid
clouds, enabling them to offer various cloud services, such as
IaaS, PaaS, or SaaS.
In this session will explain the three popular open-source cloud
software environments: Eucalyptus, OpenNebula, and
OpenStack

12/23/2024 45
Eucalyptus
Eucalyptus is designed primarily to support IaaS clouds.
It focus on virtual networking and management of VMs.
It was developed by Eucalyptus Systems, originating from a
research project at the University of California, Santa Barbara.
It was designed to provide services compatible with Amazon’s
EC2 cloud and Simple Storage Service(S3)
It offers services like Walrus, a block storage system similar to
Amazon S3.
Eucalyptus used to build private clouds that interact with end-
users through Ethernet or the Internet and can also interact
with other private or public clouds.

12/23/2024 46
Cont’d…
Eucalyptus is designed primarily to support IaaS clouds.
It focus on virtual networking and management of VMs.
It was developed by Eucalyptus Systems, originating from a
research project at the University of California, Santa Barbara.
It was designed to provide services compatible with Amazon’s
EC2 cloud and Simple Storage Service(S3)
It offers services like Walrus, a block storage system similar to
Amazon S3.
Eucalyptus used to build private clouds that interact with end-
users through Ethernet or the Internet and can also interact
with other private or public clouds.

12/23/2024 47
Cont’d…
Controller : CC - Cluster , CLC - Cloud, NC - Node

12/23/2024 48
Cont’d…
Eucalyptus's image management system allows users to:
Bundle their own root file systems.
Upload and register images, linking them with specific
kernel and RAM disk images.
Store images in Walrus, retrieving them from any
availability zone.
This allows users to create and deploy specialized virtual
appliances.
Eucalyptus is available in both commercial and open-
source versions.

12/23/2024 49
OpenNebula
It is a versatile open-source toolkit that transforms existing
infrastructure into an IaaS cloud with cloud-like interfaces.

12/23/2024 50
Cont’d…
Fig. in the previous slide shows the OpenNebula architecture
and its main components.
It provides a flexible and modular architecture for
integrating diverse storage, network infrastructure, and
hypervisor technologies.
It manages the entire VM lifecycle, including dynamic
network setup for groups of VMs and storage management.
A key strength is its ability to support hybrid cloud models.
OpenNebula can interface with external clouds (like
Amazon EC2 and Eucalyptus) via cloud drivers.

12/23/2024 51
OpenStack
OpenStack is a collaborative project introduced by Rackspace
and NASA in July 2010.
It used for building massively scalable and secure open-source
cloud infrastructure.
Focus on compute and storage through OpenStack Compute
and OpenStack Storage solutions.
OpenStack Compute: manages large groups of virtual
private servers.
OpenStack Storage: provides redundant and scalable
object storage using clusters of commodity servers.

12/23/2024 52
Thank you!

12/23/2024 53

You might also like