0% found this document useful (0 votes)
5 views

BDS Session 2

The document discusses the principles of Parallel and Distributed Systems as they relate to Big Data Systems, covering topics such as the roadblocks of processor scaling, the motivation for using parallel and distributed systems, and various data access strategies. Key concepts include Amdahl's Law and Gustafson's Law, which describe the limits of parallelism and the potential speedup achievable through parallel processing. Additionally, the document highlights the importance of cluster computing and techniques for high-volume data processing.

Uploaded by

Shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

BDS Session 2

The document discusses the principles of Parallel and Distributed Systems as they relate to Big Data Systems, covering topics such as the roadblocks of processor scaling, the motivation for using parallel and distributed systems, and various data access strategies. Key concepts include Amdahl's Law and Gustafson's Law, which describe the limits of parallelism and the potential speedup achievable through parallel processing. Additionally, the document highlights the importance of cluster computing and techniques for high-volume data processing.

Uploaded by

Shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

DSECL ZG 522: Big Data Systems

Session 2: Parallel and Distributed Systems

Janardhanan PS
[email protected]
Context

Big Data Systems use basic principles of


Parallel and
Distributed Systems

2
Topics for today

• Road-blocks of processor scaling


• What are parallel / distributed systems
• Motivation for parallel / distributed systems
• Limits of parallelism
• Data access strategies - Replication, Partitioning, Messaging
• Cluster computing

3
Topics for today

• Road blocks of processor scaling


• What are parallel / distributed systems
• Motivation for parallel / distributed systems
• Limits of parallelism
• Data access strategies - Replication, Partitioning, Messaging
• Cluster computing

4
Roadblocks of Processor / Vertical Scaling

• The Frequency Wall


✓Not much headroom

• The Power Wall


✓Dynamic and static power dissipation

• The Memory Wall


✓Gap between compute bandwidth and memory bandwidth

5
Frequency wall

• More CPU Power -> More CPU hungry applications


• More Disk space -> New requirements to fill it up
• Applications enjoyed free performance benefits from latest processors
• 500 MHz -> 1 GHz -> 2 GHz -> 3.4 GHz -> 4 GHz ?
( No need to rewrite the S/W or even new release. Sometimes rebuilding is required)
Why we do not have a 10GHz processor now?

Chip designers are under pressure to deliver faster CPUs that will risk changing the
structure of your program, and possibly break it, in order to make it run faster

Revolutions in computing:
• 1990 - First Revolution in SW development – Object Oriented Programming
• Applications will increasingly need to be concurrently running if they want to fully exploit
continuing exponential multi-core CPU throughput gains
• Next Revolution in SW development – Parallel (Concurrent) Programming

6
Thermal wall (CPU Power consumption)

10000
Sun’s
Surface

Rocket
Power Density (W/cm2)
1000
Nozzle

Nuclear
100
Reactor
8086 Hot Plate
10 4004 P6
8008 8085 386 Pentium®
286 486
8080
1
Source: Intel Corp.
1970 1980 1990 2000 2010
Year

7
Memory Wall

µProc
1000 CPU 60%/yr.
“Moore’s Law”

100 Processor-Memory
Performance Gap:
(grows 50% / year)

10
DRAM
7%/yr.
DRAM

Time

8
Serial Computing
• Software written for serial computation:
✓ A problem is broken into a discrete series of instructions
✓ Instructions are executed sequentially one after another
✓ Executed on a single processor
✓ Only one instruction may execute at any moment in time
✓ Single data stores - memory and disk

Extra info:

• Von Neumann architecture : common memory store


and pathways between instructions and data -
causes Von Neumann bottleneck
• Harvard architecture separates them to reduce
bottleneck.
• Modern architectures use separate caches for
instruction and data.

9
Parallel Computing

• Simultaneous use of multiple compute resources to solve a computational problem


✓ A problem is broken into discrete parts that can be solved concurrently
✓ Each part is further broken down to a series of instructions
✓ Instructions from each part execute simultaneously on different processors
✓ Different processors can work with independent memory and storage
✓ An overall control/coordination mechanism is employed

10
Spectrum of Parallelism

More coupling More granularity of parallelism

Think of a factory assembly line Message passing


over interconnect
Super-scalar Multi-core Shared memory
over interconnect Clusters Grids Clouds
Pipelining Multi-threaded

Pseudo-parallel Parallel
(intra-processor)

Distributed

Task level Task level Service /


Instruction level
Data level Data level Request level
Application level
More in session on programming
11
Distributed Computing
• In distributed computing,
✓Multiple computing resources are connected in a network and
computing tasks are distributed across these resources
✓Results in increase in speed and efficiency of system
✓Faster and more efficient than traditional methods of computing
✓More suitable to process huge amounts of data in limited time

12
Multi-processor Vs Multi-computer systems

UMA
» Uniform Memory Access
Multiprocessor
» Shared memory address space
» No common clock
» Fast interconnect

NUMA
» Non Uniform Memory Access
Multicomputer UMA NUMA

» May have shared address spaces


» Typically message passing
» No common clock

13
Interconnection Networks

a) A crossbar switch - faster


b) An omega switching network - cheaper

14
Classification based on Instruction and Data parallelism
Instruction Stream and Data Stream

• The term ‘stream’ refers to a sequence or flow of either instructions or data operated on by the CPU.
• In the complete cycle of instruction execution, a flow of instructions from main memory to the CPU is
established. This flow of instructions is called instruction stream.
• Similarly, there is a flow of operands between processor and memory bi-directionally. This flow of
operands is called data stream.

Reduce Von Neumann bottleneck with separate caches

15
Flynn’s Taxonomy

Instruction Streams

Single Multiple

MISD
Single

SISD
Uniprocessors Uncommon
Pipelining Fault tolerance
Data Streams

SIMD MIMD
Multiple

Scientific computing Multi-computers


Matrix manipulations Distributed Systems

Image from sciencedirect.com

16
Some basic concepts ( esp. for programming in Big Data Systems )

» Coupling
» Tight - SIMD, MISD shared memory systems
» Loose - NOW, distributed systems, no shared memory
» Speedup
» how much faster can a program run when given N processors as opposed to 1 processor — T(1) / T(N)
» We will study Amdahl’s Law, Gustafson’s Law
» Parallelism of a program
» Compare time spent in computations to time spent for communication via shared memory or message passing
» Granularity
» Average number of compute instructions before communication is needed across processors
» Note:
» If coarse granularity, use distributed systems else use tightly coupled multi-processors/computers
» Potentially high parallelism doesn’t lead to high speedup if granularity is too small leading to high overheads

17
Comparing Parallel and Distributed Systems

Parallel System Distributed System


Computer system with several processing units Independent, autonomous systems connected in a
attached to it network accomplishing specific tasks

A common shared memory can be directly Coordination is possible between connected


accessed by every processing unit in a network computers with own memory and CPU

Tight coupling of processing resources that are Loose coupling of computers connected in
used for solving single, complex problem network, providing access to data and remotely
located resources
Programs may demand fine grain parallelism Programs have coarse grain parallelism

18
Topics for today

• What are parallel / distributed systems


• Motivation for parallel / distributed systems
• Limits of parallelism
• Data access strategies - Replication, Partitioning, Messaging
• Cluster computing

19
Motivation for parallel / distributed systems (1)
• Inherently distributed applications
• e.g. financial tx involving 2 or more parties
• Better scale in creating multiple smaller parallel tasks instead of a complex task
• e.g. evaluate an aggregate over 6 months data
• Processors getting cheaper and networks faster
• e.g. Processor speed 2x / 1.5 years, network traffic 2x/year, processors
limited by energy consumption
• Better scale using replication or partitioning of storage
• e.g. replicated media servers for faster access or shards in search engines replicated / partitioned storage
• Access to shared remote resources
• e.g. remote central DB
• Increased performance/cost ratio compared to special parallel systems
• e.g. search engine runs on a Network-of-Workstations

remote shared resource

20
Motivation for parallel / distributed systems (2)

Motivation for parallel / distributed systems (2)


• Better reliability because less chance of multiple failures
• Be careful about Integrity : Consistent state of a resource
cluster nodes
across concurrent access
• Incremental scalability
• Add more nodes in a cluster to scale up
• e.g. Clusters in Cloud services, autoscaling in AWS
resize cluster
• Offload computing closer to user for scalability and better
resource usage
server
• Edge computing

Machine Learning at the edge:


https://www.datasciencecentral.com/machine-learning-at-the-edge/ offload model inference to edge

21
Distributed network of content caching servers

This would be a P2P network if you were using bit torrent for free

22
Techniques for High Volume Data Processing

Method Description Usage


Cluster computing A collection of computers, homogenous or Commonly used in Big Data Systems,
heterogenous, using commodity components such as Hadoop
running open source or proprietary software,
communicating via message passing
Massively Parallel Processing Typically, proprietary Distributed Shared May be used in traditional Data
(MPP) Memory machines with integrated storage Warehouses, Data processing
(Terra Data ?) appliances, e.g. EMC Greenplum
(postgreSQL on an MPP)

High-Performance Computing Known to offer high performance and Used to develop specialty and
(HPC) scalability by using in-memory computing custom scientific applications for
research where results is more
valuable than cost

23
Topics for today

• What are parallel / distributed systems


• Motivation for parallel / distributed systems
• Limits of parallelism
• Data access strategies - Replication, Partitioning, Messaging
• Cluster computing

24
Limits of Parallelism

• A parallel program has some sequential / serial code and


significant parallelized code

25
Amdahl’s Law – (1) - 1967

1
Speed Up = ----------------------
(1-P) + P/N

P = Parallel part (%) of the program


N = No. of Processors (Workers)

P = 0; Completely sequential – Speedup = 1


P = 1; Completely parallel – Speedup = N
P = 0.5; N=2; Partial Parallel – Speedup = 1.333

2007
Amdahl’s Law – (2)

• T(1) : Time taken for a job with 1 processor


• T(N) : Time taken for same job with N processors
• Speedup S(N) = T(1) / T(N)
• S(N) is ideally N when it is a perfectly parallelizable program, i.e. data parallel with no sequential component
• Assume fraction of a program that cannot be parallelised (serial) is f and 1-f is parallel
✓ T(N) >= f * T(1) + (1-f) * T(1) / N

Only parallel portion is faster by N


• S(N) = T(1) / ( f * T(1) + (1-f) * T(1) / N )
1
• S(N) = 1 / ( f + (1-f) / N ) Speed Up = ----------------------
(1-P) + P/N
• Implication :
✓ If N=>inf, S(N) => 1/f
✓ The effective speedup is limited by the sequential fraction of the code

27
Amdahl’s Law – Example calculation

10% of a program is sequential (f) and there are 100 processors.


What is the effective speedup ?
S(N) = 1 / ( f + (1-f) / N )
S(100) = 1 / ( 0.1 + (1-0.1) / 100 )
= 1 / 0.109
= 9.17 (approx)

28
Actual Limitations in speedup

Besides the sequential component of the program,


communication delays also result in reduction of speedup

A and B exchange messages in blocking mode


Say processor speed is 0.5ns / instruction
Say network delay one way is 10 us
For one message delay, A and B would have each executed
10us/0.5ns = 20000 instructions

+ context switching, scheduling, load balancing, I/O …

29
Why Amdahl’s Law is such bad news

S(N) ~ 1/ f , for large N

Suppose 33% of a program is sequential


• Then even a billion processors won’t give a speedup over 3

• For the 256 cores to gain ≥100x speedup, we need


100  1 / (f + (1-f)/256)
Which means f  .0061 or 99.4% of the algorithm must be perfectly parallelizable !!

31
Speedup plot
256.
Speedup for 1, 4, 16, 64, and 256 Processors
T1 / TN = 1 / (f + (1-f)/N)
192.

128.

64.

0.
0.00% 6.50% 13.00% 19.50% 26.00%
Percentage of Code that is Sequential

1 Processor 4 Processors 16 Processors 64 Processors 256 Processors

32
But wait - may be we are missing something

» The key assumption in Amdahl’s Law is total workload is fixed as no of


processors is increased
» This doesn’t happen in practice — sequential part doesn’t increase with resources
» Additional processors can be used for more complex workload and new age
larger parallel problems
» So, Amdahl’s law under-estimates Speedup
» What if we assume fixed workload per processor

33
Gustafson-Barsis Law - (1988)
Let W be the execution workload of the program before adding resources
f is the sequential part of the workload
So W = f * W + (1-f) * W
Let W(N) be larger execution workload after adding N processors
So W(N) = f * W + N * (1-f) * W
Parallelizable work can increase N times
The theoretical speedup in latency of the whole task at a fixed interval time T
S(N) = T * W(N) / T * W
= W(N) / W = ( f * W + N * (1-f) * W) / W
S(N) = f + (1-f) * N
S(N) is not limited by f as N scales

So solve larger problems when you have more processors

34
Usage Scenarios

• Amdahl’s law can be used when the workload is fixed, as it calculates the potential
speedup with the assumption of a fixed workload. Moreover, it can be utilized when
the non-parallelizable portion of the task is relatively large, highlighting
the diminishing returns of parallelization.

• Gustafson’s law is applicable when the workload or problem size can be scaled
proportionally with the available resources. It also addresses problems requiring
larger problem sizes or workloads, promoting the development of systems capable of
handling such realistic computations.

35
Topics for today

• What are parallel / distributed systems


• Motivation for parallel / distributed systems
• Limits of parallelism
• Data access strategies - Replication, Partitioning, Messaging
• Cluster computing

36
Data Access Strategies: Partition
• Strategy:
✓Partition data – typically, equally – to the nodes of the (distributed) system
• Cost:
✓ Network access and merge cost when query needs to go across partitions
• Advantage(s):
✓ Works well if task/algorithm is (mostly) data parallel
✓ Works well when there is Locality of Reference within a partition
• Concerns
✓ Merge across data fetched from multiple partitions
✓ Partition balancing
✓ Row vs Columnar layouts - what improves locality of reference ?
✓ Will study shards and partition in Hadoop, MongoDB, and Cassandra

37
Data Access Strategies: Replication

• Strategy:
✓ Replicate all data across nodes of the (distributed) system
• Cost:
✓ Higher storage cost
• Advantage(s):
✓ All data accessed from local disk: no (runtime) communication on the network
✓ High performance with parallel access
✓ Fail over across replicas
• Concerns
✓ Keep replicas in sync — various consistency models between readers and writers
✓ Will study in depth for MongoDB, Cassandra

38
Data Access Strategies: (Dynamic) Communication

• Strategy:
✓ Communicate (at runtime) only the data that is required
• Cost:
✓ High network cost for loosely coupled systems and data set to be exchanged is large
• Advantage(s):
✓ Minimal communication cost when only a small portion of the data is actually required
by each node
• Concerns
✓ Highly available and performant network
✓ Fairly independent parallel data processing

39
Data Access Strategies – Networked Storage

• Common Storage on the Network:


✓ Storage Area Network (for raw access – i.e. disk block access)
✓ Network Attached Storage (for file access)

• Common Storage on the Cloud:


✓ Use Storage as a Service
✓ e.g. Amazon S3

More in-depth coverage when studying Amazon storage case study


Webinar on Bigdata storage on cloud

40
Topics for today

• What are parallel / distributed systems


• Motivation for parallel / distributed systems
• Limits of parallelism
• Data access strategies - Replication, Partitioning, Messaging
• Cluster computing

41
Cluster - Goals

• Continuous availability
• Data integrity
• Linear scalability
• Open access
• Parallelism in processing
• Distributed systems management

42
Computer Cluster - Definition
• A cluster is a type of distributed processing system
✓ consisting of a collection of inter-connected stand-
alone computers
✓ working together as a single, integrated computing
resource
✓Examples:
• High Availability Clusters
ServiceGuard, Lifekeeper, Failsafe, heartbeat, HACMP, failover clusters
• High Performance Clusters
Beowulf; 1000 nodes; parallel programs; MPI
• Database Clusters
Oracle Parallel Server (OPS) - RAC
• Storage Clusters
Cluster filesystems; same view of data from each node.
HDFS
43
Cluster - Objectives
• A computer cluster is typically built for one of the following two reasons:
✓ High Performance - referred to as compute-clusters
✓ High Availability - achieved via redundancy

An off-the-shelf or custom load balancer, reverse proxy can be configured to serve the use case

• Question: How is this relevant for Big Data?

Hadoop nodes are a cluster for performance (independent Map/Reduce jobs are started on
multiple nodes) and availability (data is replicated on multiple nodes for fault tolerance)

Most Big Data systems run on a cluster configuration for performance and availability

44
Clusters – Peer to Peer computation

• Distributed Computing models can be classified as:


✓ Client Server models
✓ Peer-to-Peer models
• based on the structure and interactions of the nodes in a distributed system
• Clusters within the nodes use a Peer-to-Peer model of computation.
• There may be special control nodes that allocate and manage work thus having
a master-slave relationship.

45
Client-Server vs. Peer-to-Peer

• Client-Server Computation
✓ A server node performs the core computation – business logic in case of applications
✓ Client nodes request for such computation
✓ At the programming level this is referred to as the request-response model
✓ Email, network file servers, …

• Peer-to-Peer Computation:
✓ All nodes are peers i.e. they perform core computations and may act as client or
server for each other.
✓ bit torrent, some multi-player games, clusters

46
Cloud and Clusters

• A cloud uses a datacenter as the infrastructure on top of which services are provided
• e.g. AWS would have a datacenter in many regions - Mumbai, US east, … (you
can pick where you want your services deployed)
• A cluster is the basic building block for a datacenter:
✓ i.e. a datacenter is structured as a collection of clusters
• A cluster can host
✓ a multi-tenant service across clients - cost effective
✓ individual clients and their service(s) - dedicated instances

47
Motivation for using Clusters (1)

• Rate of obsolescence of computers is high


✓ Even for mainframes and supercomputers
✓ Servers (used for high performance computing) have to be replaced every 3 to
5 years.
• Solution: Build a cluster of commodity workstations
✓ Incrementally add nodes to the cluster to meet increasing workload
✓ Add nodes instead of replacing (i.e. let older nodes operate at a lower speed)
✓ This model is referred to as a scale-out cluster

48
Motivation for using Clusters (2)

• Scale-out clusters with commodity workstations as nodes are suitable for software
environments that are resilient:
✓ i.e. individual nodes may fail, but
✓ middleware and software will enable computations to keep running (and keep services
available) for end users
✓for instance, back-ends of Google and Facebook use this model.

• On the other hand, (public) cloud infrastructure is typically built as clusters of servers
✓ due to higher reliability of individual servers – used as nodes – (compared to that of
workstations as nodes).

49
Typical cluster components
Parallel applications

Parallel programming environment (e.g. map reduce)

Seq. Apps Cluster middleware (e.g. hadoop)

OS and runtimes OS and runtimes OS and runtimes

processor and memory


local storage
processor and memory
local storage
… processor and memory
local storage
network stack network stack network stack

cluster node

high speed switching network


50
Split Brain – Evil of clustering

• Caused by failure of Heartbeat network connection(s)


• Two “halves” of a cluster keep running
✓All Heartbeat networks fail or are “partitioned”
• Each half wants to keep running
• After a split, both splits think that they're the active cluster.
• Recovery options:
✓Allow cluster half with majority number of nodes to survive
✓Force cluster with minority number of nodes to shut down

51
STONITH (Shoot The Other Node In The Head)

• STONITH is a Linux service for maintaining the integrity of nodes in an HA cluster.


• STONITH automatically powers down a node that is not working correctly.
• If a node fails to respond or is behaving unusually, STONITH ensures that the
node cannot do any damage, especially in the event of a split-brain scenario
(when two nodes both think they're the active server).
• For example, if a primary database node automatically fails over to a standby
node because of a disruption in service, the first node might still try to write data
to the shared storage system at the same time as the new primary node, which
could corrupt the data or impact its integrity. To prevent this from happening,
STONITH will shut down the first node or restart it and set it as the standby
server.

52
Cluster Middleware - Some Functions

Single System Image (SSI) infrastructure High Availability (HA) Infrastructure


✓ Glues together OSs on all nodes to offer unified ✓ Cluster services for
access to system resources ✓ Availability
✓ Single process space
✓ Redundancy
✓ Cluster IP or single entry point
✓ Fault-tolerance
✓ Single auth
✓ Single memory and IO space ✓ Recovery from failures
✓ Process checkpointing and migration
✓ Single IPC space
✓ Single fs root
✓ Single virtual networking
✓ Single management GUI

http://www.cloudbus.org/papers/SSI-CCWhitePaper.pdf
53
Example cluster: Hadoop
• A job divided into tasks
• Considers every task either as a Map or a Reduce
• Tasks assigned to a set of nodes (cluster)
• Special control nodes manage the nodes for resource
management, setup, monitoring, data transfer, failover etc.
• Hadoop clients work with these control nodes to get the job done

54
Limits of Parallelism in distributed computing

Execution Time Vs Concurrent Tasks on a Hadoop cluster


12

10

Overheads for
Execution Time (Secs)

8 ✓ Distributed Scheduling
✓ Local on node scheduling
6
✓ Communication
4 ✓ Synchronization, etc.

0
0 200 400 600 800 1000 1200

Number of Tasks running concurrently

55
Summary

• Motivation and classification of parallel systems


• Computing limits of speedup
• How replication, partitioning helps in Big Data storage and access
• Cluster computing basics

56
Next Session:
Fault Tolerance, Big Data Analytics and Systems

You might also like