BDS Session 2
BDS Session 2
Janardhanan PS
[email protected]
Context
2
Topics for today
3
Topics for today
4
Roadblocks of Processor / Vertical Scaling
5
Frequency wall
Chip designers are under pressure to deliver faster CPUs that will risk changing the
structure of your program, and possibly break it, in order to make it run faster
Revolutions in computing:
• 1990 - First Revolution in SW development – Object Oriented Programming
• Applications will increasingly need to be concurrently running if they want to fully exploit
continuing exponential multi-core CPU throughput gains
• Next Revolution in SW development – Parallel (Concurrent) Programming
6
Thermal wall (CPU Power consumption)
10000
Sun’s
Surface
Rocket
Power Density (W/cm2)
1000
Nozzle
Nuclear
100
Reactor
8086 Hot Plate
10 4004 P6
8008 8085 386 Pentium®
286 486
8080
1
Source: Intel Corp.
1970 1980 1990 2000 2010
Year
7
Memory Wall
µProc
1000 CPU 60%/yr.
“Moore’s Law”
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
7%/yr.
DRAM
Time
8
Serial Computing
• Software written for serial computation:
✓ A problem is broken into a discrete series of instructions
✓ Instructions are executed sequentially one after another
✓ Executed on a single processor
✓ Only one instruction may execute at any moment in time
✓ Single data stores - memory and disk
Extra info:
9
Parallel Computing
10
Spectrum of Parallelism
Pseudo-parallel Parallel
(intra-processor)
Distributed
12
Multi-processor Vs Multi-computer systems
UMA
» Uniform Memory Access
Multiprocessor
» Shared memory address space
» No common clock
» Fast interconnect
NUMA
» Non Uniform Memory Access
Multicomputer UMA NUMA
13
Interconnection Networks
14
Classification based on Instruction and Data parallelism
Instruction Stream and Data Stream
• The term ‘stream’ refers to a sequence or flow of either instructions or data operated on by the CPU.
• In the complete cycle of instruction execution, a flow of instructions from main memory to the CPU is
established. This flow of instructions is called instruction stream.
• Similarly, there is a flow of operands between processor and memory bi-directionally. This flow of
operands is called data stream.
15
Flynn’s Taxonomy
Instruction Streams
Single Multiple
MISD
Single
SISD
Uniprocessors Uncommon
Pipelining Fault tolerance
Data Streams
SIMD MIMD
Multiple
16
Some basic concepts ( esp. for programming in Big Data Systems )
» Coupling
» Tight - SIMD, MISD shared memory systems
» Loose - NOW, distributed systems, no shared memory
» Speedup
» how much faster can a program run when given N processors as opposed to 1 processor — T(1) / T(N)
» We will study Amdahl’s Law, Gustafson’s Law
» Parallelism of a program
» Compare time spent in computations to time spent for communication via shared memory or message passing
» Granularity
» Average number of compute instructions before communication is needed across processors
» Note:
» If coarse granularity, use distributed systems else use tightly coupled multi-processors/computers
» Potentially high parallelism doesn’t lead to high speedup if granularity is too small leading to high overheads
17
Comparing Parallel and Distributed Systems
Tight coupling of processing resources that are Loose coupling of computers connected in
used for solving single, complex problem network, providing access to data and remotely
located resources
Programs may demand fine grain parallelism Programs have coarse grain parallelism
18
Topics for today
19
Motivation for parallel / distributed systems (1)
• Inherently distributed applications
• e.g. financial tx involving 2 or more parties
• Better scale in creating multiple smaller parallel tasks instead of a complex task
• e.g. evaluate an aggregate over 6 months data
• Processors getting cheaper and networks faster
• e.g. Processor speed 2x / 1.5 years, network traffic 2x/year, processors
limited by energy consumption
• Better scale using replication or partitioning of storage
• e.g. replicated media servers for faster access or shards in search engines replicated / partitioned storage
• Access to shared remote resources
• e.g. remote central DB
• Increased performance/cost ratio compared to special parallel systems
• e.g. search engine runs on a Network-of-Workstations
20
Motivation for parallel / distributed systems (2)
21
Distributed network of content caching servers
This would be a P2P network if you were using bit torrent for free
22
Techniques for High Volume Data Processing
High-Performance Computing Known to offer high performance and Used to develop specialty and
(HPC) scalability by using in-memory computing custom scientific applications for
research where results is more
valuable than cost
23
Topics for today
24
Limits of Parallelism
25
Amdahl’s Law – (1) - 1967
1
Speed Up = ----------------------
(1-P) + P/N
2007
Amdahl’s Law – (2)
27
Amdahl’s Law – Example calculation
28
Actual Limitations in speedup
29
Why Amdahl’s Law is such bad news
31
Speedup plot
256.
Speedup for 1, 4, 16, 64, and 256 Processors
T1 / TN = 1 / (f + (1-f)/N)
192.
128.
64.
0.
0.00% 6.50% 13.00% 19.50% 26.00%
Percentage of Code that is Sequential
32
But wait - may be we are missing something
33
Gustafson-Barsis Law - (1988)
Let W be the execution workload of the program before adding resources
f is the sequential part of the workload
So W = f * W + (1-f) * W
Let W(N) be larger execution workload after adding N processors
So W(N) = f * W + N * (1-f) * W
Parallelizable work can increase N times
The theoretical speedup in latency of the whole task at a fixed interval time T
S(N) = T * W(N) / T * W
= W(N) / W = ( f * W + N * (1-f) * W) / W
S(N) = f + (1-f) * N
S(N) is not limited by f as N scales
34
Usage Scenarios
• Amdahl’s law can be used when the workload is fixed, as it calculates the potential
speedup with the assumption of a fixed workload. Moreover, it can be utilized when
the non-parallelizable portion of the task is relatively large, highlighting
the diminishing returns of parallelization.
• Gustafson’s law is applicable when the workload or problem size can be scaled
proportionally with the available resources. It also addresses problems requiring
larger problem sizes or workloads, promoting the development of systems capable of
handling such realistic computations.
35
Topics for today
36
Data Access Strategies: Partition
• Strategy:
✓Partition data – typically, equally – to the nodes of the (distributed) system
• Cost:
✓ Network access and merge cost when query needs to go across partitions
• Advantage(s):
✓ Works well if task/algorithm is (mostly) data parallel
✓ Works well when there is Locality of Reference within a partition
• Concerns
✓ Merge across data fetched from multiple partitions
✓ Partition balancing
✓ Row vs Columnar layouts - what improves locality of reference ?
✓ Will study shards and partition in Hadoop, MongoDB, and Cassandra
37
Data Access Strategies: Replication
• Strategy:
✓ Replicate all data across nodes of the (distributed) system
• Cost:
✓ Higher storage cost
• Advantage(s):
✓ All data accessed from local disk: no (runtime) communication on the network
✓ High performance with parallel access
✓ Fail over across replicas
• Concerns
✓ Keep replicas in sync — various consistency models between readers and writers
✓ Will study in depth for MongoDB, Cassandra
38
Data Access Strategies: (Dynamic) Communication
• Strategy:
✓ Communicate (at runtime) only the data that is required
• Cost:
✓ High network cost for loosely coupled systems and data set to be exchanged is large
• Advantage(s):
✓ Minimal communication cost when only a small portion of the data is actually required
by each node
• Concerns
✓ Highly available and performant network
✓ Fairly independent parallel data processing
39
Data Access Strategies – Networked Storage
40
Topics for today
41
Cluster - Goals
• Continuous availability
• Data integrity
• Linear scalability
• Open access
• Parallelism in processing
• Distributed systems management
42
Computer Cluster - Definition
• A cluster is a type of distributed processing system
✓ consisting of a collection of inter-connected stand-
alone computers
✓ working together as a single, integrated computing
resource
✓Examples:
• High Availability Clusters
ServiceGuard, Lifekeeper, Failsafe, heartbeat, HACMP, failover clusters
• High Performance Clusters
Beowulf; 1000 nodes; parallel programs; MPI
• Database Clusters
Oracle Parallel Server (OPS) - RAC
• Storage Clusters
Cluster filesystems; same view of data from each node.
HDFS
43
Cluster - Objectives
• A computer cluster is typically built for one of the following two reasons:
✓ High Performance - referred to as compute-clusters
✓ High Availability - achieved via redundancy
An off-the-shelf or custom load balancer, reverse proxy can be configured to serve the use case
Hadoop nodes are a cluster for performance (independent Map/Reduce jobs are started on
multiple nodes) and availability (data is replicated on multiple nodes for fault tolerance)
Most Big Data systems run on a cluster configuration for performance and availability
44
Clusters – Peer to Peer computation
45
Client-Server vs. Peer-to-Peer
• Client-Server Computation
✓ A server node performs the core computation – business logic in case of applications
✓ Client nodes request for such computation
✓ At the programming level this is referred to as the request-response model
✓ Email, network file servers, …
• Peer-to-Peer Computation:
✓ All nodes are peers i.e. they perform core computations and may act as client or
server for each other.
✓ bit torrent, some multi-player games, clusters
46
Cloud and Clusters
• A cloud uses a datacenter as the infrastructure on top of which services are provided
• e.g. AWS would have a datacenter in many regions - Mumbai, US east, … (you
can pick where you want your services deployed)
• A cluster is the basic building block for a datacenter:
✓ i.e. a datacenter is structured as a collection of clusters
• A cluster can host
✓ a multi-tenant service across clients - cost effective
✓ individual clients and their service(s) - dedicated instances
47
Motivation for using Clusters (1)
48
Motivation for using Clusters (2)
• Scale-out clusters with commodity workstations as nodes are suitable for software
environments that are resilient:
✓ i.e. individual nodes may fail, but
✓ middleware and software will enable computations to keep running (and keep services
available) for end users
✓for instance, back-ends of Google and Facebook use this model.
• On the other hand, (public) cloud infrastructure is typically built as clusters of servers
✓ due to higher reliability of individual servers – used as nodes – (compared to that of
workstations as nodes).
49
Typical cluster components
Parallel applications
cluster node
51
STONITH (Shoot The Other Node In The Head)
52
Cluster Middleware - Some Functions
http://www.cloudbus.org/papers/SSI-CCWhitePaper.pdf
53
Example cluster: Hadoop
• A job divided into tasks
• Considers every task either as a Map or a Reduce
• Tasks assigned to a set of nodes (cluster)
• Special control nodes manage the nodes for resource
management, setup, monitoring, data transfer, failover etc.
• Hadoop clients work with these control nodes to get the job done
54
Limits of Parallelism in distributed computing
10
Overheads for
Execution Time (Secs)
8 ✓ Distributed Scheduling
✓ Local on node scheduling
6
✓ Communication
4 ✓ Synchronization, etc.
0
0 200 400 600 800 1000 1200
55
Summary
56
Next Session:
Fault Tolerance, Big Data Analytics and Systems