ParallelDBs PDF
ParallelDBs PDF
DATABASES
CS561-SPRING 2012
W P I , M O H A M E D E L TA B A K H
INTRODUCTION
In centralized database:
Data is located in one place (one server)
All DBMS functionalities are done by that server
Enforcing ACID properties of transactions
Concurrency control, recovery mechanisms
Answering queries
In Distributed databases:
Data is stored in multiple places (each is running a DBMS)
New notion of distributed transactions
DBMS functionalities are now distributed over many machines
Revisit how these functionalities work in distributed environment
2
Machines are physically close to each other, e.g., same server room
Machines connects with dedicated high-speed LANs and switches
Communication cost is assumed to be small
Can shared-memory, shared-disk, or shared-nothing architecture
Distributed Databases
PARALLEL DATABASE
&
PARALLEL PROCESSING
1 Terabyte
Ba
1,000 x parallel
1.5 minute to scan.
nd
wi
1 Terabyte
dt
h
10 MB/s
Divide a big problem into many smaller ones to be solved in
parallel
Increase bandwidth (in our case decrease queries response
time)
DIFFERENT ARCHITECTURE
Three possible architectures for passing information
Shared-memory
Shared-disk
Shared-nothing
1- SHARED-MEMORY ARCHITECTURE
Every processor has its own disk
Single memory address-space for
all processors
Reading or writing to far memory can
be slightly more expensive
2- SHARED-DISK ARCHITECTURE
Every processor has its own
memory (not accessible by others)
All machines can access all disks
in the system
Number of disks does not
necessarily match the number of
processors
3- SHARED-NOTHING ARCHITECTURE
Most common architecture nowadays
Every machine has its own memory and
disk
Many cheap machines (commodity
hardware)
Scales better
Easier to build
Cheaper cost
10
TYPES OF PARALLELISM
Pipeline Parallelism (Inter-operator parallelism)
Ordered (or partially ordered) tasks and different machines
are performing different tasks
Order between
them
Pipeline
Sequential
Sequential
Sequential
Partition
Sequential
Sequential
11
Scale-Up
If resources increased in
proportion to increase in
data size, time is constant.
sec./Xact
(response time)
Speed-Up
Xact/sec.
(throughput)
degree of ||-ism
Ideal
degree of ||-ism
PARTITIONING OF DATA
To partition a relation R over m machines
Range partitioning
Hash-based partitioning
Round-robin partitioning
14
Each machine scans its own partition and applies the selection
condition c
If data are partitioned using round robin or a hash function (over
the entire tuple)
The resulted relation is expected to be well distributed over all nodes
All partitioned will be scanned
15
Each machine i receives all ith partitions from all machines (from R
and S)
Each machine can locally join the partitions it has
...
Disk
INPUT
hash
function
h
Partitions
1
2
B-1
B-1
Disk
17
PARALLEL SORTING
Range-based
Merge-based
Sites 1-8
Sites 1-4
A
Sites 5-8
B
19
PERFORMANCE OF PARALLEL
ALGORITHMS
In many cases, parallel algorithms reach their expected lower
bound (or close to)
If parallelism degree is m, then the parallel cost is 1/m of the sequential cost
Cost mostly refers to querys response time
Example
Ideal
degree of ||-ism
sec./Xact
(response time)
Xact/sec.
(throughput)
degree of ||-ism
20
PERFORMANCE OF PARALLEL
ALGORITHMS (CONTD)
Total disk I/Os (sum over all machines) of parallel algorithms can
be larger than that of sequential counterpart
But we get the benefit of being done in parallel
Example
Merge-sort join (serial case) has I/O cost = 3(B(R) + B(S))
Merge-sort join (parallel case) has total (sum) I/O cost = 5(B(R) + B(S))
Considering the parallelism = 5(B(R) + B(S)) / m
Number of pages
of relations R and S
21
Table
Scan
A..M
Index
Scan
N..Z
22
Shared-memory
Shared-disk
Shared-nothing (the most common one)
Parallel algorithms
Intra-operator
Inter-operator