adbms-unit4
adbms-unit4
1. Increased Throughput
2. Improved Response Time
3. Useful to Query extremely large databases.
4. Substantial Performance improvement
5. Increased availability of System
6. Greater Flexibility
7. Possible to serve large number of users.
A computer has several active CPUs that are attached to an interconnection network
and can share a single main memory and a common array of disk storage. A single copy of
multithreaded OS and multithreaded DBMS can support multiple cpu’s. This structure is
used for achieving moderate parallelism.
CPU
CPU CPU CPU
InterConnection Network
Shared
Disk Disk Disk memory
Benefits
Limitations
Design must take special precaution so that different CPU’s have equal access to
the common memory.
This is architecture is not scalable beyond 80 or 100 CPUs in parallel. The n/w
becomes a bottleneck if the number of CPU increases.
The addition of more CPU may cause CPUs to spend time waiting for their turn
to access memory.
2. Shared-disk Multiple CPU Parallel Database Architecture
Interconnection n/w
Benefits
Easy to load balance, since data does not have to be permanently divided among
CPU’s.
Memory bus is not a bottleneck.
Provides a high degree of Fault tolerance. In case of a CPU failure, the other CPU
take over the task, since the database is resident on disks that are accessible from
all CPUs.
Acceptance in wide applications.
Limitations
Interference and Memory contention bottleneck as the number of CPUs
increases.
Problem of scalability. The interconnection to the disk subsystem becomes
bottleneck.
mem mem
disk
disk
CPU cpu
Interconnection n/w
Benefits
Minimise contention among CPUs by not sharing resources and therefore offer a
high degree of Scalability.
Only queries and access to non local disk and result relation pass through the
network.
Adding more CPUs and more disks enables the system grow and provides high
degree of Scalability. It provides linear speed up and linear scale up.
Linear speed up and linear scale up increase the transmission capacity of shared
nothing architecture and so it can easily support for large number of CPUs.
Limitations
Applications:
1. Speed UP
2. Scale UP
3. Synchronisation
4. Locking
1. Speed UP
It is the property in which the time taken for performing a task decreases in
proportion to the number of CPU’s and disks in parallel. It is the property of
running a given task in less time by increasing the degree of
parallelism.Speed UP due to parallelism can be defined as
Speed up=to/tp
For ex: if a original system takes 60sec to perform a task and the two parallel
systems take 30sec to perform the same task, then the value of speed up is 60/30=2. The
speed up value is an indication of linear speed up. Parallel system is said to demonstrate
linear speed up if the speed-up is N, when the larger system has N times the resources of
the smaller system. If the speed up is less than N, the system is less than N, the system is
said to demonstrate sub-linear speed up.
Linear speed up
speed
resources
2. Scale UP
Scale up is the property in which the performance of the parallel database is
sustained if the number of CPU and disks are increased in proportion to the
amount of data.
Scale up due to parallelism can be defined as
Scale up=vp/vo
Where vp=parallel or large processing volume.
Vo=original or small processing volume.
Linear speed up
speed
4. Locking
It is a method of synchronizing concurrent tasks. For external locking, a
distributed lock manager(DLM) is used, which is a part of OS. The DLM
allows applications to synchronize access to resources such as data, s/w and
peripherals.
QUERY PARALLELISM
Parallelism is used to provide speed up and scale up. This is done so that the
queries are executed faster by adding more resources. The main challenge of parallel
databases is Query Parallelism. Some of the Query parallelism architectures are,
I/O Parallelism
Intra Query Parallelism
Inter Query Parallelism
Intra-Operation Parallelism
Inter-operation parallelism
I/O Parallelism is the simplest form of parallelism in which the relations are
partitioned on multiple disks to reduce the retrieval time of relations from disk. The i/p data
is partitioned and each partition is executed in parallel. The results are combined after the
processing of all partitioned data. This is also called data partitioning.
Hash Partitioning
Range Partitioning
Round-Robin Partitioning
Schema Partitioning
HASH PARTITIONING
RANGE PARTITIONING
o The relations are scanned in any order and the ith tuple is send to disk number di mod
n.
o It ensures an even distribution of tuples across disks. Each disk has the same number
of tuples as the other.
Advantages
o Ideally suited for applications that wish to read the entire relation
sequentially for each query.
Disadvantages:
o Both point queries and range queries are complicated to process, since
each of the n disks must be used for search.
SCHEMA PARTITIONING
For ex. Suppose a relation has been partitioned across multiple disks by range partitioning
on some attribute. Now a user wants to sort on the partitioning attribute. The sort
operation can be implemented by sorting each partition in parallel, then concatenating
the sorted partitions to get the final sorted relation.
Advantages
Interconnection n/w
Query1
3. Inter-Query Parallelism
Interconnection n/w
4.Intra-Operation Parallelism
Here, the execution of each individual operation of a task is parallelized. The tasks
may be sorting, projection, join and so on. It scales better with increasing parallelism
5.Inter-operation Parallelism
Here, the different operations in a query expression are executed in parallel. The
two types of inter-operation parallelism are
1. Pipelined Parallelism
2. Independent Parallelism
Pipelined Parallelism
Disadv:
Does not scale up well. Pipeline chain do not attain sufficient length. Only
Independent Parallelism
Site A Site B
Communication n/w
Site C Site d
Architecture of DD
1. Client-server
2. Collaborating Server
3. Middleware Systems
Client-Server Architecture
DBMS workload is split into 2 logic components-Client & server.
Client is the user of the resource whereas the server is the provider of the resource.
Applications and tools are put on one or more client platforms and are connected to
DBMs tht resides on the server.
The DBMS in turn, services these requests and returns the results to the client.
All modern information systems are based on client/server architecture.
Users Users
Applications & tools
Applications & tools
Database server
Components
Clients in the form of intelligent workstations.
DBMS Server
Communication n/w
s/w Applications connecting clients,server and network.
Benefits
Simple to implement
Better Adaptability
Use of GUI
Less Expensive
Optimally Utilized.
Computing platform Independence
Productivity improvement
Improved Performance.
Limitations
Single query cannot be spanned across multiple servers.
Client process is quite complex
An increase in no. of users and processing sites often create security problems.
2. Collaborating Server Systems
There are several database servers, each capable of running
transactions against local data which cooperatively execute transactions spanning multiple
servers.
When a server receives a query that requires access to other
servers, it generates the appropriate sub query and sends it to the other server. In this way
collaboration among various servers take place.
3. Middleware Systems
Also called data access middleware. Is designed to allow a
single query to span multiple servers, without requiring all servers to be capable of
managing that execution strategies.
Middle ware provides users with consistent interface to multiple DBMS and file systems. It
simplifies heterogeneous environment to programmers.
Middle ware is a layer of s/w which works with a special server and coordinate the
execution of queries and transactions. It is responsible for routing a local request to one or
more remote servers, translating the request to remote servers from one sql dialect to
another as needed and converting data from one format to another.
This architecture consists of API, middleware engine, driversm and native interfaces. API
usually consists of series of function calls as well as series of data access statements.
Full replication of a relation is the case where the relation is stored at all sites.
Fully redundant databases are those in which every site contains a copy of the
entire database.
Advantages of Replication
a. Availability: failure of site containing relation r does not result in
unavailability of r is replicas exist.
b. Parallelism: queries on r may be processed by several nodes in parallel.
c. Reduced data transfer: relation r is available locally at each site containing a
replica of r.
Disadvantages of Replication
d. Increased cost of updates: each replica of relation r must be updated.
e. Increased complexity of concurrency control: concurrent updates to distinct
replicas may lead to inconsistent data unless special concurrency control
mechanisms are implemented.
i. One solution: choose one copy as primary copy and apply
concurrency control operations on primary copy
2. DATA FRAGMENTATION
Division of relation r into fragments r1, r2, …, rn which contain sufficient information to
reconstruct relation r.
Vertical fragmentation: the schema for relation r is split into several smaller schemas
All schemas must contain a common candidate key (or superkey) to ensure
lossless join property.
A special attribute, the tuple-id attribute may be added to each schema to serve
as a candidate key.
Fragmentation of
account Relation
account_number
branch_name balance
Hillside A-305 500
Hillside A-226 336
account1 =
Hillside A-155 62
account_number
branch_name
balance
Valleyviewbranch_name=“Hillside”
A-177 205
(account )
Valleyview A-402 10000
account
Valleyview A-4082= 1123
Valleyview A-639 750
branch_name=“Valleyview”
(account )
Vertical
Fragmentation of
employee_info Relation
branch_name
customer_name tuple_id
Hillside Lowman 1
Hillside Camp 2
ValleyviewCamp 3
depositKahn
Valleyview 1 = branch_name, 4
account_number
Hillside balance
Kahn tuple_id
5
A-305 Kahn 500
customer_name, tuple_id 16
Valleyview336
A-226
(employee_info ) 27
ValleyviewGreen
A-177 205 3
deposit
A-4022 =10000 account_number,
4
A-155 62
balance, tuple_id
5
A-408 1123 )
(employee_info 6
A-639 750 7
ADVANTAGES OF FRAGMENTATION
o Horizontal:
o allows parallel processing on fragments of a relation
o allows a relation to be split so that tuples are located where they are most
frequently accessed
o Vertical:
o allows tuples to be split so that each part of the tuple is stored where it is
most frequently accessed
o tuple-id attribute allows efficient joining of vertical fragments
o allows parallel processing on a relation
o Vertical and horizontal fragmentation can be mixed.
o Fragments may be successively fragmented to an arbitrary depth.
When a transaction needs to lock a data item, it sends a lock request to S i and lock
manager determines whether the lock can be granted immediately
If yes, lock manager sends a message to the site which initiated the request
The transaction can read the data item from any one of the sites at which a replica
of the data item resides.
Writes must be performed on all replicas of a data item
Advantages of scheme:
o Simple implementation
o Simple deadlock handling
Disadvantages of scheme are:
o Bottleneck: lock manager site becomes a bottleneck
o Vulnerability: system is vulnerable to lock manager site failure.
If no, request is delayed until it can be granted, at which time a message is sent to
the initiating site
a. Primary copy
b. Majority protocol
c. Biased protocol
d. Quorum consensus
Primary copy
Choose one replica of data item to be the primary copy.
o Site containing the replica is called the primary site for that data item
o Different data items can have different primary sites
When a transaction needs to lock a data item Q, it requests a lock at the primary site
of Q.
o Implicitly gets lock on all replicas of the data item
Benefit
o Concurrency control for replicated data handled similarly to unreplicated
data - simple implementation.
Drawback
o If the primary site of Q fails, Q is inaccessible even though other sites
containing a replica may be accessible.
Majority Protocol
Local lock manager at each site administers lock and unlock requests for data items
stored at that site.
When a transaction wishes to lock an unreplicated data item Q residing at site Si, a
message is sent to Si ‘s lock manager.
o If Q is locked in an incompatible mode, then the request is delayed until it
can be granted.
o When the lock request can be granted, the lock manager sends a message
back to the initiator indicating that the lock request has been granted.
Biased Protocol
Local lock manager at each site as in majority protocol, however, requests for
shared locks are handled differently than requests for exclusive locks.
Shared locks. When a transaction needs to lock data item Q, it simply requests a lock
on Q from the lock manager at one site containing a replica of Q.
Exclusive locks. When transaction needs to lock data item Q, it requests a lock on Q
from the lock manager at all sites containing a replica of Q.
Advantage - imposes less overhead on read operations.
Disadvantage - additional overhead on writes
3. Time stamping
o a new edge is inserted in or removed from one of the local wait-for graphs.
If the coordinator finds a cycle, it selects a victim and notifies all sites. The sites roll
back the victim transaction
Local and Global Wait-
For Graphs
Local
Global
The three-phase commit (3PC) protocol is more complicated and more expensive,
but avoids some drawbacks of two-phase commit protocol. This protocol is not
used in practice.
Let T be a transaction initiated at site Si, and let the transaction coordinator at Si be
Ci.
o Ci adds the records <prepare T> to the log and forces log to stable storage
o if not, add a record <no T> to the log and send abort T message to Ci
o if the transaction can be committed, then:
T can be committed of Ci received a ready T message from all the participating sites:
otherwise T must be aborted.
Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces
record onto stable storage. Once the record stable storage it is irrevocable (even if
failures occur)
HANDLING OF FAILURES
Site Failure
Log contains <ready T> record: site must consult Ci to determine the fate of T.
The log contains no control records concerning T replies that Sk failed before
responding to the prepare T message from Ci
If coordinator fails while the commit protocol for T is executing then participating
sites must decide on T’s fate:
o If an active site contains a <commit T> record in its log, then T must be
committed.
o If an active site contains an <abort T> record in its log, then T must be
aborted.
o If some active participating site does not contain a <ready T> record in its
log, then the failed coordinator Ci cannot have decided to commit T. Can
therefore abort T.
o If none of the above cases holds, then all active sites must have a <ready T>
record in their logs, but no additional control records (such as <abort T> of
<commit T>). In this case active sites must wait for Ci to recover, to find
decision.
Blocking problem: active sites may have to wait for failed coordinator to recover.
For centralized systems, the primary criterion for measuring the cost of a particular
strategy is the number of disk accesses. I n a distributed system, other issues must
be taken into account:
o The potential gain in performance from having several sites process parts of
the query in parallel.
Query Transformation
Example:
Consider the following relational algebra expression in which the three relations are
neither replicated nor fragmented
For a query issued at site SI, the system needs to produce the result at site SI
2. Semijoin Strategy
Let r1 be a relation with schema R1 stores at site S1. Let r2 be a relation with schema
R2 stores at site S2.
r1 r2
For joins of several relations, the above strategy can be extended to a series of semijoin
steps.