Distributed Database Management Systems: Week-4
Distributed Database Management Systems: Week-4
Management Systems
Week-4
Distribution Design Issues
1. Why fragment at all?
2. How to fragment?
3. How much to fragment?
4. How to test correctness?
5. How to allocate?
6. Information requirements?
Fragmentation
Can't we just distribute relations?
• What is a reasonable unit of distribution?
➡ relation
✦ app. views are subsets of relations ->locality
✦ natural to consider it
➡ fragments of relations (sub-relations)
✦ Increase concurrency
✦ single view may depend on multiple fragments (join cost)
✦ Integrity check
Fragmentation Alternatives –
Horizontal
Fragmentation Alternatives –
Vertical
Degree of Fragmentation
• Access frequency
• frequency with which user applications access data. If Q = {q1,….qn} is a set of
user queries, acc(q ) indicates the access frequency of query q in a given
i i
period.
Desirable properties of simple predicates
• The set should be complete.
• Informally, the set should include only predicates with attributes and
conditions that are used in the applications
Completeness
• A set of simple predicate Pr is said to be complete if and only if there is an
equal probability of access by every application to any tuple belonging to any
minterm fragment that is defined according to Pr.
• Case 1: The only application that accesses J wants to access the tuples
according to the location (any location).
• The Pr is not complete since some tuple in JPi has higher access
probability
• Replication
• System maintains multiple copies of data, stored in different sites, for
faster retrieval and fault tolerance.
Data Replication
• A relation or fragment of a relation is replicated if it is stored
redundantly in two or more sites.
• Full replication of a relation is the case where the relation is
stored at all sites.
• Fully redundant databases are those in which every site
contains a copy of the entire database.
Data Replication (Cont.)
• Advantages of Replication
• Availability: failure of site containing relation r does not result in unavailability of r
is replicas exist.
• Parallelism: queries on r may be processed by several nodes in parallel.
• Reduced data transfer: relation r is available locally at each site containing a replica
of r.
• Disadvantages of Replication
• Increased cost of updates: each replica of relation r must be updated.
• Increased complexity of concurrency control: concurrent updates to distinct
replicas may lead to inconsistent data unless special concurrency control
mechanisms are implemented.
• One solution: one copy as primary copy and apply concurrency control
operations on primary copy choose
Data Transparency
• Data transparency: Degree to which system user may remain unaware
of the details of how and where the data items are stored in a
distributed system
• Consider transparency issues in relation to:
• Fragmentation transparency
• Replication transparency
• Location transparency
Naming of Data Items - Criteria