Chapter 7
Distributed Databases and
Client-Server Architectures
Distributed Database Concepts
 A distributed  database (DDB) is a collection of
  multiple logically related databases distributed over a
  computer network
 A distributed database management system (DDMS)
  is a software system that manages a distributed database
  while making the distribution transparent to the user
  ◦ A transaction can be executed by multiple networked
     computers in a unified manner
                                                             2
  Distributed Database System
Advantages
 ◦ Management of distributed data with different
   levels of transparency:
   This refers to the physical placement of data (files,
    relations, etc.) which is not known to the user
    (distribution transparency).
                                                            3
   Cont…
 Advantages   (cont…)
  ◦ Distribution and Network transparency:
     Users do not have to worry about operational details of the
      network
       There is Location transparency, which refers to freedom of
         issuing command from any location without affecting its
         working
  ◦ Replication transparency:
     It allows to store copies of a data at multiple sites for better
      availability.
     Makes the user unaware of the existence of copies
     This is done to minimize access time to the required data.
  ◦ Fragmentation transparency:
     Allows to fragment a relation horizontally (create a subset of
      tuples of a relation) or vertically (create a subset of columns
      of a relation)
     Makes the user unaware of the existence of fragments
                                                                         4
  Distributed Database
  System(cont…)
Advantages      (cont...)
 ◦ Increased reliability and availability:
   Reliability refers to system life time; that is, system is
    running efficiently most of the time
   Availability is the probability that the system is
    continuously available (usable or accessible) during a
    time interval
   A distributed database system has multiple nodes
    (computers) and if one fails then others are available to
    do the job.
                                                                 5
  Distributed Database
  System(cont…)
Other Advantages        (cont…)
 ◦ Improved performance:
    A distributed DBMS fragments the database to keep data
     closer to where it is needed most
    This reduces data management overhead (access and
     modification time) significantly
 ◦ Easier expansion (scalability):
    Refers to expansion of the system in terms of adding
     more data, increasing database sizes or adding more
     processors
                                                              6
  Data Fragmentation, Replication and Allocation
 Data  Fragmentation
  ◦ Split a relation into logically related and correct parts. A
    relation can be fragmented in two ways:
  ◦ Horizontal Fragmentation - Vertical
    Fragmentation
 Horizontal   fragmentation
  ◦ It is a horizontal subset of a relation which contain those of
    tuples which satisfy selection conditions.
  ◦ Consider the Employee relation with selection condition
    (DNO = 5). All tuples that satisfy this condition will create a
    subset which will be a horizontal fragment of Employee
    relation.
  ◦ A selection condition may be composed of several conditions
    connected by AND / OR
                                                                      7
  Data Fragmentation, Replication and
  Allocation(cont…)
Vertical fragmentation
  ◦ It is a subset of a relation which is created by a subset of
    columns. Thus a vertical fragment of a relation will
    contain values of selected columns. There is no selection
    condition used in vertical fragmentation.
  ◦ Consider the Employee relation. A vertical fragment of
    can be created by keeping the values of Name, Bdate,
    Sex, and Address.
  ◦ Because there is no condition for creating a vertical
    fragment, each fragment must include the primary key
    attribute of the parent relation Employee.
     In this way all vertical fragments of a relation are
       connected.
                                                                   8
Data Fragmentation, Replication and
Allocation(cont…)
 Representing horizontal fragmentation
   ◦ Each horizontal fragment on a relation can be specified by
     a sCi (R) operation in the relational algebra
   ◦ Complete horizontal fragmentation
      A set of horizontal fragments whose conditions C1, C2,
       …, Cn include all the tuples in R- that is, every tuple in
       R satisfies
      (C1 OR C2 OR … OR Cn)
   ◦ Disjoint complete horizontal fragmentation: No tuple in R
     satisfies (Ci AND Cj) where i ≠ j
   ◦ To reconstruct R from horizontal fragments a UNION is
     applied
                                                               9
   Data Fragmentation, Replication and
   Allocation(cont…)
Vertical fragmentation
  ◦ A vertical fragment on a relation can be specified by a
    Li(R) operation in the relational algebra.
  ◦ Complete vertical fragmentation
     A set of vertical fragments whose projection lists L1, L2, …,
      Ln include all the attributes in R but share only the primary
      key of R. In this case the projection lists satisfy the following
      two conditions:
       L1  L2  ...  Ln = ATTRS (R)
       Li  Lj = PK(R) for any I, j, where ATTRS (R) is the set of
        attributes of R and PK(R) is the primary key of R.
                                                                      10
Data Fragmentation, Replication and
Allocation(cont…)
Mixed    (Hybrid) fragmentation
  ◦ A combination of Vertical fragmentation and
    Horizontal fragmentation
  ◦ This is achieved by SELECT-PROJECT operations
    which is represented by Li(sCi (R))
                                                    11
 Data Fragmentation, Replication and
 Allocation(cont…)
Data Replication
 Replication refer the distribution of whole or part of the
  data to a number of sites
  ◦ Useful in improving availability of data
  ◦ Improve performance of global queries since the result of
    such query can be obtained from any one site
  ◦ In full replication, the entire database is replicated and in
    partial replication some selected part is replicated to some
    of the sites
  ◦ The disadvantage of full replication is that it can slow
    down update operation since a single logical update must
    be performed on every copy of the database to keep the
    copies consistent
                                                                    12
      Types of Distributed Database Systems
 Homogeneous
  ◦ All sites of the database
    system have identical setup,              Window
    i.e., same database system                 Site 5          Unix
                                                        Oracle Site 1
    software.                                                           Oracle
     For example, all sites run Window
      Oracle or DB2, or Sybase or Site 4      Communications
                                                 network
      some other but the same
      database system software.
                                    Oracle
  ◦ The underlying operating
                                             Site 3       Site 2
    systems may be different (can            Linux Oracle Linux Oracle
    be a mixture of Linux,
    Window, Unix, etc.)
                                                                             13
    Types of Distributed Database Systems
   Heterogeneous
    ◦ Federated: Each site may run different database system but the data
      access is managed through a single conceptual schema.
       This implies that the degree of local autonomy is minimum. Each site
        must adhere to a centralized access policy. There may be a global
        schema.
    ◦ Multidatabase: There is no one conceptual global schema. For data
      access a schema is constructed dynamically as needed by the
      application software.          Unix Relational
                                Object
                               Oriented Site 5          Unix
                                                        Site 1
                                                              Hierarchical
                        Window
                         Site 4         Communications
                                           network
                                             Network
                         Object              DBMS
                        Oriented    Site 3             Site 2       Relational
                                    Linux              Linux                     14
Types of Distributed Database Systems
Federated      Database Management Systems
  Issues
  ◦ Differences in data models:
     Relational, Objected oriented, hierarchical, network,
      etc.
  ◦ Differences in constraints:
     Each site may have their own data accessing and
      processing constraints.
  ◦ Differences in query language:
     Some site may use SQL, some may use SQL-89,
      some may use SQL-92, and so on.
                                                          15
  Query Processing in Distributed Databases
Issues
 ◦ Cost of transferring data (files and results) over
   the network.
    This cost is usually high. So, some optimization is
     necessary.
    Example: Employee at site 1 and Department at Site 2
     Employee at site 1. 10,000 rows. Row size = 100 bytes.
        This means, table size = 106 bytes.
      Department at Site 2. 100 rows. Row size = 35 bytes.
        This means, table size = 3,500 bytes.
                                                              16
Query Processing in Distributed Databases (cont…)
      Issues(cont…)
       ◦ Cost of transferring data (files and results) over the
         network
       ◦ Example
          Q: For each employee, retrieve employee name and
           department name Where the employee works.
          Q: Fname,Lname,Dname (EmployeeDno = Dnumber Department)
          Employee
          Fname    Minit   Lname     SSN    Bdate   Address   Sex   Slary   Superssn   Dno
          Department
           Dname    Dnumber        Mgrssn     Mgrstartdate
                                                                                        17
Query Processing in Distributed Databases (cont…)
Result
 ◦ If every employee is related to a department, the result
   of this query will have 10,000 tuples
 ◦ Suppose that each result tuple is 40 bytes long. The
   query is submitted at site 3 and the result is sent to this
   site
 ◦ Suppose that Employee and Department relations are
   not present at site 3
                                            Employee
                                 Site 1
        Site 2                          Site 3
 Department
                                                                 18
Query Processing in Distributed Databases
(cont…)
  Strategies   (Available options):
    1. Transfer Employee and Department to site 3.
         Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
    2. Transfer Employee to site 2, execute join at site 2 and
       send the result to site 3.
         Query result size = 40 * 10,000 = 400,000 bytes.
         Total transfer size = 1,000,000 + 400,000 = 1,400,000
          bytes.
    3. Transfer Department relation to site 1, execute the join at
       site 1, and send the result to site 3
         Total bytes transferred = 3500 + 400,000 = 403,500 bytes.
  Optimization    criteria: minimizing data transfer.
    ◦ Preferred strategy: strategy 3.
                                                                       19
Query Processing in Distributed Databases (cont…)
   Consider the query
    ◦ Q’: For each department, retrieve the department
      name and the name of the department manager
   Relational Algebra expression:
    ◦   Fname,Lname,Dname (Employee   Mgrssn = SSN   Department)
   Assuming that every department has a
    manager, the result of this query will have
    100 tuples
                                                                    20
Query Processing in Distributed Databases (cont…)
      Execution strategies:
       1. Transfer Employee and Department to the result site and
          perform the join at site 3.
              Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes
       2. Transfer Employee to site 2, execute join at site 2 and send
          the result to site 3.
              Query result size = 40 * 100 = 4000 bytes.
              Total transfer size = 1,000,000 +4000 = 1,004,000 bytes.
       3. Transfer Department relation to site 1, execute join at site 1
          and send the result to site 3.
              Total transfer size = 3500 + 4000 = 7500 bytes.
      Preferred strategy: Choose strategy 3.
                                                                              21
             Query Processing in Distributed
             Databases (cont…)
   Now suppose the result site is 2.
   Possible strategies :
    1.   Transfer Employee relation to site 2, execute the query and
         present the result to the user at site 2
            Total transfer size = 1,000,000 bytes for both queries Q and Q’.
    2.   Transfer Department relation to site 1, execute join at site 1 and
         send the result back to site 2
            Total transfer size for Q:
                3500 +400,000 = 403,500 bytes
            Total transfer size for Q’:
                3500 +4000 = 7500 bytes
                                                                                22
    Query Processing in Distributed Databases
   Semijoin:
    ◦ Objective is to reduce the number of tuples in a relation before
       transferring it to another site.
   Example execution of Q or Q’:
    1. Project the join attributes of Department at site 2, and transfer them
       to site 1.
           Assume size of Dnumber=4 bytes and size of Mgrssn=9 bytes
           Assume size of fname and lname is 15 bytes each
    ◦    For Q, 4 * 100 = 400 bytes are transferred and for Q’, 9 * 100 =
         900 bytes are transferred
    2. Join the transferred file with the Employee relation at site 1, and
         transfer the required attributes from the resulting file to site 2.
           For Q, 34 * 10,000 = 340,000 bytes are transferred and
           For Q’, 39 * 100 = 3900 bytes are transferred
    3. Execute the query by joining the transferred file with Department and
        present the result to the user at site 2.
                                                                               23
Concurrency Control and Recovery
Distributed  Databases encounter a
 number of concurrency control and
 recovery problems which are not present
 in centralized databases.
Some of these problems are listed below:
 ◦   Dealing with multiple copies of data items
 ◦   Failure of individual sites
 ◦   Communication link failure
 ◦   Distributed commit
 ◦   Distributed deadlock
                                                  24
Concurrency Control and Recovery (cont…)
Details
  ◦ Dealing with multiple copies of data items:
     The concurrency control must maintain global
      consistency
     Likewise, the recovery mechanism must recover all
      copies and maintain consistency after recovery
  ◦ Failure of individual sites:
     Database availability must not be affected due to the
      failure of one or two sites and the recovery scheme
      must recover them before they are available for use
                                                          25
  Concurrency Control and Recovery (cont…)
 (Details….)
 Communication    link failure:
  ◦ This failure may create network partition which would affect
    database availability even though all database sites may be
    running.
 Distributed commit:
  ◦ A transaction may be fragmented and they may be executed
    by a number of sites. This require a two commit approach
    for transaction commit.
 Distributed deadlock:
  ◦ Since transactions are processed at multiple sites, two or
    more sites may get involved in deadlock. This must be
    resolved in a distributed manner.
                                                               26
    Concurrency Control and Recovery (cont…)
 Distributed Concurrency control based on distinguished
  copy of a data item
  ◦ Primary site technique: A single site is assigned as a
    primary site which serves as a coordinator for
    transaction management.
                                      Primary site
                                     Site 5
                                                     Site 1
                       Site 4   Communications neteork
                                 Site 3          Site 2
                                                              27
Concurrency Control and Recovery
Transaction    management:
 ◦ Concurrency control and commit are managed
   by this site
 ◦ All locks are kept at that site and all requests for
   locking or unlocking are sent there
 ◦ In two phase locking, this site manages locking
   and releasing of data items
 ◦ If all transactions follow two-phase policy at
   all sites, then serializability is guaranteed
                                                      28
Concurrency Control and Recovery (cont…)
 Advantages:
     It is an extension to the centralized two phase locking and
      hence simple to Implement and manage
     Data items are locked only at one site but they can be
      accessed at any site at which they reside
 Disadvantages:
     All transaction management activities go to primary site
      which is likely to overload the site.
     If the primary site fails, the entire system is inaccessible
        To aid recovery, a backup site is designated which
         behaves as a shadow of primary site.
  ◦ In case of primary site failure, backup site can act as
    primary site.
                                                                     29
   Concurrency Control and Recovery (cont…)
 Primary   Copy Technique:
  ◦ In this approach, instead of a site, a data item partition is
    designated as primary copy
       Load of lock coordination is distributed among the
        various sites
       To lock a data item, just the primary copy of the data
        item is locked
  ◦ Advantages:
     Since primary copies are distributed at various sites, a
      single site is not overloaded with locking and unlocking
      requests
  ◦ Disadvantages:
     Identification of a primary copy is complex...
                                                                    30
Concurrency Control and Recovery
 Recovery   from a coordinator failure
  ◦ In both approaches, a coordinator site or copy may become
    unavailable. This will require the selection of a new
    coordinator.
 Primary   site approach with no backup site:
  ◦ Aborts and restarts all active transactions at all sites. Elects
    a new coordinator and initiates transaction processing.
 Primary   site approach with backup site:
  ◦ Suspends all active transactions, designates the backup site
    as the primary site and identifies a new back up site.
    Primary site receives all transaction management
    information to resume processing.
 Primary   and backup sites fail or no backup site:
  ◦ Use election process to select a new coordinator site.
                                                                   31
Concurrency Control and Recovery
 Concurrency      control based on voting:
  ◦   In a voting method, a lock request is sent to all the
      sites that have the copy of the data item
  ◦   Each copy maintains its own lock and can grant or
      deny request for it
  ◦   If majority of sites grant the lock, the requesting
      transaction gets the data item and inform all copies
      that it has been granted the lock
  ◦   To avoid unacceptably long wait, a time-out period is
      defined. If the requesting transaction does not get
      any vote information, the transaction is aborted.
                                                              32
       Client-Server Database Architecture
 Itconsists of clients running client software, a set of
  servers which provide all database functionalities and a
  reliable communication infrastructure.
               Server 1                    Client 1
                                           Client 2
               Server 2                    Client 3
               Server n                    Client n
                                                             33
   Client-Server Database Architecture
 Server:  is responsible for local data management at a
  site, much like centralized DBMS software
 Client: is responsible for most of the distribution
  function; it accesses data distribution information from
  the DBMS catalog and processes all requests that require
  access to more than one site
 The communication software manages communication
  among clients and servers
                                                             34
Client-Server Database Architecture
The  processing of a SQL queries goes as
 follows:
 ◦ Client parses a user query and decomposes it
   into a number of independent sub-queries.
 ◦ Each server processes its query and sends the
   result to the client.
 ◦ The client combines the results of sub queries
   and produces the final result.
                                                    35