Unit 2 DDMS
Unit 2 DDMS
A distributed database is essentially a database that is dispersed across numerous sites, i.e., on various
computers or over a network of computers, and is not restricted to a single system. A distributed database
system is spread across several locations with distinct physical components. This can be necessary when
different people from all over the world need to access a certain database. It must be handled such that, to
users, it seems to be a single database.
Types of DDMS:
Distributed databases can be broadly classified into homogeneous and heterogeneous distributed database
environments, each with further sub-divisions, as shown in the following illustration.
Homogeneous Distributed Database
In a homogeneous distributed database, all the sites use identical DBMS and operating systems. Its
properties are −
Autonomous − Each database is independent that functions on its own. They are integrated by a
controlling application and use message passing to share data updates.
Non-autonomous − Data is distributed across the homogeneous nodes and a central or master
DBMS co-ordinates data updates across the sites.
In a heterogeneous distributed database, different sites have different operating systems, DBMS products
and data models. Its properties are −
Data may be stored on several places in two ways using distributed data storage:
1. Replication - With this strategy, every aspect of the connection is redundantly kept at two or more
locations. It is a completely redundant database if the entire database is accessible from every
location. Systems preserve copies of the data because of replication. This has advantages since it
makes more data accessible at many locations. Moreover, query requests can now be handled in
parallel. But, there are some drawbacks as well. Data must be updated often. All changes performed
at one site must be documented at every site where that relation is stored in order to avoid
inconsistent results. There is a tone of overhead here. Moreover, since concurrent access must now
be monitored across several sites, concurrency management becomes far more complicated.
2. Fragmentation - In this method, the relationships are broken up into smaller pieces and each
fragment is kept in the many locations where it is needed. To ensure there is no data loss, the pieces
must be created in a way that allows for the reconstruction of the original relation. As fragmentation
doesn't result in duplicate data, consistency is not a concern.
Distributed databases are logically connected to one another when they are part of a collection, and they
frequently form a single logical database. Data is physically stored across several sites and is separately
handled in distributed databases. Each site's processors are connected to one another through a network, but
they are not set up for multiprocessing.
A widespread misunderstanding is that a distributed database is equivalent to a loosely coupled file system.
It's considerably more difficult than that in reality. Although distributed databases use transaction
processing, they are not the same as systems that use it.
o Place unrelated
o Independent of hardware
o Transparency of transactions
o DBMS unrelated
All of the physical sites in a homogeneous distributed database system use the same operating system and
database software, as well as the same underlying hardware. It can be significantly simpler to build and
administer homogenous distributed database systems since they seem to the user as a single system. The
data structures at each site must either be the same or compatible for a distributed database system to be
considered homogeneous. Also, the database program utilized at each site must be compatible or same.
The hardware, operating systems, or database software at each site may vary in a heterogeneous distributed
database. Although separate sites may employ various technologies and schemas, a variation in schema
might make query and transaction processing challenging.
Various nodes could have dissimilar hardware, software, and data structures, or they might be situated in
incompatible places. Users may be able to access data stored at a different place but not upload or modify it.
Because heterogeneous distributed databases are sometimes challenging to use, many organizations find
them to be economically unviable.
o As distributed databases provide modular development, systems may be enlarged by putting new
computers and local data in a new location and seamlessly connecting them to the distributed system.
o With centralized databases, failures result in a total shutdown of the system. Distributed database
systems, however, continue to operate with lower performance when a component fails until the
issue is resolved.
o If the data is near to where it is most often utilized, administrators can reduce transmission costs for
distributed database systems. Centralized systems are unable to accommodate this t ypes of Distributed
Database
o Data instances are created in various areas of the database using replicated data. Distributed
databases may access identical data locally by utilizing duplicated data, which reduces bandwidth.
Read-only and writable data are the two types of replicated data that may be distinguished.
o Only the initial instance of replicated data can be changed in read-only versions; all subsequent
corporate data replications are then updated. Data that is writable can be modified, but only the initial
occurrence is affected.
o Primary keys that point to a single database record are used to identify horizontally fragmented data.
Horizontal fragmentation is typically used when business locations only want access to the database
for their own branch.
o Using primary keys that are duplicates of each other and accessible to each branch of the database is
how vertically fragmented data is organized. When a company's branch and central location deal
with the same accounts differently, vertically fragmented data is used.
o Data that has been edited or modified for decision support databases is referred to as reorganised
data. When two distinct systems are managing transactions and decision support, reorganised data is
generally utilised. When there are numerous requests, online transaction processing must be
reconfigured, and decision support systems might be challenging to manage.
o In order to accommodate various departments and circumstances, separate schema data separates the
database and the software used to access it. Often, there is overlap between many databases and
separate schema data
The transfer of data for storage at various computers or locations connected over a network is known as a
distributed database. It may alternatively be described as a database that gathers information from several
databases using independent computers linked by data communication connections. Compared to centralized
database systems, distributed databases can offer higher availability and dependability. This is so that the
system can continue to function even if one or more sites go down. A distributed database system can
perform more effectively by distributing the burden and using the information across several sites.
Design Considerations for Distributed Databases
Data Partitioning
In data partitioning, the database is split up into smaller units, or fragments, and distributed across other
nodes. The two types of data partitioning are:
Horizontal Fragmentation: The relation is divided into groups of tuples in horizontal fragmentation, with
each tuple given to at least one fragment. As a result, requests may be processed in parallel and resources
can be used effectively.
Vertical Fragmentation: This process includes breaking the relation's schema into smaller schemas. A
common candidate key must be present in each fragment for a lossless join to occur. When distinct aspects
of the connection have different access patterns or when data privacy considerations necessitate the
separation of sensitive information, vertical fragmentation might be advantageous.
Replication
Replication includes keeping copies of the data across many distributed database nodes. Replication aims to
increase performance, fault tolerance, and data availability. Any node that has a copy of the necessary data
may perform a query when it is run, decreasing latency and speeding up response time. The different
replication strategies are:
Full replication: It ensures high availability but comes with high storage and update expense as entire
copies of the database must be kept at all nodes.
Partial replication: It chooses which data pieces to repeat depending on data relevance or access patterns.
Multi−master replication: Better speed and fault tolerance are provided by multi−master replication, which
enables many nodes to accept read and write operations.
Consistency and Concurrency Control
The ability to communicate over a network is essential in distributed database systems. To get the best
performance and availability, low−latency networks and effective communication protocols are necessary.
For distributed database architecture, minimizing the quantity of data sent over the network and enhancing
data transfer rates are crucial factors.
Given that distributed databases contain data across several nodes that may be scattered geographically,
security is a crucial component. Additional security issues, such as data privacy, authentication, access
control, and encryption are brought on by the distributed architecture of the database. To secure sensitive
data and guarantee obedience to rules and privacy policies, strong security measures must be put in place.
Client−Server Architecture
A common method for spreading database functionality is the client−server architecture. Clients
communicate with a central server, which controls the distributed database system, in this design. The server
is in charge of maintaining data storage, controlling access, and organizing transactions. This architecture
has several clients and servers connected. A client sends a query and the server which is available at the
earliest would help solve it. This Architecture is simple to execute because of the centralised server system.
Peer−to−Peer Architecture
Each node in the distributed database system may function as both a client and a server in a peer−to−peer
architecture. Each node is linked to the others and works together to process and store data. Each node is in
charge of managing its data management and organizing node−to−node interactions. Because the loss of a
single node does not cause the system to collapse, peer−to−peer systems provide decentralized control and
high fault tolerance. This design is ideal for distributed systems with nodes that can function independently
and with equal capabilities.
Federated Architecture
Multiple independent databases with various types are combined into a single meta−database using a
federated database design. It offers a uniform interface for navigating and exploring distributed data. In the
federated design, each site maintains a separate, independent database, while the virtual database manager
internally distributes requests. When working with several data sources or legacy systems that can't be
simply updated, federated architectures are helpful.
Shared−Nothing Architecture
Data is divided up and spread among several nodes in a shared−nothing architecture, with each node in
charge of a particular portion of the data. Resources are not shared across nodes, and each node runs
independently. Due to the system's capacity to add additional nodes as needed without affecting the current
nodes, this design offers great scalability and fault tolerance. Large−scale distributed systems, such as data
warehouses or big data analytics platforms, frequently employ shared−nothing designs.
Advantages of DDMS
The database is easier to expand as it is already spread across multiple systems and it is not too
complicated to add a system.
The distributed database can have the data arranged according to different levels of transparency i.e data
with different transparency levels can be stored at different locations.
The database can be stored according to the departmental information in an organisation. In that case, it is
easier for a organisational hierarchical access.
there were a natural catastrophe such as fire or an earthquake all the data would not be destroyed it is
stored at different locations.
It is cheaper to create a network of systems containing a part of the database. This database can also be
easily increased or decreased.
Even if some of the data nodes go offline, the rest of the database can continue its normal functions.
Disadvantages of DDMS
The distributed database is quite complex and it is difficult to make sure that a user gets a uniform view
of the database because it is spread across multiple locations.
This database is more expensive as it is complex and hence, difficult to maintain.
It is difficult to provide security in a distributed database as the database needs to be secured at all the
locations it is stored. Moreover, the infrastructure connecting all the nodes in a distributed database also
needs to be secured.
It is difficult to maintain data integrity in the distributed database because of its nature. There can also be
data redundancy in the database as it is stored at multiple locations.
The distributed database is complicated and it is difficult to find people with the necessary experience
who can manage and maintain it.
Files—a distributed file system allows devices to mount a virtual drive, with the actual files
distributed across several machines.
Block storage—a block storage system stores data in volumes known as blocks. This is an
alternative to a file-based structure that provides higher performance. A common distributed block
storage system is a Storage Area Network (SAN).
Objects—a distributed object storage system wraps data into objects, identified by a unique ID or
hash.
Scalability—the primary motivation for distributing storage is to scale horizontally, adding more
storage space by adding more storage nodes to the cluster.
Redundancy—distributed storage systems can store more than one copy of the same data, for high
availability, backup, and disaster recovery purposes.
Cost—distributed storage makes it possible to use cheaper, commodity hardware to store large
volumes of data at low cost.
Performance—distributed storage can offer better performance than a single server in some
scenarios, for example, it can store data closer to its consumers, or enable massively parallel access
to large files.
Features and Limitations
or why distributed storage is important?
Most distributed storage systems have some or all the following features:
Partitioning—the ability to distribute data between cluster nodes and enable clients to seamlessly
retrieve the data from multiple nodes.
Replication—the ability to replicate the same data item across multiple cluster nodes and maintain
consistency of the data as clients update it.
Fault tolerance—the ability to retain availability to data even when one or more nodes in the
distributed storage cluster goes down.
Elastic scalability—enabling data users to receive more storage space if needed, and enabling
storage system operators to scale the storage system up and down by adding or removing storage
units to the cluster.
What is Distributed Transaction:
A distributed transaction is a set of operations that are performed across multiple data repositories. It
is also known as a global transaction.
Distributed transactions ensure that a group of related operations are either all completed successfully
or all rolled back. This maintains data consistency and integrity.
Distributed transactions are typically coordinated across separate nodes connected by a
network. However, they may also span multiple databases on a single server.
Distributed transactions are also known as global transactions. For the transaction to commit
successfully, all the individual data sources must commit successfully
The distributed transaction ensures ACID (Atomicity, Consistency, Isolation, Durability) properties
and data integrity.
2) none of the operations are performed at all due to a failure somewhere in the system.
In the latter case, if some work was completed prior to the failure, that work will be reversed to ensure no
net work was done. This type of operation is in compliance with the “ACID” (atomicity-consistency-
isolation-durability) principles of databases that ensure data integrity. ACID is most commonly associated
with transactions on a single database server, but distributed transactions extend that guarantee across
multiple databases.
The operation known as a “two-phase commit” (2PC) is a form of a distributed transaction. “XA
transactions” are transactions using the XA protocol, which is one implementation of a two-phase commit
operation.
Steps of Distributed Transaction
Atomic Commit
The atomic commit procedure should meet the following requirements:
All participants who make a choice reach the same conclusion.
If any participant decides to commit, then all other participants must have voted yes.
If all participants vote yes and no failure occurs, then all participants decide to commit.
Atomicity
All changes to data are performed as if they are a single operation. That is, all the changes are performed, or
none of them are.
For example, in an application that transfers funds from one account to another, the atomicity property
ensures that, if a debit is made successfully from one account, the corresponding credit is made to the other
account.
Consistency
Data is in a consistent state when a transaction starts and when it ends.
For example, in an application that transfers funds from one account to another, the consistency property
ensures that the total value of funds in both the accounts is the same at the start and end of each transaction.
Isolation
The intermediate state of a transaction is invisible to other transactions. As a result, transactions that run
concurrently appear to be serialized.
For example, in an application that transfers funds from one account to another, the isolation property
ensures that another transaction sees the transferred funds in one account or the other, but not in both, nor in
neither.
Durability
After a transaction successfully completes, changes to data persist and are not undone, even in the event of
a system failure.
For example, in an application that transfers funds from one account to another, the durability property
ensures that the changes made to each account will not be reversed.
the Atomic Commit Protocol is essential to maintaining transaction dependability and consistency
in distributed systems. Its main goal is to ensure that, even in the event of a failure, a transaction is
either fully committed or fully aborted across all participating nodes.
Concurrency Control in Distributed Database
Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database.
Concurrency control in distributed databases is the process of coordinating concurrent access to a database
in a multiuser database management system (DBMS). It allows users to access a database in a multi-
programmed way while maintaining the illusion that each user is executing alone on a dedicated system.
Concurrency controlling techniques ensure that multiple transactions are executed simultaneously while
maintaining the ACID properties of the transactions and serializability in the schedules.
o In a multi-user system, multiple users can access and use the same database at one time, which is
known as the concurrent execution of the database. It means that the same database is executed
simultaneously on a multi-user system by different users.
o While working on the database transactions, there occurs the requirement of using the database by
multiple users for performing different operations, and in that case, concurrent execution of the
database is performed.
o The thing is that the simultaneous execution that is performed should be done in an interleaved
manner, and no operation should affect the other executing operations, thus maintaining the
consistency of the database. Thus, on making the concurrent execution of the transaction operations,
In a database transaction, the two main operations are READ and WRITE operations. So, there is a need to
manage these two operations in the concurrent execution of the transactions as if these operations are not
performed in an interleaved manner, and the data may become inconsistent. So, the following problems
occur with the Concurrent Execution of the operations:
Problem 1: Lost Update Problems (W - W Conflict)
The problem occurs when two different database transactions perform the read/write operations on the
same database items in an interleaved manner (i.e., concurrent execution) that makes the values of the items
incorrect hence making the database inconsistent.
For example:
Consider the below diagram where two transactions T X and TY, are performed on the same account A
where the balance of account A is Rs 300.
300
250
300
400
250
400
o At time t1, transaction TX reads the value of account A, i.e., Rs 300 (only read).
o At time t2, transaction TX deducts Rs 50 from account A that becomes Rs 250 (only deducted and not
updated/write).
o Alternately, at time t3, transaction T Y reads the value of account A that will be Rs 300 only because
TX didn't update the value yet.
o At time t4, transaction TY adds Rs 100 to account A that becomes Rs 400 (only added but not
updated/write).
o At time t6, transaction TX writes the value of account A that will be updated as Rs 250 only, as
TY didn't update the value yet.
o Similarly, at time t7, transaction T Y writes the values of account A, so it will write as done at time t4
that will be Rs 400. It means the value written by TX is lost, i.e., Rs 250 is lost.
The dirty read problem occurs when one transaction updates an item of the database, and somehow the
transaction fails, and before the data gets rollback, the updated database item is accessed by another
transaction. There comes the Read-Write Conflict between both transactions.
For example:
Consider two transactions TX and TY in the below diagram performing read/write operations on
account A where the available balance in account A is Rs 300:
300
350
350
350
300
Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two different values are
read for the same database item.
For example:
Consider two transactions, TX and TY, performing the read/write operations on account A, having an
available balance = Rs 300. The diagram is shown below:
300
300
400
400
400
o At time t1, transaction TX reads the value from account A, i.e., Rs 300.
o At time t2, transaction TY reads the value from account A, i.e., Rs 300.
o At time t3, transaction TY updates the value of account A by adding Rs 100 to the available balance,
and then it becomes Rs 400.
o At time t4, transaction TY writes the updated value, i.e., Rs 400.
o After that, at time t5, transaction T X reads the available value of account A, and that will be read as
Rs 400.
o It means that within the same transaction T X, it reads two different values of account A, i.e., Rs 300
initially, and after updation made by transaction T Y, it reads Rs 400. It is an unrepeatable read and is
therefore known as the Unrepeatable read problem.
Thus, in order to maintain consistency in the database and avoid such problems that take place in concurrent
execution, management is needed, and that is where the concept of Concurrency Control comes into role.
Lock-Based Protocol
In this type of protocol, any transaction cannot read or write data until it acquires an appropriate lock on it.
There are two types of lock:
1. Shared lock:
o It is also known as a Read-only lock. In a shared lock, the data item can only read by the transaction.
o It can be shared between the transactions because when the transaction holds a lock, then it can't
update the data on the data item.
2. Exclusive lock:
o In the exclusive lock, the data item can be both reads as well as written by the transaction.
o This lock is exclusive, and in this lock, multiple transactions do not modify the same data
simultaneously.
Timestamp Ordering Protocol
o The Timestamp Ordering Protocol is used to order the transactions based on their Timestamps. The
order of transaction is nothing but the ascending order of the transaction creation.
o The priority of the older transaction is higher that's why it executes first. To determine the timestamp
of the transaction, this protocol uses system time or logical counter.
o The lock-based protocol is used to manage the order between conflicting pairs among transactions at
the execution time. But Timestamp based protocols start working as soon as a transaction is created.
o Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered the
system at 007 times and transaction T2 has entered the system at 009 times. T1 has the higher
priority, so it executes first as it is entered the system first.
o The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write' operation on a
data.
Validation phase is also known as optimistic concurrency control technique. In the validation based
protocol, the transaction is executed in the following three phases:
1. Read phase: In this phase, the transaction T is read and executed. It is used to read the value of
various data items and stores them in temporary local variables. It can perform all the write
operations on temporary variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be validated against the actual
data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary results are written
to the database or system otherwise the transaction is rolled back.
Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation phase.
o This protocol is used to determine the time stamp for the transaction for serialization using the time
stamp of the validation phase, as it is the actual phase which determines if the transaction will
commit or rollback.
o Hence TS(T) = validation(T).
o The serializability is determined during the validation process. It can't be decided in advance.
o While executing the transaction, it ensures a greater degree of concurrency and also less number of
conflicts.
o Thus it contains transactions which have less number of rollbacks.
Heterogeneous Database
A heterogeneous distributed database (HDDS) is a database system that uses different schemas, operating
systems, DDBMS, and data models. At least one of the databases in a heterogeneous HDDS is different
from the others.
Various operating systems and database applications may be used by various machines. They could even
employ separate database data models. Translations are therefore necessary for communication across
various sites.
Like a traditional on-premises database, cloud databases can be classified into relational databases and non-
relational databases.
Relational cloud databases consist of one or more tables of columns and rows and allow you to
organize data in predefined relationships to understand how data is logically related. These databases
typically use a fixed data schema, and you can use structured query language (SQL) to query and
manipulate data. They are highly consistent, reliable, and best suited to dealing with large amounts of
structured data.
Examples of relational databases include SQL Server, Oracle, MySQL, PostgreSQL, Spanner, and Cloud
SQL.
Non-relational cloud databases store and manage unstructured data, such as email and mobile
message text, documents, surveys, rich media files, and sensor data. They don’t follow a clearly-
defined schema like relational databases and allow you to save and organize information regardless
of its format.
Examples of non-relational databases include MongoDB, Redis, Cassandra, Hbase, and Cloud Bigtable.
The amount of data generated and collected today is growing exponentially. It’s not only more varied, but
also wildly disparate. Data can now reside across on-premises databases and distributed cloud applications
and services, making it difficult to integrate using traditional approaches. In addition, real-time data
processing is becoming essential to business success—delays and lags in data delivery to mission-critical
applications could have catastrophic consequences.
As cloud adoption accelerates and the way we use data continues to evolve, legacy databases face significant
challenges.
Cloud databases provide flexibility, reliability, security, affordability and more. Providing a solid foundation
for building modern business applications. In particular, they can rapidly adapt to changing workloads and
demands without increasing the workload of already overburdened teams.