0% found this document useful (0 votes)
23 views

Unit 2 DDMS

Uploaded by

vashu150105
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Unit 2 DDMS

Uploaded by

vashu150105
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT -2

Distributed Database Management System


Syllabus Content
Distributed database Management System
Functions of distributed database system
Distributed databases - Homogeneous and Heterogeneous databases
Distributed Data storage
Why Distributed storage is important?
Distributed cloud storage - Features of distributed cloud storage
Distributed Transactions
How distributed transactions work
Essential properties of distributed transactions (ACID)
Commit protocols - Distributed one phase commit
Distributed two phase commit
Objectives of Concurrency control in Distributed Databases
Concurrency Control anomalies
Methods of concurrency control Serializability and recoverability
Distributed Serializability, Enhanced lock based and timestamp-based protocols
Heterogeneous distributed databases
Cloud based databases Why use a cloud- database
Types of cloud-based database Advantages of cloud based databases
Replication in MongoDB Indexing in MongoDB
Distributed Query Optimization Algorithm
Notes Unit 2

DDMS (Distributed Database Management System)

A distributed database is essentially a database that is dispersed across numerous sites, i.e., on various
computers or over a network of computers, and is not restricted to a single system. A distributed database
system is spread across several locations with distinct physical components. This can be necessary when
different people from all over the world need to access a certain database. It must be handled such that, to
users, it seems to be a single database.

Types of DDMS:

Distributed databases can be broadly classified into homogeneous and heterogeneous distributed database
environments, each with further sub-divisions, as shown in the following illustration.
Homogeneous Distributed Database

In a homogeneous distributed database, all the sites use identical DBMS and operating systems. Its
properties are −

 The sites use very similar software.


 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to process user requests.
 The database is accessed through a single interface as if it is a single database.

Types of Homogeneous Distributed Database

There are two types of homogeneous distributed database −

 Autonomous − Each database is independent that functions on its own. They are integrated by a
controlling application and use message passing to share data updates.
 Non-autonomous − Data is distributed across the homogeneous nodes and a central or master
DBMS co-ordinates data updates across the sites.

Heterogeneous Distributed Databases

In a heterogeneous distributed database, different sites have different operating systems, DBMS products
and data models. Its properties are −

 Different sites use dissimilar schemas and software.


 The system may be composed of a variety of DBMSs like relational, network, hierarchical or object
oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation in processing user
requests.
Types of Heterogeneous Distributed Databases
1. Federated − The heterogeneous database systems are independent in nature and integrated together
so that they function as a single database system.
2. Un-federated − The database systems employ a central coordinating module through which the
databases are accessed.
Functions of Distributed Database
A distributed database system is a collection of multiple interconnected databases that are geographically
distributed across different sites. The functions of a distributed database system are:

1. Data distribution: One of the primary functions of a distributed database system is to


distribute data across multiple sites. This is done to ensure that data is stored closer to where it
is needed and to reduce the amount of data that needs to be transferred over the network.
2. Data replication: In a distributed database system, data can be replicated across multiple sites.
Replication can improve system availability and reliability by ensuring that data is available
even if one of the sites fails.
3. Data fragmentation: Data fragmentation involves breaking down a large database into smaller
fragments and distributing them across multiple sites. This can help improve system
performance by reducing the amount of data that needs to be transferred over the network.
4. Query processing: Query processing involves processing user queries and retrieving data from
the distributed database system. This is a complex task as data may be stored across multiple
sites and may need to be combined to answer user queries.
5. Transaction management: In a distributed database system, transactions may span multiple
sites. Transaction management involves coordinating these transactions and ensuring that they
are executed correctly and efficiently.
6. Security and access control: In a distributed database system, it is important to ensure that
data is secure and that access to it is controlled. This involves implementing appropriate
security measures and access control mechanisms to protect data from unauthorized access or
modification.
7. Performance optimization: A distributed database system needs to be optimized for
performance to ensure that it can handle large volumes of data and user requests. This may
involve optimizing query processing algorithms, improving network performance, or tuning
database parameters to improve performance.
8. System administration: System administration involves managing the distributed database
system and ensuring that it is running smoothly. This may involve tasks such as monitoring
system performance, backing up data, and resolving system issues.

Data may be stored on several places in two ways using distributed data storage:

1. Replication - With this strategy, every aspect of the connection is redundantly kept at two or more
locations. It is a completely redundant database if the entire database is accessible from every
location. Systems preserve copies of the data because of replication. This has advantages since it
makes more data accessible at many locations. Moreover, query requests can now be handled in
parallel. But, there are some drawbacks as well. Data must be updated often. All changes performed
at one site must be documented at every site where that relation is stored in order to avoid
inconsistent results. There is a tone of overhead here. Moreover, since concurrent access must now
be monitored across several sites, concurrency management becomes far more complicated.
2. Fragmentation - In this method, the relationships are broken up into smaller pieces and each
fragment is kept in the many locations where it is needed. To ensure there is no data loss, the pieces
must be created in a way that allows for the reconstruction of the original relation. As fragmentation
doesn't result in duplicate data, consistency is not a concern.

Relationships can be fragmented in one of two ways:


o Separating the relation into groups of tuples using rows results in horizontal fragmentation, where
each tuple is allocated to at least one fragment.
o Vertical fragmentation, also known as splitting by columns, occurs when a relation's schema is split
up into smaller schemas. A common candidate key must be present in each fragment in order to
guarantee a lossless join

Sometimes a strategy that combines fragmentation and replication is employed.

Uses for distributed databases


o The corporate management information system makes use of it.

o Multimedia apps utilize it.

o Used in hotel chains, military command systems, etc.

o The production control system also makes use of it

Characteristics of distributed databases

Distributed databases are logically connected to one another when they are part of a collection, and they
frequently form a single logical database. Data is physically stored across several sites and is separately
handled in distributed databases. Each site's processors are connected to one another through a network, but
they are not set up for multiprocessing.

A widespread misunderstanding is that a distributed database is equivalent to a loosely coupled file system.
It's considerably more difficult than that in reality. Although distributed databases use transaction
processing, they are not the same as systems that use it.

Generally speaking, distributed databases have the following characteristics:

o Place unrelated

o Spread-out query processing


o The administration of distributed transactions

o Independent of hardware

o Network independent of operating systems

o Transparency of transactions

o DBMS unrelated

Architecture for a distributed database

Both homogeneous and heterogeneous distributed databases exist.

All of the physical sites in a homogeneous distributed database system use the same operating system and
database software, as well as the same underlying hardware. It can be significantly simpler to build and
administer homogenous distributed database systems since they seem to the user as a single system. The
data structures at each site must either be the same or compatible for a distributed database system to be
considered homogeneous. Also, the database program utilized at each site must be compatible or same.

The hardware, operating systems, or database software at each site may vary in a heterogeneous distributed
database. Although separate sites may employ various technologies and schemas, a variation in schema
might make query and transaction processing challenging.

Various nodes could have dissimilar hardware, software, and data structures, or they might be situated in
incompatible places. Users may be able to access data stored at a different place but not upload or modify it.
Because heterogeneous distributed databases are sometimes challenging to use, many organizations find
them to be economically unviable.

Distributed databases' benefits

Using distributed databases has a lot of benefits.

o As distributed databases provide modular development, systems may be enlarged by putting new
computers and local data in a new location and seamlessly connecting them to the distributed system.
o With centralized databases, failures result in a total shutdown of the system. Distributed database
systems, however, continue to operate with lower performance when a component fails until the
issue is resolved.
o If the data is near to where it is most often utilized, administrators can reduce transmission costs for
distributed database systems. Centralized systems are unable to accommodate this t ypes of Distributed
Database
o Data instances are created in various areas of the database using replicated data. Distributed
databases may access identical data locally by utilizing duplicated data, which reduces bandwidth.
Read-only and writable data are the two types of replicated data that may be distinguished.
o Only the initial instance of replicated data can be changed in read-only versions; all subsequent
corporate data replications are then updated. Data that is writable can be modified, but only the initial
occurrence is affected.

o Primary keys that point to a single database record are used to identify horizontally fragmented data.
Horizontal fragmentation is typically used when business locations only want access to the database
for their own branch.
o Using primary keys that are duplicates of each other and accessible to each branch of the database is
how vertically fragmented data is organized. When a company's branch and central location deal
with the same accounts differently, vertically fragmented data is used.
o Data that has been edited or modified for decision support databases is referred to as reorganised
data. When two distinct systems are managing transactions and decision support, reorganised data is
generally utilised. When there are numerous requests, online transaction processing must be
reconfigured, and decision support systems might be challenging to manage.
o In order to accommodate various departments and circumstances, separate schema data separates the
database and the software used to access it. Often, there is overlap between many databases and
separate schema data

Distributed database examples


o Apache Ignite, Apache Cassandra, Apache HBase, Couchbase Server, Amazon SimpleDB,
Clusterpoint, and FoundationDB are just a few examples of the numerous distributed databases
available.
o Large data sets may be stored and processed with Apache Ignite across node clusters. GridGain
Systems released Ignite as open source in 2014, and it was later approved into the Apache Incubator
program. RAM serves as the database's primary processing and storage layer in Apache Ignite.
o Apache Cassandra has its own query language, Cassandra Query Language, and it supports clusters
that span several locations (CQL). Replication tactics in Cassandra may also be customized.
o Apache HBase offers a fault-tolerant mechanism to store huge amounts of sparse data on top of the
Hadoop Distributed File System. Moreover, it offers per-column Bloom filters, in-memory
execution, and compression. Although Apache Phoenix offers a SQL layer for HBase, HBase is not
meant to replace SQL databases.
o An interactive application that serves several concurrent users by producing, storing, retrieving,
aggregating, altering, and displaying data is best served by Couchbase Server, a NoSQL software
package. Scalable key value and JSON document access is provided by Couchbase Server to satisfy
these various application demands.
o Along with Amazon S3 and Amazon Elastic Compute Cloud, Amazon SimpleDB is utilised as a web
service. Developers may request and store data with Amazon SimpleDB with a minimum of database
maintenance and administrative work.
o Relational database designs' complexity, scalability problems, and performance restrictions are all
eliminated with Clusterpoint. Open APIs are used to handle data in the XLM or JSON formats.
Clusterpoint does not have the scalability or performance difficulties that other relational database
systems experience since it is a schema-free document database.

Distributed Database Architecture

The transfer of data for storage at various computers or locations connected over a network is known as a
distributed database. It may alternatively be described as a database that gathers information from several
databases using independent computers linked by data communication connections. Compared to centralized
database systems, distributed databases can offer higher availability and dependability. This is so that the
system can continue to function even if one or more sites go down. A distributed database system can
perform more effectively by distributing the burden and using the information across several sites.
Design Considerations for Distributed Databases

Data Partitioning

In data partitioning, the database is split up into smaller units, or fragments, and distributed across other
nodes. The two types of data partitioning are:

Horizontal Fragmentation: The relation is divided into groups of tuples in horizontal fragmentation, with
each tuple given to at least one fragment. As a result, requests may be processed in parallel and resources
can be used effectively.
Vertical Fragmentation: This process includes breaking the relation's schema into smaller schemas. A
common candidate key must be present in each fragment for a lossless join to occur. When distinct aspects
of the connection have different access patterns or when data privacy considerations necessitate the
separation of sensitive information, vertical fragmentation might be advantageous.

Replication

Replication includes keeping copies of the data across many distributed database nodes. Replication aims to
increase performance, fault tolerance, and data availability. Any node that has a copy of the necessary data
may perform a query when it is run, decreasing latency and speeding up response time. The different
replication strategies are:

Full replication: It ensures high availability but comes with high storage and update expense as entire
copies of the database must be kept at all nodes.
Partial replication: It chooses which data pieces to repeat depending on data relevance or access patterns.
Multi−master replication: Better speed and fault tolerance are provided by multi−master replication, which
enables many nodes to accept read and write operations.
Consistency and Concurrency Control

In a distributed database, maintaining consistency across numerous nodes is a challenging problem.


Conflicts that result in inconsistent data might happen when many transactions are being carried out
simultaneously. Distributed databases use a variety of concurrency control techniques, such as locking,
timestamp ordering, or optimistic concurrency control, to assure consistency.

Network Communication and Latency

The ability to communicate over a network is essential in distributed database systems. To get the best
performance and availability, low−latency networks and effective communication protocols are necessary.
For distributed database architecture, minimizing the quantity of data sent over the network and enhancing
data transfer rates are crucial factors.

Security and Privacy

Given that distributed databases contain data across several nodes that may be scattered geographically,
security is a crucial component. Additional security issues, such as data privacy, authentication, access
control, and encryption are brought on by the distributed architecture of the database. To secure sensitive
data and guarantee obedience to rules and privacy policies, strong security measures must be put in place.

Distributed Database Architecture

Client−Server Architecture

A common method for spreading database functionality is the client−server architecture. Clients
communicate with a central server, which controls the distributed database system, in this design. The server
is in charge of maintaining data storage, controlling access, and organizing transactions. This architecture
has several clients and servers connected. A client sends a query and the server which is available at the
earliest would help solve it. This Architecture is simple to execute because of the centralised server system.
Peer−to−Peer Architecture

Each node in the distributed database system may function as both a client and a server in a peer−to−peer
architecture. Each node is linked to the others and works together to process and store data. Each node is in
charge of managing its data management and organizing node−to−node interactions. Because the loss of a
single node does not cause the system to collapse, peer−to−peer systems provide decentralized control and
high fault tolerance. This design is ideal for distributed systems with nodes that can function independently
and with equal capabilities.

Federated Architecture

Multiple independent databases with various types are combined into a single meta−database using a
federated database design. It offers a uniform interface for navigating and exploring distributed data. In the
federated design, each site maintains a separate, independent database, while the virtual database manager
internally distributes requests. When working with several data sources or legacy systems that can't be
simply updated, federated architectures are helpful.
Shared−Nothing Architecture

Data is divided up and spread among several nodes in a shared−nothing architecture, with each node in
charge of a particular portion of the data. Resources are not shared across nodes, and each node runs
independently. Due to the system's capacity to add additional nodes as needed without affecting the current
nodes, this design offers great scalability and fault tolerance. Large−scale distributed systems, such as data
warehouses or big data analytics platforms, frequently employ shared−nothing designs.

Advantages of DDMS
 The database is easier to expand as it is already spread across multiple systems and it is not too
complicated to add a system.
 The distributed database can have the data arranged according to different levels of transparency i.e data
with different transparency levels can be stored at different locations.
 The database can be stored according to the departmental information in an organisation. In that case, it is
easier for a organisational hierarchical access.
 there were a natural catastrophe such as fire or an earthquake all the data would not be destroyed it is
stored at different locations.
 It is cheaper to create a network of systems containing a part of the database. This database can also be
easily increased or decreased.
 Even if some of the data nodes go offline, the rest of the database can continue its normal functions.

Disadvantages of DDMS
 The distributed database is quite complex and it is difficult to make sure that a user gets a uniform view
of the database because it is spread across multiple locations.
 This database is more expensive as it is complex and hence, difficult to maintain.
 It is difficult to provide security in a distributed database as the database needs to be secured at all the
locations it is stored. Moreover, the infrastructure connecting all the nodes in a distributed database also
needs to be secured.
 It is difficult to maintain data integrity in the distributed database because of its nature. There can also be
data redundancy in the database as it is stored at multiple locations.
 The distributed database is complicated and it is difficult to find people with the necessary experience
who can manage and maintain it.

What is Distributed Data Storage:


“A distributed data store is a system that stores and processes data on multiple machines.”
Distributed cloud storage is related to traditional cloud storage in some ways, especially in the techniques
and hardware it uses. But there’s one important difference. Instead of data stored on a collection of storage
devices in one data center, distributed cloud storage is made up of data stored on clusters of storage nodes
that are geographically dispersed.
As a developer, you can think of a distributed data store as how you store and retrieve application data,
metrics, logs, etc. Some popular distributed data stores you might be familiar with are MongoDB, Amazon
Web Service’s S3, and Google Cloud Platform’s Spanner.
The storage system includes features that synchronize and coordinate data across the cluster nodes greatly
simplifying storage rollouts and management. Since the data is distributed, you can deploy cloud-based data
monitoring tools to detect, prevent, recover from and analyze cyber attacks. Shared storage is a big target
for ransomware attacks, data governance features of the distributed cloud storage greatly help to detect
signatures, block user sessions, endpoints and perform forensic analysis and also help with recovery efforts
in case of an attack.

Distributed storage systems can store several types of data:

 Files—a distributed file system allows devices to mount a virtual drive, with the actual files
distributed across several machines.
 Block storage—a block storage system stores data in volumes known as blocks. This is an
alternative to a file-based structure that provides higher performance. A common distributed block
storage system is a Storage Area Network (SAN).
 Objects—a distributed object storage system wraps data into objects, identified by a unique ID or
hash.

Distributed storage systems have several advantages:

 Scalability—the primary motivation for distributing storage is to scale horizontally, adding more
storage space by adding more storage nodes to the cluster.
 Redundancy—distributed storage systems can store more than one copy of the same data, for high
availability, backup, and disaster recovery purposes.
 Cost—distributed storage makes it possible to use cheaper, commodity hardware to store large
volumes of data at low cost.
 Performance—distributed storage can offer better performance than a single server in some
scenarios, for example, it can store data closer to its consumers, or enable massively parallel access
to large files.
Features and Limitations
or why distributed storage is important?

Most distributed storage systems have some or all the following features:

 Partitioning—the ability to distribute data between cluster nodes and enable clients to seamlessly
retrieve the data from multiple nodes.
 Replication—the ability to replicate the same data item across multiple cluster nodes and maintain
consistency of the data as clients update it.
 Fault tolerance—the ability to retain availability to data even when one or more nodes in the
distributed storage cluster goes down.
 Elastic scalability—enabling data users to receive more storage space if needed, and enabling
storage system operators to scale the storage system up and down by adding or removing storage
units to the cluster.
What is Distributed Transaction:
 A distributed transaction is a set of operations that are performed across multiple data repositories. It
is also known as a global transaction.
 Distributed transactions ensure that a group of related operations are either all completed successfully
or all rolled back. This maintains data consistency and integrity.
 Distributed transactions are typically coordinated across separate nodes connected by a
network. However, they may also span multiple databases on a single server.
 Distributed transactions are also known as global transactions. For the transaction to commit
successfully, all the individual data sources must commit successfully
 The distributed transaction ensures ACID (Atomicity, Consistency, Isolation, Durability) properties
and data integrity.

What is the Need For Distributed Transaction?


Some properties are harder to implement, that cannot be implemented with simple transactions. Many times
basic single-system techniques are not sufficient. Distributed transactions are required when there is a need
to quickly update data that is related and spread across the multiple databases or nodes connected in a
network.
There are two possible outcomes:

1) all operations successfully complete, or

2) none of the operations are performed at all due to a failure somewhere in the system.

In the latter case, if some work was completed prior to the failure, that work will be reversed to ensure no
net work was done. This type of operation is in compliance with the “ACID” (atomicity-consistency-
isolation-durability) principles of databases that ensure data integrity. ACID is most commonly associated
with transactions on a single database server, but distributed transactions extend that guarantee across
multiple databases.

The operation known as a “two-phase commit” (2PC) is a form of a distributed transaction. “XA
transactions” are transactions using the XA protocol, which is one implementation of a two-phase commit
operation.
Steps of Distributed Transaction

 Step 1: Application to Resource – Issues Distributed Transaction.


 Step 2: Resource 1 to Resource 2 – Ask Resource 2 to Prepare to Commit.
 Step 3: Resource 2 to Resource 1 – Resource 2 Acknowledges Preparation.
 Step 4: Resource 1 to Resource 2 – Ask Resource 2 to Commit.

Atomic Commit
The atomic commit procedure should meet the following requirements:
 All participants who make a choice reach the same conclusion.
 If any participant decides to commit, then all other participants must have voted yes.
 If all participants vote yes and no failure occurs, then all participants decide to commit.

ACID properties of transactions


In the context of transaction processing, the acronym ACID refers to the four key properties of a transaction:
Atomicity, Consistency, Isolation, & Durability.

Atomicity
All changes to data are performed as if they are a single operation. That is, all the changes are performed, or
none of them are.
For example, in an application that transfers funds from one account to another, the atomicity property
ensures that, if a debit is made successfully from one account, the corresponding credit is made to the other
account.
Consistency
Data is in a consistent state when a transaction starts and when it ends.
For example, in an application that transfers funds from one account to another, the consistency property
ensures that the total value of funds in both the accounts is the same at the start and end of each transaction.
Isolation
The intermediate state of a transaction is invisible to other transactions. As a result, transactions that run
concurrently appear to be serialized.
For example, in an application that transfers funds from one account to another, the isolation property
ensures that another transaction sees the transferred funds in one account or the other, but not in both, nor in
neither.
Durability
After a transaction successfully completes, changes to data persist and are not undone, even in the event of
a system failure.
For example, in an application that transfers funds from one account to another, the durability property
ensures that the changes made to each account will not be reversed.

Coordination in Distributed Transactions


At the time of coordination in Distributed Transactions, one of the servers becomes a coordinator, and the
rest of the workers become coordinators.
 In a simple transaction, the first server acts as the Coordinator.
 In the nested transaction, the top-level server acts as the Coordinator.
 Role of Coordinator: The coordinator keeps track of participating servers, gathers results from
workers, and makes a decision to ensure transaction consistency.
 Role of Workers: Workers are aware of the coordinator’s existence and in addition,
communicate their outcome to the coordinator and then follow the coordinator’s decision.

Distributed One-Phase Commit


A one-phase commitment protocol involves a coordinator who communicates with servers and performs
each task regularly to inform them to perform or cancel actions i.e. transactions.
One phase Commit

Distributed Two-Phase Commit


There are two phases for the commit procedure to work:
Phase 1: Voting
 A “prepare message” is sent to each participating worker by the coordinator.
 The coordinator must wait until a response whether ready or not ready is received from each
worker, or a timeout occurs.
 Workers must wait until the coordinator sends the “prepare” message.
 If a transaction is ready to commit then a “ready” message is sent to the coordinator.
 If a transaction is not ready to commit then a “no” message is sent to the coordinator and
resulting in aborting of the transaction.

Phase 2: Completion of the voting result


 In this phase, the Coordinator will check about the “ready” message. If each worker sent a
“ready” message then only a “commit” message is sent to each worker; otherwise, send an
“abort” message to each worker.
 Now, wait for acknowledgment until it is received from each worker.
 In this phase, Workers wait until the coordinator sends a “commit” or “abort” message; then act
according to the message received.
 At last, Workers send an acknowledgment to the Coordinator.

Two phase Commit

the Atomic Commit Protocol is essential to maintaining transaction dependability and consistency
in distributed systems. Its main goal is to ensure that, even in the event of a failure, a transaction is
either fully committed or fully aborted across all participating nodes.
Concurrency Control in Distributed Database

Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database.
Concurrency control in distributed databases is the process of coordinating concurrent access to a database
in a multiuser database management system (DBMS). It allows users to access a database in a multi-
programmed way while maintaining the illusion that each user is executing alone on a dedicated system.
Concurrency controlling techniques ensure that multiple transactions are executed simultaneously while
maintaining the ACID properties of the transactions and serializability in the schedules.

o In a multi-user system, multiple users can access and use the same database at one time, which is
known as the concurrent execution of the database. It means that the same database is executed
simultaneously on a multi-user system by different users.
o While working on the database transactions, there occurs the requirement of using the database by
multiple users for performing different operations, and in that case, concurrent execution of the
database is performed.

o The thing is that the simultaneous execution that is performed should be done in an interleaved

manner, and no operation should affect the other executing operations, thus maintaining the

consistency of the database. Thus, on making the concurrent execution of the transaction operations,

there occur several challenging problems that need to be solved.

o Types of concurrency problems in DBMS are as follows:

 Lost Update Problem(write-write conflict)

 Temporary Update Problem( dirty read problem)

 Unrepeatable Read Problem

 Incorrect Summary Problem

 Phantom Read Problem

Problems with Concurrent Execution

In a database transaction, the two main operations are READ and WRITE operations. So, there is a need to
manage these two operations in the concurrent execution of the transactions as if these operations are not
performed in an interleaved manner, and the data may become inconsistent. So, the following problems
occur with the Concurrent Execution of the operations:
Problem 1: Lost Update Problems (W - W Conflict)

The problem occurs when two different database transactions perform the read/write operations on the
same database items in an interleaved manner (i.e., concurrent execution) that makes the values of the items
incorrect hence making the database inconsistent.

For example:

Consider the below diagram where two transactions T X and TY, are performed on the same account A
where the balance of account A is Rs 300.

300

250

300

400

250

400

o At time t1, transaction TX reads the value of account A, i.e., Rs 300 (only read).
o At time t2, transaction TX deducts Rs 50 from account A that becomes Rs 250 (only deducted and not
updated/write).
o Alternately, at time t3, transaction T Y reads the value of account A that will be Rs 300 only because
TX didn't update the value yet.
o At time t4, transaction TY adds Rs 100 to account A that becomes Rs 400 (only added but not
updated/write).
o At time t6, transaction TX writes the value of account A that will be updated as Rs 250 only, as
TY didn't update the value yet.
o Similarly, at time t7, transaction T Y writes the values of account A, so it will write as done at time t4
that will be Rs 400. It means the value written by TX is lost, i.e., Rs 250 is lost.

Hence data becomes incorrect, and database sets to inconsistent.

Dirty Read Problems (W-R Conflict)

The dirty read problem occurs when one transaction updates an item of the database, and somehow the
transaction fails, and before the data gets rollback, the updated database item is accessed by another
transaction. There comes the Read-Write Conflict between both transactions.

For example:

Consider two transactions TX and TY in the below diagram performing read/write operations on
account A where the available balance in account A is Rs 300:

300

350

350

350

300

o At time t1, transaction TX reads the value of account A, i.e., Rs 300.


o At time t2, transaction TX adds Rs 50 to account A that becomes Rs 350.
o At time t3, transaction TX writes the updated value in account A, i.e., Rs 350.
o Then at time t4, transaction TY reads account A that will be read as Rs 350.
o Then at time t5, transaction TX rollbacks due to server problem, and the value changes back to Rs
300 (as initially).
o But the value for account A remains Rs 350 for transaction T Y as committed, which is the dirty read
and therefore known as the Dirty Read Problem.

Unrepeatable Read Problem (W-R Conflict)

Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two different values are
read for the same database item.

For example:

Consider two transactions, TX and TY, performing the read/write operations on account A, having an
available balance = Rs 300. The diagram is shown below:

300

300

400

400

400

o At time t1, transaction TX reads the value from account A, i.e., Rs 300.
o At time t2, transaction TY reads the value from account A, i.e., Rs 300.
o At time t3, transaction TY updates the value of account A by adding Rs 100 to the available balance,
and then it becomes Rs 400.
o At time t4, transaction TY writes the updated value, i.e., Rs 400.
o After that, at time t5, transaction T X reads the available value of account A, and that will be read as
Rs 400.
o It means that within the same transaction T X, it reads two different values of account A, i.e., Rs 300
initially, and after updation made by transaction T Y, it reads Rs 400. It is an unrepeatable read and is
therefore known as the Unrepeatable read problem.

Thus, in order to maintain consistency in the database and avoid such problems that take place in concurrent
execution, management is needed, and that is where the concept of Concurrency Control comes into role.

Concurrency Control Protocols

The concurrency control protocols ensure the atomicity, consistency, isolation,


durability and serializability of the concurrent execution of the database transactions. Therefore, these
protocols are categorized as:

o Lock Based Concurrency Control Protocol


o Time Stamp Concurrency Control Protocol
o Validation Based Concurrency Control Protocol

Lock-Based Protocol

In this type of protocol, any transaction cannot read or write data until it acquires an appropriate lock on it.
There are two types of lock:

1. Shared lock:

o It is also known as a Read-only lock. In a shared lock, the data item can only read by the transaction.
o It can be shared between the transactions because when the transaction holds a lock, then it can't
update the data on the data item.

2. Exclusive lock:

o In the exclusive lock, the data item can be both reads as well as written by the transaction.
o This lock is exclusive, and in this lock, multiple transactions do not modify the same data
simultaneously.
Timestamp Ordering Protocol
o The Timestamp Ordering Protocol is used to order the transactions based on their Timestamps. The
order of transaction is nothing but the ascending order of the transaction creation.
o The priority of the older transaction is higher that's why it executes first. To determine the timestamp
of the transaction, this protocol uses system time or logical counter.
o The lock-based protocol is used to manage the order between conflicting pairs among transactions at
the execution time. But Timestamp based protocols start working as soon as a transaction is created.
o Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered the
system at 007 times and transaction T2 has entered the system at 009 times. T1 has the higher
priority, so it executes first as it is entered the system first.
o The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write' operation on a
data.

Validation Based Protocol

Validation phase is also known as optimistic concurrency control technique. In the validation based
protocol, the transaction is executed in the following three phases:

1. Read phase: In this phase, the transaction T is read and executed. It is used to read the value of
various data items and stores them in temporary local variables. It can perform all the write
operations on temporary variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be validated against the actual
data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary results are written
to the database or system otherwise the transaction is rolled back.

Here each phase has the following different timestamps:

Start (Ti): It contains the time when Ti started its execution.

Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation phase.

Finish(Ti): It contains the time when Ti finishes its write phase.

o This protocol is used to determine the time stamp for the transaction for serialization using the time
stamp of the validation phase, as it is the actual phase which determines if the transaction will
commit or rollback.
o Hence TS(T) = validation(T).
o The serializability is determined during the validation process. It can't be decided in advance.
o While executing the transaction, it ensures a greater degree of concurrency and also less number of
conflicts.
o Thus it contains transactions which have less number of rollbacks.
Heterogeneous Database

A heterogeneous distributed database (HDDS) is a database system that uses different schemas, operating
systems, DDBMS, and data models. At least one of the databases in a heterogeneous HDDS is different
from the others.

Various operating systems and database applications may be used by various machines. They could even
employ separate database data models. Translations are therefore necessary for communication across
various sites.

Here are some things that can happen in a heterogeneous HDDS:


 Different sites
Different sites can use different software and schemas, which can cause problems with transactions and
query processing. One site may not be aware of the other sites, which can lead to limited cooperation in
processing user requests.
 Different hardware
Distributed databases can be heterogeneous, homogeneous, or different. A heterogeneous database can
have a variety of software and hardware.
 Time-consuming tasks
Managing data representation, translation, and query processing may be time consuming in a
heterogeneous HDDS.

What is a Cloud Database?


A cloud database is a database that is deployed, delivered, and accessed in the cloud. Cloud databases
organize and store structured, unstructured, and semi-structured data just like traditional on-premises
databases. However, they also provide many of the same benefits of cloud computing, including speed,
scalability, agility, and reduced costs.

Cloud database definition


A cloud database is a database built to run in a public or hybrid cloud environment to help organize, store,
and manage data within an organization. Cloud databases can be offered as a managed database-as-a-service
(DBaaS) or deployed on a cloud-based virtual machine (VM) and self-managed by an in-house IT team.

Types of cloud databases

Like a traditional on-premises database, cloud databases can be classified into relational databases and non-
relational databases.

 Relational cloud databases consist of one or more tables of columns and rows and allow you to
organize data in predefined relationships to understand how data is logically related. These databases
typically use a fixed data schema, and you can use structured query language (SQL) to query and
manipulate data. They are highly consistent, reliable, and best suited to dealing with large amounts of
structured data.

Examples of relational databases include SQL Server, Oracle, MySQL, PostgreSQL, Spanner, and Cloud
SQL.

 Non-relational cloud databases store and manage unstructured data, such as email and mobile
message text, documents, surveys, rich media files, and sensor data. They don’t follow a clearly-
defined schema like relational databases and allow you to save and organize information regardless
of its format.

Examples of non-relational databases include MongoDB, Redis, Cassandra, Hbase, and Cloud Bigtable.

Why use a cloud database?

The amount of data generated and collected today is growing exponentially. It’s not only more varied, but
also wildly disparate. Data can now reside across on-premises databases and distributed cloud applications
and services, making it difficult to integrate using traditional approaches. In addition, real-time data
processing is becoming essential to business success—delays and lags in data delivery to mission-critical
applications could have catastrophic consequences.

As cloud adoption accelerates and the way we use data continues to evolve, legacy databases face significant
challenges.

Cloud databases provide flexibility, reliability, security, affordability and more. Providing a solid foundation
for building modern business applications. In particular, they can rapidly adapt to changing workloads and
demands without increasing the workload of already overburdened teams.

You might also like