0% found this document useful (0 votes)
49 views74 pages

20it403 DBMS Digital Material Unit V

This document outlines the course content and objectives for the 20IT403 Database Management Systems course. The course is divided into 5 units that cover database concepts, design, transactions, data storage and querying, and advanced topics. The course objectives are to understand basic database concepts, SQL, database design, transaction processing, querying, and how advanced databases differ from traditional databases. The document also includes information on pre-requisites, syllabus, course outcomes, lecture plan, and activities for learning.

Uploaded by

Dhiviya Bharathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views74 pages

20it403 DBMS Digital Material Unit V

This document outlines the course content and objectives for the 20IT403 Database Management Systems course. The course is divided into 5 units that cover database concepts, design, transactions, data storage and querying, and advanced topics. The course objectives are to understand basic database concepts, SQL, database design, transaction processing, querying, and how advanced databases differ from traditional databases. The document also includes information on pre-requisites, syllabus, course outcomes, lecture plan, and activities for learning.

Uploaded by

Dhiviya Bharathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

20IT403– DATABASE

MANAGEMENTSYSTEMS
Department: IT
Batch/Year: 2020-24/II
Created by:
Ms.R.Asha
AP/IT

Date: 25.05.2022
1.TABLE OF CONTENTS

1. Contents
2. Course Objectives

1. Pre Requisites

1. Syllabus

1. Course outcomes

1. CO- PO/PSO Mapping

1. Lecture Plan

1. Activity based learning

1. Lecture Notes

1. Assignments

1. Part A Question & Answer

1. Part B Question & Answer

1. Supportive online Certification courses

1. Real time Applications in day to day life and to Industry

1. Contents beyond the Syllabus

1. Assessment Schedule

1. Prescribed Text Books & Reference Books

1. Mini Project suggestions


2. COURSE OBJECTIVES

 To understand the basic concepts of Data modeling and Database Systems.

 To understand SQL and effective relational database design concepts.


 To know the fundamental concepts of transaction processing, concurrency control
techniques and recovery procedure.

 To understand efficient data querying and updates, with needed configuration

 To learn how to efficiently design and implement various database objects and entities
3. PRE REQUISITES

• 20GE101 Problem Solving and C Programming

• 20CS201 Data Structures

8
4. SYLLABUS
DATABASE MANAGEMENT SYSTEMS

UNIT I DATABASE CONCEPTS 9

Concept of Database and Overview of DBMS - Characteristics of databases, Database


Language, Types of DBMS architecture – Three-Schema Architecture - Introductions to
data models types- ER Model- ER Diagrams Extended ER Diagram reducing ER to
table Applications: ER model of University Database Application.

SQL fundamentals Views - Integrity Procedures, Functions, Cursor and Triggers


Embedded SQL Dynamic SQL.

UNIT II DATABASE DESIGN 9


Design a DB for Car Insurance Company - Draw ER diagram and convert ER model to
relational schema. Evaluating data model quality - The relational Model Schema Keys-
Relational Algebra Domain Relational Calculus- Tuple Relational Calculus - Fundamental
operations. Relational Database Design and Querying Undesirable Properties of
Relations Functional Dependency: Closures- Single Valued Dependency Single valued
Normalization (1NF, 2NF 3NF and BCNF) - Desirable properties of Decompositions 4NF
- 5NF De-normalization

UNIT III TRANSACTIONS 9


Transaction Concepts – ACID Properties – Schedules – Serializability – Concurrency
Control – Need for Concurrency – Locking Protocols – Two Phase Locking – Deadlock
– Transaction Recovery - Save Points – Isolation Levels – SQL Facilities for
Concurrency and Recovery

UNIT IV DATA STORAGE AND QUERYING 9


RAID – File Organization – Organization of Records in Files – Indexing and Hashing
–Ordered Indices – B+ tree Index Files – B tree Index Files – Static Hashing –
Dynamic Hashing – Overview of physical storage structure- stable storage, failure
classification -log based recovery, deferred database modification, check-pointing-
File Structures:-Index structures-Primary, Secondary and clustering indices. Single
and multilevel indexing.

Query Processing Overview – Algorithms for SELECT and JOIN operations – Query
optimization using Heuristics and Cost Estimation

UNIT V ADVANCED TOPICS 9

Distributed database Implementation Concurrent transactions - Concurrency control


Lock based Time stamping-Validation based. NoSQL, NoSQL Categories - Designing an
enterprise database system - Client Server database
5. COURSE OUTCOMES

CO1: Implement SQL and effective relational database design concepts.

CO2: Map ER model to Relational model to perform database design effectively.

CO3: Compare and contrast various indexing strategies in different database


systems.

CO4: Implement queries using normalization criteria and optimization techniques.

CO5: Analyse how advanced databases differ from traditional databases.

CO6: Design and deploy an efficient and scalable data storage node for varied kind
of application requirements.
6. CO- PO/PSO MAPPING

PO PO PO PO PO PO PO PO PO PO1 PO1 PO12


1 2 3 4 5 6 7 8 9 0 1
CO 2 1 1 1 1 1 1 2 2 2 2 2
1
CO 3 2 2 1 1 1 1 2 2 2 2 2
2
CO 2 1 1 1 1 1 1 2 2 2 2 2
3
CO 2 1 1 1 1 1 1 2 2 2 2 2
4 PSO 1 PSO 2 PSO3
CO 2 1 1 1 1 1 1 2 2 2 2 2
C5O1 2 2 2
CCOO2 2 1 12 1 1 1 1 3 2 2 2 2 2
6 2 2 2
CO3
CO4 2 2 2
CO5 2 2 2
CO6 2 2 2
7. LECTURE PLAN

UNIT IV - IMPLEMENTATION TECHNIQUES

N O OF
SL PE RIODS PROPOSED ACTUAL
. LECTURE LECTUR E
NO TAX MODE
TOPIC CO ON OF
O MY DELIVEY
LEVE
L
Distributed 1 PPT/
database ONLINE
1 Implement C LECTU
K2
ation O RE

18.05.2022 18.05.2022 5
Concurrent 1 PPT/
transactions ONLINE
- Concurrency CO LECTU
K2
control 5 RE
2 19.05.2022 19.05.2022
Lock based PPT/
ONLINE
1 LECTU
C RE
3 O K2

19.05.2022 19.05.2022 5
Time stamping 1 PPT/
ONLINE
LECTU
C RE
4 O K3

21.05.2022 21.05.2022 5
Validation based 1 PPT/
ONLINE
C LECTU
O RE
5 K3
24.05.2022 24.05.2022 5
NoSQL 1 PPT/
ONLINE
6 LECTU
RE
C
O K2

24.05.2022 24.05.2022 6
NoSQL 1 PPT/
Categories ONLINE
LECTU
C RE
7 O K2

25.05.2022 25.05.2022 6 1
2
Designing an 1 PPT/
enterprise ONLINE
8. ACTIVITY BASED LEARNING

Crossword Puzzle Across

4. One of the characteristic of distributed database


6. One of the IR model
7. Way to achieve reliability in distributed database
9. One of the IR model
10. Instruction given for retrieving the data from database
12. Refers to behavior of a object
13. One of the characteristic of distributed database
14. Duration of variable exists in memory
15. Overall design of the database1. One example of a DBMS.

Down
1. Type of distributed database
2. Set of instructions which forms a logical unit
3. One of the characteristic of distributed database
5. Way to achieve availability in distributed database
8. One of the Object oriented concepts
11. Processing of accessing the information from the storage medium
8. ACTIVITY BASED LEARNING
9. LECTURE NOTES

Unit V

DISTRIBUTED DATABASES

A distributed database is a database in which not all storage devices are attached to
a common processor. It may be stored in multiple computers, located in the same
physical location; or may be dispersed over a network of interconnected computers.

Distributed database is a system in which storage devices are not connected to a


common processing unit.

Database is controlled by Distributed Database Management System and data may


be stored at the same location or spread over the interconnected network. It is a
loosely coupled system. Shared nothing architecture is used in distributed databases.

Distributed Database System


Communication channel is used to communicate with the different locations and
every system has its own memory and database.

Reliability: In distributed database system, if one system fails down or stops working
for some time another system can complete the task.
Availability: In distributed database system reliability can be achieved even if sever
fails down. Another system is available to serve the client request.

Performance: Performance can be achieved by distributing database over different


locations. So the databases are available to every location which is easy to
maintain.

Homogeneous distributed databases system:


 Homogeneous distributed database system is a network of two or more
databases (With same type of DBMS software) which can be stored on one or
more machines.

 So, in this system data can be accessed and modified simultaneously on


several databases in the network. Homogeneous distributed system are easy
to handle. Example: Consider that we have three departments using Oracle-
9i for DBMS. If some changes are made in one department then, it would
update the other department also.

Heterogeneous distributed database system.


 Heterogeneous distributed database system is a network of two or more
databases with different types of DBMS software, which can be stored on one
or more machines.

 In this system data can be accessible to several databases in the network


with the help of generic connectivity (ODBC and JDBC).
Example: In the following diagram, different DBMS software are accessible to each
other using ODBC and JDBC.

Distributed DBMS Architectures


DDBMS architectures are generally developed depending on three parameters −
Distribution − It states the physical distribution of data across the
different sites.
Autonomy − It indicates the distribution of control of the database system
and the degree to which each constituent DBMS can operate independently.
Heterogeneity − It refers to the uniformity or dissimilarity of the data
models, system components and databases.

 Architectural Models
Some of the common architectural models are –
 Client - Server Architecture for DDBMS
 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture

Client - Server Architecture for DDBMS


This is a two-level architecture where the functionality is divided into servers and
clients. The server functions primarily encompass data management, query
processing, optimization and transaction management.
Client functions include mainly user interface. However, they have some functions
like consistency checking and transaction management. The two different client -

server architecture are −

 Single Server Multiple Client

 Multiple Server Multiple Client (shown in the following diagram)

Peer- to-Peer Architecture for DDBMS


In these systems, each peer acts both as a client and a server for imparting
database services. The peers share their resource with other peers and co-ordinate
their activities.

This architecture generally has four levels of schemas −


 Global Conceptual Schema − Depicts the global logical view of data.
 Local Conceptual Schema − Depicts logical data organization at each site.

 Local Internal Schema − Depicts physical data organization at each site.


 External Schema − Depicts user view of data.
Multi - DBMS Architectures
This is an integrated database system formed by a collection of two or more
autonomous database systems. Multi-DBMS can be expressed through six levels of
schemas −

Multi-database View Level − Depicts multiple user views comprising of


subsets of the integrated distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database that
comprises of global logical multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across different
sites and multi-database to local data mapping.

 Local database View Level − Depicts public view of local data.


Local database Conceptual Level − Depicts local data organization at each
site.
Local database Internal Level − Depicts physical data organization at each
site.
There are two design alternatives for multi-DBMS
Model with multi-database conceptual level.
Model without multi-database conceptual level.
DATA STORAGE & TRANSACTION PROCESSING

DISTRIBUTED DATA STORAGE


Consider a relation r that is to be stored in the database. There are two approaches
to storing this relation in the distributed database:
 Replication: The system maintains several identical replicas of the relation, and
stores each replica at a different site. The alternative to replication is to store only
one copy of relation r.

Fragmentation: The system partitions the relation into several fragments, and
stores each fragment at a different site.
 Data Replication: If relation r is replicated, a copy of relation r is stored in two
or more sites. In the most extreme case, we have full replication, in which a copy
is stored in every site in the system.

 There are a number of advantages and disadvantages to replication.


Availability: If one of the sites containing relation r fails, then the relation r
can be found in another site. Thus, the system can continue to process queries
involving r, despite the failure of one site.
 Increased parallelism: In the case where the majority of accesses to the
relation r result in only the reading of the relation, then several sites can process
queries involving r in parallel. The more replicas of r there are, the greater the
chance that the needed data will be found in the site where the transaction is
executing. Hence, data replication minimizes movement of data between sites.
 Increased overhead on update: The system must ensure that all replicas of a
relation r are consistent; otherwise, erroneous computations may result. Thus,
whenever is updated, the update must be propagated to all sites containing
replicas. The result is increased overhead. For example, in a banking system,
where account information is replicated in various sites, it is necessary to ensure
that the balance in a particular account agrees in all sites.
Data Fragmentation
If relation r is fragmented, r is divided into a number of fragments r1, r2,...,rn.
These fragments contain sufficient information to allow reconstruction of the original
relation r. There are two different schemes for fragmenting a relation: horizontal
fragmentation and vertical fragmentation.
Horizontal fragmentation splits the relation by assigning each tuple of r to one
or more fragments.
 Vertical fragmentation splits the relation by decomposing the scheme R of relation
r.
In horizontal fragmentation, a relation r is partitioned into a number of subsets, r1,
r2,...,rn. Each tuple of relation r must belong to at least one of the fragments, so that

the original relation can be reconstructed, if needed.


account1 = branch name = ―Hillside‖ (account)
account2 = branch name = ―Valleyview‖ (account)
Horizontal fragmentation is usually used to keep tuples at the sites where they are
used the most, to minimize data transfer.

Transparency
The user of a distributed database system should not be required to know where the
data are physically located nor how the data can be accessed at the specific local
site. This characteristic, called data transparency, can take several forms:
 Fragmentation transparency. Users are not required to know how a relation
has been fragmented.
 Replication transparency. Users view each data object as logically unique. The
distributed system may replicate an object to increase either system performance
or data availability. Users do not have to be concerned with what data objects have
been replicated, or where replicas have been placed.
 Location transparency. Users are not required to know the physical location of
the data. The distributed database system should be able to find any data as
long as the data identifier is supplied by the user transaction.

DISTRIBUTED TRANSACTIONS
There are two types of transaction that we need to consider.
Local transactions are those that access and update data in only one local
database;
 Global transactions are those that access and update data in several local
databases.

System Structure
Each site has its own local transaction manager, whose function is to ensure the ACID
properties of those transactions that execute at that site. The various transaction
managers cooperate to execute global transactions. To understand how
such a manager can be implemented, consider an abstract model of a transaction
system, in which each site contains two subsystems:

 The transaction manager


manages the execution of those transactions (or sub transactions) that access
data stored in a local site.
Maintains a log for recovery purposes. Participate in an appropriate concurrency-
control scheme to coordinate the concurrent execution of the transactions
executing at that site.

 The transaction coordinator coordinates the execution of the various


transactions (both local and global) initiated at that site.
The transaction coordinator subsystem is not needed in the centralized environment,
since a transaction accesses data at only a single site. A transaction coordinator, as its
name implies, is responsible for coordinating the execution of all the transactions
initiated at that site.

For each such transaction, the coordinator is responsible for:


• Starting the execution of the transaction.
• Breaking the transaction into a number of sub transactions and distributing these
sub transactions to the appropriate sites for execution.
• Coordinating the termination of the transaction, which may result in the
transaction being committed at all sites or aborted at all sites.

System Failure Modes


Failure of a site. Loss of messages.
Failure of a communication link. Network partition.
Concurrency Control in Distributed Databases There are various
concurrency control techniques.

• Locking Protocols
• Timestamp based Protocol
• Validation based protocol
Locking Protocols
Single Lock-Manager Approach
• In the single lock-manager approach, the system maintains a single lock manager
that resides in a single chosen site—say Si .

• All lock and unlock requests are made at site Si .


• When a transaction needs to lock a data item, it sends a lock request to Si .
• The lock manager determines whether the lock can be granted immediately.
• If the lock can be granted, the lock manager sends a message to that effect to
the site at which the lock request was initiated.

• Otherwise, the request is delayed until it can be granted, at which time

a message is sent to the site at which the lock request was initiated. The
transaction can read the data item from any one of the sites at which a replica of

the data item resides.


• In the case of a write, all the sites where a replica of the data item resides must
be involved in the writing.

Advantages:
• Simple implementation This scheme requires two messages for handling lock
requests and one message for handling unlock requests.
•Simple deadlock handling Since all lock and unlock requests are made at one
site, the deadlock-handling algorithms can be applied directly.

Disadvantages
• Bottleneck The site Si becomes a bottleneck, since all requests must be processed
there.
•Vulnerability If the site Si fails, the concurrency controller is lost. Either processing
must stop, or a recovery scheme must be used so that a backup site can take over
lock management from Si
Distributed Lock Manager
• A compromise between the advantages and disadvantages can be achieved
through the distributed-lock-manager approach, in which the lock-manager
function is distributed over several sites.

• Each site maintains a local lock manager whose function is to


administer the lock and unlock requests for those data items that are stored in
that site.

• When a transaction wishes to lock a data item Q that is not replicated and
resides at site Si , a message is sent to the lock manager at site Si requesting a
lock (in a particular lock mode).
• If data item Q is locked in an incompatible mode, then the request is
delayed until it can be granted. Once it has determined that the lock request
can be granted, the lock manager sends a message back to the initiator
indicating that it has granted the lock request.

Advantage:

• simple implementation, and reduces the degree to which the coordinator is a


bottleneck.
• It has a reasonably low overhead, requiring two message transfers for handling
lock requests, and one message transfer for handling unlock requests.

Disadvantage
• Deadlock handling is more complex, since the lock and unlock requests are no
longer made at a single site:
• There may be intersite deadlocks even when there is no deadlock within a
single site.

Several variants of this approach


• Primary copy
• Majority protocol
• Biased protocol
• Quorum consensus
Primary Copy
• When a system uses data replication Choose one of the replicas as the primary
copy.

• For each data item Q, the primary copy of Q must reside in precisely

one site, which we call the primary site of Q.


• When a transaction needs to lock a data item Q, it requests a lock at the
primary site of Q.

• As before, the response to the request is delayed until it can be granted.


• The primary copy enables concurrency control for replicated data to be handled
like that for unreplicated data.

Advantage:
• allows for a simple implementation.
Disadvantage:
• However, if the primary site of Q fails, Q is inaccessible, even though other sites
containing a replica may be accessible.

Majority Protocol
The majority protocol works this way:
• If data item Q is replicated in n different sites, then a lock-request message must
be sent to more than one-half of the n sites in which Q is stored.

• Each lock manager determines whether the lock can be granted immediately. As
before, the response is delayed until the request can be granted.
• The transaction does not operate on Q until it has successfully obtained a lock on
a majority of the replicas of Q.
• Writes are performed on all replicas, requiring all sites containing replicas to be
available. However, the major benefit of the majority protocol is that it can be
extended to deal with site failures.
Advantage:
The protocol also deals with replicated data in a decentralized manner, thus avoiding
the drawbacks of central control.

Disadvantages:
 Implementation. The majority protocol is more complicated to implement than are
the previous schemes.
It requires at least 2(n/2 + 1) messages for handling lock requests and at least (n/2
+ 1) messages for handling unlock requests.
Deadlock handling. In addition to the problem of global deadlocks due to the use of
a distributed-lock-manager approach, it is possible for a deadlock to occur even if
only one data item is being locked.
As an illustration, consider a system with four sites and full replication. Suppose that
transactions T1 and T2 wish to lock data item Q in exclusive mode. Transaction T1
may succeed in locking Q at sites S1 and S3, while transaction T2 may succeed in
locking Q at sites S2 and S4. Each then must wait to acquire the third lock; hence, a
deadlock has occurred. We can avoid such deadlocks with relative ease, by requiring
all sites to request locks on the replicas of a data item in the same predetermined
order.

Biased Protocol
The biased protocol is another approach to handling replication.
The difference from the majority protocol is that requests for shared locks are
given more favorable treatment than requests for exclusive locks.

 Shared locks.
When a transaction needs to lock data item Q, it simply requests a lock on Q from
the lock manager at one site that contains a replica of Q.

 Exclusive locks.
When a transaction needs to lock data item Q, it requests a lock on Q from the
lock manager at all sites that contain a replica of Q. As before, the response to the
request is delayed until it can be granted.
Advantage :
imposing less overhead on read operations than does the majority protocol. This
savings is especially significant in common cases in which the frequency of read is
much greater than the frequency of write.

Disadvantage:
Additional overhead on writes
The biased protocol shares the majority protocol‘s disadvantage of complexity in
handling deadlock.

Quorum Consensus Protocol

The quorum consensus protocol is a generalization of the majority protocol. The


quorum consensus protocol assigns each site a nonnegative weight.

It assigns read and write operations on an item x two integers, called read
quorum
Qr and write quorum Qw, that must satisfy the following condition, where S is the
total weight of all sites at which x resides:

Qr + Qw > S and 2 ∗Qw > S


To execute a read operation, enough replicas must be locked that their total weight
is at least r.
To execute a write operation, enough replicas must be locked so that their total
weight is at least w.
A benefit of the quorum consensus approach is that it can permit the cost of either
read or write locking to be selectively reduced by appropriately defining the read and
write quorums. For instance, with a small read quorum, reads need to obtain fewer
locks, but the write quorum will be higher, hence writes need to obtain more locks.
Also, if higher weights are given to some sites, fewer sites need to be accessed for
acquiring locks.
In fact, by setting weights and quorums appropriately, the quorum consensus protocol can
simulate the majority protocol and the biased protocols. Like the majority protocol, quorum
consensus can be extended to work even in the presence of site failures.

Time Stamp Based Protocols:

Timestamp based Protocol in DBMS is an algorithm which uses the System Time or
Logical Counter as a timestamp to serialize the execution of concurrent transactions. The
Timestamp-based protocol ensures that every conflicting read and write operations are
executed in a timestamp order.
The older transaction is always given priority in this method. It uses system time to
determine the time stamp of the transaction. This is the most commonly used concurrency
protocol.
Lock-based protocols help you to manage the order between the conflicting transactions
when they will execute. Timestamp-based protocols manage conflicts as soon as an
operation is created.
Example:
Suppose there are there transactions T1, T2, and T3.
T1 has entered the system at time 0010
T2 has entered the system at 0020 T3 has entered the system at 0030
Priority will be given to transaction T1, then transaction T2 and lastly
Transaction T3.
The principal idea behind the timestamp-based concurrency control protocols is that each
transaction is given a unique timestamp that the system uses in deciding the serialization
order. Our first task, then, in generalizing the centralized scheme to a distributed scheme is
to develop a scheme for generating unique timestamps.

Generation of Timestamps

There are two primary methods for generating unique timestamps, one centralized and
one distributed.

In the centralized scheme, a single node distributes the timestamps.

The node can use a logical counter or its own local clock for this purpose. While this
scheme is easy to implement, failure of the node would potentially block all transaction
processing in the system.
In the distributed scheme, each node generates a unique local timestamp by
using either a logical counter or the local clock.
We obtain the unique global timestamp by concatenating the unique local
timestamp with the node identifier, which also must be unique
We may still have a problem if one node generates local timestamps at a rate
faster than that of the other nodes.
In such a case, the fast node‘s logical counter will be larger
than that of other nodes. Therefore, all timestamps generated by the fast node
will be larger than those generated by other nodes. What we need is a
mechanism to ensure that local timestamps are generated fairly across the
system.
There are two solution approaches for this problem.
1. Keep the clocks synchronized by using a network time protocol. The protocol
periodically communicates with a server to find the current time. If the local time
is ahead of the time returned by the server, the local clock is slowed down,
whereas if the local time is behind the time returned by the server it is speeded
up, to bring it back in synchronization with the time at the server. Since all nodes
are approximately synchronized with the server, they are also approximately
synchronized with each other.

2.We define within each node Ni a logical clock (LCi), which generates the unique
local timestamp. The logical clock can be implemented as a counter that is
incremented after a new local timestamp is generated. To ensure that the various
logical clocks are synchronized, we require that a node Ni advance its logical clock
whenever a transaction Ti with timestamp < x, y > visits that node and x is greater
than the current value of LCi. In this case, node Ni advances its logical clock to the
value x + 1. As long as messages are exchanged regularly, the logical clocks will be
approximately synchronized.
Distributed Timestamp Ordering
The timestamp ordering protocol can be easily extended to a parallel or
distributed database setting.

Each transaction is assigned a globally unique timestamp at the node where it


originates.

Requests sent to other nodes include the transaction timestamp.


Each node keeps track of the read and write timestamps of the data items at that
node.

Whenever an operation is received by a node, it does the timestamp checks


locally, without any need to communicate with other nodes.

Timestamps must be reasonably synchronized across nodes; otherwise, the


following problem can occur.

Suppose one node has a time significantly lagging the others, and a transaction
T1 gets its timestamp at that node n1. Suppose the transaction T1 fails a
timestamp test on a data item di because di has been updated by a transaction T2
with a higher timestamp; T1 would be restarted with a new timestamp, but if the
time at node n1 is not synchronized, the new timestamp may still be old enough
to cause the timestamp test to fail, and T1 would be restarted repeatedly until the
time at n1 advances ahead of the timestamp of T2.

Advantages:
Schedules are serializable just like 2PL protocols
No waiting for the transaction, which eliminates the possibility of deadlocks!
Disadvantages:
Starvation is possible if the same transaction is restarted and continually aborted
Distributed Validation
The protocol is based on three timestamps: The start timestamp StartTS(Ti).
The validation timestamp, TS(Ti), which is used as the serialization order.
The finish timestamp FinishTS(Ti) which identifies when the writes of a transaction
have completed.
Steps in Validation based protocol
1.Validation is done locally at each node, with timestamps assigned as described
below.

2. In a distributed setting, the validation timestamp TS(Ti) can be assigned at any


of the nodes, but the same timestamp TS(Ti) must be used at all nodes where
validation is to be performed. Transactions must be serializable based on their
timestamps TS(Ti).

3.The validation test for a transaction Ti looks at all transactions Tj with TS(Tj) <
TS(Ti), to check if Tj either finished before Ti started, or has no conflicts with Ti. The
assumption is that once a particular transaction enters the validation phase, no
transaction with a lower timestamp can enter the validation phase. The assumption
can be ensured in a centralized system by assigning the timestamps in a critical
section, but cannot be ensured in a distributed setting.

A key problem in the distributed setting is that a transaction Tj may enter the
validation phase after a transaction Ti, but with TS(Tj) < TS(Ti). It is too late for

Ti to be validated against Tj. However, this problem can be easily fixed by rolling back
any transaction if, when it starts validation at a node, a transaction with a later
timestamp had already started validation at that node.
4.The start and finish timestamps are used to identify transactions Tj whose writes
would definitely have been seen by a transaction Ti. These timestamps must be
assigned locally at each node, and must satisfy StartTS(Ti) ≤ TS(Ti) ≤ FinishTS(Ti).
Each node uses these timestamps to perform validation locally.

5.When used in conjunction with 2PC, a transaction must first be validated and then
enter the prepared state. Writes cannot be committed at the database until the
transaction enters the committed state in 2PC. Suppose a transaction Tj reads an
item updated by a transaction Ti that is in the prepared state and is allowed to
proceed using the old value of the data item (since the value generated by Ti has not
yet been written to the database). Then, when transaction Tj attempts to validate, it
will be serialized after Ti and will surely fail validation if Ti commits. Thus, the read
by Tj may as well be held until Ti commits and finishes its writes. The above
behavior is the same as what would happen with locking, with write locks acquired at
the time of validation.

Although full implementations of validation-based protocols are not widely used in


distributed settings, optimistic concurrency control without read validation, is widely
used in distributed settings.
Validation based Protocol in DBMS also known as Optimistic Concurrency Control
Technique is a method to avoid concurrency in transactions. In this protocol, the local
copies of the transaction data are updated rather than the data itself, which results in
less interference while execution of the transaction.

 The Validation based Protocol is performed in the following three phases: Read

Phase

Validation Phase Write Phase

Read Phase
In the Read Phase, the data values from the database can be read by a
transaction but the write operation or updates are only applied to the local data
copies, not the actual database.

Validation Phase
In Validation Phase, the data is checked to ensure that there is no violation of
serializability while applying the transaction updates to the database.

Write Phase
In the Write Phase, the updates are applied to the database if the validation is
successful, else; the updates are not applied, and the transaction is rolled back.

Characteristics of Good Concurrency Protocol


An ideal concurrency control DBMS mechanism has the following objectives: Must
be resilient to site and communication failures.
It allows the parallel execution of transactions to achieve maximum concurrency.
Its storage mechanisms and computational methods should be modest to
minimize overhead.
It must enforce some constraints on the structure of atomic actions of
transactions.
Introduction to NOSQL Systems
Many companies and organizations are faced with applications that store vast
amounts of data. Consider a free e-mail application, such as Google Mail or Yahoo
Mail or other similar service—this application can have millions of users, and each
user can have thousands of e-mail messages. There is a need for a storage system
that can manage all these e-mails; a structured relational SQL system may not be
appropriate because

1) SQL systems offer too many services (powerful query language, concurrency
control, etc.), which this application may not need; and

2) A structured data model such the traditional relational model may be too
restrictive.

Some of the organizations that were faced with these data management and storage
applications decided to develop their own systems:

BigTable
Google developed a proprietary NOSQL system known as BigTable, which is used in
many of Google‘s applications that require vast amounts of data storage, such as
Gmail, Google Maps, and Web site indexing.

Apache Hbase is an open source NOSQL system based on similar concepts.


Google‘s innovation led to the category of NOSQL systems known as column-
based or wide column stores; they are also sometimes referred to as column
family stores.

DynamoDB

Amazon developed a NOSQL system called DynamoDB that is available through


Amazon‘s cloud services.

This innovation led to the category known as key-value data stores or sometimes
key-tuple or key-object data stores.
Cassandra
Facebook developed a NOSQL system called Cassandra, which is now open source
and known as Apache Cassandra.

This NOSQL system uses concepts from both key-value stores and column-based
systems.

Other software companies started developing their own solutions and making them
available to users who need these capabilities—for example, MongoDB and
CouchDB, which are classified as document-based NOSQL systems or document
stores.

Another category of NOSQL systems is the graph-based NOSQL systems, or graph


databases; these include Neo4J and GraphBase, among others.

Some NOSQL systems, such as OrientDB, combine concepts from many of the
categories discussed above.

NOSQL characteristics related to distributed databases and distributed


systems

Scalability
Availability, replication, and eventual consistency Replication models

Master-slave Master-master

Sharding of files

High performance data access


Scalability:

There are two kinds of scalability in distributed systems: horizontal and vertical. In
NOSQL systems,

horizontal scalability:
It is generally used, where the distributed system is expanded by adding more nodes
for data storage and processing as the volume of data grows.

Vertical scalability: refers to expanding the storage and computing power of


existing nodes.

In NOSQL systems, horizontal scalability is employed while the system is operational,


so techniques for distributing the existing data among new nodes without
interrupting system operation are necessary.

Availability, Replication and Eventual Consistency:

Many applications that use NOSQL systems require continuous system availability.
To accomplish this, data is replicated over two or more nodes in a transparent
manner, so that if one node fails, the data is still available on other nodes.

Replication improves data availability and can also improve read performance,
because read requests can often be serviced from any of the replicated data
nodes.

Replication Models:
Two major replication models are used in NOSQL systems:

master-slave and master-master replication.


Master-slave Replication
 It requires one copy to be the master copy.
 All write operations must be applied to the master copy and then propagated to
the slave copies, usually using eventual consistency (the slave copies will
eventually be the same as the master copy).
 For read, the master-slave paradigm can be configured in various ways.
 One configuration requires all reads to also be at the master copy, so this would be
similar to the primary site or primary copy methods of distributed concurrency
control.
 Another configuration would allow reads at the slave copies but would not
guarantee that the values are the latest writes, since writes to the slave nodes

can be done after they are applied to the master copy.


Master-master Replication
 It allows reads and writes at any of the replicas but may not guarantee that reads
at nodes that store different copies see the same values.
 Different users may write the same data item concurrently at different nodes of
the system, so the values of the item will be temporarily inconsistent.

 A reconciliation method to resolve conflicting write operations of the


same data item at different nodes must be implemented as part of the master-
master replication scheme.
Sharding of Files:
In many NOSQL applications, files (or collections of data objects) can have many
millions of records (or documents or objects), and these records can be accessed
concurrently by thousands of users. So it is not practical to store the whole file in
one node.

Sharding (also known as horizontal partitioning of the file records is often


employed in NOSQL systems. This serves to distribute the load of accessing the file
records to multiple nodes. The combination of sharding the file records and
replicating the shards works in tandem to improve load balancing as well as data
availability.

High-Performance Data Access:


In many NOSQL applications, it is necessary to find individual records or objects
(data items) from among the millions of data records or objects in a file.

To achieve this, most systems use one of two techniques: hashing or range
partitioning on object keys.

The majority of accesses to an object will be by providing the key value rather than
by using complex query conditions.

In hashing, a hash function h(K) is applied to the key K, and the location of the
object with key K is determined by the value of h(K).

In range partitioning, the location is determined via a range of key values; for
example, location i would hold the objects whose key values K are in the range Kimin
≤ K ≤ Kimax. In applications that require range queries, where multiple objects
within a range of key values are retrieved, range partitioned is preferred.
NOSQL characteristics related to data models and query languages

Schema not required


Less powerful query languages Versioning

Not Requiring a Schema:


The flexibility of not requiring a schema is achieved in many NOSQL systems by
allowing semi-structured, selfdescribing data.

The users can specify a partial schema in some systems to improve storage efficiency,
but it is not required to have a schema in most of the NOSQL systems.

As there may not be a schema to specify constraints, any constraints on the data
would have to be programmed in the application programs that access the data
items.

There are various languages for describing semistructured data, such as JSON
(JavaScript Object Notation) and XML (Extensible Markup Language).

JSON is used in several NOSQL systems, but other methods for describing semi-
structured data can also be used.

Less Powerful Query Languages: Many applications that use NOSQL systems
may not require a powerful query language such as SQL, because search (read)
queries in these systems often locate single objects in a single file based on their
object keys. NOSQL systems typically provide a set of functions and operations as a
programming API (application programming interface), so reading and writing the
data objects is accomplished by calling the appropriate operations by the programmer.
In many cases, the operations are called CRUD operations, for Create, Read,
Update, and Delete.

In other cases, they are known as SCRUD because of an added Search (or Find)
operation.

Some NOSQL systems also provide a high-level query language, but it may not have
the full power of SQL; only a subset of SQL querying capabilities would be provided.
In particular, many NOSQL systems do not provide join operations as part of the
query language itself; the joins need to be implemented in the application programs.

Versioning:
Some NOSQL systems provide storage of multiple versions of the data items, with
the timestamps of when the data version was created.
Categories of NoSQL Databases:
NoSQL databases can be divided into 4 types. They are as follows

1. Document Database

1. Key Value Database

1. Column Oriented Database

2. Graph Database Document Database:

The document database stores data in the form of documents. This implies that
data is grouped into files that make it easier to be recognized when it is required
for building application software.

It is a semi-structured and hierarchical NoSQL database that allows efficient


storage of data. Especially when it comes to user profiles or catalogs, this type of
NoSQL database works very well. A typical NoSQL database example is Mongodb.

Key Value Database:


 Termed to be the simplest form of NoSQL database of all other types, the key-
value database is a database that stores data in a schema-less manner. This type
of database stores data in the key-value format.

 A data point is categorized as a key to which a value (another data point) is


allotted. For instance, a key data point can be termed as 'age' while the value data
point can be termed as '45'.

 In Key value Database data gets stored in an organized manner with the help of
associative pairing. A typical example of this type is Amazon's Dynamo database.
Column Oriented Database:
 This type of database stores data in the form of columns that segregates
information into homogenous categories.

 This allows the user to access only the desired data without having to retrieve
unnecessary information.

 When it comes to data analytics in social media networking sites, the column-
oriented database works very efficiently by showcasing data that is prevalent in
the search results.

 A typical example of a column-oriented NoSQL database is Apache HBase.

Graph Database:
 Data is stored in the form of graphical knowledge and related elements like edges,
nodes, etc.

 Data points are placed in such a manner that nodes are related to edges and thus,
a network or connection is established between several data points.

 This way, one data point leads to the other without the user having to retrieve
individual data points. In the case of software development, this type of database
works well since connected data points often lead to networked data storage.

 This, in turn, makes the functioning of software highly effective and organized. An
example of the graph NoSQL database is Amazon Neptune.

Hybrid NOSQL systems: These systems have characteristics from two or more of
the above four categories.
Document-Based NOSQL Systems and MongoDB

Document-based or document-oriented NOSQL systems typically store data as


collections of similar documents.

These types of systems are also sometimes known as document stores.


The individual documents somewhat resemble complex objects or XML documents
but a major difference between document-based systems versus object and
object-relational systems and XML is that there is no requirement to specify a
schema—rather, the documents are specified as self-describing data

Although the documents in a collection should be similar, they can have different
data elements (attributes), and new documents can have new data elements that
do not exist in any of the current documents in the collection.

The system basically extracts the data element names from the self-describing
documents in the collection, and the user can request that the system create
indexes on some of the data elements.

Documents can be specified in various formats, such as XML. A popular language


to specify documents in NOSQL systems is JSON (JavaScript Object Notation).

There are many document-based NOSQL systems, including MongoDB and


CouchDB, among many others.

MongoDB Data Model


MongoDB documents are stored in BSON (Binary JSON) format, which is a
variation of JSON with some additional data types and is more efficient for storage
than JSON.

Individual documents are stored in a collection.


The operation createCollection is used to create each collection. For example, the
following command can be used to create a collection called project to hold
PROJECT objects from the COMPANY database
db.createCollection(―project‖, { capped : true, size : 1310720, max : 500 } )
The first parameter ―project‖ is the name of the collection, which is followed by an
optional document that specifies collection options.

In our example, the collection is capped; this means it has upper limits on its
storage space (size) and number of documents (max). The capping parameters
help the system choose the storage options for each collection.

For our example, we will create another document collection called worker to
hold information about the EMPLOYEEs who work on each project; for example:
db.createCollection(―worker‖, { capped : true, size : 5242880, max : 2000 } ) )

Each document in a collection has a unique ObjectId field, called _id, which is
automatically indexed in the collection unless the user explicitly requests no index
for the _id field.

The value of ObjectId can be specified by the user, or it can be system- generated
if the user does not specify an _id field for a particular document.

System-generated ObjectIds have a specific format, which combines the


timestamp when the object is created (4 bytes, in an internal MongoDB format),
the node id (3 bytes), the process id (2 bytes), and a counter (3 bytes) into a 16-
byte Id value.

User-generated ObjectsIds can have any value specified by the user as long as it
uniquely identifies the document and so these Ids are similar to primary keys in
relational systems.

A collection does not have a schema. The structure of the data fields in
documents is chosen based on how documents will be accessed and used, and the
user can choose a normalized design (similar to normalized relational tuples) or a
denormalized design (similar to XML documents or complex objects).
MongoDB CRUD Operations
MongoDb has several CRUD operations, where CRUD stands for (create, read,
update, delete).
Documents can be created and inserted into their collections using the insert
operation, whose format is: db.<collection_name>.insert(<document(s)>)
The parameters of the insert operation can include either a single document or an
array of documents.
The delete operation is called remove, and the format is:
db.<collection_name>.remove(<condition>)

The documents to be removed from the collection are specified by a Boolean


condition on some of the fields in the collection documents.
There is also an update operation, which has a condition to select certain
documents, and a $set clause to specify the update.
It is also possible to use the update operation to replace an existing document
with another one but keep the same ObjectId.
For read queries, the main command is called find, and the format is:
db.<collection_name>.find(<condition>)
General Boolean conditions can be specified as <condition>, and the documents
in the collection that return true are selected for the query result.
MongoDB Distributed Systems Characteristics
Most MongoDB updates are atomic if they refer to a single document, but
MongoDB also provides a pattern for specifying transactions on multiple
documents. Since MongoDB is a distributed system, the two-phase commit
method is used to ensure atomicity and consistency of multidocument
transactions.

Replication in MongoDB. The concept of replica set is used in MongoDB to


create multiple copies of the same data set on different nodes in the distributed
system, and it uses a variation of the master-slave approach for replication.
Sharding in MongoDB. When a collection holds a very large number of
documents or requires a large storage space, storing all the documents in one
node can lead to performance problems, particularly if there are many user
operations accessing the documents concurrently using various CRUD operations.
Sharding of the documents in the collection—also known as horizontal
partitioning— divides the documents into disjoint partitions known as shards. This
allows the system to add more nodes as needed by a process known as
horizontal scaling of the distributed system and to store the shards of the
collection on different nodes to achieve load balancing.

NOSQL Key-Value Stores


Key-value stores focus on high performance, availability, and scalability by
storing

data in a distributed storage system.


The data model used in key-value stores is relatively simple, and in many of these
systems, there is no query language but rather a set of operations that can be
used by the application programmers. T

The key is a unique identifier associated with a data item and is used to locate
this data item rapidly.

The value is the data item itself, and it can have very different formats for
different key-value storage systems.

In some cases, the value is just a string of bytes or an array of bytes, and the
application using the key-value store has to interpret the structure of the data
value.

In other cases, some standard formatted data is allowed; for example, structured
data rows (tuples) similar to relational data, or semistructured data using JSON or
some other self-describing data format.
Different key-value stores can thus store unstructured, semistructured, or
structured data items.

The main characteristic of key-value stores is the fact that every value (data item)
must be associated with a unique key, and that retrieving the value by supplying
the key must be very fast.

DynamoDB Overview
The DynamoDB system is an Amazon product and is available as part of Amazon‘s
AWS/SDK platforms (Amazon Web Services/Software Development Kit).
It can be used as part of Amazon‘s cloud computing services, for the data storage
component.
DynamoDB data model.
The basic data model in DynamoDB uses the concepts
of tables, items, and attributes. A table in DynamoDB does not have a schema;
it holds a collection of self-describing items.
Each item will consist of a number of (attribute, value) pairs, and attribute
values can be single-valued or multivalued.
So basically, a table will hold a collection of items, and each item is a self-
describing record (or object).
DynamoDB also allows the user to specify the items in JSON format, and the
system will convert them to the internal storage format of DynamoDB.
When a table is created, it is required to specify a table name and a primary
key; the primary key will be used to rapidly locate the items in the table.
Thus, the primary key is the key and the item is the value for the DynamoDB
key-value store.
The primary key attribute must exist in every item in the table.
Column-Based or Wide Column NOSQL Systems
The Google distributed storage system for big data, known as BigTable, is a well-
known example of this class of NOSQL systems, and it is used in many Google
applications that require large amounts of data storage, such as Gmail.

Big-Table uses the Google File System (GFS) for data storage and distribution.
An open source system known as Apache Hbase is somewhat similar to Google
Big-Table, but it typically uses HDFS (Hadoop Distributed File System) for data
storage.

Hbase can also use Amazon‘s Simple Storage System (known as S3) for data
storage.

Hbase Data Model and Versioning

Hbase data model. The data model in Hbase organizes data using the concepts
of namespaces, tables, column families, column qualifiers, columns, rows, and data
cells.

A column is identified by a combination of (column family:column qualifier).

Data is stored in a self-describing form by associating columns with data values,


where data values are strings. Hbase also stores multiple versions of a data item,
with a timestamp associated with each version, so versions and timestamps are
also part of the Hbase data model.

Tables and Rows. Data in Hbase is stored in tables, and each table has a table
name. Data in a table is stored as self-describing rows. Each row has a unique
row key
Column Families, Column Qualifiers, and Columns. A table is associated with
one or more column families. Each column family will have a name, and the
column families associated with a table must be specified when the table is
created and cannot be changed later.

When the data is loaded into a table, each column family can be associated with
many column qualifiers, but the column qualifiers are not specified as part of
creating a table.

So the column qualifiers make the model a self-describing data model because the
qualifiers can be dynamically specified as new rows are created and inserted into
the table.

A column is specified by a combination of ColumnFamily:ColumnQualifier.


Versions and Timestamps. Hbase can keep several versions of a data item,
along with the timestamp associated with each version. The timestamp is a long
integer number that represents the system time when the version was created, so
newer versions have larger timestamp values.

Cells. A cell holds a basic data item in Hbase. The key (address) of a cell is
specified by a combination of (table, rowid, columnfamily, columnqualifier,
timestamp).

Namespaces. A namespace is a collection of tables. A namespace basically


specifies a collection of one or more tables that are typically used together by user
applications, and it corresponds to a database that contains a collection of tables in
relational terminology.
NOSQL Graph Databases and Neo4j
Another category of NOSQL systems is known as graph databases or graph
Oriented NOSQL systems.

The data is represented as a graph, which is a collection of vertices (nodes) and


edges. Both nodes and edges can be labeled to indicate the types of entities and
relationships they represent, and it is generally possible to store data associated
with both individual nodes and individual edges.

Neo4j Data Model

The data model in Neo4j organizes data using the concepts of nodes and
relationships.

Both nodes and relationships can have properties, which store the data items
associated with nodes and relationships. Nodes can have labels; the nodes that
have the same label are grouped into a collection that identifies a subset of the
nodes in the database graph for querying purposes.

A node can have zero, one, or several labels. Relationships are directed; each
relationship has a start node and end node as well as a relationship type, which
serves a similar role to a node label by identifying similar relationships that have
the same relationship type.

Properties can be specified via a map pattern, which is made of one or more
―name : value‖ pairs enclosed in curly brackets; for example {Lname : ‗Smith‘,
Fname : ‗John‘, Minit : ‗B‘}.
An enterprise DBMS is such in which 100 to 10, 000 individuals can access
simultaneously. Businesses and big companies use this to handle their vast data
set. Such database allows businesses to increase their productivity. This kind of
database can handle large organizations with thousands of employees and busy
web server with lakhs of people accessing it simultaneously online.

Typically DBMS is managed by Database administrator, or DBA, who is specialist in


particular software product. DBA instructs-system to load, retrieve, or change data
in database, as well as tells who can access data and what commands each one
can use.

These databases provide an interconnected collection of five types of database


architecture: original data capture, transaction data staging area, subject area,
wholesale data warehouses, and data marts. All of these databases represent one
or more functional areas that are present in every enterprise in general.

• Original data Capture :


Original data capture stores information about ongoing applications to database.
Data can be created by any individual, or may it can be result of software
development.
• TDSA :
TDSA stands for transaction data staging area that smoothes all various
semantics, times, and units that can occur in original data capture packages since
they come from different users, who run under different operating systems.

• Subject area database :


Subject area database draw data from one or more TDSA databases and create
databases of large subject areas. Overall number of subject area databases is
equal to number of independent user resources

• Warehouse Database :
A warehouse database is one that holds data taken from several different
databases in subject field. Data in data centers is taken from databases of subject
region.

• Data Mart Database :


data mart database is ad hoc and is created for specific need. Its architecture,
loading software, and reporting from it are all custom created. Datamart database
volumes can also be limited to wide range of data needed for individual offices or
even individuals and can be downloaded Weekly or even nightly, to that office or
person.

Various Enterprise Database Management System :


There are many enterprise databases such as :
• Oracle Database 18c
• Microsoft SQL Server
• IBM DB2
• SAP Sybase ASE
• PostgreSQL
• MariaDB Enterprise etc
Features of Enterprise Database Management System :

• Parallel query :
At the same time, several users will position queries in there.All the questions are
responded to simultaneously.

• Multi-process support :
Several processes can be handled by splitting work load between them all.

• Clustering features :
That is, it combines more than one server or single database connecting
case.Often one server can not be sufficient to handle data volume, so this is time
where this function comes into play.
10.ASSIGNMENT

1. Given the following relation EMP and the predicates p1: SAL > 23000, p2: SAL <
23000

a) Perform a horizontal fragmentation of the table based on the given predicates.


b) Is this a correct fragmentation?
c) If the answer to (b) is no, explain why, and give the predicates that would
correctly fragment the table.
11. Part A Question & Answer

S.No Question and Answers CO K


Compare homogeneous distributed database and
heterogeneous distributed database
A homogeneous distributed database has identical software
and hardware running all databases instances, and may
appear through a single interface as if it were a single
1 database. A heterogeneous distributed database may have CO6 K2
different hardware, operating systems, database
management systems, and even data models for different
databases

Define Distributed Database Systems.


Database spread over multiple machines (also referred to
as sites or nodes).Network interconnects the machines.
Database shared by users on multiple machines is called
2 CO6 K2
Distributed Database Systems
What is meant by fragmentation in Distributed Database?
The system partitions the relation into several fragment and
stores each fragment at different sites Two approaches :
3 Horizontal Fragmentation, Vertical Fragmentation CO6 K2

What is Database replication?


Database replication can be used on many database
management systems, usually with a master/slave
relationship between the original and the copies. The
master logs the updates, which then ripple through to the
4 slaves. The slave outputs a message stating that it has CO6 K2
received the update successfully, thus allowing the sending
of subsequent updates.
List out Concurrency Control Techniques.
There are various concurrency control techniques.
• Locking Protocols
5 • Timestamp based Protocol CO6 K2
• Validation based protocol

Define Distributed Lock Manager.


• A compromise between the advantages and
disadvantages can be achieved through the distributed-
lock-manager approach, in which the lock-manager function
6 is distributed over several sites. CO6 K1
• Each site maintains a local lock manager whose function
is to administer the lock and unlock requests for those data
items that are stored in that site.
S.No Question and Answers CO K
What are Architectural Models of DDBMS? Some of the common
architectural models are : Client - Server Architecture for
DDBMS
7 Peer - to - Peer Architecture for DDBMS Multi - DBMS CO6 K2
Architecture

Different types of Client Server Architechture?


 Single Server Multiple Client
8 CO6 K2
 Multiple Server Multiple Client
Consider a relation r that is to be stored in the database, Define
its approaches.
Replication: The system maintains several identical replicas of
the relation, and stores each replica at a different site. The
alternative to replication is to store only one copy of relation r.
9 Fragmentation: The system partitions the relation into several CO6 K2
fragments, and stores each fragment at a different site.

What is Biased Protocol?


Shared locks.
When a transaction needs to lock data item Q, it simply requests
a lock on Q from the lock manager at one site that contains a
replica of Q.
 Exclusive locks.
10 When a transaction needs to lock data item Q, it requests a lock CO6 K2
on Q from the lock manager at all sites that contain a replica of
Q. As before, the response to the request is delayed until it can
be granted.

What is Quorum Consensus ?


The quorum consensus protocol is a generalization of the
majority protocol. The quorum consensus protocol assigns each
site a nonnegative weight. It assigns read and write operations
on an item x two integers, called read quorum Qr and write
11 quorum Qw, that must satisfy the following condition, where S is CO6 K2
the total weight of all sites at which x resides:
Qr + Qw > S and 2 ∗Qw > S

Define Time Stamp Based Protocol?


Timestamp based Protocol in DBMS is an algorithm which
uses the System Time or Logical Counter as a timestamp to
serialize the execution of concurrent transactions. The
Timestamp-based protocol ensures that every conflicting read
12 and write operations are executed in a timestamp order. CO6 K2
S.No Question and Answers CO K
Define Validation Based Protocol.
Validation based Protocol in DBMS also known as
Optimistic Concurrency Control Technique is a method to
avoid concurrency in transactions. In this protocol, the local
copies of the transaction data are updated rather than the
13 data itself, which results in less interference while execution CO6 K2
of the transaction.
List out the phases in Validation Based Protocol ?
The Validation based Protocol is performed in the following
three phases: 1.Read Phase 2.Validation Phase 3.Write
Phase
14 CO6 K2

Illustrate an example for Time Stamp Based Protocol.


Suppose there are there transactions T1, T2, and T3. T1
has entered the system at time 0010
T2 has entered the system at 0020
T3 has entered the system at 0030
15 Priority will be given to transaction T1, then transaction T2 CO6 K2
and lastly Transaction T3.

Illustrate the need for NoSQL Database.


There is a need for a storage system that can manage all
these e-mails; a structured relational SQL system may not
be appropriate because
1) SQL systems offer too many services (powerful query
language, concurrency control, etc.), which this
16 application may not need; and CO6 K2
2) A structured data model such the traditional relational
model may be too restrictive.
List out the Categories in NoSQL Database.
NoSQL databases can be divided into 4 types. They are as
follows
1. Document Database
2. Key Value Database
17 3. Column Oriented Database CO6 K1
4. Graph Database
Illustrate the features of NoSQL.
Each NoSQL database has its own unique features. At a
high level, many NoSQL databases have the following
features:
 Flexible Schemas
18  Horizontal scaling CO6 K2
 Complex free working
 Durable
S.No Question and Answers CO K
Define Horizontal Scaling in NoSQL.
Horizontal scaling, also known as scale-out, refers to
bringing on additional nodes to share the load. This is
difficult with relational databases due to the difficulty in
spreading out related data across nodes. With non-
relational databases, this is made simpler since collections
19 are self- contained and not coupled relationally. This allows CO6 K2
them to be distributed across nodes more simply, as
queries do not have to ―join‖ them together across nodes.

What is Document Databse?


The document database stores data in the form of
documents. This implies that data is grouped into files that
make it easier to be recognized when it is required for
building application software.
It is a semi-structured and hierarchical NoSQL database that
20
allows efficient storage of data. Especially when it comes to CO6 K2
user profiles or catalogs, this type of NoSQL database works
very well. A typical NoSQL database example is Mongodb

Define Column Oriented Database.


This type of database stores data in the form of columns
that segregates information into homogenous categories.
21 This allows the user to access only the desired data without CO6 K1
having to retrieve unnecessary information.

Define Graph Database.


Data is stored in the form of graphical knowledge and
related elements like edges, nodes, etc. Data points are
22 placed in such a manner that nodes are related to edges CO6 K2
and thus, a network or connection is established between
several data points.
List out the characteristic of Distributed DBMS
Architectures. DDBMS architectures are generally developed
depending on three parameters −
Distribution − It states the physical distribution of data
across the different sites.
Autonomy − It indicates the distribution of control of the
23 database system and the degree to which each constituent CO6 K1
DBMS can operate independently.
Heterogeneity − It refers to the uniformity or dissimilarity
of the data models, system components and databases.

What are the distributed transactions?


There are two types of transaction that we need to consider.
Local transactions are those that access and update data in
only one local database;.
24 Global transactions are those that access and update data CO6 K2
in several local databases .
Part – BQuestions
S.No Questions CO K
1 Explain the architectural models of distributed database CO5 K1

2 Explain database transaction Explain the types of DDBMS CO5 K1

3 Explain locking based protocols for distributed database CO5 K1

4 Explain Time Stamp and Validation Based Protocols for CO5 K1


distributed database
5 Explain fragmentation, replication in DDBMS CO5 K2

6 Explain NoSQL database in detail CO6 K1


7 Explain the Client - Server Architecture for DDBMS in detail CO6 K1

8 Explain the characteristics of NoSQL database systems CO6 K1


9 Depict the need for NoSQL database and List out its CO6 K1
applications.
10 Explain NoSQL and its Categories in detail. CO6 K1
13. SUPPORTIVE ONLINE CERTIFICATION COURSES

Name of the Name of the


Institute Course
S.No. Website Link
https://nptel.ac.in/courses/106/1
0 6/106106168/
1. NPTEL Distributed
Systems
Introduction to https://nptel.ac.in/courses/106/1
Modern Application 0 6/106106156/
2. NPTEL
Development

NoSQL Database https://www.edx.org/course/nosq


Basics l- basics
3. EdX

Learn MongoDB: https://www.udemy.com/course/l


Leading NoSQL e arn-mongodb-leading-nosql-
4. Udemy
Database database-from-scratch/
14. REAL TIME APPLICATIONS IN DAY TO DAY LIFE
AND TO INDUSTRY

In General, the traditional approach to disaster protection does not apply to online
transaction processing. An alternate approach to disaster protection is to have two or
more sites actively back one another up. During normal operation, each site stores a
replica of the data and each carries part of the telecommunications and
computational load. In an emergency, all work is shifted to the surviving sites. For
failsafe protection, the sites must be independent of each other's failures. They
should be geographically separate, part of different power grids, switching stations,
and so on. Geographic diversity gives protection from fires, natural disasters, and
sabotage.

A finance company gives a good example of this approach. The company has traders
of notes, loans, and bonds located in ten American cities. To allow delivery to the
banks for processing on a "same-day" basis, deals must be consummated by 1 PM.
Traders do not start making deals until about 11 AM because interest rates can
change several times during the morning. In a three hour period the traders move
about a billion dollars of commercial paper.

To guard against lost data and lost work, two identical data centers were installed,
each a six-processor system with 10 spindles of disc. The data centers are about 500
km apart. Each center supports half the terminals, although each trading floor has a
direct path to both centers. Each data center stores the whole database. As is
standard with Tandem, each data center duplexes its discs so the database is stored
on four sets of discs they call this the "quad" database architecture. Hence, there is
redundancy of communications, processing, and data.
15. CONTENT BEYOND SYLLABUS

XML Databases - Hierarchical Model


XML Hierarchical (Tree) Data Model:

The basic object in XML is the XML document. Two main structuring concepts are used to
construct an XML document: elements and attributes. An example of an XML element called
. As in HTML, elements are identified in a document by their start tag and end tag. The tag
names are enclosed between angled brackets < ... >, and end tags are further identified by

a slash.</…..>.

Complex elements are constructed from other elements hierarchically, whereas simple
elements contain data values. A major difference between XML and HTML is that XML tag
names are defined to describe the meaning of the data elements in the document, rather
than to describe how the text is to be displayed. This makes it possible to process the data
elements in the XML document automatically by computer programs. Also, the XML tag
(element) names can be defined in another document, known as the schema document, to
give a semantic meaning to the tag names that can be exchanged among multiple users. In
HTML, all tag names are predefined and fixed; that is why they are not extendible.

It is possible to characterize three main types of XML documents:


 Data-centric XML documents: These documents have many small data items that
follow a specific structure and hence may be extracted from a structured database. They
are formatted as XML documents in order to exchange them over or display them on the
Web. These usually follow a predefined schema that defines the tag names.
 Document-centric XML documents: These are documents with large amounts of text,
such as news articles or books. There are few or no structured data elements in these
documents.
 Hybrid XML documents: These documents may have parts that contain structured data
and other parts that are predominantly textual or unstructured. They may or may not
have a predefined schema.
XML documents that do not follow a predefined schema of element names and
corresponding tree structure are known as schema less XML documents. It is important to
note that data-centric XML documents can be considered either as semi-structured data or

as structured data
DOCUMENT TYPE DEFINITION (DTD):
The document type definition (DTD) is an optional part of an XML
document. The main purpose of a DTD is much like that of a schema:
to constrain and type the information present in the document.

However, the DTD does not in fact constrain types in the sense of basic types like integer or
string. Instead, it constrains only the appearance of sub elements and attributes within an
element. The DTD is primarily a list of rules for what pattern of sub-elements may appear
within an element.

Example of a DTD
Thus, in the DTD, a university element consists of one or more course, department, or
instructor elements; the operator specifies ―or‖ while the + operator specifies ―one or more.‖
Although not shown here, the ∗operator is used to specify ―zero or more,‖ while the?
operator is used to specify an optional element (that is, ―zero or one‖). The course element
contains sub elements course id, title, dept name, and credits (in that order).
Similarly, department and instructor have the attributes of their relational schema defined as
sub elements in the DTD. Finally, the elements course id, title, dept name, credits, building,
budget, IID, name, and salary are all declared to be of type #PCDATA. The keyword #PCDATA
indicates text data; it derives its name, historically, from ―parsed character data.‖
Two other special type declarations are empty, which says that the element has no
contents, and any, which says that there is no constraint on the sub elements of the
element; that is, any elements, even those not mentioned in the DTD, can occur as
sub elements of the element. The absence of a declaration for an element is
equivalent to explicitly declaring the type as any.

XML SCHEMA
XML Schema defines a number of built-in types such as string, integer, decimal date,
and boolean. In addition, it allows user-defined types; these may be simple types
with added restrictions, or complex types constructed using constructors such as
complex Type and sequence.

Note that any namespace prefix could be used in place of xs; thus we could replace
all occurrences of ―xs:‖ in the schema definition with ―xsd:‖ without changing the
meaning of the schema definition. All types defined by XML Schema must be
prefixed by this namespace prefix. The first element is the root element university,
whose type is specified to be University Type, which is declared later. The example
then defines the types of elements department, course, instructor, and teaches. Note
that each of these is specified by an element with tag xs:element, whose body
contains the type definition.

The type of department is defined to be a complex type, which is further specified to


consist of a sequence of elements dept name, building, and budget. Any type that
has either attributes or nested sub elements must be specified to be a complex type.
Alternatively, the type of an element can be specified to be a predefined type by the
attribute type; observe how the XML Schema types xs: string and xs: decimal are
used to constrain the types of data elements such as dept name and credits. Finally,
the example defines the type University Type as containing zero or more occurrences
of each of department, course, instructor, and teaches. Note the use of ref to specify
the occurrence of an element defined earlier.
XML Schema can define the minimum and maximum number of occurrences of sub
elements by using min Occurs and max Occurs. The default for both minimum and
maximum occurrences is 1, so these have to be specified explicitly to allow zero or
more department, course, instructor, and teaches elements
Attributes are specified using the xs:attribute tag. For example, we could have
defined dept name as an attribute by adding:

<xs:attribute name = ―dept name‖/>


within the declaration of the department element. Adding the attribute use =
―required‖ to the above attribute specification declares that the attribute must be
specified, whereas the default value of use is optional. Attribute specifications would
appear directly under the enclosing complex Type specification, even if elements are
nested within a sequence specification.
In addition to defining types, a relational schema also allows the specification of
constraints. XML Schema allows the specification of keys and key references,
corresponding to the primary-key and foreign-key definition in SQL. In SQL, a
primary-key constraint or unique constraint ensures that the attribute values do not
recur within the relation. In the context of XML, we need to specify a scope within
which values are unique and form a key. The selector is a path expression that
defines the scope for the constraint, and field declarations specify the elements or
attributes that form the key. To specify that dept name forms a key for department
elements under the root university element, we add the following constraint
specification to the schema definition:
XML Schema offers several benefits over DTDs, and is widely used today. Among the
benefits that we have seen in the examples above are these:
It allows the text that appears in elements to be constrained to specific types, such
as numeric types in specific formats or complex types such as sequences of
elements of other types.

 It allows user-defined types to be created.


 It allows uniqueness and foreign-key constraints.
 It is integrated with namespaces to allow different parts of a document to
conform to different schemas.
In addition to the features we have seen, XML Schema supports several other
features that DTDs do not, such as these:
 It allows types to be restricted to create specialized types, for instance by
specifying minimum and maximum values.

 It allows complex types to be extended by using a form of inheritance.

XQUERY
XPath allows us to write expressions that select items from a tree-structured XML
document. XQuery permits the specification of more general queries on one or more
XML documents. The typical form of a query in XQuery is known as a FLWR
expression, which stands for the four main clauses of XQuery and has the following
form:
FOR<variable bindings to individual nodes (elements)> LET <variable
bindings to collections of nodes (elements)> WHERE <qualifier
conditions>
RETURN<query result specification>
There can be zero or more instances of the FOR clause, as well as of the LET clause
in a single XQuery. The WHERE clause is optional, but can appear at most once, and
the RETURN clause must appear exactly once. Let us illustrate these clauses with the
following simple example of a XQuery.
1.Variables are prefixed with the $ sign. In the above example, $d, $x, and $y are
variables.

2.The LET clause assigns a variable to a particular expression for the rest of the
query. In this example, $d is assigned to the document file name. It is possible to
have a query that refers to multiple documents by assigning multiple variables in
this way.
3.The FOR clause assigns a variable to range over each of the individual items in a
sequence. In our example, the sequences are specified by path expressions. The
$x variable ranges over elements that satisfy the path expression

$d/company/project [projectNumber = 5]/projectWorker. The $y variable ranges


over elements that satisfy the path expression $d/company/employee. Hence, $x
ranges over projectWorker elements, whereas $y ranges over employee
elements.
4.The WHERE clause specifies additional conditions on the selection of items. In
this example, the first condition selects only those projectWorker elements that
satisfy the condition (hours gt 20.0). The second condition specifies a join
condition that combines an employee with a projectWorker only if they have the
same ssn value.
5. Finally, the RETURN clause specifies which elements or attributes should be
retrieved from the items that satisfy the query conditions. In this example, it will
return a sequence of elements each containing for employees who work more that
20 hours per week on project number 5.
XQuery has very powerful constructs to specify complex queries. In particular, it can
specify universal and existential quantifiers in the conditions of a query, aggregate
functions, ordering of query results, selection based on position in a sequence, and
even conditional branching. Hence, in some ways, it qualifies as a full-fledged
programming language.
16. ASSESSMENT SCHEDULE

Assessment Tools Proposed Date

Assessment 1 20.9.2021

Assessment 2 22.10.2021

Model Exam 18.11.2021


17.PRESCRIBED TEXT BOOKS & REFERENCE BOOKS
TEXT BOOKS:
1. Elmasri R. and S. Navathe, ―Fundamentals of Database Systems‖, Pearson Education, 7th
Edition, 2016.
2. Abraham Silberschatz, Henry F.Korth, ―Database System Concepts‖, Tata McGraw Hill , 7th
Edition, 2021.
3. Elmasri R. and S. Navathe, Database Systems: Models, Languages, Design and Application
Programming, Pearson Education, 2013

REFERENCES:
1. Raghu Ramakrishnan, Gehrke ―Database Management Systems‖, MCGraw Hill, 3rd Edition
2014.

2. Plunkett T., B. Macdonald, ―Oracle Big Data Hand Book‖ , McGraw Hill, First Edition, 2013
3. Gupta G K , ―Database Management Systems‖ , Tata McGraw Hill Education Private Limited,
New Delhi, 2011.
4. C. J. Date, A.Kannan, S. Swamynathan, ―An Introduction to Database Systems‖, Eighth
Edition, Pearson Education, 2015.

5. Maqsood Alam, Aalok Muley, Chaitanya Kadaru, Ashok Joshi, Oracle NoSQL Database: Real-
Time Big Data Management for the Enterprise, McGraw Hill Professional, 2013.
6. Thomas Connolly, Carolyn Begg, ― Database Systems: A Practical Approach to Design,
Implementation and Management‖, Pearson , 6th Edition, 2015.
18. MINI PROJECT SUGGESTIONS

Design a Distributed Relational database for the following


1) Insurance Management System
The insurance management project deals the adding new insurance schemes and managing the
clients for the insurance. The project has complete access for the crud operations that are to
create, read, update and delete the database entries. At first you need to add a branch and the
staff members for the branch then secondly add a user to the database now you can add an

insurance scheme and finally make the payments for the client to the added insurance.

2) Inventory Management
The project starts by adding a seller and by adding details of customer. the user can now
purchase new products by the desired seller and then can sell them to the customer, the
purchasing and selling of products is reflected in the inventory section. The main aim of
Inventory Management Mini DBMS project is to add new products and sell them and keep an
inventory to manage them.

3) Pharmacy management System


The project starts by adding a dealer and by adding details of customer. the user can now
purchase new medicines by the desired dealer and then can sell them to the customer added

the purchasing and selling of medicines is reflected in the inventory section.

4) Library management system


there will be an admin who will be responsible for manages the system. The admin will look after
how many books are available in the library and he can update them if any new books brought
to the library. Perform operations like adding a book to the database, view all books which are
added, search for a specific book, issue books, and retrieve the book from users.
5) Hotel management system
Add features like providing the menu of the food items that are prepared in the hotel. Provide
the menu and prices of the food in the hotel. you can add the online food ordering facility to
this which will give a good impression on the project. we can also book the tables in the
project. Add a feature that will show the collection of the day this will be only viewed by the
admin. we can also provide online booking of rooms also. Improve this with some other
features.
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.

You might also like