0% found this document useful (0 votes)
7 views

Dbms Module 5

Uploaded by

Nidhi Kushwaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Dbms Module 5

Uploaded by

Nidhi Kushwaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

MODULE –V NOTES

Module 5
Concurrency Control in Databases
In this chapter we discuss a number of concurrency control techniques that are used to ensure the non-
interference or isolation property of concurrently executing transactions. Most of these techniques ensure
Serializability of schedules.
2.1 Two-Phase Locking Techniques for Concurrency Control
Some of the main techniques used to control concurrent execution of transactions are based on the concept of
locking data items. A lock is a variable associated with a data item that describes the status of the item with
respect to possible operations that can be applied to it.
2.1.1 Types of Locks and System Lock Tables
Several types of locks are used in concurrency control. To introduce locking concepts gradually, first we
discuss binary locks, which are simple, but are also too restrictive for database concurrency control purposes,
and so are not used in practice. Then we discuss shared/exclusive locks—also known as read/write locks—
which provide more general locking capabilities and are used in practical database locking schemes. we
describe an additional type of lock called a certify lock, and show how it can be used to improve performance
of locking protocols.
Binary Locks. A binary lock can have two states or values: locked and unlocked (or 1 and 0,for simplicity).A
distinct lock is associated with each database item X.If the value of the lock on X is 1, item X cannot be
accessed by a database operation that requests the item. If the value of the lock on X is 0, the item can be
accessed when requested, and the lock value is changed to 1. We refer to the current value (or state) of the lock
associated with item X as lock(X).
Two operations, lock_item and unlock_item,are used with binary locking .A transaction requests access
to an item X by first issuing a lock_item(X) operation. If LOCK(X) = 1, the transaction is forced to wait. If
LOCK(X) = 0,it is set to 1 (the transaction locks the item) and the transaction is allowed to access item X.
When the transaction is through using the item, it issues an unlock_item(X) operation, which sets LOCK(X)
back to 0 (unlocks the item) so that X may be accessed by other transactions. Hence, a binary lock enforces
mutual exclusion on the data item. A description of the lock_item(X) and unlock_item(X) operations is shown
in Figure 22.1.

Sahana M 1
MODULE –V NOTES

If the simple binary locking scheme described here is used, every transaction must obey the following rules:

Shared/Exclusive (or Read/Write) Locks: The preceding binary locking scheme is too restrictive for
database items because at most, one transaction can hold a lock on a given item. We should allow several
transactions to access the same item X if they all access X for reading purposes only. This is because read
operations on the same item by different transactions are not conflicting. However, if a transaction is to write
an item X, it must have exclusive access to X.
For this purpose, a different type of lock called a multiple-mode lock is used. In this scheme—called
shared/exclusive or read/write locks—there are three locking operations: read_lock(X), write_lock(X),
and unlock(X). A lock associated with an item X, LOCK(X), now has three possible states: read-locked,
write-locked, or unlocked. A read-locked item is also called share-locked because other transactions are
allowed to read the item, whereas a write-locked item is called exclusive-locked because a single transaction
exclusively holds the lock on the item.

Sahana M 2
MODULE –V NOTES
When we use the shared/exclusive locking scheme, the system must enforce the following rules:
1. A transaction T must issue the operation read_lock(X) or write_lock(X) before any read_item(X) operation is
performed in T.
2. A transaction T must issue the operation write_lock(X) before any write_item(X) operation is performed in
T.
3. A transaction T must issue the operation unlock(X) after all read_item(X) and write_item(X) operations are
completed in T.
4. A transaction T will not issue a read_lock(X) operation if it already holds a read (shared) lock or a write
(exclusive) lock on item X. This rule may be relaxed,as we discuss shortly.
5. A transaction T will not issue a write_lock(X) operation if it already holds a read (shared) lock or write
(exclusive) lock on item X.This rule may also be relaxed,as we discuss shortly.
6. A transaction T will not issue an unlock(X) operation unless it already holds a read (shared) lock or a
write (exclusive) lock on item X.
Conversion of Locks. Sometimes it is desirable to relax conditions 4 and 5 in the preceding list in order to
allow lock conversion; that is, a transaction that already holds a lock on item X is allowed under certain
conditions to convert the lock from one locked state to another. For example, it is possible for a transaction T to
issue a read_lock(X) and then later to upgrade the lock by issuing a write_lock(X) operation. If T is the only
transaction holding a read lock on X at the time it issues the write_lock(X) operation, the lock can be upgraded;
otherwise, the transaction must wait. It is also possible for a transaction T to. issue a write_lock(X) and then
later to downgrade the lock by issuing a read_lock(X) operation

Sahana M 3
MODULE –V NOTES

Sahana M 4
MODULE –V NOTES
2.1.2 Guaranteeing Serializability by Two-Phase Locking
Two-Phase Locking
transaction is said to follow the two-phase locking protocol if all locking operations (read_lock, write_lock)
precede the first unlock operation in the transaction. Such a transaction can be divided into two phases.

Growing (first) phase: an expanding or growing (first) phase, during which new locks on items can be
acquired but none can be released.

shrinking (second) phase: a shrinking (second) phase, during which existing locks can be released but no
new locks can be acquired.

If lock conversion is allowed, then upgrading of locks (from read-locked to write-locked) must be done
during the expanding phase, and downgrading of locks (from write-locked to read-locked) must be done in
the shrinking phase. Hence, a read_lock(X) operation that downgrades an already held write lock on X can
appear only in the shrinking phase.

Transactions T1 andT2 in Figure 22.3(a) do not follow the two-phase locking protocol because the
write_lock(X) operation follows the unlock(Y) operation in T1, and similarly the write_lock(Y) operation
follows the unlock(X) operation in T2.

If we enforce two-phase locking, the transactions can be rewritten as T1and T2, as shown in Figure 22.4.
Now, the schedule shown in Figure 22.3(c) is not permitted for T1and T2(with their modified order of
locking and unlocking operations).

Sahana M 5
MODULE –V NOTES

Variations (Types of two-phase locking) of two-phase locking (2PL). a)Basic, b) Conservative, c) Strict,
and d)Rigorous Two-Phase Locking:
There are a number of variations of two-phase locking (2PL). The technique just described is known as basic
2PL.

Sahana M 6
MODULE –V NOTES

A variation known as conservative 2PL (or static 2PL) requires a transaction to lock all the items it
accesses before the transaction begins execution,by predeclaring its read-set and write-set. The read-set of a
transaction is the set of all items that the transaction reads, and the write-set is the set of all items that it
writes. If any of the predeclared items needed cannot be locked, the transaction does not lock any item;
instead, it waits until all the items are available for locking. Conservative 2PL is a deadlock- free protocol.

In practice, the most popular variation of 2PL is strict 2PL,which guarantees strict schedules . In this
variation, a transaction T does not release any of its exclusive (write) locks until after it commits or aborts.
Hence, no other transaction can read or write an item that is written by T unless T has committed, leading to
a strict schedule for recoverability. Strict 2PL is not deadlock-free.

A more restrictive variation of strict 2PL is rigorous 2PL, which also guarantees strict schedules. In this
variation, a transaction T does not release any of its locks (exclusive or shared) until after it commits or
aborts, and so it is easier to implement than strict 2PL.

2.1.3 Dealing with Deadlock and Starvation


Deadlock occurs when each transaction T in a set of two or more transactions is waiting for some item that
is locked by some other transaction Tin the set. Hence, each transaction in the set is in a waiting queue,
waiting for one of the other transactions in the set to release the lock on an item. But because the other
transaction is also waiting, it will never release the lock.

A simple example is shown in Figure 22.5(a), where the two transactions T1and T2are deadlocked in a
partial schedule; T1is in the waiting queue for X, which is locked by T2, while T2is in the waiting queue for
Y, which is locked by T1. Meanwhile, neither T1nor T2nor any other transaction can access items X and Y.

Sahana M 7
MODULE –V NOTES
Deadlock Prevention Protocols. One way to prevent deadlock is to use a deadlock prevention protocol.
One deadlock prevention protocol, which is used in conservative two-phase locking, requires that every
transaction lock all the items it needs in advance (which is generally not a practical assumption)—if any of
the items cannot be obtained, none of the items are locked. Rather, the transaction waits and then tries again
to lock all the items it needs. Obviously, this solution further limits concurrency. A second protocol, which
also limits concurrency, involves ordering all the items in the database and making sure that a transaction
that needs several items will lock them according to that order. This requires that the programmer (or the
system) is aware of the chosen order of the items, which is also not practical in the database context.

A number of other deadlock prevention schemes have been proposed that make a decision about what to do
with a transaction involved in a possible deadlock situation.

Some of these techniques use the concept of transaction timestamp TS(T), which is a unique identifier
assigned to each transaction. The timestamps are typically based on the order in which transactions are
started; hence, if transaction T1 starts before transaction T2,then TS(T1) < TS(T2).

Two schemes that prevent deadlock are called wait-die and wound wait. Suppose that transaction Ti tries to
lock an item X but is not able to because X is locked by some other transaction Tj with a conflicting lock.
The rules followed by these schemes are:

 Wait-die. If TS(Ti) < TS(Tj), then (Ti older than Tj) Ti is allowed to wait; otherwise (Ti younger
than Tj) abort Ti (Ti dies) and restart it later with the same timestamp.

 Wound-wait. If TS(Ti) < TS(Tj),then (Ti older than Tj) abort Tj (Ti wounds Tj) and restart it later
with the same timestamp; otherwise (Ti younger than Tj) Ti is allowed to wait.

Another group of protocols that prevent deadlock do not require timestamps. These include the no waiting
(NW) and cautious waiting (CW) algorithms.

In the no waiting algorithm, if a transaction is unable to obtain a lock, it is immediately aborted and then
restarted after a certain time delay without checking whether a deadlock will actually occur or not. In this
case, no transaction ever waits, so no deadlock will occur.
The cautious waiting algorithm was proposed to try to reduce the number of needless aborts/restarts.
Suppose that transaction Ti tries to lock an item X but is not able to do so because X is locked by some
other transaction Tj with a conflicting lock. The cautious waiting rules are as follows:
 Cautious waiting. If Tj is not blocked (not waiting for some other locked item),then Ti is
blocked and allowed to wait; otherwise abort Ti.

Sahana M 8
MODULE –V NOTES
Deadlock Detection: A second, more practical approach to dealing with deadlock is deadlock detection,
where the system checks if a state of deadlock actually exists.

A simple way to detect a state of deadlock is for the system to construct and maintain a wait-for graph. One
node is created in the wait-for graph for each transaction that is currently executing. Whenever a transaction
Ti is waiting to lock an item X that is currently locked by a transaction Tj, a directed edge (Ti→ Tj) is
created in the wait-for graph. When Tj releases the lock(s) on the items that Ti was waiting for, the directed
edge is dropped from the wait-for graph. We have a state of deadlock if and only if the wait-for graph has a
cycle.

Figure 22.5(b) (below figure) shows the wait-for graph for the (partial) schedule shown in Figure 22.5(a).

If the system is in a state of deadlock, some of the transactions causing the deadlock must be aborted.
Choosing which transactions to abort is known as victim selection.

Timeouts. Another simple scheme to deal with deadlock is the use of timeouts. This method is practical
because of its low overhead and simplicity. In this method, if a transaction waits for a period longer than a
system-defined timeout period, the system assumes that the transaction may be deadlocked and aborts it—
regardless of whether a deadlock actually exists or not.

Starvation. Another problem that may occur when we use locking is starvation, which occurs when a
transaction cannot proceed for an indefinite period of time while other transactions in the system continue
normally. This may occur if the waiting scheme for locked items is unfair, giving priority to some
transactions over others. One solution for starvation is to have a fair waiting scheme, such as using a first-
come-first-served queue; transactions are enabled to lock an item in the order in which they originally
requested the lock. Another scheme allows some transactions to have priority over others but increases the
priority of a transaction the longer it waits, until it eventually gets the highest priority and proceeds.

Starvation can also occur because of victim selection if the algorithm selects the same transaction as victim
repeatedly, thus causing it to abort and never finish execution. The algorithm can use higher priorities for
transactions that have been aborted multiple times to avoid this problem. The wait-die and wound-wait

Sahana M 9
MODULE –V NOTES
schemes discussed previously avoid starvation, because they restart a transaction that has been aborted with
its same original timestamp, so the possibility that the same transaction is aborted repeatedly is slim.

2.2 Concurrency Control Based on Timestamp Ordering


A different approach that guarantees serializability involves using transaction timestamps to order
transaction execution for an equivalent serial schedule. we discuss how serializability is enforced by
ordering transactions based on their timestamps.

2.2.1 Timestamps
a timestamp is a unique identifier created by the DBMS to identify a transaction. Typically, timestamp
values are assigned in the order in which the transactions are submitted to the system, so a timestamp can be
thought of as the transaction start time. We will refer to the timestamp of transaction T as TS(T).
Concurrency control techniques based on timestamp ordering do not use locks; hence, deadlocks cannot
occur.

Timestamps can be generated in several ways. One possibility is to use a counter that is incremented each
time its value is assigned to a transaction. The transaction timestamps are numbered 1,2,3,...in this scheme.
A computer counter has a finite maximum value, so the system must periodically reset the counter to zero
when no transactions are executing for some short period of time.

Another way to implement timestamps is to use the current date/time value of the system clock and ensure
that no two timestamp values are generated during the same tick of the clock.
2.2.2 The Timestamp Ordering Algorithm
The idea for this scheme is to order the transactions based on their timestamps. A schedule in which the
transactions participate is then serializable, and the only equivalent serial schedule permitted has the
transactions in order of their timestamp values. This is called timestamp ordering (TO). .In timestamp
ordering, however, the schedule is equivalent to the particular serial order corresponding to the order of the
transaction timestamps. The algorithm must ensure that, for each item accessed by conflicting operations in
the schedule, the order in which the item is accessed does not violate the timestamp order. To do this, the
algorithm associates with each database item X two timestamp (TS) values:

1. read_TS(X). The read timestamp of item X is the largest timestamp among all the timestamps of
transactions that have successfully read item X—that is, read_TS(X) = TS(T), where T is the youngest

Sahana M 10
MODULE –V NOTES
transaction that has read X successfully.

2. write_TS(X). The write timestamp of item X is the largest of all the timestamps of transactions that have
successfully written item X—that is, write_TS(X) = TS(T), where T is the youngest transaction that has
written X successfully.

Basic Timestamp Ordering (TO): Whenever some transaction T tries to issue a read_item(X) or a
write_item(X) operation, the basic TO algorithm compares the timestamp of T with read_TS(X) and
write_TS(X) to ensure that the timestamp order of transaction execution is not violated .If this order is
violated, then transaction T is aborted and resubmitted to the system as a new transaction with a new
timestamp. If T is aborted and rolled back, any transaction T1 that may have used a value written by T must
also be rolled back. Similarly, any transaction T2 that may have used a value written by T1 must also be
rolled back, and so on. This effect is known as cascading rollback.

We first describe the basic TO algorithm here. The concurrency control algorithm must check whether
conflicting operations violate the timestamp ordering in the following two cases:
1. Whenever a transaction T issues a write_item(X) operation, the following is checked:
a. If read_TS(X) > TS(T) or if write_TS(X) > TS(T),then abort and roll back T and reject the
operation. This should be done because some younger transaction with a timestamp greater than
TS(T)—and hence after T in the timestamp ordering—has already read or written the value of item X
before T had a chance to write X,thus violating the timestamp ordering.

b. If the condition in part (a) does not occur, then execute the write_item(X) operation of T and set
write_TS(X) to TS(T).

2. Whenever a transaction T issues a read_item(X) operation, the following is checked:

a. If write_TS(X) > TS(T), then abort and roll back T and reject the operation. This should be done
because some younger transaction with timestamp greater than TS(T)—and hence after T in the timestamp
ordering—has already written the value of item X before T had a chance to read X.

b. If write_TS(X) ≤ TS(T), then execute the read_item(X) operation of T and set read_TS(X) to the
larger of TS(T) and the current read_TS(X).

Sahana M 11
MODULE –V NOTES

Strict Timestamp Ordering (TO): A variation of basic TO called strict TO ensures that the schedules are
both strict (for easy recoverability) and (conflict) serializable. In this variation, a transaction T that issues a
read_item(X) or write_item(X) such that TS(T) > write_TS(X) has its read or write operation delayed until
the transaction Tthat wrote the value of X (hence TS(T) = write_TS(X)) has committed or aborted. To
implement this algorithm, it is necessary to simulate the locking of an item X that has been written by
transaction Tuntil Tis either committed or aborted. This algorithm does not cause deadlock, since T waits
for Tonly if TS(T) > TS(T).

Thomas’s Write Rule: A modification of the basic TO algorithm, known as Thomas’s write rule, does not
enforce conflict serializability, but it rejects fewer write operations by modifying the checks for the
write_item(X) operation as follows:

1. If read_TS(X) > TS(T),then abort and roll back T and reject the operation.

2. If write_TS(X) > TS(T),then do not execute the write operation but continue processing. This is because
some transaction with timestamp greater than TS(T)—and hence after T in the timestamp ordering—has
already written the value of X. Thus, we must ignore the write_item(X) operation of T because it is already
outdated and obsolete. Notice that any conflict arising from this situation would be detected by case (1).

3. If neither the condition in part (1) nor the condition in part (2) occurs,then execute the write_item(X)
operation of T and set write_TS(X) to TS(T).

2.3 Multiversion Concurrency Control Techniques


Other protocols for concurrency control keep the old values of a data item when the item is updated. These
are known as multiversion concurrency control, because several versions (values) of an item are
maintained. When a transaction requires access to an item, an appropriate version is chosen to maintain the
serializability of the currently executing schedule, if possible. The idea is that some read operations that
would be rejected in other techniques can still be accepted by reading an older version of the item to
maintain serializability. When a transaction writes an item, it writes a new version and the old version(s) of
the item are retained. Some multiversion concurrency control algorithms use the concept of view
serializability rather than conflict serializability. An obvious drawback of multiversion techniques is that

Sahana M 12
MODULE –V NOTES
more storage is needed to maintain multiple versions of the database items.

Several multiversion concurrency control schemes have been proposed. We discuss two schemes here,one
based on timestamp ordering and the other based on 2PL.

2.3.1 Multiversion Technique Based on Timestamp Ordering


In this method, several versions X1, X2, ..., Xk of each data item X are maintained. For each version, the
value of version Xi and the following two timestamps are kept:

1. read_TS(Xi).Theread timestampof Xi is the largest of all the timestamps of transactions that have
successfully read version Xi.

2. write_TS(Xi). The write timestamp of Xi is the timestamp of the transaction that wrote the value of
version Xi.

Whenever a transaction Tis allowed to execute a write_item(X) operation, a new version Xk+1 of item X is
created, with both the write_TS(Xk+1) and the read_TS(Xk+1) set to TS(T). Correspondingly, when a
transaction T is allowed to read the value of version Xi, the value of read_TS(Xi) is set to the larger of the
current read_TS(Xi) and TS(T).

To ensure serializability, the following rules are used:

1. If transaction T issues a write_item(X) operation, and version i of X has the highest write_TS(Xi) of all
versions of Xthat is also less than or equal toTS(T), and read_TS(Xi) > TS(T), then abort and roll back
transaction T; otherwise, create a new version Xj of X with read_TS(Xj) = write_TS(Xj) = TS(T).

2. If transaction T issues a read_item(X) operation, find the version i of X that has the highest write_TS(Xi)
of all versions of X that is also less than or equal to TS(T); then return the value of Xi to transaction T, and
set the value of read_TS(Xi) to the larger of TS(T) and the current read_TS(Xi).

2.3.2 Multiversion Two-Phase Locking Using Certify Locks


In this multiple-mode locking scheme, there are three locking modes for an item: read, write, and certify,
instead of just the two modes (read, write) discussed previously. Hence, the state of LOCK(X) for an item
X can be one of read-locked, write locked, certify-locked, or unlocked. In the standard locking scheme,

Sahana M 13
MODULE –V NOTES
with only read and write locks, a write lock is an exclusive lock. We can describe the relationship between
read and write locks in the standard scheme by means of the lock compatibility table shown in Figure
22.6(a). An entry of Yes means that if a transaction T holds the type of lock specified in the column header
on item X and if transaction T requests the type of lock specified in the row header on the same item X,
then T can obtain the lock because the locking modes are compatible. On the other hand, an entry of No in
the table indicates that the locks are not compatible, so T must wait until T releases the lock.

In the standard locking scheme, once a transaction obtains a write lock on an item, no other transactions can
access that item. However, once T is ready to commit, it must obtain a certify lock on all items that it
currently holds write locks on before it can commit. The certify lock is not compatible with read locks, so
the transaction may have to delay its commit until all its write-locked items are released by any reading
transactions in order to obtain the certify locks. Once the certify locks—which are exclusive locks—are
acquired, the committed version X of the data item is set to the value of version X, version X is discarded,
and the certify locks are then released. The lock compatibility table for this scheme is shown in Figure
22.6(b).
In this multiversion 2PL scheme, reads can proceed concurrently with a single write operation—an
arrangement not permitted under the standard 2PL schemes.

2.4 Validation (Optimistic) Concurrency Control Techniques


In optimistic concurrency control techniques, also known as validation or certification techniques, no
checking is done while the transaction is executing. Several theoretical concurrency control methods are
based on the validation technique

Sahana M 14
MODULE –V NOTES

At the end of transaction execution, a validation phase checks whether any of the transaction’s updates violate
serializability. Certain information needed by the validation phase must be kept by the system. If
serializability is not violated, the transaction is committed and the database is updated from the local copies;
otherwise, the transaction is aborted and then restarted later.
There are three phases for this concurrency control protocol:
1. Read phase. A transaction can read values of committed data items from the database. However,
updates are applied only to local copies (versions) of the data items kept in the transaction workspace.
2. Validation phase. Checking is performed to ensure that serializability will not be violated if the
transaction updates are applied to the database.
3. Write phase. If the validation phase is successful, the transaction updates are applied to the database;
otherwise, the updates are discarded and the transaction is restarted.

The validation phase for Ti checks that, for each such transaction Tj that is either committed or is in its
validation phase, one of the following conditions holds:

 Transaction Tj completes its write phase before Ti starts its read phase.
 Ti starts its write phase after Tj completes its write phase, and the read_set of Ti has no items in common
with the write_set of Tj.
 Both the read_set and write_set of Ti have no items in common with the write_set of Tj, and Tj completes
its read phase before Ti completes its read phase.

2.5 Granularity of Data Items and Multiple Granularity Locking


All concurrency control techniques assume that the database is formed of a number of named data
items. A database item could be chosen to be one of the following:
 A database record
 A field value of a database record
 A disk block
 A whole file
 The whole database
The granularity can affect the performance of concurrency control and recovery.

Sahana M 15
MODULE –V NOTES

2.5.1 Granularity Level Considerations for Locking


The size of data items is often called the data item granularity. Fine granularity refers to small item sizes,
whereas coarse granularity refers to large item sizes. Several tradeoffs must be considered in choosing the
data item size.

First, notice that the larger the data item size is, the lower the degree of concurrency permitted.
On the other hand, the smaller the data item size is, the more the number of items in the database. Because
every item is associated with a lock, the system will have a larger number of active locks to be handled by
the lock manager.
2.5.2 Multiple Granularity Level Locking
Since the best granularity size depends on the given transaction, it seems appropriate that a database system
should support multiple levels of granularity, where the granularity level can be different for various mixes
of transactions. Figure 22.7 shows a simple granularity hierarchy with a database containing two files, each
file containing several disk pages, and each page containing several records. This can be used to illustrate a
multiple granularity level 2PL protocol, where a lock can be requested at any level. However, additional
types of locks will be needed to support such a protocol efficiently.

Consider the following scenario, with only shared and exclusive lock types, that refers to the example in
Figure 22.7. Suppose transaction T1 wants to update all the records in file f1,and T1 requests and is granted
an exclusive lock for f1.Then all of f1’s pages (p11 through p1n)—and the records contained on those
pages—are locked in exclusive mode. This is beneficial for T1 because setting a single file-level lock is
more efficient than setting n page-level locks or having to lock each individual record. Now suppose
another transaction T2 only wants to read record r1nj from page p1n of file f1; then T2 would request a

Sahana M 16
MODULE –V NOTES
shared record-level lock on r1nj. However, the database system (that is, the transaction manager or more
specifically the lock manager) must verify the compatibility of the requested lock with already held locks.
One way to verify this is to traverse the tree from the leaf r1nj to p1n to f1 to db. If at any time a conflicting
lock is held on any of those items, then the lock request for r1nj is denied and T2 is blocked and must wait.
This traversal would be fairly efficient.
To make multiple granularity level locking practical, additional types of locks, called intention locks, are
needed. The idea behind intention locks is for a transaction to indicate, along the path from the root to the
desired node, what type of lock (shared or exclusive) it will require from one of the node’s descendants.
There are three types of intention locks:
 Intention-shared (IS) indicates that one or more shared locks will be requested on some descendant node(s).

 Intention-exclusive (IX) indicates that one or more exclusive locks will be requested on some descendant
node(s).
1. Shared-intention-exclusive (SIX) indicates that the current node is locked in shared mode but that one or
more exclusive locks will be requested on some descendant node(s).

The compatibility table of the three intention locks, and the shared and exclusive locks, is shown in Figure
22.8. Besides the introduction of the three types of intention locks, an appropriate locking protocol must be
used. The multiple granularity locking (MGL) protocol consists of the following rules:
1. The lock compatibility must be adhered to.
2. The root of the tree must be locked first, in any mode.
3. A node N can be locked by a transaction T in S or IS mode only if the parent node N is already
locked by transaction T in either IS or IX mode.
4. A node N can be locked by a transaction T in X, IX or SIX mode only if the parent node N is
already locked by transaction T in either IX or SIX mode.
5. A transaction T can lock a node only if it has not unlocked any mode.
6. A transaction T can unlock a node, N, only if none of the children of node N are currently locked by
T.
Rule 1 simply states that conflicting locks cannot be granted. Rules 2, 3, and 4 state the conditions when a
transaction may lock a given node in any of the lock modes. Rules 5 and 6 of the MGL protocol enforce 2PL
rules to produce serializable schedules.

Sahana M 17
MODULE –V NOTES

Sahana M 18
MODULE –V NOTES
Chapter 2
NOSQL Databases and Big Data Storage Systems

24.1 Introduction to NOSQL Systems


24.1.1 Emergence of NOSQL Systems
Many companies and organizations are faced with applications that store vast amounts of data. Consider a free
e-mail application, such as Google Mail or Yahoo Mail or other similar service—this application can have
millions of users, and each user can have thousands of e-mail messages. There is a need for a storage system
that can manage all these e-mails; a structured relational SQL system may not be appropriate because (1) SQL
systems offer too many services (powerful query language, concurrency control, etc.), which this application
may not need; and (2) a structured data model such the traditional relational model may be too restrictive.
Although newer relational systems do have more complex object-relational modeling options (see Chapter 12),
they still require schemas, which are not required by many of the NOSQL systems. As another example,
consider an application such as Facebook, with millions of users who submit posts, many with images and
videos; then these posts must be displayed on pages of other users using the social media relationships among
the users. User profiles, user relationships, and posts must all be stored in a huge collection of data stores, and
the appropriate posts must be made available to the sets of users that have signed up to see these posts. Some of
the data for this type of application is not suitable for a traditional relational system and typically needs multiple
types of databases and data storage systems. Some of the organizations that were faced with these data
management and storage applications decided to develop their own systems:
■ Google developed a proprietary NOSQL system known as BigTable, which is used in many of Google’s
applications that require vast amounts of data storage, such as Gmail, Google Maps, and Web site indexing.
Apache Hbase is an open source NOSQL system based on similar concepts. Google’s innovation led to the
category of NOSQL systems known as column-based or wide column stores; they are also sometimes
referred to as column family stores.
■ Amazon developed a NOSQL system called DynamoDB that is available through Amazon’s cloud
services. This innovation led to the category known as key-value data stores or sometimes key-tuple or key-
object data stores.
■ Facebook developed a NOSQL system called Cassandra, which is now open source and known as
Apache Cassandra. This NOSQL system uses concepts from both key-value stores and column-based

Sahana M 19
MODULE –V NOTES
systems.
■ Other software companies started developing their own solutions and making them available to users who
need these capabilities—for example, MongoDB and CouchDB, which are classified as document-based
NOSQL systems or document stores.
■ Another category of NOSQL systems is the graph-based NOSQL systems, or graph databases; these
include Neo4J and GraphBase, among others.
■ Some NOSQL systems, such as OrientDB, combine concepts from many of the categories discussed
above.
These are just a few examples of NOSQL systems that have been developed. There are many systems, and
listing all of them is beyond the scope of our presentation.

24.1.2 Characteristics of NOSQL Systems

1. Scalability: there are two kinds of scalability in distributed systems: horizontal and vertical. In NOSQL
systems, horizontal scalability is generally used, where the distributed system is expanded by adding more
nodes for data storage and processing as the volume of data grows. Vertical scalability, on the other hand,
refers to expanding the storage and computing power of existing nodes. In NOSQL systems, horizontal
scalability is employed while the system is operational, so techniques for distributing the existing data
among new nodes without interrupting system operation are necessary.
2. Availability, Replication and Eventual Consistency: Many applications that use NOSQL systems
require continuous system availability. To accomplish this, data is replicated over two or more nodes in a
transparent manner, so that if one node fails, the data is still available on other nodes. Replication improves
data availability and can also improve read performance, because read requests can often be serviced from
any of the replicated data nodes. However, write performance becomes more cumbersome because an
update must be applied to every copy of the replicated data items; this can slow down write performance if
serializable consistency is required Many NOSQL applications do not require serializable consistency, so
more relaxed forms of consistency known as eventual consistency are used.
3. Replication Models: Two major replication models are used in NOSQL systems: master-slave and
master-master replication. Master-slave replication requires one copy to be the master copy; all write
operations must be applied to the master copy and then propagated to the slave copies, usually using
eventual consistency (the slave copies will eventually be the same as the master copy). For read, the master-

Sahana M 20
MODULE –V NOTES
slave paradigm can be configured in various ways. One configuration requires all reads to also be at the
master copy, so this would be similar to the primary site or primary copy methods of distributed
concurrency control , with similar advantages and disadvantages. Another configuration would allow reads
at the slave copies but would not guarantee that the values are the latest writes, since writes to the slave
nodes can be done after they are applied to the master copy. The master-master replication allows reads and
writes at any of the replicas but may not guarantee that reads at nodes that store different copies see the
same values. Different users may write the same data item concurrently at different nodes of the system, so
the values of the item will be temporarily inconsistent. A reconciliation method to resolve conflicting write
operations of the same data item at different nodes must be implemented as part of the master-master
replication scheme.
4. Sharding of Files: In many NOSQL applications, files (or collections of data objects) can have many
millions of records (or documents or objects), and these records can be accessed concurrently by thousands
of users. So it is not practical to store the whole file in one node. Sharding of the file records is often
employed in NOSQL systems. This serves to distribute the load of accessing the file records to multiple
nodes. The combination of sharding the file records and replicating the shards works in tandem to improve
load balancing as well as data availability.
5. High-Performance Data Access: In many NOSQL applications, it is necessary to find individual
records or objects (data items) from among the millions of data records or objects in a file. To achieve this,
most systems use one of two techniques: hashing or range partitioning on object keys. The majority of
accesses to an object will be by providing the key value rather than by using complex query conditions. The
object key is similar to the concept of object id. In hashing, a hash function h(K) is applied to the key K,
and the location of the object with key K is determined by the value of h(K). In range partitioning, the
location is determined via a range of key values; for example, location i would hold the objects whose key
values K are in the range Kimin ≤ K ≤ Kimax. In applications that require range queries, where multiple
objects within a range of key values are retrieved, range partitioned is preferred. Other indexes can also be
used to locate objects based on attribute conditions different from the key K.
NOSQL characteristics related to data models and query languages.
NOSQL systems emphasize performance and flexibility over modeling power and complex querying. We
discuss some of these characteristics next.
1. Not Requiring a Schema: The flexibility of not requiring a schema is achieved in many NOSQL
systems by allowing semi-structured, self describing. The users can specify a partial schema in some

Sahana M 21
MODULE –V NOTES
systems to improve storage efficiency, but it is not required to have a schema in most of the NOSQL
systems. As there may not be a schema to specify constraints, any constraints on the data would have to be
programmed in the application programs that access the data items. There are various languages for
describing semistructured data, such as JSON (JavaScript Object Notation) and XML. JSON is used in
several NOSQL systems, but other methods for describing semi-structured data can also be used.
2. Less Powerful Query Languages: Many applications that use NOSQL systems may not require a
powerful query language such as SQL, because search (read) queries in these systems often locate single
objects in a single file based on their object keys. NOSQL systems typically provide a set of functions and
operations as a programming API (application programming interface), so reading and writing the data
objects is accomplished by calling the appropriate operations by the programmer. In many cases, the
operations are called CRUD operations, for Create, Read, Update, and Delete. In other cases, they are
known as SCRUD because of an added Search (or Find) operation. Some NOSQL systems also provide a
high-level query language, but it may not have the full power of SQL; only a subset of SQL querying
capabilities would be provided. In particular, many NOSQL systems do not provide join operations as part
of the query language itself; the joins need to be implemented in the application programs.
3. Versioning: Some NOSQL systems provide storage of multiple versions of the data items, with the
timestamps of when the data version was created. We will discuss this aspect in Section 24.5 when we
present column-based NOSQL systems. In the next section, we give an overview of the various categories
of NOSQL systems.

24.1.3 Categories of NOSQL Systems


NOSQL systems have been characterized into four major categories, with some additional categories that
encompass other types of systems. The most common categorization lists the following four major
categories:
1. Document-based NOSQL systems: These systems store data in the form of documents using well-
known formats, such as JSON (JavaScript Object Notation). Documents are accessible via their
document id, but can also be accessed rapidly using other indexes.
2. NOSQL key-value stores: These systems have a simple data model based on fast access by the key to
the value associated with the key; the value can be a record or an object or a document or even have a
more complex data structure.
3. Column-based or wide column NOSQL systems: These systems partition a table by column into

Sahana M 22
MODULE –V NOTES
column families, where each column family is stored in its own files. They also allow versioning of data
values.
4. Graph-based NOSQL systems: Data is represented as graphs, and related nodes can be found by
traversing the edges using path expressions. Additional categories can be added as follows to include
some systems that are not easily categorized into the above four categories, as well as some other types
of systems that have been available even before the term NOSQL became widely used.
5. Hybrid NOSQL systems: These systems have characteristics from two or more of the above four
categories.
6. Object databases
7. XML databases

24.2 The CAP Theorem


When we discussed concurrency control in distributed databases in Section 23.3, we assumed that the
distributed database system (DDBS) is required to enforce the ACID properties (atomicity, consistency,
isolation, durability) of transactions that are running concurrently (see Section 20.3). In a system with
data replication, concurrency control becomes more complex because there can be multiple copies of
each data item. So if an update is applied to one copy of an item, it must be applied to all other copies in
a consistent manner. The possibility exists that one copy of an item X is updated by a transaction T1
whereas another copy is updated by a transaction T2, so two inconsistent copies of the same item exist
at two different nodes in the distributed system. If two other transactions T3 and T4 want to read X,
each may read a different copy of item X.
We saw in Section 23.3 that there are distributed concurrency control methods that do not allow this
inconsistency among copies of the same data item, thus enforcing serializability and hence the isolation
property in the presence of replication. However, these techniques often come with high overhead,
which would defeat the purpose of creating multiple copies to improve performance and availability in
distributed database systems such as NOSQL. In the field of distributed systems, there are various levels
of consistency among replicated data items, from weak consistency to strong consistency. Enforcing
serializability is considered the strongest form of consistency, but it has high overhead so it can reduce
performance of read and write operations and hence adversely affect system performance. The CAP
theorem, which was originally introduced as the CAP principle, can be used to explain some of the
competing requirements in a distributed system with replication. The three letters in CAP refer to three

Sahana M 23
MODULE –V NOTES
desirable properties of distributed systems with replicated data: consistency (among replicated copies),
availability (of the system for read and write operations) and partition tolerance (in the face of the nodes
in the system being partitioned by a network fault). Availability means that each read or write request
for a data item will either be processed successfully or will receive a message that the operation cannot
be completed. Partition tolerance means that the system can continue operating if the network
connecting the nodes has a fault that results in two or more partitions, where the nodes in each partition
can only communicate among each other. Consistency means that the nodes will have the same copies
of a replicated data item visible for various transactions. It is important to note here that the use of the
word consistency in CAP and its use in ACID do not refer to the same identical concept. In CAP, the
term consistency refers to the consistency of the values in different copies of the same data item in a
replicated distributed system. In ACID, it refers to the fact that a transaction will not violate the
integrity constraints specified on the database schema. However, if we consider that the consistency of
replicated copies is a specified constraint, then the two uses of the term consistency would be related.
The CAP theorem states that it is not possible to guarantee all three of the desirable properties—
consistency, availability, and partition tolerance—at the same time in a distributed system with data
replication. If this is the case, then the distributed system designer would have to choose two properties
out of the three to guarantee. It is generally assumed that in many traditional (SQL) applications,
guaranteeing consistency through the ACID properties is important. On the other hand, in a NOSQL
distributed data store, a weaker consistency level is often acceptable, and guaranteeing the other two
properties (availability, partition tolerance) is important. Hence, weaker consistency levels are often
used in NOSQL system instead of guaranteeing serializability. In particular, a form of consistency
known as eventual consistency is often adopted in NOSQL systems. In Sections 24.3 through 24.6, we
will discuss some of the consistency models used in specific NOSQL systems. The next four sections of
this chapter discuss the characteristics of the four main categories of NOSQL systems. We discuss
document-based NOSQL systems in Section 24.3, and we use MongoDB as a representative system. In
Section 24.4, we discuss

Sahana M 24
MODULE –V NOTES
24.3 Document-Based NOSQL Systems and MongoDB
Document-based or document-oriented NOSQL systems typically store data as collections of similar
documents. These types of systems are also sometimes known as document stores. The individual
documents somewhat resemble complex objects or XML documents , but a major difference between
document-based systems versus object and object-relational systems and XML is that there is no
requirement to specify a schema—rather, the documents are specified as self-describing data. Although
the documents in a collection should be similar, they can have different data elements (attributes), and
new documents can have new data elements that do not exist in any of the current documents in the
collection. The system basically extracts the data element names from the self-describing documents in
the collection, and the user can request that the system create indexes on some of the data elements.
Documents can be specified in various formats, such as XML (see Chapter 13). A popular language to
specify documents in NOSQL systems is JSON (JavaScript Object Notation). There are many
document-based NOSQL systems, including MongoDB and CouchDB, among many others. We will
give an overview of MongoDB in this section. It is important to note that different systems can use
different models, languages, and implementation methods, but giving a complete survey of all
document-based NOSQL systems is beyond the scope of our presentation.
24.3.1 MongoDB Data Model
MongoDB documents are stored in BSON (Binary JSON) format, which is a variation of JSON with
some additional data types and is more efficient for storage than JSON. Individual documents are stored
in a collection. We will use a simple example based on our COMPANY database that we used
throughout this book. The operation createCollection is used to create each collection. For example, the
following command can be used to create a collection called project to hold PROJECT objects from the
COMPANY database
db.createCollection(“project”, { capped : true, size : 1310720, max : 500 } )
The first parameter “project” is the name of the collection, which is followed by an optional document
that specifies collection options. In our example, the collection is capped; this means it has upper limits
on its storage space (size) and number of documents (max). The capping parameters help the system
choose the storage options for each collection. There are other collection options, but we will not
discuss them here.
For our example, we will create another document collection called worker to hold information about
the EMPLOYEEs who work on each project; for example:

Sahana M 25
MODULE –V NOTES
db.createCollection(“worker”, { capped : true, size : 5242880, max : 2000 } ) )
Each document in a collection has a unique ObjectId field, called _id, which is automatically indexed in
the collection unless the user explicitly requests no index for the _id field. The value of ObjectId can be
specified by the user, or it can be system-generated if the user does not specify an _id field for a
particular document. System-generated ObjectIds have a specific format, which combines the
timestamp when the object is created (4 bytes, in an internal MongoDB format), the node id (3 bytes),
the process id (2 bytes), and a counter (3 bytes) into a 16-byte Id value. User-generated ObjectsIds can
have any value specified by the user as long as it uniquely identifies the document and so these Ids are
similar to primary keys in relational systems.
A collection does not have a schema. The structure of the data fields in documents is chosen based on
how documents will be accessed and used, and the user can choose a normalized design (similar to
normalized relational tuples) or a denormalized design (similar to XML documents or complex objects).
Interdocument references can be specified by storing in one document the ObjectId or ObjectIds of
other related documents. Figure 24.1(a) shows a simplified MongoDB document showing some of the
data from Figure 5.6 from the COMPANY database example that is used throughout the book. In our
example, the _id values are user-defined, and the documents whose _id starts with P (for project) will be
stored in the “project” collection, whereas those whose _id starts with W (for worker) will be stored in
the “worker” collection.
In Figure 24.1(a), the workers information is embedded in the project document; so there is no need for
the “worker” collection. This is known as the denormalized pattern, which is similar to creating a
complex object (see Chapter 12) or an XML document (see Chapter 13). A list of values that is enclosed
in square brackets [ … ] within a document represents a field whose value is an array.
Another option is to use the design in Figure 24.1(b), where worker references are embedded in the
project document, but the worker documents themselves are stored in a separate “worker” collection. A
third option in Figure 24.1(c) would use a normalized design, similar to First Normal Form relations
(see Section 14.3.4). The choice of which design option to use depends on how the data will be
accessed.

It is important to note that the simple design in Figure 24.1(c) is not the general normalized design for a
many-to-many relationship, such as the one between employees and projects; rather, we would need three
collections for “project”, “employee”, and “works_on”, as we discussed in detail in Section 9.1. Many of the
design tradeoffs that were discussed in Chapters 9 and 14 (for first normal form relations and for ERto-

Sahana M 26
MODULE –V NOTES
relational mapping options), and Chapters 12 and 13 (for complex objects and XML) are applicable for
choosing the appropriate design for document structures and document collections, so we will not repeat the
discussions here. In the design in Figure 24.1(c), an EMPLOYEE who works on several projects would be
represented by multiple worker documents with different _id values; each document would represent the
employee as worker for a particular project. This is similar to the design decisions for XML schema design
(see Section 13.6). However, it is again important to note that the typical document-based system does not
have a schema, so the design rules would have to be followed whenever individual documents are inserted
into a collection.

Sahana M 27
MODULE –V NOTES

24.3.2 MongoDB CRUD Operations


MongoDb has several CRUD operations, where CRUD stands for (create, read, update, delete). Documents
can be created and inserted into their collections using the insert operation, whose format is:

db.<collection_name>.insert(<document(s)>)

The parameters of the insert operation can include either a single document or an array of documents, as
shown in Figure 24.1(d). The delete operation is called remove, and the format is:

Db. <collection_name>.remove(<condition>)

The documents to be removed from the collection are specified by a Boolean condition on some of the fields
in the collection documents. There is also an update operation, which has a condition to select certain
documents, and a $set clause to specify the update. It is also possible to use the update operation to replace an
existing document with another one but keep the same ObjectId.

For read queries, the main command is called find, and the format is:

db. <collection_name>..find(<condition>)

General Boolean conditions can be specified as , and the documents in the collection that return true are
selected for the query result. For a full discussion of the MongoDb CRUD operations, see the MongoDB
online documentation in the chapter references.

24.3.3 MongoDB Distributed Systems Characteristics


Most MongoDB updates are atomic if they refer to a single document, but MongoDB also provides a pattern
for specifying transactions on multiple documents. Since MongoDB is a distributed system, the two-phase
commit method is used to ensure atomicity and consistency of multidocument transactions. We discussed the
atomicity and consistency properties of transactions in Section 20.3, and the two-phase commit protocol in

Sahana M 28
MODULE –V NOTES
Section 22.6.

Replication in MongoDB. The concept of replica set is used in MongoDB to create multiple copies of the
same data set on different nodes in the distributed system, and it uses a variation of the master-slave approach
for replication. For example, suppose that we want to replicate a particular document collection C. A replica
set will have one primary copy of the collection C stored in one node N1, and at least one secondary copy
(replica) of C stored at another node N2. Additional copies can be stored in nodes N3, N4, etc., as needed, but
the cost of storage and update (write) increases with the number of replicas. The total number of participants
in a replica set must be at least three, so if only one secondary copy is needed, a participant in the replica set
known as an arbiter must run on the third node N3. The arbiter does not hold a replica of the collection but
participates in elections to choose a new primary if the node storing the current primary copy fails. If the total
number of members in a replica set is n (one primary plus i secondaries, for a total of n = i + 1), then n must
be an odd number; if it is not, an arbiter is added to ensure the election process works correctly if the primary
fails. We discussed elections in distributed systems in Section 23.3.1. In MongoDB replication, all write
operations must be applied to the primary copy and then propagated to the secondaries. For read operations,
the user can choose the particular read preference for their application. The default read preference processes
all reads at the primary copy, so all read and write operations are performed at the primary node. In this case,
secondary copies are mainly to make sure that the system continues operation if the primary fails, and
MongoDB can ensure that every read request gets the latest document value. To increase read performance, it
is possible to set the read preference so that read requests can be processed at any replica (primary or
secondary); however, a read at a secondary is not guaranteed to get the latest version of a document because
there can be a delay in propagating writes from the primary to the secondaries.

Sharding in MongoDB. When a collection holds a very large number of documents or requires a large
storage space, storing all the documents in one node can lead to performance problems, particularly if there
are many user operations accessing the documents concurrently using various CRUD operations. Sharding of
the documents in the collection—also known as horizontal partitioning— divides the documents into disjoint
partitions known as shards. This allows the system to add more nodes as needed by a process known as
horizontal scaling of the distributed system (see Section 23.1.4), and to store the shards of the collection on
different nodes to achieve load balancing. Each node will process only those operations pertaining to the
documents in the shard stored at that node. Also, each shard will contain fewer documents than if the entire
collection were stored at one node, thus further improving performance.

Sahana M 29
MODULE –V NOTES
There are two ways to partition a collection into shards in MongoDB—range partitioning and hash
partitioning. Both require that the user specify a particular document field to be used as the basis for
partitioning the documents into shards. The partitioning field—known as the shard key in MongoDB—must
have two characteristics: it must exist in every document in the collection, and it must have an index. The
ObjectId can be used, but any other field possessing these two characteristics can also be used as the basis for
sharding. The values of the shard key are divided into chunks either through range partitioning or hash
partitioning, and the documents are partitioned based on the chunks of shard key values.

Range partitioning creates the chunks by specifying a range of key values; for example, if the shard key
values ranged from one to ten million, it is possible to create ten ranges—1 to 1,000,000; 1,000,001 to
2,000,000; … ; 9,000,001 to 10,000,000—and each chunk would contain the key values in one range. Hash
partitioning applies a hash function h(K) to each shard key K, and the partitioning of keys into chunks is
based on the hash values (we discussed hashing and its advantages and disadvantages in Section 16.8). In
general, if range queries are commonly applied to a collection (for example, retrieving all documents whose
shard key value is between 200 and 400), then range partitioning is preferred because each range query will
typically be submitted to a single node that contains all the required documents in one shard. If most searches
retrieve one document at a time, hash partitioning may be preferable because it randomizes the distribution of
shard key values into chunks.

When sharding is used, MongoDB queries are submitted to a module called the query router, which keeps
track of which nodes contain which shards based on the particular partitioning method used on the shard keys.
The query (CRUD operation) will be routed to the nodes that contain the shards that hold the documents that
the query is requesting. If the system cannot determine which shards hold the required documents, the query
will be submitted to all the nodes that hold shards of the collection. Sharding and replication are used
together; sharding focuses on improving performance via load balancing and horizontal scalability, whereas
replication focuses on ensuring system availability when certain nodes fail in the distributed system.

There are many additional details about the distributed system architecture and components of MongoDB, but
a full discussion is outside the scope of our presentation. MongoDB also provides many other services in
areas such as system administration, indexing, security, and data aggregation, but we will not discuss these
features here. Full documentation of MongoDB is available online (see the bibliographic notes).

Sahana M 30
MODULE –V NOTES
24.4 NOSQL Key-Value Stores
Key-value stores focus on high performance, availability, and scalability by storing data in a distributed
storage system. The data model used in key-value stores is relatively simple, and in many of these systems,
there is no query language but rather a set of operations that can be used by the application programmers. The
key is a unique identifier associated with a data item and is used to locate this data item rapidly. The value is
the data item itself, and it can have very different formats for different key-value storage systems. In some
cases, the value is just a string of bytes or an array of bytes, and the application using the key-value store has
to interpret the structure of the data value. In other cases, some standard formatted data is allowed; for
example, structured data rows (tuples) similar to relational data, or semistructured data using JSON or some
other self-describing data format. Different key-value stores can thus store unstructured, semistructured, or
structured data items (see Section 13.1). The main characteristic of key-value stores is the fact that every
value (data item) must be associated with a unique key, and that retrieving the value by supplying the key
must be very fast. There are many systems that fall under the key-value store label, so rather than provide a
lot of details on one particular system, we will give a brief introductory overview for some of these systems
and their characteristics.

24.4.1 DynamoDB Overview


The DynamoDB system is an Amazon product and is available as part of Amazon’s AWS/SDK platforms
(Amazon Web Services/Software Development Kit). It can be used as part of Amazon’s cloud computing
services, for the data storage component.

DynamoDB data model. The basic data model in DynamoDB uses the concepts of tables, items, and
attributes. A table in DynamoDB does not have a schema; it holds a collection of self-describing items. Each
item will consist of a number of (attribute, value) pairs, and attribute values can be single-valued or
multivalued. So basically, a table will hold a collection of items, and each item is a self-describing record (or
object). DynamoDB also allows the user to specify the items in JSON format, and the system will convert
them to the internal storage format of DynamoDB.

When a table is created, it is required to specify a table name and a primary key; the primary key will be used
to rapidly locate the items in the table. Thus, the primary key is the key and the item is the value for the
DynamoDB key-value store. The primary key attribute must exist in every item in the table. The primary key
can be one of the following two types:

Sahana M 31
MODULE –V NOTES
■ A single attribute. The DynamoDB system will use this attribute to build a hash index on the items in the
table. This is called a hash type primary key. The items are not ordered in storage on the value of the hash
attribute.

■ A pair of attributes. This is called a hash and range type primary key. The primary key will be a pair of
attributes (A, B): attribute A will be used for hashing, and because there will be multiple items with the same
value of A, the B values will be used for ordering the records with the same A value. A table with this type of
key can have additional secondary indexes defined on its attributes. For example, if we want to store multiple
versions of some type of items in a table, we could use ItemID as hash and Date or Timestamp (when the
version was created) as range in a hash and range type primary key.

DynamoDB Distributed Characteristics. Because DynamoDB is proprietary, in the next subsection we will
discuss the mechanisms used for replication, sharding, and other distributed system concepts in an open
source key-value system called Voldemort. Voldemort is based on many of the techniques proposed for
DynamoDB.

24.4.2 Voldemort Key-Value Distributed Data Store


Voldemort is an open source system available through Apache 2.0 open source licensing rules. It is based on
Amazon’s DynamoDB. The focus is on high performance and horizontal scalability, as well as on providing
replication for high availability and sharding for improving latency (response time) of read and write requests.
All three of those features—replication, sharding, and horizontal scalability—are realized through a technique
to distribute the key-value pairs among the nodes of a distributed cluster; this distribution is known as
consistent hashing. Voldemort has been used by LinkedIn for data storage. Some of the features of Voldemort
are as follows:

■ Simple basic operations. A collection of (key, value) pairs is kept in a Voldemort store. In our discussion,
we will assume the store is called s. The basic interface for data storage and retrieval is very simple and
includes three operations: get, put, and delete. The operation s.put(k, v) inserts an item as a key-value pair
with key k and value v. The operation s.delete(k) deletes the item whose key is k from the store, and the
operation v = s.get(k) retrieves the value v associated with key k. The application can use these basic
operations to build its own requirements. At the basic storage level, both keys and values are arrays of bytes
(strings).

■ High-level formatted data values. The values v in the (k, v) items can be specified in JSON (JavaScript
Object Notation), and the system will convert between JSON and the internal storage format. Other data

Sahana M 32
MODULE –V NOTES
object formats can also be specified if the application provides the conversion (also known as serialization)
between the user format and the storage format as a Serializer class. The Serializer class must be provided by
the user and will include operations to convert the user format into a string of bytes for storage as a value, and
to convert back a string (array of bytes) retrieved via s.get(k) into the user format. Voldemort has some built-
in serializers for formats other than JSON.

■ Consistent hashing for distributing (key, value) pairs. A variation of the data distribution algorithm
known as consistent hashing is used in Voldemort for data distribution among the nodes in the distributed
cluster of nodes. A hash function h(k) is applied to the key k of each (k, v) pair, and h(k) determines where
the item will be stored. The method assumes that h(k) is an integer value, usually in the range 0 to Hmax =
2n−1, where n is chosen based on the desired range for the hash values. This method is best visualized by
considering the range of all possible integer hash values 0 to Hmax to be evenly distributed on a circle (or
ring). The nodes in the distributed system are then also located on the same ring; usually each node will have
several locations on the ring (see Figure 24.2). The positioning of the points on the ring that represent the
nodes is done in a psuedorandom manner.

An item (k, v) will be stored on the node whose position in the ring follows the position of h(k) on the ring in
a clockwise direction. In Figure 24.2(a), we assume there are three nodes in the distributed cluster labeled A,
B, and C, where node C has a bigger capacity than nodes A and B. In a typical system, there will be many
more nodes. On the circle, two instances each of A and B are placed, and three instances of C (because of its
higher capacity), in a pseudorandom manner to cover the circle. Figure 24.2(a) indicates which (k, v) items
are placed in which nodes based on the h(k) values.

■ The h(k) values that fall in the parts of the circle marked as range 1 in Figure 24.2(a) will have their (k, v)
items stored in node A because that is the node whose label follows h(k) on the ring in a clockwise direction;
those in range 2 are stored in node B; and those in range 3 are stored in node C. This scheme allows
horizontal scalability because when a new node is added to the distributed system, it can be added in one or
more locations on the ring depending on the node capacity. Only a limited percentage of the (k, v) items will
be reassigned to the new node from the existing nodes based on the consistent hashing placement algorithm.
Also, those items assigned to the new node may not all come from only one of the existing nodes because the
new node can have multiple locations on the ring. For example, if a node D is added and it has two
placements on the ring as shown in Figure 24.2(b), then some of the items from nodes B and C would be
moved to node D. The items whose keys hash to range 4 on the circle (see Figure 24.2(b)) would be migrated
to node D. This scheme also allows replication by placing the number of specified replicas of an item on

Sahana M 33
MODULE –V NOTES
successive nodes on the ring in a clockwise direction. The sharding is built into the method, and different
items in the store (file) are located on different nodes in the distributed cluster, which means the items are
horizontally partitioned (sharded) among the nodes in the distributed system. When a node fails, its load of
data items can be distributed to the other existing nodes whose labels follow the labels of the failed node in
the ring. And nodes with higher capacity can have more locations on the ring, as illustrated by node C in
Figure 24.2(a), and thus store more items than smaller-capacity nodes.

Sahana M 34
MODULE –V NOTES
■ Consistency and versioning. Voldemort uses a method similar to the one developed for DynamoDB for
consistency in the presence of replicas. Basically, concurrent write operations are allowed by different
processes so there could exist two or more different values associated with the same key at different nodes
when items are replicated. Consistency is achieved when the item is read by using a technique known as
versioning and read repair. Concurrent writes are allowed, but each write is associated with a vector clock
value. When a read occurs, it is possible that different versions of the same value (associated with the same
key) are read from different nodes. If the system can reconcile to a single final value, it will pass that value to
the read; otherwise, more than one version can be passed back to the application, which will reconcile the
various versions into one version based on the application semantics and give this reconciled value back to
the nodes.

24.4.3 Examples of Other Key-Value Stores


In this section, we briefly review three other key-value stores. It is important to note that there are many
systems that can be classified in this category, and we can only mention a few of these systems.

Oracle key-value store. Oracle has one of the well-known SQL relational database systems, and Oracle also
offers a system based on the key-value store concept; this system is called the Oracle NoSQL Database.

Redis key-value cache and store. Redis differs from the other systems discussed here because it caches its
data in main memory to further improve performance. It offers master-slave replication and high availability,
and it also offers persistence by backing up the cache to disk.

Apache Cassandra. Cassandra is a NOSQL system that is not easily categorized into one category; it is
sometimes listed in the column-based NOSQL category (see Section 24.5) or in the key-value category. If
offers features from several NOSQL categories and is used by Facebook as well as many other customers.

24.5 Column-Based or Wide Column NOSQL Systems


Another category of NOSQL systems is known as column-based or wide column systems. The Google
distributed storage system for big data, known as BigTable, is a well-known example of this class of NOSQL
systems, and it is used in many Google applications that require large amounts of data storage, such as Gmail.
BigTable uses the Google File System (GFS) for data storage and distribution. An open source system known
as Apache Hbase is somewhat similar to Google BigTable, but it typically uses HDFS (Hadoop Distributed
File System) for data storage. HDFS is used in many cloud computing applications, as we shall discuss in
Chapter 25. Hbase can also use Amazon’s Simple Storage System (known as S3) for data storage. Another

Sahana M 35
MODULE –V NOTES
well-known example of column-based NOSQL systems is Cassandra, which we discussed briefly in Section
24.4.3 because it can also be characterized as a key-value store. We will focus on Hbase in this section as an
example of this category of NOSQL systems.

BigTable (and Hbase) is sometimes described as a sparse multidimensional distributed persistent sorted map,
where the word map means a collection of (key, value) pairs (the key is mapped to the value). One of the
main differences that distinguish column-based systems from key-value stores (see Section 24.4) is the nature
of the key. In column-based systems such as Hbase, the key is multidimensional and so has several
components: typically, a combination of table name, row key, column, and timestamp. As we shall see, the
column is typically composed of two components: column family and column qualifier. We discuss these
concepts in more detail next as they are realized in Apache Hbase.

24.5.1 Hbase Data Model and Versioning


Hbase data model. The data model in Hbase organizes data using the concepts of namespaces, tables, column
families, column qualifiers, columns, rows, and data cells. A column is identified by a combination of
(column family:column qualifier). Data is stored in a self-describing form by associating columns with data
values, where data values are strings. Hbase also stores multiple versions of a data item, with a timestamp
associated with each version, so versions and timestamps are part of the Hbase data model (this is similar to
the concept of attribute versioning in temporal databases, which we shall discuss in Section 26.2). As with
other NOSQL systems, unique keys are associated with stored data items for fast access, but the keys identify
cells in the storage system. Because the focus is on high performance when storing huge amounts of data, the
data model includes some storage-related concepts. We discuss the Hbase data modeling concepts and define
the terminology next. It is important to note that the use of the words table, row, and column is not identical
to their use in relational databases, but the uses are related.

■ Tables and Rows. Data in Hbase is stored in tables, and each table has a table name. Data in a table is
stored as self-describing rows. Each row has a unique row key, and row keys are strings that must have the
property that they can be lexicographically ordered, so characters that do not have a lexicographic order in the
character set cannot be used as part of a row key.

■ Column Families, Column Qualifiers, and Columns. A table is associated with one or more column
families. Each column family will have a name, and the column families associated with a table must be
specified when the table is created and cannot be changed later. Figure 24.3(a) shows how a table may be
created; the table name is followed by the names of the column families associated with the table. When the

Sahana M 36
MODULE –V NOTES
data is loaded into a table, each column family can be associated with many column qualifiers, but the column
qualifiers are not specified as part of creating a table. So the column qualifiers make the model a self-
describing data model because the qualifiers can be dynamically specified as new rows are created and
inserted into the table. A column is specified by a combination of ColumnFamily:ColumnQualifier. Basically,
column families are a way of grouping together related columns (attributes in relational terminology) for
storage purposes, except that the column qualifier names are not specified during table creation. Rather, they
are specified when the data is created and stored in rows, so the data is selfdescribing since any column
qualifier name can be used in a new row of data (see Figure 24.3(b)). However, it is important that the
application programmers know which column qualifiers belong to each column family, even though they
have the flexibility to create new column qualifiers on the fly when new data rows are created. The concept of
column family is somewhat similar to vertical partitioning (see Section 23.2), because columns (attributes)
that are accessed together because they belong to the same column family are stored in the same files. Each
column family of a table is stored in its own files using the HDFS file system.

■ Versions and Timestamps. Hbase can keep several versions of a data item, along with the timestamp
associated with each version. The timestamp is a long integer number that represents the system time when
the version was created, so newer versions have larger timestamp values. Hbase uses midnight ‘January 1,
1970 UTC’ as timestamp value zero, and uses a long integer that measures the number of milliseconds since
that time as the system timestamp value (this is similar to the value returned by the Java utility
java.util.Date.getTime() and is also used in MongoDB). It is also possible for the user to define the timestamp
value explicitly in a Date format rather than using the system-generated timestamp.

■ Cells. A cell holds a basic data item in Hbase. The key (address) of a cell is specified by a combination of
(table, rowid, columnfamily, columnqualifier, timestamp). If timestamp is left out, the latest version of the
item is retrieved unless a default number of versions is specified, say the latest three versions. The default
number of versions to be retrieved, as well as the default number of versions that the system needs to keep,
are parameters that can be specified during table creation.

■ Namespaces. A namespace is a collection of tables. A namespace basically specifies a collection of one or


more tables that are typically used together by user applications, and it corresponds to a database that contains
a collection of tables in relational terminology.

Sahana M 37
MODULE –V NOTES

24.5.2 Hbase CRUD Operations


Hbase has low-level CRUD (create, read, update, delete) operations, as in many of the NOSQL systems. The
formats of some of the basic CRUD operations in Hbase are shown in Figure 24.3(c). Hbase only provides
low-level CRUD operations. It is the responsibility of the application programs to implement more complex
operations, such as joins between rows in different tables. The create operation creates a new table and
specifies one or more column families associated with that table, but it does not specify the column qualifiers,
as we discussed earlier. The put operation is used for inserting new data or new versions of existing data
items. The get operation is for retrieving the data associated with a single row in a table, and the scan
operation retrieves all the rows.

Sahana M 38
MODULE –V NOTES
24.5.3 Hbase Storage and Distributed System Concepts
Each Hbase table is divided into a number of regions, where each region will hold a range of the row keys in
the table; this is why the row keys must be lexicographically ordered. Each region will have a number of
stores, where each column family is assigned to one store within the region. Regions are assigned to region
servers (storage nodes) for storage. A master server (master node) is responsible for monitoring the region
servers and for splitting a table into regions and assigning regions to region servers.

Hbase uses the Apache Zookeeper open source system for services related to managing the naming,
distribution, and synchronization of the Hbase data on the distributed Hbase server nodes, as well as for
coordination and replication services. Hbase also uses Apache HDFS (Hadoop Distributed File System) for
distributed file services. So Hbase is built on top of both HDFS and Zookeeper. Zookeeper can itself have
several replicas on several nodes for availability, and it keeps the data it needs in main memory to speed
access to the master servers and region servers. We will not cover the many additional details about the
distributed system architecture and components of Hbase; a full discussion is outside the scope of our
presentation. Full documentation of Hbase is available online (see the bibliographic notes).

24.6 NOSQL Graph Databases and Neo4j


Another category of NOSQL systems is known as graph databases or graphoriented NOSQL systems. The
data is represented as a graph, which is a collection of vertices (nodes) and edges. Both nodes and edges can
be labeled to indicate the types of entities and relationships they represent, and it is generally possible to store
data associated with both individual nodes and individual edges. Many systems can be categorized as graph
databases. We will focus our discussion on one particular system, Neo4j, which is used in many applications.
Neo4j is an open source system, and it is implemented in Java.

24.6.1 Neo4j Data Model


The data model in Neo4j organizes data using the concepts of nodes and relationships. Both nodes and
relationships can have properties, which store the data items associated with nodes and relationships. Nodes
can have labels; the nodes that have the same label are grouped into a collection that identifies a subset of the
nodes in the database graph for querying purposes. A node can have zero, one, or several labels.
Relationships are directed; each relationship has a start node and end node as well as a relationship type,
which serves a similar role to a node label by identifying similar relationships that have the same relationship

Sahana M 39
MODULE –V NOTES
type. Properties can be specified via a map pattern, which is made of one or more “name : value” pairs
enclosed in curly brackets; for example {Lname : ‘Smith’, Fname : ‘John’, Minit : ‘B’}. In conventional
graph theory, nodes and relationships are generally called vertices and edges. The Neo4j graph data model
somewhat resembles how data is represented in the ER and EER models (see Chapters 3 and 4) , but with
some notable differences. Comparing the Neo4j graph model with ER/EER concepts, nodes correspond to
entities, node labels correspond to entity types and subclasses, relationships correspond to relationship
instances, relationship types correspond to relationship types, and properties correspond to attributes. One
notable difference is that a relationship is directed in Neo4j, but is not in ER/EER. Another is that a node may
have no label in Neo4j, which is not allowed in ER/EER because every entity must belong to an entity type. A
third crucial difference is that the graph model of Neo4j is used as a basis for an actual high-performance
distributed database system whereas the ER/EER model is mainly used for database design. Figure 24.4(a)
shows how a few nodes can be created in Neo4j. There are various ways in which nodes and relationships can
be created; for example, by calling appropriate Neo4j operations from various Neo4j APIs. We will just show
the high-level syntax for creating nodes and relationships; to do so, we will use the Neo4j CREATE
command, which is part of the high-level declarative query language Cypher. Neo4j has many options and
variations for creating nodes and relationships using various scripting interfaces, but a full discussion is
outside the scope of our presentation.

■ Labels and properties. When a node is created, the node label can be specified. It is also possible to create
nodes without any labels. In Figure 24.4(a), the node labels are EMPLOYEE, DEPARTMENT, PROJECT,
and LOCATION, and the created nodes correspond to some of the data from the COMPANY database in
Figure 5.6 with a few modifications; for example, we use EmpId instead of SSN, and we only include a small
subset of the data for illustration purposes. Properties are enclosed in curly brackets { … }. It is possible that
some nodes have multiple labels; for example the same node can be labeled as PERSON and EMPLOYEE
and MANAGER by listing all the labels separated by the colon symbol as follows:
PERSON:EMPLOYEE:MANAGER. Having multiple labels is similar to an entity belonging to an entity type
(PERSON) plus some subclasses of PERSON (namely EMPLOYEE and MANAGER) in the EER model (see
Chapter 4) but can also be used for other purposes.

■ Relationships and relationship types. Figure 24.4(b) shows a few example relationships in Neo4j based
on the COMPANY database in Figure 5.6. The → specifies the direction of the relationship, but the
relationship can be traversed in either direction. The relationship types (labels) in Figure 24.4(b) are
WorksFor, Manager, LocatedIn, and WorksOn; only relationships with the relationship type WorksOn have

Sahana M 40
MODULE –V NOTES
properties (Hours) in Figure 24.4(b).

■ Paths. A path specifies a traversal of part of the graph. It is typically used as part of a query to specify a
pattern, where the query will retrieve from the graph data that matches the pattern. A path is typically
specified by a start node, followed by one or more relationships, leading to one or more end nodes that satisfy
the pattern. It is somewhat similar to the concepts of path expressions that we discussed in Chapters 12 and 13
in the context of query languages for object databases (OQL) and XML (XPath and XQuery).

■ Optional Schema. A schema is optional in Neo4j. Graphs can be created and used without a schema, but
in Neo4j version 2.0, a few schema-related functions were added. The main features related to schema
creation involve creating indexes and constraints based on the labels and properties. For example, it is
possible to create the equivalent of a key constraint on a property of a label, so all nodes in the collection of
nodes associated with the label must have unique values for that property.

■ Indexing and node identifiers. When a node is created, the Neo4j system creates an internal unique
system-defined identifier for each node. To retrieve individual nodes using other properties of the nodes
efficiently, the user can create indexes for the collection of nodes that have a particular label. Typically, one
or more of the properties of the nodes in that collection can be indexed. For example, Empid can be used to
index nodes with the EMPLOYEE label, Dno to index the nodes with the DEPARTMENT label, and Pno to
index the nodes with the PROJECT label.

24.6.2 The Cypher Query Language of Neo4j


Neo4j has a high-level query language, Cypher. There are declarative commands for creating nodes and
relationships (see Figures 24.4(a) and (b)), as well as for finding nodes and relationships based on specifying
patterns. Deletion and modification of data is also possible in Cypher. We introduced the CREATE command
in the previous section, so we will now give a brief overview of some of the other features of Cypher. A
Cypher query is made up of clauses. When a query has several clauses, the result from one clause can be the
input to the next clause in the query. We will give a flavor of the language by discussing some of the clauses
using examples. Our presentation is not meant to be a detailed presentation on Cypher, just an introduction to
some of the languages features. Figure 24.4(c) summarizes some of the main clauses that can be part of a
Cyber query. The Cyber language can specify complex queries and updates on a graph database. We will give
a few of examples to illustrate simple Cyber queries in Figure 24.4(d).

Sahana M 41
MODULE –V NOTES

Sahana M 42
MODULE –V NOTES

Empid = 2 works on. Query 3, on the other hand, returns the employees and hours per week who work on the
project with Pno = 2. Query 4 illustrates the ORDER BY clause and returns all employees and the projects
they work on, sorted by Ename. It is also possible to limit the number of returned results by using the LIMIT
clause as in query 5, which only returns the first 10 answers. Query 6 illustrates the use of WITH and

Sahana M 43
MODULE –V NOTES
aggregation, although the WITH clause can be used to separate clauses in a query even if there is no
aggregation. Query 6 also illustrates the WHERE clause to specify additional conditions, and the query
returns the employees who work on more than two projects, as well as the number of projects each employee
works on. It is also common to return the nodes and relationships themselves in the query result, rather than
the property values of the nodes as in the previous queries. Query 7 is similar to query 5 but returns the nodes
and relationships only, and so the query result can be displayed as a graph using Neo4j’s visualization tool. It
is also possible to add or remove labels and properties from nodes. Query 8 shows how to add more
properties to a node by adding a Job property to an employee node. The above gives a brief flavor for the
Cypher query language of Neo4j. The full language manual is available online (see the bibliographic notes).

24.6.3 Neo4j Interfaces and Distributed System Characteristics


Neo4j has other interfaces that can be used to create, retrieve, and update nodes and relationships in a graph
database. It also has two main versions: the enterprise edition, which comes with additional capabilities, and
the community edition. We discuss some of the additional features of Neo4j in this subsection.

■ Enterprise edition vs. community edition. Both editions support the Neo4j graph data model and storage
system, as well as the Cypher graph query language, and several other interfaces, including a high-
performance native API, language drivers for several popular programming languages, such as Java, Python,
PHP, and the REST (Representational State Transfer) API. In addition, both editions support ACID
properties. The enterprise edition supports additional features for enhancing performance, such as caching
and clustering of data and locking.

■ Graph visualization interface. Neo4j has a graph visualization interface, so that a subset of the nodes and
edges in a database graph can be displayed as a graph. This tool can be used to visualize query results in a
graph representation.

■ Master-slave replication. Neo4j can be configured on a cluster of distributed system nodes (computers),
where one node is designated the master node. The data and indexes are fully replicated on each node in the
cluster. Various ways of synchronizing the data between master and slave nodes can be configured in the
distributed cluster.

■ Caching. A main memory cache can be configured to store the graph data for improved performance.

■ Logical logs. Logs can be maintained to recover from failures. Review Questions 909 A full discussion of
all the features and interfaces of Neo4j is outside the scope of our presentation. Full documentation of Neo4j
is available online (see the bibliographic notes).

Sahana M 44

You might also like