UNIT 2_SCSA3008_DISTRIBUTED DATABASE AND INFORMATION SYSTEMS
UNIT 2_SCSA3008_DISTRIBUTED DATABASE AND INFORMATION SYSTEMS
Security
Security refers to protecting and securing computers and their related data, networks, software,
hardware from unauthorized access, misuse, theft, information loss, and other security issues.
Technology is growing day by day and the entire world is in its grasp. With the use of this
growing technology, invaders, hackers and thieves are trying to harm our computer’s security
for monetary gains, recognition purposes, ransom demands, bullying others, invading into
other businesses, organizations, etc. In order to protect our system from all these risks,
computer security is important.
Types of security
1. Cyber Security: Cyber security means securing our computers, electronic devices, networks
, programs, systems from cyber attacks. Cyber attacks are those attacks that happen when our
system is connected to the Internet.
2. Information Security: Information security means protecting our system’s information
from theft, illegal use and piracy from unauthorized use.
3. Application Security: Application security means securing our applications and data so that
they don’t get hacked and also the databases of the applications remain safe and private to the
owner itself so that user’s data remains confidential.
4. Network Security: Network security means securing a network and protecting the user’s
information about who is connected through that network
Security Techniques
1.Interfaces are exposed: Distributed systems are composed of processes that offer services or
share information. Their communication interfaces are necessarily open (to allow new clients to
access them) – an attacker can send a message to any interface.
4.Algorithms and program code are available to attackers: Secret encryption algorithms are
totally inadequate for today’s large-scale network environments. Best practice is to publish the
algorithms used for encryption and authentication, relying only on the secrecy of cryptographic
keys. This helps to ensure that the algorithms are strong by throwing them open to scrutiny by
third parties.
5.Attackers may have access to large resources: The cost of computing power is rapidly
decreasing. We should assume that attackers will have access to the largest and most powerful
computers projected in the lifetime of a system, then add a few orders of magnitude to allow for
unexpected developments.
6.Minimize the trusted base: The portions of a system that are responsible for the
implementation of its security, and all the hardware and software components upon which they
rely, have to be trusted – this is often referred to as the trusted computing base. Any defect or
programming error in this trusted base can produce security weaknesses, so we should aim to
minimize its size. For example, application programs should not be trusted to protect data from
their users.
Cryptography Algorithms
• Encryption is the process of encoding a message in such a way as to hide its contents.
Modern cryptography includes several secure algorithms for encrypting and decrypting
messages. They are all based on the use of secrets called keys.
• A cryptographic key is a parameter used in an encryption algorithm in such a way that the
encryption cannot be reversed without knowledge of the key.
• The first uses shared secret keys – the sender and the recipient must share a knowledge of
the key and it must not be revealed to anyone else. The second class of encryption
algorithms uses public/private key pairs. Here the sender of a message uses a public key –
one that has already been published by the recipient – to encrypt the message. The
recipient uses a corresponding private key to decrypt the message.
Uses of cryptography
Algorithms
Locking-based concurrency control systems can use either one-phase or two-phase locking
protocols.
In this method, each transaction locks an item before use and releases the lock as soon as it has
finished using it. This locking method provides for maximum concurrency but does not always
enforce serializability.
In this method, all locking operations precede the first lock-release or unlock operation. The
transaction comprise of two phases. In the first phase, a transaction only acquires all the locks it
needs and do not release any lock. This is called the expanding or the growing phase. In the
second phase, the transaction releases the locks and cannot request any new locks. This is called
the shrinking phase. The fundamental decision in distributed locking-based concurrency control
algorithms is where and how the locks are maintained (usually called a lock table).
Centralized 2PL
The 2PL algorithm can easily be extended to the distributed DBMS environment by delegating
lock management responsibility to a single site. This means that only one of the sites has a lock
manager; the transaction managers at the other sites communicate with it to obtain locks. This
approach is also known as the primary site 2PL algorithm
This communication is between the coordinatingTM, the lock manager at the central site, and the
data processors (DP) at the otherparticipating sites. The participating sites are those that store the
data items on whichthe operatio\n is to be carried out.These algorithms use a 5-tuple for the
operation they perform: Op : _Type ={BT,R,W,A,C}.The transaction manager (C2PL-TM)
algorithm is written as a process thatruns forever and waits until a message arrives from either an
application (with atransaction operation) or from a lock manager, or from a data processor shown
in fig 2.2. The lockmanager (C2PL-LM) and data processor (DP) algorithms are written as
proceduresthat are called when needed.
Distributed 2PL (D2PL) requires the availability of lock managers at each site.
Thecommunication between cooperating sites that execute a transaction according tothe
distributed 2PL protocol.The distributed 2PL transaction management algorithm is similar to the
C2PLTM,with two major modifications. The messages that are sent to the central sitelock
manager in C2PL-TM are sent to the lock managers at all participating sites inD2PL-TM. The
second difference is that the operations are not passed to the dataprocessors by the coordinating
transaction manager, but by the participating lockmanagers. This means that the coordinating
transaction manager does not wait fora “lock request granted” message shown in Fig 2.3.
Locking-based concurrency control algorithms may cause deadlocks; in the caseof distributed
DBMSs, these could be distributed (or global) deadlocks due totransactions executing at
different sites waiting for each other. Deadlock detectionand resolution is the most popular
approach to managing deadlocks in the distributedsetting. The wait-for graph (WFG) can be
useful for detecting deadlocks; this is adirected graph whose vertices are active transactions with
an edge from Tito Tjifan operation in Tiis waiting to access a data item that is currently locked in
anincompatible mode by an operation in Tj. However, the formation of the WFGis more
complicated in a distributed setting due to the distributed execution oftransactions. Therefore, it
is not sufficient for each site to form a local wait-forgraph (LWFG) and check it; it is also
necessary to form a global wait-for graph(GWFG), which is the union of all the LWFGs, and
check it for cycles.
There are threefundamental methods of detecting distributed deadlocks, referred as centralized,
distributed, and hierarchical deadlock detection.
In the centralized deadlock detection approach, one site is designated as thedeadlock detector for
the entire system. Periodically, each lock manager transmits itsLWFG to the deadlock detector,
which then forms the GWFG and looks for cycles init. The lock managers need only send
changes in their graphs (i.e., the newly createdor deleted edges) to the deadlock detector. The
length of intervals for transmittingthis information is a system design decision: the smaller the
interval, the smallerthe delays due to undetected deadlocks, but the higher the deadlock detection
andcommunication overhead.
Deadlocks that are local to a single site would bedetected at that site using the LWFG. Each site
also sends its LWFG to the deadlockdetector at the next level. Thus, distributed deadlocks
involving two or more siteswould be detected by a deadlock detector in the next lowest level that
has controlover these sites. For example, a deadlock at site 1 would be detected by the
localdeadlock detector (DD) at site 1 (denoted DD21, 2 for level 2, 1 for site 1). If,however, the
deadlock involves sites 1 and 2, then DD11 detects it. Finally, if t thedeadlock involves sites 1
and 4, DD0x detects it, where x is one of 1, 2, 3, or 4. The hierarchical deadlock detection
method reduces the dependence on thecentral site, thus reducing the communication cost. It is,
however, considerably morecomplicated to implement and would involve nontrivial
modifications to the lockand transaction manager algorithms.
Timestamp-Based Algorithms
In the basic TO algorithm the coordinating TM assigns the timestamp to eachtransaction Ti[ts(Ti
)], determines the sites where each data item is stored, and sends the relevant operations to these
sites.
TO Rule Given two conflicting operations Oijand Oklbelonging, respectively, totransactions
Tiand Tk, Oijis executed before Oklif and only if ts(Ti) <ts(Tk). Inthis case Tiis said to be the
older transaction and Tk is said to be the younger one.A scheduler that enforces the TO rule
checks each new operation againstconflicting operations that have already been scheduled. If the
new operationbelongs to a transaction that is younger than all the conflicting ones that have
already
been scheduled, the operation is accepted; otherwise, it is rejected, causing the entiretransaction
to restart with a new timestamp.To facilitate checking of the TO Rule, each data item x is
assigned twotimestamps: a read timestamp [rts(x)], which is the largest of the timestamps ofthe
transactions that have read x, and a write timestamp [wts(x)], which is thelargest of the
timestamps of the transactions that have written (updated) x. It isnow sufficient to compare the
timestamp of an operation with the read and writetimestamps of the data item that it wants to
access to determine if any transactionwith a larger timestamp has already accessed the same data
item.When an operation is rejected by a scheduler, the corresponding transaction isrestarted by
the transaction manager with a new timestamp. This ensures that thetransaction has a chance to
execute in its next try.
Optimistic Concurrency Control Algorithm
In systems with low conflict rates, the task of validating every transaction for serializability may
lower performance. In these cases, the test for serializability is postponed to just before commit.
Since the conflict rate is low, the probability of aborting transactions which are not serializable is
also low. This approach is called optimistic concurrency control technique.
In this approach, a transaction’s life cycle is divided into the following three phases −
According to this rule, a transaction must be validated locally at all sites when it executes. If a
transaction is found to be invalid at any site, it is aborted. Local validation guarantees that the
transaction maintains serializability at the sites where it has been executed. After a transaction
passes local validation test, it is globally validated.
Rule 2 − Given two transactions Ti and Tj, if Ti is writing the data item that Tj is reading, then
Ti’s commit phase cannot overlap with Tj’s execution phase. Tj can start executing only after
Ti has already committed.
According to this rule, after a transaction passes local validation test, it should be globally
validated. Global validation ensures that if two conflicting transactions run together at more than
one site, they should commit in the same relative order at all the sites they run together. This may
require a transaction to wait for the other conflicting transaction, after validation before commit.
This requirement makes the algorithm less optimistic since a transaction may not be able to
commit as soon as it is validated at a site.
Rule 3 − Given two transactions Ti and Tj, if Ti is writing the data item which Tj is also writing,
then Ti’s commit phase cannot overlap with Tj’s commit phase. Tj can start to commit only after
Ti has already committed.
Types of Schedules
• Parallel Schedules − In parallel schedules, more than one transactions are active
simultaneously, i.e. the transactions contain operations that overlap at time. This is
depicted in the following graph shown in Fig 2.5
Conflicts in Schedules
• At least one of the operations is a write_item() operation, i.e. it tries to modify the data
item.
Serializability
A serializable schedule of ‘n’ transactions is a parallel schedule which is equivalent to a serial
schedule comprising of the same ‘n’ transactions. A serializable schedule contains the
correctness of serial schedule while ascertaining better CPU utilization of parallel schedule.
Equivalence of Schedules
• Result equivalence − Two schedules producing identical results are said to be result
equivalent.
• View equivalence − Two schedules that perform similar action in a similar manner are
said to be view equivalent.
• Conflict equivalence − Two schedules are said to be conflict equivalent if both contain
the same set of transactions and has the same order of conflicting pairs of operations.
Serial schedules have less resource utilization and low throughput. To improve it, two are more
transactions are run concurrently. But concurrency of transactions may lead to inconsistency in
database. To avoid this, we need to check whether these concurrent schedules are serializable or
not.
Distributed deadlocks
Deadlock is a state of a database system having two or more transactions, when each transaction
is waiting for a data item that is being locked by some other transaction. A deadlock can be
indicated by a cycle in the wait-for-graph. This is a directed graph in which the vertices denote
transactions and the edges denote waits for data items.
For example, in the following wait-for-graph, transaction T1 is waiting for data item X which is
locked by T3. T3 is waiting for Y which is locked by T2 and T2 is waiting for Z which is locked
by T1. Hence, a waiting cycle is formed, and none of the transactions can proceed executing
shown in Fig 2.6.
Fig 2.6 Deadlocks
Deadlock Handling in Centralized Systems
• Deadlock prevention.
• Deadlock avoidance.
• Deadlock detection and removal.
All of the three approaches can be incorporated in both a centralized and a distributed database
system.
Deadlock Prevention
The deadlock prevention approach does not allow any transaction to acquire locks that will lead
to deadlocks. The convention is that when more than one transactions request for locking the
same data item, only one of them is granted the lock.
One of the most popular deadlock prevention methods is pre-acquisition of all the locks. In this
method, a transaction acquires all the locks before starting to execute and retains the locks for the
entire duration of transaction. If another transaction needs any of the already acquired locks, it
has to wait until all the locks it needs are available. Using this approach, the system is prevented
from being deadlocked since none of the waiting transactions are holding any lock.
Deadlock Avoidance
The deadlock avoidance approach handles deadlocks before they occur. It analyzes the
transactions and the locks to determine whether or not waiting leads to a deadlock.
The method can be briefly stated as follows. Transactions start executing and request data items
that they need to lock. The lock manager checks whether the lock is available. If it is available,
the lock manager allocates the data item and the transaction acquires the lock. However, if the
item is locked by some other transaction in incompatible mode, the lock manager runs an
algorithm to test whether keeping the transaction in waiting state will cause a deadlock or not.
Accordingly, the algorithm decides whether the transaction can wait or one of the transactions
should be aborted.
There are two algorithms for this purpose, namely wait-die and wound-wait. Let us assume that
there are two transactions, T1 and T2, where T1 tries to lock a data item which is already locked
by T2. The algorithms are as follows −
The deadlock detection and removal approach runs a deadlock detection algorithm periodically
and removes deadlock in case there is one. It does not check for deadlock when a transaction
places a request for a lock. When a transaction requests a lock, the lock manager checks whether
it is available. If it is available, the transaction is allowed to lock the data item; otherwise the
transaction is allowed to wait.
Since there are no precautions while granting lock requests, some of the transactions may be
deadlocked. To detect deadlocks, the lock manager periodically checks if the wait-forgraph has
cycles. If the system is deadlocked, the lock manager chooses a victim transaction from each
cycle. The victim is aborted and rolled back; and then restarted later. Some of the methods used
for victim selection are −
This approach is primarily suited for systems having transactions low and where fast response to
lock requests is needed.
Transaction processing in a distributed database system is also distributed, i.e. the same
transaction may be processing at more than one site. The two main deadlock handling concerns
in a distributed database system that are not present in a centralized system are transaction
location and transaction control. Once these concerns are addressed, deadlocks are handled
through any of deadlock prevention, deadlock avoidance or deadlock detection and removal.
Transaction Location
Transactions in a distributed database system are processed in multiple sites and use data items
in multiple sites. The amount of data processing is not uniformly distributed among these sites.
The time period of processing also varies. Thus the same transaction may be active at some sites
and inactive at others. When two conflicting transactions are located in a site, it may happen that
one of them is in inactive state. This condition does not arise in a centralized system. This
concern is called transaction location issue.
This concern may be addressed by Daisy Chain model. In this model, a transaction carries
certain details when it moves from one site to another. Some of the details are the list of tables
required, the list of sites required, the list of visited tables and sites, the list of tables and sites
that are yet to be visited and the list of acquired locks with types. After a transaction terminates
by either commit or abort, the information should be sent to all the concerned sites.
Transaction Control
Transaction control is concerned with designating and controlling the sites required for
processing a transaction in a distributed database system. There are many options regarding the
choice of where to process the transaction and how to designate the center of control, like −
The site where the transaction enters is designated as the controlling site. The controlling site
sends messages to the sites where the data items are located to lock the items. Then it waits for
confirmation. When all the sites have confirmed that they have locked the data items, transaction
starts. If any site or communication link fails, the transaction has to wait until they have been
repaired.
In case of conflict, one of the transactions may be aborted or allowed to wait as per distributed
wait-die or distributed wound-wait algorithms.
Let us assume that there are two transactions, T1 and T2. T1 arrives at Site P and tries to lock a
data item which is already locked by T2 at that site. Hence, there is a conflict at Site P. The
algorithms are as follows −
• Distributed Wound-Die
o If T1 is older than T2, T1 is allowed to wait. T1 can resume
execution after Site P receives a message that T2 has either
committed or aborted successfully at all sites.
o If T1 is younger than T2, T1 is aborted. The concurrency control at
Site P sends a message to all sites where T1 has visited to abort T1.
The controlling site notifies the user when T1 has been successfully
aborted in all the sites.
• Distributed Wait-Wait
o If T1 is older than T2, T2 needs to be aborted. If T2 is active at Site
P, Site P aborts and rolls back T2 and then broadcasts this message
to other relevant sites. If T2 has left Site P but is active at Site Q,
Site P broadcasts that T2 has been aborted; Site L then aborts and
rolls back T2 and sends this message to all sites.
o If T1 is younger than T1, T1 is allowed to wait. T1 can resume
execution after Site P receives a message that T2 has completed
processing.
Distributed Deadlock Detection
Just like centralized deadlock detection approach, deadlocks are allowed to occur and are
removed if detected. The system does not perform any checks when a transaction places a lock
request. For implementation, global wait-for-graphs are created. Existence of a cycle in the
global wait-for-graph indicates deadlocks. However, it is difficult to spot deadlocks since
transaction waits for resources across the network.
Alternatively, deadlock detection algorithms can use timers. Each transaction is associated with a
timer which is set to a time period in which a transaction is expected to finish. If a transaction
does not finish within this time period, the timer goes off, indicating a possible deadlock.
Another tool used for deadlock handling is a deadlock detector. In a centralized system, there is
one deadlock detector. In a distributed system, there can be more than one deadlock detectors. A
deadlock detector can find deadlocks for the sites under its control. There are three alternatives
for deadlock detection in a distributed system, namely.
• Centralized Deadlock Detector − One site is designated as the central deadlock
detector.
• Hierarchical Deadlock Detector − A number of deadlock detectors are arranged
in hierarchy.
• Distributed Deadlock Detector − All the sites participate in detecting deadlocks
and removing them.
Power failure causes loss of information in the non-persistent memory. When power is restored,
the operating system and the database management system restart. Recovery manager initiates
recovery from the transaction logs.
In case of immediate update mode, the recovery manager takes the following actions −
• Transactions which are in active list and failed list are undone and written on the
abort list.
• Transactions which are in before-commit list are redone.
• No action is taken for transactions in commit or abort lists.
In case of deferred update mode, the recovery manager takes the following actions −
• Transactions which are in the active list and failed list are written onto the abort
list. No undo operations are required since the changes have not been written to
the disk yet.
• Transactions which are in before-commit list are redone.
• No action is taken for transactions in commit or abort lists.
Recovery from Disk Failure
A disk failure or hard crash causes a total database loss. To recover from this hard crash, a new
disk is prepared, then the operating system is restored, and finally the database is recovered using
the database backup and transaction log. The recovery method is same for both immediate and
deferred update modes.
• The transactions in the commit list and before-commit list are redone and written
onto the commit list in the transaction log.
• The transactions in the active list and failed list are undone and written onto the
abort list in the transaction log.
Checkpointing
Checkpoint is a point of time at which a record is written onto the database from the buffers. As
a consequence, in case of a system crash, the recovery manager does not have to redo the
transactions that have been committed before checkpoint. Periodical checkpointing shortens the
recovery process.
• Consistent checkpointing
• Fuzzy checkpointing
Consistent Checkpointing
If in step 4, the transaction log is archived as well, then this checkpointing aids in recovery from
disk failures and power failures, otherwise it aids recovery from only power failures.
Fuzzy Checkpointing
In fuzzy checkpointing, at the time of checkpoint, all the active transactions are written in the
log. In case of power failure, the recovery manager processes only those transactions that were
active during checkpoint and later. The transactions that have been committed before checkpoint
are written to the disk and hence need not be redone.
Example of Checkpointing
Let us consider that in system the time of checkpointing is tcheck and the time of system crash is
tfail. Let there be four transactions Ta, Tb, Tc and Td such that −
Transaction recovery is done to eliminate the adverse effects of faulty transactions rather than to
recover from a failure. Faulty transactions include all transactions that have changed the database
into undesired state and the transactions that have used values written by the faulty transactions.
• UNDO all faulty transactions and transactions that may be affected by the faulty
transactions.
• REDO all transactions that are not faulty but have been undone due to the faulty
transactions.
• If the faulty transaction has done INSERT, the recovery manager deletes the data
item(s) inserted.
• If the faulty transaction has done DELETE, the recovery manager inserts the
deleted data item(s) from the log.
• If the faulty transaction has done UPDATE, the recovery manager eliminates the
value by writing the before-update value from the log.
• If the transaction has done INSERT, the recovery manager generates an insert
from the log.
• If the transaction has done DELETE, the recovery manager generates a delete
from the log.
• If the transaction has done UPDATE, the recovery manager generates an update
from the log.
In a local database system, for committing a transaction, the transaction manager has to only
convey the decision to commit to the recovery manager. However, in a distributed system, the
transaction manager should convey the decision to commit to all the servers in the various sites
where the transaction is being executed and uniformly enforce the decision. When processing is
complete at each site, it reaches the partially committed transaction state and waits for all other
transactions to reach their partially committed states. When it receives the message that all the
sites are ready to commit, it starts to commit. In a distributed system, either all sites commit or
none of them does.
• One-phase commit
• Two-phase commit
• Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a
controlling site and a number of slave sites where the transaction is being executed. The steps in
distributed commit are −
• After each slave has locally completed its transaction, it sends a “DONE” message
to the controlling site.
• The slaves wait for “Commit” or “Abort” message from the controlling site. This
waiting time is called window of vulnerability.
• When the controlling site receives “DONE” message from each slave, it makes a
decision to commit or abort. This is called the commit point. Then, it sends this
message to all the slaves.
• On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.
Distributed Two-phase Commit
Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The
steps performed in the two phases are as follows −
• After each slave has locally completed its transaction, it sends a “DONE” message
to the controlling site. When the controlling site has received “DONE” message
from all slaves, it sends a “Prepare” message to the slaves.
• The slaves vote on whether they still want to commit or not. If a slave wants to
commit, it sends a “Ready” message.
• A slave that does not want to commit sends a “Not Ready” message. This may
happen when the slave has conflicting concurrent transactions or there is a
timeout.
• After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the
slaves.
o The slaves apply the transaction and send a “Commit ACK”
message to the controlling site.
o When the controlling site receives “Commit ACK” message from
all the slaves, it considers the transaction as committed.
• After the controlling site has received the first “Not Ready” message from any
slave −
o The controlling site sends a “Global Abort” message to the slaves.
o The slaves abort the transaction and send a “Abort ACK” message
to the controlling site.
o When the controlling site receives “Abort ACK” message from all
the slaves, it considers the transaction as aborted.
Distributed Three-phase Commit
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message is
not required.
Threats in a Database
• Availability loss − Availability loss refers to non-availability of database objects
by legitimate users.
• Integrity loss − Integrity loss occurs when unacceptable operations are performed
upon the database either accidentally or maliciously. This may happen while
creating, inserting, updating or deleting data. It results in corrupted data leading to
incorrect decisions.
• Confidentiality loss − Confidentiality loss occurs due to unauthorized or
unintentional disclosure of confidential information. It may result in illegal
actions, security threats and loss in public confidence.
Measures of Control
The measures of control can be broadly divided into the following categories −
The coded message is called cipher text and the original message is called plain text. The
process of converting plain text to cipher text by the sender is called encoding or encryption.
The process of converting cipher text to plain text by the receiver is called decoding
or decryption.
The entire procedure of communicating using cryptography can be illustrated through the
following diagram −
Fig 2.8 Cryptography
In conventional cryptography, the encryption and decryption is done using the same secret key.
Here, the sender encrypts the message with an encryption algorithm using a copy of the secret
key. The encrypted message is then send over public communication channels. On receiving the
encrypted message, the receiver decrypts it with a corresponding decryption algorithm using the
same secret key.
The most famous conventional cryptography algorithm is Data Encryption Standard or DES.
The advantage of this method is its easy applicability. However, the greatest problem of
conventional cryptography is sharing the secret key between the communicating parties. The
ways to send the key are cumbersome and highly susceptible to eavesdropping.
In contrast to conventional cryptography, public key cryptography uses two different keys,
referred to as public key and the private key. Each user generates the pair of public key and
private key. The user then puts the public key in an accessible place. When a sender wants to
sends a message, he encrypts it using the public key of the receiver. On receiving the encrypted
message, the receiver decrypts it using his private key. Since the private key is not known to
anyone but the receiver, no other person who receives the message can decrypt it.
The most popular public key cryptography algorithms are RSA algorithm and Diffie–
Hellman algorithm. This method is very secure to send private messages. However, the problem
is, it involves a lot of computations and so proves to be inefficient for long messages.
The solution is to use a combination of conventional and public key cryptography. The secret
key is encrypted using public key cryptography before sharing between the communicating
parties. Then, the message is send using conventional cryptography with the aid of the shared
secret key.
Digital Signatures
A Digital Signature (DS) is an authentication technique based on public key cryptography used
in e-commerce applications. It associates a unique mark to an individual within the body of his
message. This helps others to authenticate valid senders of messages.
Typically, a user’s digital signature varies from message to message in order to provide security
against counterfeiting. The method is as follows −
• The sender takes a message, calculates the message digest of the message and
signs it digest with a private key.
• The sender then appends the signed digest along with the plaintext message.
• The message is sent over communication channel.
• The receiver removes the appended signed digest and verifies the digest using the
corresponding public key.
• The receiver then takes the plaintext message and runs it through the same
message digest algorithm.
• If the results of step 4 and step 5 match, then the receiver knows that the message
has integrity and authentic.
A distributed system needs additional security measures than centralized system, since there are
many users, diversified data, multiple sites and distributed control. In this chapter, we will look
into the various facets of distributed database security.
In distributed communication systems, there are two types of intruders −
• Passive eavesdroppers − They monitor the messages and get hold of private
information.
• Active attackers − They not only monitor the messages but also corrupt data by
inserting new data or modifying existing data.
Security measures encompass security in communications, security in data and data auditing.
Communications Security
In a distributed database, a lot of data communication takes place owing to the diversified
location of data, users and transactions. So, it demands secure communication between users and
databases and between the different database environments.
Two popular, consistent technologies for achieving end-to-end secure communications are −
A database security system needs to detect and monitor security violations, in order to ascertain
the security measures it should adopt. It is often very difficult to detect breach of security at the
time of occurrences. One method to identify security violations is to examine audit logs. Audit
logs contain information such as −
The World Wide Web (“WWW” or “web” for short) has become a major repositoryof data and
documents. the web represents a very large, dynamic, anddistributed data store and there are the
obvious distributed data management issuesin accessing web data. The web, in its present form,
can be viewed as two distinct yet related components.The first of these components is what is
known as the publicly indexable web(PIW) that is composed of all static (and cross-linked) web
pages that exist on webservers. These can be easily searched and indexed. The other component,
whichis known as the deep web (or the hidden web), is composed of a huge number ofdatabases
that encapsulate the data, hiding it from the outside world. The data in thehidden web are usually
retrieved by means of search interfaces where the user entersa query that is passed to the
database server, and the results are returned to the useras a dynamically generated web pae. A
portion of the deep web has come to beknown as the “dark web,” which consists of encrypted
data and requires a particularbrowser such as Tor to access.
Web Crawling
crawler scans the web on behalf of a search engine to extractinformation about the visited web
pages. Given the size of the web, the changingnature of web pages, and the limited computing
and storage capabilities of crawlers,it is impossible to crawl the entire web. Thus, a crawler must
be designed to visit“most important” pages before others. The issue, then, is to visit the pages in
someranked order of importance.There are a number of issues that need to be addressed in
designing a crawler.Since the primary goal is to access more important pages before others, there
needsto be some way of determining the importance of a page. This can be done bymeans of a
measure that reflects the importance of a given page. These measurescan be static, such that the
importance of a page is determined independent ofretrieval queries that will run against it, or
dynamic in that they take the queries intoconsideration. Examples of static measures are those
that determine the importanceof a page Pi with respect to the number of pages that point to Pi
(referred to asbacklink), or those that additionally take into account the importance of the
backlinkpages as is done in the popular PageRank metric that is used by Google and others.A
possible dynamicmeasure may be one that calculates the importance of a pagePi with respect its
textual similarity to the query that is being evaluated using someof the well-known information
retrieval similarity measures. Recall that thePageRank of a page Pi , denoted PR(Pi ), is simply
the normalized sum of thePageRank of all Pi ’s backlink pages (denoted as BPi) where the
normalization foreach Pj∈BPiis over all of Pj’s forward links FPj:
Recall also that this formula calculates the rank of a page based on the backlinks,but normalizes
the contribution of each backlinking page Pjusing the number offorward links that Pjhas. The
idea here is that it is more important to be pointedat by pages conservatively link to other pages
than by those who link to othersindiscriminately, but the “contribution” of a link from such a
page needs to benormalized over all the pages that it points to.A second issue is how the crawler
chooses the next page to visit once it hascrawled a particular page. As noted earlier, the crawler
maintains a queue in whichit stores the URLs for the pages that it discovers as it analyzes each
page. Thus,the issue is one of ordering the URLs in this queue. A number of strategies
arepossible. One possibility is to visit the URLs in the order in which they werediscovered; this
is referred to as the breadth-first approach. Another alternative isto use random ordering
whereby the crawler chooses a URL randomly from amongthose that are in its queue of unvisited
pages. Other alternatives are to use metricsthat combine ordering with importance ranking
discussed above, such as backlinkcounts or PageRank.