0% found this document useful (0 votes)
7 views

Advanced Database - Allchapters

Advanced database

Uploaded by

gemechubonsa029
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Advanced Database - Allchapters

Advanced database

Uploaded by

gemechubonsa029
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 306

CONTENTS

Evaluation Schemes
9
Brain Storming
1. Differentiate the terms:

• Database,

• Database Management System(DBMS) and

• Database System.
Definition of Terms

Database – is an organized collection of related data held in


a computer or a data bank, which is designed to
be accessible in various ways.

DBMS – The technology of storing and retrieving users’ data


with utmost efficiency along with appropriate
security measures.
– A software package/ system to facilitate the creation
and maintenance of a computerized database.

Database – The DBMS software together with the


System:
data itself.

11
Brain Storming
1. How has the database technology evolved?
Types of DBMS Models

13
• In a Hierarchical database model, the data is
organized in a tree-like structure.
• Data is Stored Hierarchically (top down or
bottom up) format.
• Data is represented using a parent-child
relationship that are one to one or one to many.
• In Hierarchical DBMS parent may have many
children, but children have only one parent.

14
• The network database model allows each child to have multiple
parents.
• More complex relationships such as many-to-many relationship
(e.g., orders/parts).
• The entities are organized in a graph which can be accessed
through several paths.

15
• Easiest and Most widely used DBMS model
• Based on normalizing data in the rows and columns
of the tables.
• Data is stored in fixed structures and manipulated
using SQL.

16
Characteristics of Relational model (70’s) ….
 Clean and simple.

 Great for administrative and transactional data.

 Not as good for other kinds of complex data (e.g., multimedia,


networks, CAD).

 Relations are the key concept, everything else is around


relations

 Primitive data types, e.g., strings, integer, date, etc.

 Great normalization, query optimization, and theory

17
What is missing??
 Handling of complex objects
– Could not store complex data types
such as images or sound.
 Handling of complex data types
– RDBMSs have provided limited data types.
 Code is not coupled with data
– SQL is declarative but programming
languages are procedural.

18
• In Object-oriented Model data stored in the form of objects.
• The structure which is called classes which display data within
it.
• It defines a database as a collection of objects which stores both
data member values and operations.

• Properties
• Name
• Height
• Weight……..
• Behaviors
• Eat
• Pray
• Walk …..
19
Object-Oriented models (80’s):
▪ Complicated, but some influential ideas from Object Oriented
programming

▪ Handles Complex data types.

▪ Idea: Build DBMS based on OO model.


▪ Programming languages have evolved from Procedural to Object
Oriented. So why not DBMSs ???
Evolution of Database Models
Introduction to OODBMS
OO Concepts (Mandatory Concepts)
 The Golden Rules/1 DBMS concepts
 Complex objects (mandatory concepts)
 The Golden Rules/2
 Object identity
 Persistence
 Encapsulation
 Secondary storage
 Types and/or Classes
management
 Class or Type Hierarchies

 Overriding, overloading  Concurrency


 Computational completeness  Recovery
 Extensibility- user-defined types
can be used in the same way as  Ad Hoc Query Facility
system-defined types
23
The Goodies (Optional concepts that may be implemented)
 Multiple inheritance

 Type checking and type inferencing

 Distribution

 Design transactions

 Versions

24
Discussion …
1. What are the main features of OOP?

2. What are the main capabilities of Database?

3. What is an Object-Oriented Database (OODBMS)?


Object-Oriented Database
• An OODBMS is a type of database
management system (DBMS) that utilizes the
principles of object-oriented programming
(OOP).

• Data is stored in objects, which encapsulate


both data (attributes) and behavior (methods).

• Objects can interact with each other through


methods, promoting a more natural
representation of complex relationships
Why Object Oriented Databases?
• There are three reasons for need of OODBMS:
1. Limitation of RDBMS

2. Need for Advanced Applications

3. Popularity of Object Oriented Programming Paradigm


OODB Advantages and Disadvantages
▪ Advantages of OODBMS:
▪ Natural Data Modeling: Object-based modeling aligns well with real-
world entities, simplifying data representation.

▪ Reduced Development Time: Inheritance and built-in functionalities


can expedite development.

▪ Improved Code Maintainability: Encapsulation promotes modularity


and reduces code complexity.

▪ Complex Data Handling: OODBMS excel at managing intricate data


structures and relationships.
OODB Advantages and Disadvantages
• Disadvantages of OODBMS:

• Performance: OODBMS might have slower query performance


compared to relational databases for simple queries.

• Complexity: OOP concepts like inheritance and complex object


structures can add complexity.

• Limited Adoption: OODBMS have a smaller market share


compared to relational databases.
Approaches for OODBMS
❑ There are two approaches of Object oriented
Database:

❑ Object-Oriented Model (OODBMS)


– Pure OO concepts

– Example: Orian, Iris

❑ Object-Relational Model (ORDBMS)


– Extended relational model with OO concepts

– Example: Oracle, 8i, SQL Server 2000

30
Object Data Management Group(ODMG )
❑ ODMG — to define standards for OODBMSs.
❑ Its standard is 3.0 which is popular.
– provide a standard where previously there was none

– support portability between products

– standardize model, querying and programming issues


❑ The major components of ODMG architecture for an OODBMS are:
– Object Model (OM),
– Object Definition Language (ODL),
– Object Query Language (OQL), and C++, Java, and Smalltalk language
bindings.
31
ODMG Objects and Literals
 The basic building blocks of the object model are:
1. Objects
2. Literals
▪ Objects - represent real-world entities with attributes (data) and
methods (behavior).
▪ Objects are described by four characteristics
1. Identifier(OID) : unique system wide identifier.
• The OID of an object is independent of the values of its
attributes
2. Name : is used to refer the objects. It is optional
ODMG Objects
Objects are described by four characteristics
3. Life time
▪ Transient - unstable and can be updated and deleted
▪ Persistent – permanent object like (oodb)
▪ Naming – giving name
▪ Reachable – Collecting similar objects under one name
(extents).
4. Structure: Specifies whether the object is atomic or collection
type
Object Identity
▪ Identity: every object has a unique identity
▪ Object identity must be unique and non-volatile(immutable)

 In time: can not change and can not be re-assigned to another


object, when the original object is deleted

 In space: must be unique within and across database


boundaries
Object Factory
▪ The object that helps to generate many objects through
operations.

Example:

▪ Date object – can generate many calendar dates

▪ Ethiopian calendar

▪ European calendar

▪ Arabic calendar

▪ Indian calendar
ODMG Literals
▪ ODMG Literals- are special values used in Object Database
Management Systems (ODBMS) that follow the ODMG (Object Data
Management Group) standard.
Object types
• An object type is a blueprint or template that defines the structure
and behavior of its objects.

• It acts as a category that groups similar objects.

• The object type specifies what attributes (data) objects of that type
will have and what methods (actions) they can perform.

• For example, if you have an object type named “Car”, all Car objects
would share certain attributes like model, color, and number of
doors.

• All Car objects also share methods like accelerate(), brake(), and
turn().
37
Object types
▪ An object is made of two things
▪ State
▪ Behaviour
State
– is defined by the values of an object carries for a set of properties,
which may be either an attribute of the object or a relationship
between the object and one or more other objects.
– Example- Attributes (name, address, birthDate of a person)

38
Relationships
 Relationships are defined between types.

 Only binary relationships with cardinality 1:1, 1:*, and *:*.

 A relationship have a name &, is not a ‘first class’ object;

 Traversal paths are defined for each direction of traversal.

Example: a Branch Has a set of Staff and a member of Staff WorksAt a


Branch.

39
Relationships

40
Relationships

41
Behaviour
▪ Behaviour – is defined by a set of operations that can be performed
on or by the object.

▪ E.g., operations(age of person is computed from birthDate & current date)


▪ Operations implement the object’s behavior

▪ Types of operations:

▪ Constructor: creates a new instance of a class

▪ Query: accesses the state of an object but does not alter its state

▪ Update: alters the state of an object

▪ Scope: operation applying to the class instead of an instance


Type constructors
▪ type constructors are special keywords used to define the
structure of complex objects.

▪ type constructors act like building blocks, allowing you to create


objects with various data types and relationships between them.

▪ The three most basic constructors are

▪ Atom,

▪ Struct (or tuple), and

▪ Collection.

43
Atomic constructors

Includes the basic built-in data types of the object model,


which are similar to the basic types in many programming
languages: integers, strings, floating point numbers,
enumerated types, Booleans.

They are called single-valued or atomic types, since


each value of the type is considered an atomic (indivisible)
single value.

44
struct (or tuple) constructor
Create standard structured types, such as the tuples (record
types) in the basic relational model.

Referred to as a compound or composite type

Example

struct Name<FirstName: string, MiddleInitial: char, LastName:


string>,
struct CollegeDegree<Major: string, Degree: string,Year: date>.

45
Collection (or multivalued)
 Collection is used to create complex nested type structures in the
object model.
 Collection type constructors includes;
1. Set(T) - unordered collections that do not allow duplicates.
2. Bag(T) - allows duplicate elements in the collection and also
inherits the collection interface.
3. List(T) - create collections where the order of the elements is
important.

4. Array(T) - set of sorted list referenced by index.


5. Dictionary(K,T) - allows the creation of a collection of
association pairs <K,V>, where all K (key) and V(values) are
unique.
46
Core Concepts of OODBMS
(Objects, classes, interfaces, inheritance , encapsulation, polymorphism)
In the ODMG Object Model there are two ways to specify object types:
§ Interfaces, and
§ classes.
Interface --- defines only the abstract behavior of an object type, using
operation signatures.

Allows behavior to be inherited by other interfaces and classes using


the ‘:’ symbol.

properties (attributes and relationships) cannot be inherited.

Interface is Noninstantiable

47
Classes
Class defines both the abstract state and behavior of an object type.

Class is instantiable (thus, interface is an abstract concept and class is


an implementation concept).

Use the extends keyword to specify single inheritance between classes.


Multiple inheritance is not allowed.

Classes encapsulate data + methods + relationships

In OODBMSs objects are persistent (unlike OOP languages)


The interface part of an operation is sometimes called the signature,
and the implementation is sometimes called the method.

48
Classes
• Classes are blueprints or templates for creating objects.
• A class defines the common properties and behaviors that
objects of the same type possess.
• It specifies the attributes and methods.

• A class has:
– A name
– A set of attributes
– A set of methods
– A set of constraints

49
Extents and keys
Extent and keys are Specified during class definition:

Extent is the set of all instances of a given type within a particular ODMS.

Deleting an object removes the object from the extent of the type.

Key uniquely identifies the instances of a type (similar to the concept of a


candidate key).

A type must have an extent to have a key.

A key is different from an object name; key is composed of properties


specified in an object type’s interface whereas an object name is defined
within the database type.
50
Abstraction, Encapsulation, and Information Hiding …

Abstraction
▪ Abstraction focuses on hiding unnecessary details and exposing only
relevant information.

▪ It provides a simplified and conceptual view of objects and their


interactions.

▪ Abstraction helps in managing complexity and improving code


maintainability.
Encapsulation ….
• Encapsulation is the concept of bundling
data and methods together within a class.
• Encapsulation helps achieve data abstraction, security, and code
reusability.
• Information hiding - separate the external aspects of an object from its
internal details, which are hidden from the outside world.
• Focus more on data security
• To encourage encapsulation, an operation is defined in two parts.
• Signature or interface of the operation specifies the operation name and
arguments (or parameters).

• Method or body specifies the implementation of the operation

• Operations can be invoked by passing a message.


Inheritance …
A class can be defined in terms of another one. Person

name: {firstName: string,


Allows the definition of new types based on other middleName: string,
lastName: string}
address: string
predefined types, leading to a type (or class) hierarchy. birthDate: date

age(): Integer
It enables the reuse of attributes and methods from a changeAddress(newAdd: string)

base class (superclass) in derived classes (subclasses).

Subclasses inherit the properties of the superclass and Student

can extend or modify them as needed. regNum: string {PK}


major: string
Example register(C: Course): boolean
Person is super-class and Student is sub-class.
Student class inherits attributes and operations of Person.
53
Types of inheritance(Multiple and Selective Inheritance)

Multiple Inheritance - a class inherits features from more


than one superclass .
Example: Engineer_Manager that is a subtype of both Engineer and Manager.

This leads to the creation of a type lattice rather than a type hierarchy.

Selective Inheritance - occurs when a subtype inherits only


some of the functions of a super type.

The mechanism of selective inheritance is not typically


provided in ODBs,

It is used more frequently in artificial intelligence applications.


54
Polymorphism
 Polymorphism - the ability to appear in many forms.
 Meaning
o Ability to process objects differently depending on their data type
or class.
o It is the ability to redefine methods for derived classes.

▪ Overloading –

▪ allows the name of a method to be reused within a class definition.

▪ Overriding –

▪allows the name of a property to be redefined in a subclass.

▪ Dynamic binding - allows the determination of an object’s type and methods


to be deferred until runtime.
55
Overriding Overloading
Extensibility
 Extensibility allows the creation of new data types, i.e. user-defined
types, and operations from built-in atomic data types and user
defined data types using the type constructor.

 A type constructor is a mechanism for building new domains.

 A complex object is built using type constructors such as sets,


tuples, lists and nested combinations.

 A combination of an user-defined type and its associated


methods is called an abstract data type (ADT).

57
Versioning
An object version represents an identifiable state of an object.

version history represents the evolution of an object.

The process of maintaining the evolution of objects is known as


version management.

58
Overview of ODL & OQL
The Object Definition Language (ODL) is a language for defining
the specifications of object types for ODMG-compliant systems.

The ODL defines the attributes, relationships and signature of


the operations, but it does not address the implementation of
signatures.

59
Overview of ODL & OQL
Object Query Language (OQL) provides declarative access to
the object database using an SQL-like syntax.

It does not provide explicit update operators, but leaves this to the
operations defined on object types.

An OQL query is a function that delivers an object whose type may


be inferred from the operator contributing to the query expression.

OQL can be used for both associative and navigational access.

60
Querying object-relational database
 Most relational operators work on the object-relational tables

 E.g., selection, projection, aggregation, set operations

 Several major software companies including IBM, Informix,


Microsoft, Oracle, and Sybase have all released object-relational
versions of their products.

 SQL-99 (SQL3): Extended SQL to operate on object-relational


databases

61
Brain Storming
1. Can you provide an overview of what SQL is and the basic
commands for querying and manipulating data?

2. Explore the categories of SQL?

3. List the commands of DDL, DML, DCL and TCL.


Defining Generalization
Assignment-1(Individual Ass 1 : 6%)
1. Develop a schema using ODL for the following object types:

• Project, document, project-leader and research-paper. Use


the following relations between the object types: project
has a set of documents and a project leader. Project leaders
publish research-papers. Make appropriate assumptions on
the cardinalities of the relations. Use document as an super
type, and research- paper is a subtype of document.
Lab Assignment-1(Group ass1: 7%)
Chapter Two

QUERY PROCESSING & OPTIMIZATION


Query Processing and Optimization: Outline
▪ Query processing
▪ Operator Evaluation Strategies
▪ Selection
▪ Join
▪ Query Optimization
▪ Heuristic query optimization
▪ Cost-based query optimization
▪ Measures of Query Cost
▪ Query Tuning
Overview of Query Processing
❖ Query processing -the activities involved in parsing,

validating, optimizing, and executing a query.

❖ Aims

❖ To transform a query written in a high-level language,


typically SQL, into a correct and efficient execution strategy
expressed in a low-level language (implementing the relational
algebra), and

❖ To execute the strategy to retrieve the required data.


Query
Processing ❖ Example – SELECT sName FROM student;

Scanning – Scan keywords, symbols, attributes, table list names.

– Line by line / word by word checking of query.

SELECT sName FROM Student ;

Keyword Attribute Keyword TableName Symbol

Parsing – checking the validity, syntax, order/ structure of a query.

Indicates Parse SELECT Student FROM sName ;


Error is
encountered. SELECT eName FROM Student ;
Steps of 1. Parsing and translation
Query 2. Optimization
Processing 3. Evaluation Query

SELECT sName FROM Student ;

Scanner
Runtime DB Relational Algebra Form
Processor parser
Convert to
Machine code Intermediate Form RA/RC

Code Generator Optimizes


Query Tree Query Graph the query,
so that the
query will
be executed
Query Execution strategy Query Optimizers in shorter
time
Steps of Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
§ DBMS has algorithms to implement relational algebra expressions
§ SQL is a kind of high level language; specify what is wanted, not how it is
obtained
Query optimization:
❖ The activity of choosing an efficient execution strategy for

processing a query.

❖ Task: Find an efficient physical query plan (aka execution plan) for an

SQL query
Goal: Minimize the evaluation time for the query, i.e., compute
query result as fast as possible
Cost Factors: Disk accesses, read/write operations, [I/O, page
transfer] (CPU time is typically ignored)
Optimization: find the most efficient evaluation plan for a query because
there can be more than one way.
Examples:
❖ Find all Managers who work at a London branch.
Example - 2
SELECT * FROM Staff s, Branch b WHERE s.branchNo = b.branchNo
AND (s.position = ‘Manager’ AND b.city = ‘London’);
The equivalent relational algebra queries corresponding to
this SQL statement are:

The
Different
Strategies
Cost Comparison
❖ Cost (in disk accesses) are:

(1) (1000 + 50) + 2*(1000 * 50) = 101 050

(2) 2*1000 + (1000 + 50) = 3 050

(3) 1000 + 2*50 + 5 + (50 + 5) = 1 160


❖ The third option significantly reduces size of relations being joined together.

❖ Cartesian product and join operations are much more expensive than

selection.
We will see shortly that one of the fundamental strategies in query
processing is to perform the unary operations, Selection and Projection,
as early as possible, thereby reducing the operands of any subsequent
binary operations.
Phases of query processing
▪ Query Decomposition
 Transform high-level query into RA query.

 Check that query is syntactically and semantically correct.

▪ Typical stages are:

▪ Analysis,

▪ Normalization,

▪ Semantic analysis,

▪ Simplification,

▪ Query restructuring.
▪ Analysis
▪ Analyze query lexically and syntactically using compiler techniques.
▪ Verify relations and attributes exist.
▪ Verify operations are appropriate for object type.
Analysis
▪ Finally, query transformed into a query tree constructed as follows:

Leaf node for each base relation.

Non-leaf node for each intermediate relation produced by RA operation.

Root of tree represents query result.

 Sequence is directed from leaves to root.


Normalization
Query normalization is the process of converting a user’s query
into a standardized format so that it can be matched with relevant
documents or data.

Predicate can be converted into one of two forms:

Conjunctive normal form:

(position = 'Manager'  salary > 20000)  (branchNo = 'B003')

Disjunctive normal form:


(position='Manager'branchNo='B003')(salary>20000branchNo ='B003')
Semantic Analysis
▪ Rejects normalized queries that are incorrectly formulated or
contradictory.
▪ Query is incorrectly formulated if components do not contribute to
generation of result.
▪ Query is contradictory if its predicate cannot be satisfied by any tuple.

▪ Algorithms to determine correctness exist only for queries that do not


contain disjunction and negation.
Semantic Analysis
 To detect
➠ connection graph (query graph)
➠ join graph
Relation connection graph

▪ Relation connection graph not fully


connected, so query is not correctly
formulated.
▪ Have omitted the join condition
(v.propertyNo = p.propertyNo) .
Example 2
SELECT Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno
= Works.Eno AND Works.Pno = Project.Pno AND Pname =
‘CAD/CAM’ AND Dur > 36 AND Title = ‘Programmer’

If the query graph is connected, the query is semantically correct.


Simplification- Removing unnecessary predicates that don't affect the
query outcome.
• Detects redundant qualifications,

• Eliminates common sub-expressions,

• Transforms query to semantically equivalent but


more easily and efficiently computed form.

➢ Apply well-known transformation rules of Boolean algebra.


Example
 SELECT TITLE FROM Emp E WHERE(NOT (TITLE= “Programmer”) AND
TITLE=“Programmer” ) OR (TITLE=”Electrical Eng.” AND NOT
(TITLE=“Electrical Eng.”))OR ENAME=“J.Doe”; is

equivalent to
 SELECT TITLE FROM Emp E WHERE ENAME= “J.Doe”;


Restructuring - Transforming the query into a formal relational
algebra expression suitable for optimization.
C
. onvert SQL to relational algebra
Make use of query trees
Example: SELECT Ename FROM Emp,
Works, Project WHERE Emp.Eno =
Works.Eno AND Works.Pno = Project.Pno
AND Ename <> ‘J. Doe’ AND Pname =
‘CAD/CAM’ AND (Dur = 12 OR Dur = 24)
Query tree
 Query tree is a data structure that corresponds to a relational algebra
expression

 Input relations of the query as leaf nodes

 Relational algebra operations as internal nodes

 An execution of the query tree consists of executing internal node


operations
Query graph

 Query graph is a graph data structure that corresponds to a


relational calculus expression.

 It does not indicate an order on which operations to perform first.

 There is only a single graph corresponding to each query.


Transformation Rules for RA Operations
1. Cascade of Selection:
 Conjunctive Selection operations can cascade into individual Selection
operations (and vice versa).

2. Commutativity of Selection
3. In a sequence of Projection operations, only the last in the sequence is
required.

∏Col_list1 (∏Col_list2 (… (∏Col_listN (T))….)) = ∏Col_list1 (T)


∏ Std_Id, Std_name (∏Std_id, Std_name, Age, Address (∏Std_id, Std_name,
Age, Address, Class_id, Skills(Student))) = ∏Std_id, Std_name (Student)
Con…
4. Commutativity of Selection and Projection.
 If predicate p involves only attributes in projection list, Selection and
Projection operations commute:
Con…
5. Commutativity of Theta join (and Cartesian product).

Rule also applies to Equijoin and Natural join.


Example:
6. Commutativity of Selection and Theta join (or Cartesian product)
If selection predicate involves only attributes of one of join
relations, Selection and Join (or Cartesian product) operations
commute:

If selection predicate is conjunctive predicate having form (p  q),


where p only involves attributes of R, and q only attributes of S,
Selection and Theta join operations commute as:
7. Commutativity of Projection &Theta join (or Cartesian product)
8. Commutativity of Union & Intersection (but not set difference)
RS=SR
RS=SR
9.Commutativity of Selection and set operations (Union,
Intersection, and Set difference).
p(R  S) = p(S)  p(R)
p(R  S) = p(S)  p(R)
p(R - S) = p(S) - p(R)

10.Commutativity of Projection and Union.


L(R  S) = L(S)  L(R)

11. Associativity of Union & Intersection (but not Set difference).


(R  S)  T = S  (R  T), (R  S)  T = S  (R  T)
12 . Associativity of Theta join (and Cartesian product).

▪ Cartesian product and Natural join are always associative.


2. Query Optimization
❖ Query optimization is the process of improving the performance
of database queries by minimizing the time and resources required
to execute them.

❖ Optimization – not necessarily “optimal”, but reasonably efficient

❖ Techniques:
 Heuristic rules
▪ Query tree (relational algebra) optimization

▪ Query graph optimization

 Cost-based (physical) optimization

▪ Cost estimation(Comparing costs of different plans)


a. Heuristic based Processing Strategies
► Perform Selection operations as early as possible.
► Keep predicates on same relation together.

► Combine Cartesian product with subsequent Selection whose predicate


represents join condition into a Join operation.
► Use associativity of binary operations to rearrange leaf nodes so leaf nodes
with most restrictive Selection operations executed first.
► Perform Projection as early as possible.
► Keep projection attributes on same relation together.
► Compute common expressions once.
► If common expression appears more than once, and result not too large,
store result and reuse it when required.
Examples
 What are the names of customers living on Elm Street who have
checked out “Terminator”?
 SQL query:
SELECT Name FROM Customer CU, CheckedOut CH, Film F WHERE Title =
’Terminator’ AND F.FilmId = CH.FilmID AND CU.CustomerID = CH.CustomerID
AND CU.Street = ‘Elm’
Apply Selections Early
Apply More Restrictive Selections Early
Form Joins
Apply Projections Early
Cost- Based Optimization
 Cost-based optimization is a technique used in query optimization that
involves analyzing the cost of different execution plans and selecting the most
efficient one.

 This is typically done by estimating the cost of each possible execution plan
based on factors such as the number of rows to be processed, the complexity
of the query, and the available resources.

 Cost can be CPU time, I/O time, communication time, main memory
usage, or a combination.

 The candidate query tree with the least total cost is selected for execution.
Measures of Query Cost
▪ There are many possible ways to estimate cost, e.g., based on

disk accesses, CPU time, or communication overhead.

▪ Disk access is the cost of block transfers from/to disks.

▪ Simplifying assumption: each block transfer has the same cost

▪ Cost of algorithm (e.g. join or selection) depends on database buffer size;

▪ More memory for DB buffer reduces disk accesses.

▪ Thus DB buffer size is a parameter for estimating cost.


▪ We refer to the cost estimate of algorithm S as cost(S).
▪ We do not consider cost of writing output to disk.
Selectivity and Cost Estimates in Query
Optimization
 Database Statistics
– For each base relation R
– nTuples(R) – the number of tuples (records) in relation R (its cardinality).

– bFactor(R) – the blocking factor of R (that is, the number of tuples of R that fit
into one block).

– nBlocks(R) – the number of blocks required to store R. If the tuples of R are


stored physically together, then:
– nBlocks(R) = [nTuples(R)/bFactor(R)]

– We use [x] to indicate that the result of the calculation is rounded to the
smallest integer that is greater than or equal to x.
For each attribute A of base relation R
 nDistinctA(R) – the number of distinct values that appear for
attribute A in relation R.

 minA(R),maxA(R) – the minimum and maximum possible


values for the attribute A in relation R.

 SCA(R) – the selection cardinality of attribute A in relation R.

 This is the average number of tuples that satisfy an equality


condition on attribute A.
SCA(R) is calculated as:
Selection Operation
Cost of Operations
▪ Cost = I/O cost + CPU cost
▪ I/O cost: # pages (reads & writes) or # operations (multiple pages)

▪ CPU cost: # comparisons or # tuples processed

▪ I/O cost dominates (for large databases)

▪ Cost depends on
▪ Types of query conditions

▪ Availability of fast access paths

▪ DBMSs keep statistics for cost estimation


Simple Selection
Simple selection: A op a(R)
A is a single attribute, a is a constant, op is one of =, , <, , >, .
Do not further discuss  because it requires a sequential scan of
table.
How many tuples will be selected?
Selectivity Factor (SFA op a(R)) : Fraction of tuples of R satisfying
“A op a”
0  SFA op a(R)  1
# tuples selected: NS = nR  SFA op a(R)
Options of Simple Selection
Sequential (linear) Scan
General condition: cost = bR
Equality on key: average cost = bR / 2
Binary Search
Records are stored in sorted order
Equality on key: cost = log2(bR)
Equality on non-key (duplicates allowed)
cost = log2(bR) + NS/bfR - 1
= sorted search time + selected – first one
Example: Cost of Selection
• Relation: R(A, B, C)
• nR = 10000 tuples
• bfR = 20 tuples/page
• dist(A) = 50, dist(B) = 500
• B+ tree clustering index on A with order 25 (p=25)
• B+ tree secondary index on B w/ order 25
• Query:
• select * from R where A = a1 and B = b1
• Relational Algebra: A=a1  B=b1 (R)
Example: Cost of Selection (cont.)

• Option 1: Sequential Scan


• Have to go thru the entire relation

• Cost = bR = 10000/20 = 500

• Option 2: Binary Search using A = a


• It is sorted on A (why?)

• NS = 10000/50 = 200

• assuming equal distribution


• Cost = log2(bR) + NS/bfR - 1

• = log2(500) + 200/20 - 1 = 18
Cost of Join

 Cost = # I/O reading R & S + # I/O writing result

 Additional notation:

 M: # buffer pages available to join operation

 LB: # leaf blocks in B+ tree index

 Limitation of cost estimation

 Ignoring CPU costs

 Ignoring timing

 Ignoring double buffering requirements


Estimate Size of Join Result

How many tuples in join result?


Cross product (special case of join)
NJ = nR  nS
R.A is a foreign key referencing S.B
NJ = nR (assume no null value)
S.B is a foreign key referencing R.A
NJ = nS (assume no null value)
Both R.A & S.B are non-key

nR  nS nR  nS
NJ = min( , )
dist (R. A) dist (S.B)
Estimate Size of Join Result (cont.)
How wide is a tuple in join result?
Natural join: W = W(R) + W(S) – W(SR)
Theta join: W = W(R) + W(S)
What is blocking factor of join result?
bfJoin = block size / W
How many blocks does join result have?
bJoin = NJ / bfJoin
Query Execution Plans
 A query execution plan is a blueprint that outlines the specific
steps the DBMS will take to retrieve data for a given SQL query.

 It essentially acts like a roadmap, detailing the most efficient way


to access and process data based on the structure of the database
and the query itself.

 Materialized evaluation - the result of an operation is stored as a


temporary relation.

 Pipelined evaluation - takes a more streamlined approach.


 The results of each operation are passed directly to the next operation in
the pipeline, without creating intermediate temporary files.
Query Tuning
 Query tuning is the process of optimizing SQL queries to improve
their performance and efficiency.

 The goal is to retrieve the desired data as quickly and with as few
resources as possible.

 Tasks includes –
 Proper Table Indexing:

 Denormalization

 Avoiding Unnecessary Operations

 Optimize WHERE Clause Conditions


Wrap up
 The goal of query processing and optimization is to find the data you need
as quickly and efficiently as possible.

 The process of transforming a user's query into an actual result involves


several steps.
— Parsing and translation

— Optimization

— Evaluation

 Query optimization is the process of fine-tuning SQL queries to retrieve the


desired data as quickly as possible with minimal resource consumption.
— Heuristic rule

— Physical cost estimations


Assignment -2 (Individual Ass2: 7%)

1. Using heuristic algorithm optimize the following sql query.


• SQL query: SELECT LNAME FROM EMPLOYEE, WORKS_ON,
PROJECT WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;
2. work individually on the following cases.
Advanced Database System(CoSc2042)
Chapter – 3

TRANSACTION PROCESSING
Chapter Outline
01: Introduction to Transaction Processing

02: Transaction and System Concepts

03: Desirable Properties of Transactions


04: Characterizing Schedule based on Recoverability& Serializabillty

05: Transaction Support in SQL


Definition of transactions
— A transaction is a unit of a program execution that accesses and

possibly modifies various data objects (tuples, relations).

— Transaction are units or sequences of work accomplished in a

logical order, whether in a manual fashion by a user or automatically

by some sort of a database program.

— A transaction (set of operations) may be specified in SQL, or

may be embedded within a program.


Transaction and System Concepts
Single-User System: At most one

user at a time can use the system.

Multiuser System: Many users can

access the system concurrently.


Concurrency: means allowing more
than one transaction to run
simultaneously on the same database.
Interleaved processing: Concurrent Figure shows- Interleaved processing versus
execution of processes are interleaved parallel processing of concurrent transactions.
in a single CPU

Parallel processing:
– Processes are concurrently
executed in multiple CPUs.
Transaction boundaries:
▪ Begin and End transaction.

▪ An application program may contain several transactions separated by;

▪ Begin and End transaction boundaries.

▪ Suppose a bank employee transfers $500 from A‘s account to B's account.

▪ This very simple and small transaction involves several low-level tasks.
Simple Model of a Database
▪ A database - collection of named data items

▪ Granularity of data - a field, a record , or a whole disk block

▪ Basic operations are read and write

▪ read(A, x): assign value of database object A to variable x;

▪ write(x , A): write value of variable x to database object A

▪ Example: Let T1 be a transaction that transfers $500 from account A to account B.


This transaction can be defined as :
READ/ WRITE OPERATIONS
PROPERTIES OF TRANSACTIONS

 Properties of a transaction generally called ACID


properties.

 Those are;

Atomicity

Consistency preservation

Isolation

Durability (permanency)
Atomic transactions
• Atomicity: A transaction is an atomic unit of processing; it is either
performed in its entirety or not performed at all.
• Example: John wants to move $200 from his savings account to his
checking account.

1) Money must be subtracted from savings account.


2) Money must be added to checking count.
• If both happen, John and the bank are both happy.
If neither happens, John and the bank are both happy.
If only one happens, either John or the bank will be unhappy.
John’s transfer must be all or nothing.
Consistency
• A correct execution of the transaction must take the database from one
consistent state to another.
• Example: Wilma tries to withdraw $1000 from account 387.
Transactions are consistent
▪ A transaction must leave the database in valid state.

▪ valid state == no constraint violations

▪ Constraint is a declared rule defining /specifying database states

▪ Constraints may be violated temporarily …

but must be corrected before the transaction completes.


Isolation

Example:
Durability
• Once a transaction changes the database and the changes are committed,
these changes must never be lost because of subsequent failure.
Concurrency Control
Isolation (+ Consistency) => Concurrency Control

▪ Concurrency means allowing more than one transaction to run simultaneously


on the same database.

▪ When several transactions run concurrently database consistency can be


destroyed.

▪ It is meant to coordinate simultaneous transactions while preserving data


integrity.

▪ It is about to control the multi-user access of database.


WHY CONCURRENCY CONTROL IS NEEDED?
▪ Several problems can occur when transactions run in an uncontrolled manner

The Lost Update Problem.


▪ This occurs when two transactions that access the same database items have
their operations interleaved in a way that makes the value of some database
item incorrect.

The update performed by T1 gets


lost;
possible solution:
T1 locks/unlocks database object A
⇒ T2 cannot read A while A is
modified by T1
Example
▪ The Lost Update Problem

T1 T2 State of X
read_item(X); 20
X:= X+10; read_item(X); 20
X:= X+20;
write_item(X); 40
commit;
Lost update write_item(X); 30
commit;

✓ Changes of T2 are lost.


▪ The Temporary Update (or Dirty Read) Problem.

▪ This occurs when one transaction updates a database item


and then the transaction fails for some reason, and the
updated item is accessed by another transaction before it is
changed back to its original value.

T1 modifies db object, and then the


transactionT1 fails for some reason.
Meanwhile the modified db object,
however, has been accessed by another
transaction T2. Thus T2 has read data
that “never existed”.
▪ Example

▪The Temporary Update/ Dirty Read Problem

T1 T2 State of X sum

read_item(X); 20 0
X:= X+10;
Dirty update write_item(X);

read_ item(X); 30
sum:= sum+X;
write_item(sum); 30
X:=X+10; commit;
write_item(X); 40

Rollback
commit;

✓ T2 sees dirty data of T1.


▪ The Incorrect Summary Problem
▪ If one transaction is calculating an aggregate summary function on a
number of records while other transactions are updating some of these
records, the aggregate function may calculate some values before they are
updated and others after they are updated.

In this schedule, the total computed by T1 is wrong (off by 100).


⇒T1 must lock/unlock several db objects
▪ Example

▪The Incorrect Summary Problem


Let A=100
T1 T2 State of X State of Y sum=0

read_item(A); 0
sum:= sum+A;
write_item(A);
commit; 100
read_item(X); 30
X:= X-10;
write_item(X);
commit; read_item(X); 20
sum:= sum+X;
read_item(Y); 10
sum:= sum+Y;
read_item(Y); 10
Y:= Y+10;
write_item(Y); 20
commit;
Incorrect summary

✓ T2 reads X after 10 is subtracted and reads Y before 10 is added, hence incorrect


summary.
▪ Unrepeatable read problem
▪ Here a transaction T1 reads the same item twice and the item is
charged by another transaction T2 between the reads, T1 receives
different values for its two reads of the same item.
Q. Consider the schedule given below, in which, transaction T1 transfers
money from account A to account B and in the meantime, transaction T2
calculates the sum of 3 accounts namely, A, B, and C. The third column shows
the account balances and calculated values after every instruction is executed.

Discuss what problem is found in the schedule and what will be the correct
value of Accounts A, B & C averages?
➢ WHY RECOVERY IS NEEDED: (WHAT CAUSES A
TRANSACTION TO FAIL?)
➢ A computer failure (system crash)

➢ A transaction or system error

➢ Local errors or exception conditions detected by the


transaction:

➢ Concurrency control enforcement

➢ Disk failure

➢ Physical problems and catastrophes


Operations
▪ Recovery manager keeps track of the following operations:
▪ begin_transaction
▪ read or write
▪ end_transaction
▪ commit_transaction
▪ rollback (or abort)
▪ Recovery techniques use the following operators:
▪ Undo
▪ Redo
THE SYSTEM LOG
✓ Log or Journal: The log keeps track of all transaction
operations that affect the values of database items.

✓ Needed to permit recovery from transaction failures.

✓ The log is kept on disk, so it is not affected by any type of


failure except for disk or catastrophic failure.

✓ Log is periodically backed up to archival storage (tape) to


guard against such catastrophic failures.

✓ Have a unique transaction-id that is generated


automatically by the system
Types of log record
– [start_transaction, T] - Indicates that transaction T has started execution.
– [write_item, T, X, old_value, new_value] - Indicates that transaction T
has changed the value of database item X from old_value to new_value.

– [read_item, T, X] - Indicates that transaction T has read the value of


database item X.

– [commit, T] - Indicates that transaction T has completed successfully, and


affirms that its effect can be committed (recorded permanently) to the database.

– [abort, T] - Indicates that transaction T has been aborted.


Commit Point of a Transaction
▪ Definition: It refers to the completion of the transaction.

▪ Transaction T reaches its commit point when,

▪ all its operations accessing DB are executed successfully and


changes are recorded in the log.

▪ Beyond the commit point, the transaction is said to be


committed, and its effect is assumed to be permanently
recorded in the database.

▪ The transaction then writes an entry [commit, T] into the log.


Con..
✓ Undo (Roll Back) of transactions:

▪ Needed for transactions that have a [start_transaction,T] entry to the


log but no commit entry [commit,T] into the log.

✓ Redoing transactions:

✓ Transactions that have commit entry in the log

✓ write entries are redone from log

✓ Force writing a log:

✓ Before a transaction reaches its commit point,

✓ Write log to disk

✓ This process is called force-writing the log file before committing a


transaction.
SCHEDULES
➢ When transactions are executing concurrently in an interleaved
fashion, the order of execution of operations from various
transactions, is known as a transaction schedule (or history).

• Transaction Schedule reflects


chronological order of operations
Characterizing schedules based on
• Recoverability- How good is the system at recovering from errors?

• Serializability – How easy is the system to find schedules that allow


transactions to execute concurrently without interfering one another.
Schedules classified on recoverability

▪ Recoverable schedule- no transaction needs to be rolled back.


Recoverable schedule
• Strict schedules are more strict than cascadeless schedules.
• All strict schedules are cascadeless schedules.
• All cascadeless schedules are not strict schedules.
Characterizing schedules based on Serializability
▪ Serial schedule
• Transactions are ordered one after the other. Otherwise, the schedule is
called nonserial schedule.

▪ Serializable schedule
• A schedule is equivalent to some serial schedule of the same n transactions.

▪ Result equivalent
• Two schedules are producing the same final state of the database.

▪ Conflict equivalent
• The order of any two conflicting operations is the same in both schedules.
Figure 3.2 Examples of serial and nonserial schedules involving transactions
T1 and T2. (a) Serial schedule A: T1 followed by T2. (b) Serial schedule B: T2
followed by T1. (c) Two nonserial schedules C and D with interleaving of
operations.
Schedule Notation
• A more compact notation for schedules:

T3
b3, r3(Y), w3(Y), e3, c3
begin
read(Y)
r3(Y)
Y = Y+1
write(Y)
operation data item
end
transaction commit

note: we ignore the computations on the local copies of the data when
considering schedules (they're not interesting)
Examples
A serial schedule is one in which the transactions do not overlap (in
time).

b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1, These are all serial schedules for the


b2,r2(X),w2(X),e2,c2, three example transactions
b3,r3(Y),w3(Y),e3,c3
There are six possible serial schedules
b2,r2(X),w2(X),e2,c2, for three transactions
b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1,
b3,r3(Y),w3(Y),e3,c3 n! possible serial schedules for n
transactions
b2,r2(X),w2(X),e2,c2,
b3,r3(Y),w3(Y),e3,c3,
b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1
• Types of Serializability

– Conflict Serializability

– View Serializability:

▪ Conflict serializable:

▪ A schedule S is said to be conflict serializable if it is conflict equivalent

to some serial schedule S’.


• Being serializable is not the same as being serial
• Being serializable implies that the schedule is a correct schedule.
• It will leave the database in a consistent state.
• Serializability is hard to check.

• View serializability: definition of serializability based on view equivalence.

– A schedule is view serializable if it is view equivalent to a serial schedule.


• Interleaving of operations occurs in an operating system through some
scheduler
• Difficult to determine before hand how the operations in a schedule will be
interleaved.

Fig 3.3. Conflicts between operations of two transactions:


Conflict Equivalence

• Two schedules are conflict equivalent if the order of any two conflicting
operations is the same in both schedules.
• Two operations conflict
– they access the same data item (read or write)
– if they belong to different transactions
– at least one is a write

T1: b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1,
conflicting operations:
r1(X),w2(X)
T2: b2,r2(X),w2(X),e2,c2 w1(X), r2(X)
w1(X), w2(X)
– Find the conflicting operation?
Two operations are conflicting, if changing their order can result in a
different outcome
Example: Conflict Equivalence
schedule 1:
b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1,
b2,r2(X),w2(X),e2,c2
schedule 2: r1(X) < w2(X), w1(X) < r2(X), w1(X) < w2(X)

b2,r2(X),w2(X),
b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1, e2,c2
w2(X) < r1(X), r2(X) < w1(X), w2(X) < w1(X)
schedule 3:
b1,r1(X),w1(X),
b2,r2(X),w2(X),e2,c2, r1(Y),w1(Y),e1,c1,
r1(X) < w2(X), w1(X) < r2(X), w1(X) < w2(X)
Schedule1and schedule 3 are conflict equivalent schedule 2 is not
conflict equivalent to either schedule 1 or 3
Testing for Conflict Serializability
• Precedence graphs are a more efficient test
– graph indicates a partial order on the transactions required
by the order of the conflicting operations.
– the partial order must hold in any conflict equivalent serial
schedule
– if there is a loop in the graph, the partial order is not
possible in any serial schedule
– if the graph has no loops, the schedule is conflict serializable
Precedence Graph Examples: find the graph the conflict
operation between the transactions?
schedule 3:
b1,r1(X),w1(X),
b2,r2(X),w2(X),e2,c2, r1(Y),w1(Y),e1,c1,
Find the conflict operations ?
r1(X) < w2(X), w1(X) < r2(X), w1(X) < w2(X)

r1(X) < w2(X) arrows


indicate
T1 T2
w1(X) < r2(X) that T1
precedes T2
w1(X) < w2(X)

schedule 3 is conflict serializable


it is conflict equivalent to some serial schedule
in which T1 precedes T2
Precedence Graph Examples
schedule 2:
b2,r2(X),w2(X), b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1,e2,c2

w2(X) < r1(X), r2(X) < w1(X), w2(X) < w1(X)

w2(X) < r1(X)

T1 T2
r2(X) < w1(X)

w2(X) < w1(X)

schedule 2 is conflict serializable


it is conflict equivalent to some serial schedule
in which T2 precedes T1.
Precedence Graph Examples
schedule 4:
b2,r2(X), b1,r1(X),w1(X),r1(Y),w1(Y),w2(X),e1,c1,e2,c2
r1(X) < w2(X), r2(X) < w1(X), w1(X) < w2(X)

r1(X) < w2(X)

T1 T2
r2(X) < w1(X)

w1(X) < w2(X)

schedule 4 is not conflict serializable


there is no serial schedule
in which T2 precedes T1 and T1 precedes T2
Transaction Support in SQL
• SQL Commands used to control transactions:
– COMMIT: to save the changes.
– ROLLBACK: to rollback the changes.
– SAVEPOINT: creates points within groups of
transactions in which to ROLLBACK
– SET TRANSACTION: Places a name on a
transaction.
Assignment-3 (Individual Ass3: 5%)
Q1. Using the precedence graph as a method of checking Serializability
based on this find the following questions

S: r1(x) r2(z) r3(x) r1(z) r2(y) r3(y) w1(x) w2(z) w3(y) w2(y)
e1,c1,e2,c2,e3,c3

A. Find the Ordering of conflicting operations?

B. Is this schedule serializable?

C. Is the schedule correct?


Advanced Database system(CoSc2042)

Chapter -4
Concurrency Control
What is Concurrency Control?

▪ Concurrency control — ensuring that each user


appears to execute in isolation.

▪ It is the procedure in DBMS for managing simultaneous


operations without conflicting with each other.

▪ Why?
 Lost Updates
 Temporary update (dirty read)
 Non-Repeatable Read
 Incorrect Summary issue
Purpose of Concurrency Control
- To force isolation (through mutual exclusion) among
conflicting Transactions

- To preserve database consistency through consistency


preserving execution of transactions

- To resolve read-write and write-write conflicts


Concurrency Control Techniques
 Various concurrency control techniques are:
• Two Phase Protocols

• Timestamp-Based Protocols

• Validation-Based Protocols

• Multi version concurrency control


Locking Techniques
– Locking is an operation which secures: permission to read, OR
permission to write a data item.

– Two phase locking is a process used to gain ownership of shared


resources without creating the possibility of deadlock.

– The 3 activities taking place in the two phase update algorithm are:
(i). Lock Acquisition

(ii). Modification of Data

(iii). Release Lock


Rules in locking technique:
▪ LOCK() : operation must be issued by transaction before any
update operation like read() or write() operation.
▪ LOCK() :can’t be issued by transaction if already holds LOCK()
on data item
▪ UNLOCK(): must be issued after all read() and write()
operations are completed in a transaction.
▪ UNLOCK(): can’t be issued by transaction unless it already hold
the lock on the data item.
Example for binary lock
T1:LOCK(A)
T1:READ(A) A=100 ▪ This example is seriazible schedule.
T1:A:=A+200 A=300 ▪ Therefore, in case of a binary
T1:WRITE(A)
T1:UNLOCK(A) locking mechanism at most one
T2:LOCK(A)
transaction can hold the lock on
T2:READ(A) A=300
T2:A:=A+300 A=600 a particular data item .
T2:WRITE(A) A=600
▪ Thus no transaction can access
T2:UNLOCK(A)
the same item concurrently.
Type of Lock modes
1) shared (S) mode

 If a transaction T1 has obtained a shared mode lock on item


X then other transaction can read but can’t write data item X.

 It is also known as a Read-only lock.

2) Exclusive (X) mode –


 If a transaction locks a data item then no other transaction
can access that item, even read until the lock is released by
the transaction.
 It is also called a write lock


Lock-compatibility matrix
Example of a transaction performing locking:
T1 T2
LOCKX(A) LOCKX(SUM)
READ(A) A=1000 SUM:=0
A:=A-200 A=800 LOCKS(A)
WRITE(A) A=800 READ(A) A=800
UNLOCK(A) SUM:=SUM+A SUM=800
LOCKX(B) UNLOCK(A)
READ(B) B=900 LOCKS(B)
B:=B+200 B=1100 READ(B) B=1100
WRITE(B) B=1100 SUM:=SUM+B SUM=1900
UNLOCK(B) WRITE(SUM) SUM=1900
UNLOCK(B)
UNLOCK(SUM)
IF EXECUTED SERIALLY THE OUTPUT
WILL BE 1900
Consider another example
▪ T1
▪ UNLOCK(A)
▪ LOCKX(B)
▪ READ(B) ▪ LOCKS(B)
▪ B:=B-50
▪ WRITE(B) ▪ READ(B) B=150
▪ UNLOCK(B)
▪ LOCKX(A) ▪ UNLOCK(B)
▪ READ(A)
▪ DISPLAY(A+B)
▪ A:=A+50
▪ WRITE(A) ▪ THIS IS CLEAR THAT IF THEY
▪ UNLOCK(A) RUN SEQUENTIALLY THE OUT
▪ T2:
PUT WILL BE 300
▪ LOCKS(A)
▪ READ(A) A=150
▪ Phase 1: Growing Phase
 Transaction may obtain locks
 Transaction may not release locks

▪ Phase 2: Shrinking Phase


 Transaction may release locks
 Transaction may not obtain locks
Example:
T1 T1
LockX(B) LockX(B)
READ(B)
READ(B)
B:=B-50
B:=B-50 WRITE(B)
WRITE(B) LOCKX(A)
UNLOCK(B) READ(A)
A:=A+50
LOCKX(A)
WRITE(A)
READ(A) UNLOCK(B)
A:=A+50 UNLOCK(A)
WRITE(A)
UNLOCK(A) This is 2 phase because unlocks
This is not 2 phase because unlock(b) appears after all lock operation
appears before lock(a)
▪ Two-phase locking does not ensure freedom from deadlocks

▪ To avoid this, follow a modified protocol called strict two-phase


locking.

▪ Strict Two-Phase Locking:


▪ A transaction must hold all its exclusive locks till it commits/ aborts.

▪ Ensures that any data written by uncommitted transaction are locked in


exclusive mode until the transaction commits.
Conversion of locks
▪ Conversion from shared to exclusive modes is denoted by upgrade

▪ Conversion from exclusive to shared mode by downgrade.

▪ Lock conversion is not allowed to occur arbitrarily.


▪ Upgrading takes place only in growing phase whereas,

▪ Downgrading takes place only in shrinking phase


Implementation of Locking
▪ A Lock manager can be implemented as a separate process to
which transactions send lock and unlock requests.

▪ The lock manager replies to a lock request by sending a lock


grant messages (or a message asking the transaction to roll
back, in case of a deadlock).

▪ The requesting transaction waits until its request is answered

▪ The lock manager maintains a data structure called a lock table


to record granted locks and pending requests.
Lock Table
▪ New request is added to the end of
the queue of requests for the data
item, and granted if it is compatible
with all earlier locks.
▪ Unlock requests result in the request
being deleted
▪ If transaction aborts, all waiting or
▪Black rectangles indicate granted
granted requests of the transaction
locks, white ones indicate waiting
are deleted
requests.
▪Lock table also records the type of ▪ Lock manager may keep a list of
lock granted or requested locks held by each transaction, to
implement this efficiently
Pitfalls of Lock-Based Protocols
1. Deadlocks
2. Starvations
 Deadlock occurs when two or more transactions are waiting on a
condition that cannot be satisfied.
 Deadlock can arise if the following 4 conditions hold simultaneously in a system;
 Mutual exclusion:At least one resource is held in a non-sharable mode.
 Exclusive lock request.
 Hold and wait:There is a transaction which acquired and held lock on a
data item, and waits for other data item.
 No preemption: situation where a transaction releases the locks on data
items which it holds only after the successful completion of the
transaction.
 Circular wait: A situation where a transaction T1 is waiting for another
transaction T2 to release lock on some data items, in turn T2 is waiting for
another transaction T3 to release lock, and so on.
Starvation
 Starvation is the situation when a transaction has to wait for an
indefinite period of time to acquire a lock.
 Reasons of Starvation –
 If waiting scheme for locked items is unfair. ( priority queue )
 Victim selection. ( same transaction is selected as a victim repeatedly)
 Resource leak.
 Via denial-of-service attack.
 What are the solutions to starvation –
 Increasing Priority
 Modification in Victim Selection algorithm
 First Come First Serve approach
 Wait die and wound wait scheme
Example
Consider the partial schedule
▪ Neither T3 nor T4 can make progress —
executing lock-S(B) causes T4 to wait for T3
to release its lock on B, while executing
lock-X(A) causes T3 to wait for T4 to release
its lock on A.

▪ Such a situation is called a deadlock.

▪ To handle a deadlock one of T3 or T4 must


be rolled back and its locks released.
Timestamp-based Protocols
 This protocol ensures that every conflicting read and write operations are
executed in timestamp order.

 The protocol uses the System Time or Logical Count as a Timestamp.

 The older transaction is always given priority in this method.


Timestamp-based Protocols
 Advantages:
 Schedules are serializable just like 2PL protocols
 No waiting for the transaction, which eliminates the possibility
of deadlocks!
 Disadvantages:
 Starvation is possible if the same transaction is restarted and
continually aborted
Validation Based Protocol
 Validation Based Protocol is also called Optimistic
Concurrency Control Technique.

 Validation based protocol –

 avoids the concurrency of the transactions and works based


on the assumption that if no transactions are running
concurrently then no interference occurs.
Three phases of Validation based Protocol
‒ Read Phase ‒ Values of committed data items from the database can be read
by a transaction. Updates are only applied to local data versions.

‒ Validation Phase ‒ Checking is performed to make sure that there is no


violation of serializability when the transaction updates are applied to
database.

‒ Write Phase ‒ on the success of validation phase, the transaction updates


are applied to the database, otherwise, the updates are discarded and the
transaction is slowed down.
Insert and Delete Operations
▪ If two-phase locking is used :
▪ A delete operation may be performed only if the transaction
deleting the tuple has an exclusive lock on the tuple to be
deleted.

▪ A transaction that inserts a new tuple into the database is given


an X-mode lock on the tuple.

▪ Insertions and deletions can lead to the phantom phenomenon.

Phantom problem can occur when a new record that is being inserted by
some transaction T satisfies a condition that a set of records accessed by
another transaction T must satisfy.
DATABASE RECOVERY
TECHNIQUES

Chapter– 5 Computer Science


Introduction
 Database recovery is the process of restoring the
database to the most recent consistent state that
existed just before the failure.
 Three states of database recovery:
 Pre-condition: At any given point in time the database is in
a consistent state.
 Condition: Occurs some kind of system failure.
 Post-condition: Restore the database to the consistent state
that existed before the failure
Types of failures
1. Transaction failures.
◼ Erroneous parameter values
◼ Logical programming errors
◼ System errors like integer overflow, division by zero
◼ Local errors like “data not found”
◼ User interruption.
◼ Concurrency control enforcement
2. Malicious transactions.
3. System crash.
◼ A hardware, software, or network error (also called
media failure)
4. Disk crash.
Basic Properties of Every Recovery
Algorithm:
 Before looking at less ideal but more effective strategies, it is useful to

identify some key points which must be kept in mind, regardless of

approach.

 Commit point:

Every transaction has a commit point.

This is the point at which transaction is finished, and all of the database

modifications are made a permanent part of the database.


Recovery approaches
 Steal approach-cache page updated by a transaction can be
written to disk before the transaction commits.
 No-steal approach -cache page updated by a transaction cannot
be written to disk before the transaction commits.
 Force approach- when a transaction commits, all pages updated
by the transaction are immediately written to disk.
 No-force approach-when a transaction commits, all pages updated
by the transaction are not immediately written to disk.
Basic Update Strategies

 Update strategies may be placed into two basic categories.


 Most practical strategies are a combination of these two:

Deferred Update Immediate Update


Cont.
1. Deferred Update (No -Undo/Redo Algorithm)

 These techniques do not physically update the DB on disk


until a transaction reaches its commit point.

 These techniques need only to redo the committed


transaction and no-undo is needed in case of failure.
Cont.
While a transaction runs:
▪ Changes made by that transaction are not recorded in the
database.
On a commit:
▪ The new data is recorded in a log file and flushed to disk
▪ The new data is then recorded in the database itself.
▪ On an abort, do nothing (the database has not been
changed).
▪ On a system restart after a failure, REDO the log.
Cont.
2. Immediate Update (Undo/Redo Algorithm)

 The DB may be updated by some operations of a


transaction before the transaction reaches it’s commit point.

 The updates are recorded in the log must contain the old
values and new values.

 These techniques need to undo the operations of the


uncommitted transactions and redo the operations of the
committed transactions .
Cont.
While a transaction runs:
▪ Changes made by the transaction can be written to the database at any
time. Both original and the new data being written, must be stored in the
log before storing it on the disk.
On a commit:
▪ All the updates which has not yet been recorded on the disk is first stored
in the log file and then flushed to disk.
▪ The new data is then recorded in the database itself.
▪ On an abort, redo all the changes which that transaction has made to the
database disk using the log entries.
▪ On a system restart after a failure, redo committed changes from log.
▪ Example: click here
Shadow Paging
 In this technique, the database is considered to be made up of
fixed-size disk blocks or pages for recovery purposes.
 Maintains two tables during the lifetime of a transaction
-current page table and shadow page table.
 Store the shadow page table in nonvolatile storage, to recover the
state of the database prior to transaction execution
 This is a technique for providing atomicity and durability.
When a transaction begins executing
Cont.
To recover from a failure

Advantages
•No-redo/no-undo

Disadvantages
•Creating shadow directory may take a long time.
•Updated database pages change locations.
•Garbage collection is needed
“ARIES” Recovery algorithm.
Recovery algorithms are techniques to ensure database
consistency ,transaction atomicity and durability without any
failure.

 Recovery algorithms have two parts


1. Actions taken during normal transaction processing to ensure
enough information exists to recover from failures.

2. Actions taken after a failure to recover the database contents to


a state that ensures atomicity, consistency and durability.
Cont.

 ARIES (Algorithms for Recovery and Isolation


Exploiting Semantics)

 The ARIES recovery algorithm consist of three steps


• Analysis
• Redo
• Undo
Cont.

 Analysis - Identify the dirty pages(updated pages) in the


buffer and set of active transactions at the time of failure.

 Redo - Re-apply updates from the log to the database. It will be


done for the committed transactions.

 Undo - Scan the log backward and undo the actions of the
active transactions in the reverse order.
Recovery from disk crashes.
 Recovery from disk crashes is much more difficult than recovery
from transaction failures or machines crashes.
 Loss from such crashes is much less common today than it was
previously, because of the wide use of redundancy in secondary
storage (RAID( Redundant Array of Independent Disk) technology).
(RAID - method of combining several hard disk drives into one
logical unit.)
Typical methods are;
 The log for the database system is usually written on a separate
physical disk from the database.
or,
 Periodically, the database is also backed up to tape or other
archival storage.
Conclusion.
✓ Types of failures.
✓ Steal/no steal, Force/no force approaches.
✓ Deferred and immediate update strategies.
✓ Shadow paging technique.
✓ ARIES recovery algorithm.
✓ Recovery from disk crashes.
DATABASE SECURITY AND
AUTHORIZATION

Chapter - 6 Introduction to Database Security Issues


Contents
 Security types
 Threats of database
 Security mechanism
❑ Here we discuss the techniques used for protecting the
database against persons who are not authorized to access
either certain parts of a database or the whole database.
Introduction to Database Security Issues

• Authentication means confirming your own identity,


− It is the process of verifying who you are.
− There are three common factors used for authentication:
− Something you know (such as a password)
− Something you have (such as a smart card)
− Something you are (such as a fingerprint or other Biometric method)

• Authorization means granting access on the system.

• In simple terms, It is the process of verifying what you have access


to.
Types of Security

• Legal and ethical issues- Various legal and ethical issues


regarding the right to access certain information.
• Who has the right to read What information?
• Policy issues - At the governmental, institutional, or corporate
level as to what kinds of information should not be made
publicly available.
• Who should enforce security (government, corporations) ?

• System-related issues- whether a security function should be


handled at the physical hardware, the operating system, or the
DBMS level.
Threats to databases

o Confidentiality, integrity and availability, also known as


the CIA triad, is a model designed to guide policies for
information security within an organization.

• Loss of integrity: (users should be able to modify things they are not
supposed to.)

E.g., students’ can change grades.

• Loss of confidentiality(secrecy): (users should be able to see things they are


not supposed to.)
ƒE.g., A student can see other students’ grades.

• Loss of availability:(data or a system is not available when needed by a user.)


Con…
 Data integrity in the database is the correctness, consistency and
completeness of data.
 Data integrity is enforced using the following three integrity
constraints:
− Entity Integrity – every table must have primary key.

− Referential Integrity - ensures that only the required alterations,


additions, or removals happen via rules implanted into the database’s
structure about how foreign keys are used.

− Domain Integrity - all columns in a relational database must be declared


upon a defined domain.
Continued..
… To protect databases against these types of threats four kinds of
countermeasures can be implemented :
• Access control,
• Inference control,
• Flow control and
• Encryption
A DBMS typically includes a database security and authorization subsystem

Continued..
Access control - handled by creating user accounts and passwords

to control login
Controlling the access to a statistical database - used to provide

statistical information based on criteria.

The countermeasures to statistical database security problem - is


called inference control measures.

…Flow control - prevents information from flowing to unauthorized users.


− Channels that are pathways for information to flow implicitly in ways that
violate the security policy of an organization are called covert channels.
Continued..

 A final counter measure is data encryption,


 used to protect sensitive data(such as credit card
numbers) transmitted thro’ communication network.
The data is encoded using some coding algorithm.
Deciphering is required by authorized users to decode
or decrypt algorithms (or keys).
Database Security and the DBA
… The database administrator (DBA) -
central authority for managing a database system.
responsible for the overall security of the database system
…The DBA has a DBA account in the DBMS - called system or superuser account,
…Following are the major responsibilities of a DBA:
Account creation
Privilege granting
Privilege revocation
Security level assignment
Access Protection, User Accounts,
and Database Audits
… To use Db user needs an account
… The DBA will create a new account number and password
… The user must log in to the DBMS using account number and password
… The database system
keep track of all operations on the database that are applied by a certain user
in each login session
In the system log

…If any tampering with the database is suspected, a database audit is performed,
This consists of
reviewing the log-
to examine all accesses and operations applied to the database
during a certain time period.
• …
A database log that is used mainly for security purposes is sometimes called an
audit trail.
Types of database security mechanisms:

• Two types of database security mechanisms:

▪ Discretionary security mechanisms


• The typical method of enforcing discretionary access control in a database
system is based on the granting and revoking privileges.

▪ Mandatory security mechanisms


• Classify data and users into various security classes
• Implement security policy
Discretionary Access Control Based on
Granting and Revoking Privileges
… Types of Discretionary Privileges
• The account level:
• At this level, the DBA specifies the particular privileges that each
account holds independently of the relations in the database.
• … The privileges at the account level apply to the capabilities provided to the account
itself and can include the following:

CREATE SCHEMA or CREATE TABLE or CREATE VIEW privilege;


The ALTER privilege

The DROP privilege;

The MODIFY privilege


The SELECT privilege,
Continued..
Relation level:

The relation (or table level): At this level, the DBA can control the privilege
to access each individual relation or view in the database.
… The granting and revoking of privileges generally follow an authorization
model for discretionary privileges known as the access matrix model,
here the rows of a matrix M represents subjects (users, accounts, programs) and
the columns represent objects (relations, records, columns, views, operations).
Each position M(i, j) in the matrix represents the types of privileges (read, write,
update) that subject i holds on object j.
… To control the granting and revoking of relation privileges, each relation R
in a database is assigned and owner account (created first)
The owner of a relation is given all privileges on that relation.

The owner account holder can pass privileges on any of the owned relation to
other users by granting privileges to their accounts.
In SQL the following types of privileges can be granted on each individual
relation R:
SELECT (retrieval or read) privilege on R: Gives the account retrieval privilege.
In SQL this gives the account the privilege to use the SELECT statement to
retrieve tuples from R.
MODIFY privileges on R: Gives the account the capability to modify tuples of R.
▪ In SQL this privilege is further divided into UPDATE, DELETE, and INSERT
privileges to apply the corresponding SQL command to R.
▪ In addition, both the INSERT and UPDATE privileges can specify that only
certain attributes can be updated by the account.
REFERENCES privilege on R: This gives the account the
capability to reference relation R when specifying integrity
constraints.
The privilege can also be restricted to specific attributes of R.
Notice that to create a view, the account must have SELECT
privilege on all relations involved in the view definition.
Specifying Privileges Using Views
The mechanism of views is an important discretionary

authorization mechanism in its own right.
…Example:

if the owner A of a relation R wants another account B to


be able to retrieve only some fields of R, then A can create
a view V of R that includes only those attributes and then
grant SELECT on V to B.
the same applies to limiting B to retrieving only certain
tuples of R;
a view V’ can be created by defining the view by means of
a query that selects only those tuples from R that A wants to
allow B to access.
Revoking Privileges
• … In some cases it is desirable to grant a privilege to
a user temporarily.
• For example, the owner of a relation may want to

grant the SELECT privilege to a user for a specific


task and then revoke that privilege once the task is
completed.
• Hence, a mechanism for revoking privileges is needed.

… In SQL, a REVOKE command is included for the


purpose of canceling privileges.
Propagation of Privileges using the
GRANT OPTION
… Whenever the owner A of a relation R grants a privilege on
R to another account B, privilege can be given to B with or
without the GRANT OPTION.
… If the GRANT OPTION is given, this means that B can also
grant that privilege on R to other accounts.
Suppose that B is given the GRANT OPTION by A and that

B then grants the privilege on R to a third account C, also
with GRANT OPTION.
… In this way, privileges on R can propagate to other accounts
without the knowledge of the owner of R.
… If the owner account A now revokes the privilege granted to
B, all the privileges that B propagated based on that
privilege should automatically be revoked by the system.
Example(1)

• …
Suppose that the DBA creates four accounts A1, A2, A3, and A4 and wants only A1 to
be able to create base relations; then the DBA must issue the following GRANT
command in SQL:

GRANT CREATE TABLE TO A1;
• …
In SQL the same effect can be accomplished by having the DBA issue
a CREATE SCHEMA command as follows:

CREATE SCHAMA EXAMPLE AUTHORIZATION A1;
… User account A1 can create tables under the schema called EXAMPLE.
• …Suppose that A1 creates the two base relations EMPLOYEE and DEPARTMENT; A1
is then owner of these two relations and hence all the relation privileges on each of
them.
• Suppose that A1 wants to grant A2 the privilege to insert and delete tuples in both of these

relations, but A1 does not want A2 to be able to propagate these privileges to additional
accounts:

… GRANT INSERT, DELETE ON EMPLOYEE, DEPARTMENT TO A2;


Example(2)
 Suppose that A1 wants to allow A3 to retrieve information from either of the
two tables and also to be able to propagate the SELECT privilege to other
accounts.
 A1 can issue the command:
GRANT SELECT ON EMPLOYEE, DEPARTMENT
TO A3 WITH GRANT OPTION;
 A3 can grant the SELECT privilege on the EMPLOYEE relation to A4 by
issuing:
GRANT SELECT ON EMPLOYEE TO A4;
 Notice that A4 can’t propagate the SELECT privilege because GRANT OPTION was
not given to A4
Example(3)
 Suppose that A1 decides to revoke the SELECT
privilege on the EMPLOYEE relation from A3; A1
can issue:
REVOKE SELECT ON EMPLOYEE FROM A3;

 The DBMS must now automatically revoke the


SELECT privilege on EMPLOYEE from A4, too,
because A3 granted that privilege to A4 and A3 does
not have the privilege any more.
Example(4)
 Suppose that A1 wants to give back to A3 a limited capability to SELECT from
the EMPLOYEE relation and wants to allow A3 to be able to propagate the
privilege.
 The limitation is to retrieve only the NAME, BDATE, and ADDRESS
attributes and only for the tuples with DNO=5.

 A1 then create the view:


CREATE VIEW A3EMPLOYEE AS SELECT NAME, BDATE, ADDRESS FROM
EMPLOYEE WHERE DNO = 5;

 After the view is created, A1 can grant SELECT on the view A3EMPLOYEE
to A3 as follows:
 GRANT SELECT ON A3EMPLOYEE TO A3 WITH GRANT OPTION;
Example(5)
 Finally, suppose that A1 wants to allow A4 to update
only the SALARY attribute of EMPLOYEE;

 A1 can issue:

GRANT UPDATE (SALARY) ON EMPLOYEE TO


A4;
 The UPDATE or INSERT privilege can specify particular
attributes that may be updated or inserted in a relation.

 Other privileges (SELECT, DELETE) are not attribute specific.


Mandatory Access Control

 Based on system-wide policies that cannot be changed by individual users.


 Each DB object is assigned a security class.
− Bell-LaPadula Model

• Objects (e.g., tables, views, tuples)

• Subjects (e.g., users, user programs)


− Security classes:
− Top secret(TS), secret (S), confidential (C), unclassified (U): TS > S> C > U

• Each object and subject is assigned a class.

• Subject S can read object O only if class(S) >= class(O) (Simple Security
Property)

• ƒSubject S can write object O only if class(S) <= class(O) (*-Property)


Question
In this chapter you will learn:
• The need for distributed databases.
• The differences between distributed database systems,
Chapter - 7 distributed processing, and parallel database systems.
• The advantages and disadvantages of distributed
Distributed
DBMSs.
Database • The functions that should be provided by a distributed
system DBMS.
• An architecture for a distributed DBMS.
• The main issues associated with distributed database
design, namely fragmentation, replication, and
allocation.
– Distributed database –
– logically interrelated collection of shared
Distributed
data (and a description of this data)
Database
physically distributed over a computer
Concepts
network.

– DDBMS –

– is a software system that manages a


distributed database while making the
distribution transparent to the user.
– A collection of logically related shared data;

– The data is split into a number of fragments;

– Fragments may be replicated;

– The sites are linked by a communications


Characteristics
network;
of DDBMS:
– The data at each site is under the control of a
DBMS;

– The DBMS at each site can handle local


applications, autonomously;

– Each DBMS participates in at least one global


application.
Advantages DDS
1. Management of distributed data with different levels of
transparency:

▪ Distribution transparency
– This refers to the physical placement of data (files, relations, etc.) is not
known to the user.

▪ Network transparency
– Users do not have to worry about operational details of the network.

▪ Location transparency
– refers to freedom of issuing command from any location without
affecting its work.
Advantages DDS…
▪ Naming transparency

– Allows access to any named object (files, relations, etc.) from any
location.

▪ Replication transparency
− Allows to store copies of a data at multiple sites.

− This is done to minimize access time to the required data.

▪ Fragmentation transparency
− Allows to segment a relation horizontally (create a subset of tuples of a
relation) or vertically (create a subset of columns of a relation).
Advantages of DDS
2. Increase reliability and availability:
− Reliability refers to system live time, that is, system is running efficiently most
of the time.

− Availability is the probability that the system is continuously available (usable


or accessible) during a time interval.

− A distributed database system has multiple nodes (computers) and if one fails
then others are available to do the job.
3. Improved performance:
− DDBMS fragments the database to keep data closer to where it is needed most.
− This reduces data management (access and modification) time significantly.
4. Scalability - Easier expansion
− Allows new nodes (computers) to be added anytime without chaining the entire
configuration.
– Complexity

– Cost
Disadvantages
of – Security
DDS – Integrity control more difficult

– Lack of standards

– Lack of experience

– Database design more complex


Database system architectures
▪ A Database Architecture is a representation of DBMS design.

▪ It helps to design, develop, implement, and maintain the database


management system.

▪ There are three database system architectures:

1. Centralized Database Architecture

2. Parallel Database Architectures

3. Distributed Database Architecture


Centralized database
• A centralized database is basically a type of database that is
stored, located and maintained at a single location only.
• This type of database is modified and managed from that
location itself.
Parallel database architectures

▪ Parallel DBMSs link multiple, smaller machines to achieve the


same throughput as a single, larger machine, often with greater
scalability and reliability.

▪ The three main architectures for parallel DBMSs:

▪ Shared memory(tightly coupled)

▪ Shared disk (loosely coupled architecture)

▪ Shared nothing-(massively parallel processing (MPP))


architecture
The three main architectures for parallel DBMSs:

■ Shared memory - tightly coupled architecture in which multiple processors


share secondary (disk) storage and primary memory.
The three main architectures for parallel DBMSs:

▪ Shared disk -loosely coupled architecture where multiple processors


share secondary (disk) storage but each has their own primary memory.
The three main architectures for parallel DBMSs:

▪ Shared nothing-(massively parallel processing (MPP)) architecture.


• Multiple processor architecture in which each processor is part of a
complete system, with its own memory and disk storage.
Distributed database
• A distributed database system allows applications to access data
from local and remote databases.
• There are two Types of distributed
database system:
Type of • Homogeneous Distributed Database.
Distributed • Heterogeneous Distributed Database.
database system
Homogeneous
• All sites of the database system have identical setup, i.e., same database
system software.
• The underlying operating systems can be a mixture of Linux, Window,
Unix, etc.
• For example, all sites run Oracle or DB2, or Sybase or some other database
system.
Window
Advantages Site 5 Unix
Oracle Site 1
✓ Easy to use Oracle
Window
✓ Easy to mange Site 4 Communications
neteork
✓ Easy to Design
Oracle
Disadvantages Site 3 Site 2
Linux Oracle Linux Oracle
✓ Difficult for most organizations to
force a homogeneous environment
Homogeneous Distributed Database Systems

▪ Autonomy determines the extent to which individual nodes or


DBs in a connected DDB can operate independently.
• Design autonomy refers to independence of data model usage and
transaction management techniques among nodes.

• Communication autonomy determines the extent to which each node


can decide on sharing of information with other nodes.

• Execution autonomy refers to independence of users to act as they


please.

▪ Non-autonomous − Data is distributed across the homogeneous nodes


and a central or master DBMS co-ordinates data updates across the sites.
Heterogeneous

✓ Different data center may run different DBMS products, with possibly different
underlying data models.
Object Unix Relational
Oriented Site 5 Unix
✓ Translations required to allow for: Site 1
Hierarchical
▪ Different hardware. Window
Site 4 Communications
▪ Change of codes and word lengths. network
▪ Different DBMS products. Network
▪ Mapping of data structures in one Object DBMS
Oriented Site 3 Site 2 Relational
data model to the equivalent data
Linux Linux
structures in another data model
▪ Translate the query language used (for example, a relational model SQL SELECT
statements are mapped to the network FIND and GET statements)
▪ Different hardware and different DBMS products.
▪ If both the hardware and software are different, then both these types of
translation are required. This makes the processing extremely complex.
Heterogeneous
⚫ Advantages
✓ Huge data can be stored in one Global center from different data center
✓ Remote access is done using the global schema.
✓ Different DBMSs may be used at each node

⚫ Disadvantages
✓ Difficult to mange
✓ Difficult to design.
.
Multidatabase system (MDBS)

• Multidatabase system (MDBS)- a distributed DBMS in which each


site maintains complete autonomy.

• MDBSs logically integrate a number of independent DDBMSs while


allowing the local DBMSs to maintain complete control of their operations.

• MDBS allows users to access and share data without requiring full database
schema integration.

• Federated database system - collection of cooperating database systems


that are autonomous and possibly heterogeneous.
❑ Differences in data models

❑ Differences in constraints

❑ Differences in query language


Distributed Processing and Distributed Database
DDBMS Components
 DDBMS protocol
 Computer workstations
 To form the network system.
 Network hardware and software
 Components that reside in each workstation.
 Communications media
 Carry the data from one workstation to another.
 Transaction processor (TP)
 Receives and Processes the application’s data requests.
 Data processor (DP)
 Stores and Retrieves data located at the site.
 Also Known as data manager (DM).
DDBMS protocol
• DDBMS protocol determines how the DDBMS will:

– Interface with the network to transport data and commands


between DPs and TPs.

– Synchronize all data received from DPs (TP side) and route
retrieved data to the appropriate TPs (DP side).

– Ensure common database functions in a distributed system --


security, concurrency control, backup, and recovery.
Distributed Database Design
• The design of a distributed database introduces three new
issues:

– How to partition the database into fragments?

– Which fragments to replicate?

– Where to locate those fragments and replicas?


Data Fragmentation
▪ Data fragmentation allows us to break a single object
into two or more segments or fragments.
▪ There are three Types of Fragmentation Strategies:
▪ Horizontal Fragmentation

▪ Vertical Fragmentation

▪ Mixed Fragmentation
Horizontal Fragmentation

▪ Horizontal Fragmentation - Consists of a subset of the


tuples of a relation.

▪ Fragment represents the equivalent of a SELECT statement, with


the WHERE clause on a single attribute.
Vertical fragment

▪ Vertical fragment Consists of a subset of the attributes of a


relation.

▪ Equivalent to the PROJECT statement.


Mixed fragment

▪ Mixed fragment - Consists of a horizontal


fragment that is subsequently vertically
fragmented, or a vertical fragment that is
then horizontally fragmented.

▪ A mixed fragment is defined using the


Selection and Projection operations of the
relational algebra.
Data Replication

⚫ Data replication refers to the storage of data copies at multiple


sites served by a computer network.

– Enhance data availability and response time, reducing


communication and total query costs.
Data Replication
• Mutual Consistency Rule
• All copies of data fragments be identical.
• DDBMS must ensure that a database update is performed at
all sites where replicas exist.
• Replication Conditions
• Fully Replicated database stores multiple copies of all
database fragments at multiple sites.
• Partially Replicated database stores multiple copies of some
database fragments at multiple sites.
• Factors for Data Replication Decision
– Database Size
– Usage Frequency
Data Allocation

⚫ Data allocation describes the processing of deciding where to


locate data.
⚫ Data Allocation Strategies
– Centralized
The entire database is stored at one site.
– Partitioned
The database is divided into several disjoint parts (fragments)
and stored at several sites.
– Replicated
Copies of one or more database fragments are stored at
several sites.
Data allocation algorithms
• Data allocation algorithm take into consideration a variety of
factors:

– Performance and data availability goals

– Size, number of rows, the number of relations that an entity


maintains with other entities.

– Types of transactions to be applied to the database, the


attributes accessed by each of those transactions.
Transparencies in a DDBMS

▪ Transparency hides implementation details from the


user.
‒ Distribution transparency

• Transaction transparency
• Failure transparency
• Performance transparency
Distribution Transparency
• Distribution transparency allows the user to perceive the database as a
single, logical entity.

• Allows us to manage a physically dispersed database as though it were


a centralized database.

• Three Levels of Distribution Transparency


– Fragmentation transparency

– Location transparency

– Local mapping transparency


Distribution Transparency
• Example :
• Employee data (EMPLOYEE) are distributed over three locations: New
York, Atlanta, and Miami.

• Depending on the level of distribution transparency support, three different


cases of queries are possible:
Distribution Transparency
• Case 1: DB Supports Fragmentation Transparency
SELECT * FROM EMPLOYEE WHERE EMP_DOB < '01-JAN-1940';

• Case 2: DB Supports Location Transparency


SELECT * FROM E1 WHERE EMP_DOB < '01-JAN-1940';
UNION
SELECT * FROM E2 WHERE EMP_DOC < '01-JAN-1940';
UNION
SELECT * FROM E3 WHERE EMP_DOC < '01-JAN-1940';

• Case 3: DB Supports Local Mapping Transparency


SELECT * FROM E1 NODE NY WHERE EMP_DOB < '01-JAN-1940';
UNION
SELECT * FROM E2 NODE ATL WHERE EMP_DOB < '01-JAN-1940';
UNION
SELECT * FROM E3 NODE MIA WHERE EMP_DOB < '01-JAN-1940';
Transaction Transparency
• Transaction transparency - ensures that database transactions will
maintain the database’s integrity and consistency.

• Transaction transparency consists:

– Remote Requests

– Remote Transactions

– Distributed Transactions

– Distributed Requests
A Remote Request
▪ Allows us to access data to be processed by a single remote database
processor.
A Remote Transaction
▪ Composed of several requests, may access data at only a single
site.
▪ Allows a transaction to reference several (local or remote) DP sites.
A Distributed Request
▪ Reference data from several remote DP sites.
▪ Allows a single request to reference a physically partitioned table.

Example2:
Distributed Request
Distributed Transactions and 2 Phase Commit
▪ Transaction transparency in a DDBMS environment ensures that all distributed
transactions maintain the distributed database’s integrity and consistency.

▪ Transaction may access data at several sites.

▪ Each site has a local transaction manager responsible for:

– Maintaining a log for recovery purposes

– Participating in coordinating the concurrent execution of the transactions


executing at that site.

▪ Each site has a transaction coordinator, which is responsible for:


– Starting the execution of transactions that originate at the site.

– Distributing sub transactions at appropriate sites for execution.

– Coordinating the termination of each transaction that originates at the site.


Two-Phase Commit Protocol
 DO performs the operation and records the “before” and “after” values in the
transaction log.

 UNDO reverses an operation, using the log entries written by the DO portion
of the sequence.

 REDO redoes an operation, using the log entries written by DO portion of the
sequence.

– The write-ahead protocol forces the log entry to be written to permanent


storage before the actual operation takes place.

• Two-phase commit protocol defines the operations between two nodes;

• Coordinator and

• Subordinates or cohorts - one or more


Two-Phase Commit Protocol
• The protocol is implemented in two phases:

• Phase 1: Preparation
• The coordinator sends a PREPARE TO COMMIT message
to all subordinates.

• The subordinates receive the message, write the transaction


log using the write-ahead protocol, and send an
acknowledgement message to the coordinator.

• The coordinator makes sure that all nodes are ready to


commit, or it aborts the transaction.
Two-Phase Commit Protocol
⚫ Phase 2: The Final Commit
– The coordinator broadcasts a COMMIT message to all
subordinates and waits for the replies.

– Each subordinate receives the COMMIT message then


updates the database, using the DO protocol.
– The subordinates reply with a COMMITTED or NOT
COMMITTED message to the coordinator.
– If one or more subordinates uncommitted, the coordinator sends
an ABORT message, thereby forcing them to UNDO all
changes.
Performance Transparency and
Query Optimization

• Query optimization must provide distribution transparency as


well as replica transparency.

• Replica transparency refers to the DDBMSs ability to hide the


existence of multiple copies of data from the user.

• Query optimization algorithms are based on two principles:

• Selection of the optimum execution order

• Selection of sites to be accessed to minimize communication


costs
Operation Modes of Query Optimization
⚫ Automatic query optimization
– DDBMS finds the most cost-effective access path without user intervention.

⚫ Manual query optimization


– Optimization is selected and scheduled by the end user or programmer.

Timing of Query Optimization


– Static query optimization takes place at compilation time.
– Dynamic query optimization takes place at execution time.
• Optimization Techniques -
– Statistically based query optimization - uses statistical information about
the database.
– Rule-based query optimization algorithm - based on a set of user-defined
rules to determine the best query access strategy.
Date’s Twelve Rules for a DDBMS

• In this final section, we list Date’s twelve rules (or objectives) for
DDBMSs (Date, 1987b).

• Fundamental principle

• To the user, a distributed system should look exactly like a non-


distributed system.
1) Local autonomy

2) No reliance on a central site

3) Continuous operation

4) Location independence
Date’s Twelve Rules for a DDBMS

5) Fragmentation independence

6) Replication independence

7) Distributed query processing

8) Distributed transaction processing

9) Hardware independence

10) Operating system independence

11) Network independence

12) Database independence


Questions ?
1. Explain what is meant by a DDBMS and discuss the motivation in
providing such a system.

2. Compare and contrast a DDBMS with a parallel DBMS. Under what


circumstances would you choose a DDBMS over a parallel DBMS?

3. Discuss the advantages and disadvantages of a DDBMS.

4. What is the difference between a homogeneous and a heterogeneous


DDBMS? Under what circumstances would such systems generally
arise?

You might also like