DBMS Notes
DBMS Notes
INTRODUCTION TO DATABASE
MANAGEMENT SYSTEMS
Learning Objectives
To Learn the difference between File systems and Database systems
To learn about various Data models
To Study the Architecture of DBMS
To learn about Various Data Independence
To Learn various Data Modeling Techniques
INTRODUCTION
A database is a collection of data elements (facts) stored in a computer in
such a systematic way that a computer program can consult it to answer
questions. The answers to those questions become information that can be used
to make decisions that may not be made with the data elements alone. The
computer program used to manage and query a database is known as a database
management system (DBMS).
A database is a collection of related data that we can use for
Defining (specifying types of data)
Constructing (storing & populating)
Manipulating (querying, updating, reporting)
A Database Management System (DBMS) is a software package to
facilitate the creation and maintenance of a computerized database. ADatabase
System (DB) is a DBMS together with the data.
Features of a database
It is a persistent (stored) collection of related data.
The data is input (stored) only once.
The data is organized (in some fashion).
The data is accessible and can be queried (effectively and efficiently).
1 CS 6302
Edited with the trial version of
Foxit Advanced PDF Editor
To remove this notice, visit:
www.foxitsoftware.com/shopping
CS 6302 2
Edited with the trial version of
Foxit Advanced PDF Editor
To remove this notice, visit:
www.foxitsoftware.com/shopping
3 CS 6302
Edited with the trial version of
Foxit Advanced PDF Editor
To remove this notice, visit:
www.foxitsoftware.com/shopping
Definition:
Customer Account
Database
Data independence means that any part of data is not dependent on any
other part of the data in the DBMS. They are no way dependent upon their
physical storage and logical arrangements. As shown in Fig. 1.5, they depend only
upon various exter- nal factors such as new hardware, new users, new technology,
new linkages and link- ages to other databases.
o Logical data independence
o change the conceptual schema without having to change the
external schemas
o Physical data independence
o change the internal schema without having to change the
conceptual schema
New
hardware New functions
Change in
New use
users
Database New data
User's
view
DBMS ARCHITECTURE
Overall System Structure
Disk Storage
Disk storage consists of Data in the form logical tables, indices, data
dictionary and statistical data. Data Dictionary stores the data about data i.e. its
structure, etc. Indices are used for easy searching in a data base. Statistical data
is the log storage details about various transactions which occur on the database.
Query processor
The users submit query which passes to optimizer where the query is
optimized, the physical execution plan goes to execution engine.
The resulting data out of physical storage comes in reverse order. The
catalog is the Data Dictionary which contains statistics and schema. Every query
execution which takes place in execution engine is logged and recovered when
required.
NOTES API/GUI
Query
Optimizer
Stats
Physical
plan Logging, recovery
Exec. Engine
SchemasData/etc Requests
Catalog
Index/file/rec Mgr
Data/etc Requests
Buffer Mgr =
Pages Pages logical
- physical
Storage Mgr
Data Requests
Storage
Fig 1.7 Query processing layers
Application Architectures
ER Notations
NOTES
Example:
Strong entities
o The instances of the entity class can exist on
their own, without participating in any
relationship.
o It is also called non-obligatory membership.
Weak entities Fig
o Entity does not have a primary key. 1.11
Enhanc
o Each instance of the entity class has to ed ER
participate in a relationship in order to model
exist.
o Keys are imported from dependent entity.
o It is also called obligatory membership.
o There is a special type of total participation.
Enhanced E-R (EER) Models
o An entity type E1 is a specialization of
another entity type E2 if E1 has the same
properties of E2 and perhaps even more.
o E1 IS-A E2
Inheritance allows one class to incorporate the attributes ad
behaviours one
or more classes.
Various sub classes are specializations of one or more super classes.
Specialization :
Object Oriented Applications are structured to perform work on generic
classes (e.g.Vehicle) and at runtime invoke behaviors appropriate for the specific
vehicle being executed upon (e.g. Boeing 747)
Aggregation :
Consider the ternary relationship works-on, which we saw earlier. Suppose
we want to record managers for tasks performed by an employee at a branch.
Then Fig
1.11 illustrates the EER situation of it.
NOTES
NOTES Generalization
Abstracting the common properties of two or more entities to produce a
“higher- level” entity is called Generalization.
Subclass/Super class
The subclass is specialization of the super class and it adds additional data or
behaviours or overrides behaviours of the super class.
Super classes are generalization of their sub classes.
This is related to instances of entities that are involved in a specialization
/generali- zation relationship
If E1 specializes E2, then each instance of E1 is also an instance of E2.
Therefore Class(E1) Class(E2)
Example
•Disjoint, Partial d
O
•Overlapping, Total
•Overlapping, Partial O
Example:
Short Questions:
1. Define Entity, Attribute, Relationship, Entity Type, Entity Instance,
Entity Class.
2. Differentiate Weak and Strong Entity Set.
3. What are the various notations for ER Diagrams?
4. What is participation constraint? Mention different types of
participation constraint.
5. Define Generalization and Specialization.
Descriptive Questions:
1. Explain various types of Attributes with suitable examples.
2. State different types of Participation Constraint and explain with a
diagram- matic example.
address
id
birthday
Movie
phone Person name
name
number
Organization IS
A
Won
salary
Acted Directed
Award In
year Film
year nam
e
title type
id birthday
Actor
name address
id birthday
Actor
name address
Representation of Composite and multi valued
attributes NOTES
id birthday
Actor
name address
address
type
Recursive Relationships
• An entity set can participate more than once in a relationship.
• In this case, we add a description of the role to the ER-diagram.
phone
number
manager
id
Employee Manage
name worker s
address
name
id Director
id
Actor Film title
Produced
name
id Actor Film
Acted In title
name
Key Constraints
•Key constraints specify whether an entity can participate in one or more
than one relationships in a relationship set.
•When there is no key constraint an entity can participate any number of times.
•When there is a key constraint, the entity can participate one time at most.
Key constraints are drawn using an arrow from the entity set to the relationship set.
•We express cardinality constraints by drawing either a directed line (),
signifying “one,” or an undirected line (—), signifying “many,” between
the relationship set and the entity set.
NOTES One-to-Many
A film is directed by one director at most.
A director can direct any number of films.
id
Directed Film title
Director
name
Many-to-Many
A film is directed by any number of directors.
A director can direct any number of films.
id Director Film
Directed title
name
Edited with the trial version of
Foxit Advanced PDF Editor
To remove this notice, visit:
www.foxitsoftware.com/shopping
id
Directed title
Director Film
name
age
father
id
Person
FatherOf
child
name
Actor name
id
id
Director Film title
produced
name
Example (2)
•We can combine key and participation constraints.
•What does this diagram mean?
id
Director Directed Film title
name
Keywords:
Entity, Entity Set, Attribute, Relation, Relationship set, Key Constraint, One to
one, one to many, Many to many, Recursive relations.
NOTES Short Questions:
1. Define the following:
a. Entity
b. Entity Set
c. Attribute
d. Key
e. Relation
f. Relationship
2. What is n-ary Relationship?
3. Draw the diagram of Participation Constraint.
Descriptive Questions:
1. Explain various Relationship types with examples.
2. Compare the Recursive Relations with Participation Constraint with example.
Relational scheme
A relation scheme is the definition; i.e. a set of attributes
A relational database scheme is a set of relation schemes: i.e. a set of sets
of attributes
Relation instance (simply relation)
A relation is an instance of a relation scheme.
A relation r over a relation scheme R = {A1, ..., An} is a subset of the
Cartesian product of the domains of all attributes, i.e.
r = D (A1), D (A2, … , D (An)
Domains
A domain is a set of acceptable values for a variable.
Example: Names of Canadians, Salaries of professors
Simple/Composite domains
Address = Street Name + Street Number + City + Province +Postal Code
Domain compatibility
Binary operations (e.g. comparison to one another, addition, etc) can
be performed on them.
Full support for domains is not provided in many current relational DBMS.
Edited with the trial version of
Foxit Advanced PDF Editor
To remove this notice, visit:
www.foxitsoftware.com/shopping
Properties
Based on the set theory
No ordering among attributes and tuples
No duplicate tuples allowed
Value-oriented: tuples are identified by the attributes values
All attribute values are atomic
No repeating groups
Degree
It is the number of attributes that is available in a relation is called Degree of the
relation.
Cardinality
It is the number of tuples available in a relation.
Cardinality constraints are specified in the form l..h, where l denotes the
minimum and h the maximum number of relationships an entity can
participate.
Beware: the positioning of the constraints is exactly the reverse of
Edited with the trial version of
Foxit Advanced PDF Editor
To remove this notice, visit:
www.foxitsoftware.com/shopping
The cardinality is 3
Integrity Constraints
# Key Constraints
Referring to Mark sheet relation as shown above
Key: Aset of attributes that uniquely identifies tuples ( e.g. : Rno , Name )
A super key of an entity set is a set of one or more attributes whose
values uniquely determine each entity.(e.g. : Rno , Name & DOB , Rno
& Name , Rno & Name & DOB)
A candidate key of an entity set is a minimal super key (e.g. : Rno ,
Name & DOB)
Although several candidate keys may exist, one of the candidate keys with
minimum number of attributes is selected to be the primary key. ( e.g.
:Rno )
# Data Constraints
~ Check constraints (e.g.: to check percentage
>80) # Others
~ Entity Integrity Constraint: No primary key value can be null. (e.g.:
Rno cannot be null)
~ Referential Integrity Constraint: A tuple in one relation must refer to
an existing tuple.
Characteristics of Relations
Ordering of Tuples in a Relation
Ordering of values within a tuple and alternative definition of a relation
Values in tuples: Each value in a tuple is an atomic value.
Advantages over other models
Simple concepts
Solid mathematical foundation
Set theory
Powerful query languages
Efficient query optimization strategies NO TES
Design theory
Industry standard SQL language
Keywords:
Relation, Relational Scheme, Relational Instance, Domain, Degree, Cardinality,
Integrity Constraints, Candidate Key, Primary Key.
Short Questions:
1. Define the following:
a. Relation
b. Relational Scheme
c. Relational Instance
d. Domain
e. Degree
f. Cardinality
g. Integrity Constraints
h. Candidate Key
i. Primary Key
2. What is a Domain Compatibility?
3. State the properties of a Relational Model.
Descriptive Questions:
1. Design a database in Relational Model and represent it in Relational
Schema.
2. Explain various constraints in Relational Model.
Summary
DBMS is used to maintain and query large datasets.
Benefits include recovery from system crashes, concurrent access, quick
application development, data integrity and security.
Levels of abstraction give data independence.
A DBMS typically has a layered architecture.
DBAs hold responsible jobs and are well-paid!
DBMS R&D is one of the broadest, most exciting areas in CS.
Conceptual design follows requirements analysis, Yields a high-level
description of data to be stored
NOTES ER model popular for conceptual design Constructs are expressive, close
to the way people think about their applications.
Basic constructs: entities, relationships, and attributes (of entities and
relationships).
Some additional constructs: weak entities, ISA hierarchies, and aggregation.
Note: There are many variations on ER model.
Several kinds of integrity constraints can be expressed in the ER model:
key constraints, participation constraints, and overlap/covering
constraints for ISA Hierarchies. Some foreign key constraints are also
implicit in the definition of a relationship set.
Some constraints (notably, functional dependencies) cannot be expressed in
the ER model.
Constraints play an important role in determining the best database design
for an enterprise.
ER design is subjective. There are often many ways to model a given
scenario!
Analyzing alternatives can be tricky, especially for a large enterprise.
Common choices include: Entity vs. attribute, entity vs. relationship, binary
or nary relationship, whether or not to use ISA hierarchies, and whether or
not to use aggregation.
Ensuring good database design: resulting relational schema should be
analyzed and refined further. FD information and normalization techniques
are especially useful.
References :
1. Ramez Elamasri and shankant B.Navathe, “ Fundamentals of Database
Systems”, Third Edition , Pearson Education, Delhi, 2002.
2. Abraham Silberschatz, Henry F.Korth and S.Sundarshan, “Database
System concepts”, Fourth Edition, McGrawHill, 2002.
3. C.J.Date,”An Introduction to Database Systems”, Seventh Edition,
Pearson Education, Delhi, 2002.
NOTES
STORAGE STRUCTURES
Learning Objectives
To Learn about Secondary Storage Devices
To learn about RAID Technology
To Study about File Operations
To learn about Hashing Techniques
To Learn about Indexing ( Both Single and Multi level )
To study B+ Tree and Indexes on Multiple keys
INTRODUCTION
Storage Structure
A database file is partitioned into fixed-length storage units called
blocks. Blocks are units of both storage allocation and data transfer.
Database system seeks to minimize the number of block
transfers between the disk and memory. We can reduce the
number of disk accesses by keeping as many blocks as possible
in main memory.
Secondary storage devices are used to store data permanently.
Buffer – portion of main memory available to store copies of
disk blocks.
Buffer manager – subsystem responsible for allocating buffer space
in main memory.
Volatile storage:
o does not survive system crashes
examples: main memory, cache
memory
Nonvolatile storage:
o survives system crashes
examples: disk, tape, flash memory,
non-volatile (battery backed up) RAM
NOTES Stable storage:
o a mythical form of storage that survives all failures
o approximated by maintaining multiple copies on
distinct nonvolatile media
2.1.1.1. Physical Storage Media
Physical storage media are the devices on which data are stored. It is
classified into different categories. It is based on the following aspects.
H Speed with which data can be accessed
H Cost per unit of data
H Reliability
4 Data loss on power failure or system crash
H Life of storage
4 Volatile storage: It loses contents when power is switched off
4 Non-volatile storage:
main- memory.
2.1.1.2 Categories of Physical Storage Media
Cache
4 The fastest and most costly form of storage
4 Volatile
4 Managed by the computer system hardware.
Main memory
4 Fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds)
4 Generally too small (or too expensive) to store the entire database
4 Capacities of a few Gigabytes are widely used currently.
4 Capacities have gone up and per-byte costs have decreased steadily and
rapidly.
4 Volatile: Contents of main memory are usually lost if a power failure or
system crash occurs.
Flash memory
4 Data survives power failure.
4 Data can be written at a location only once, but location can be erased
and written again.
It can support only a limited number of write/erase cycles.
4
NOTES
4 Erasure of memory has to be done to an entire bank of memory.
4 Reads are roughly as fast as main memory.
4 But writes are slow (few microseconds), erasure is slower.
4 Cost per unit of storage is roughly similar to main memory.
4 It is widely used in embedded devices like digital cameras.
4 It is also known as EEPROM.
Magnetic-disk
4 Data is stored on spinning disk, and it is read/written magnetically.
4 Primary medium for the long-term storage of data; typically stores entire
data base.
4 Data must be moved from disk to main memory for access, and written
back for storage.
4 Much slower access than main memory (more on this later)
Optical storage
4 Non-volatile, data is read optically from a spinning disk using a laser
4 CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
4 Write-one, read-many (WORM) optical disks used for archival storage
(CD-R and DVD-R)
4 Multiple write versions also available (CD-RW, DVD-RW, and DVD-RAM)
4 Reads and writes are slower than with magnetic disk
4 Juke-box systems, with large numbers of removable disks, a few drives, and a
Tape storage
4 Non-volatile, used primarily for backup (to recover from disk failure), and
for archival data
4 Sequential-access – much slower than disk
4 Very high capacity (40 to 300 GB tapes available)
4 Tape can be removed from drive storage costs much cheaper than disk,
but drives are expensive
4 Tape jukeboxes available for storing massive amounts of data
4 Hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1
petabyte
= 1012 bytes)
Magnetic Tapes
n Hold large volumes of data and provide high transfer rates
H Few GB for DAT (Digital Audio Tape) format, 10-40 GB with DLT
(Digital Linear Tape) format, 100 GB+ with Ultrium format, and 330 GB
with Ampex helical scan format
H Transfer rates from few to 10s of MB/s
n Currently the cheapest storage medium
H Tapes are cheap, but cost of drives is very high.
n Very slow access time in comparison with magnetic disks and optical disks
Limited to sequential access.
H
NOTES
n Used mainly for backup, for storage of infrequently used information, and as an
off- line medium for transferring information from one system to another.
n Tape jukeboxes used for very large capacity storage
H (Terabyte (1012 bytes) to Petabye (1015 bytes)
Storage Hierarchy
Primary storage: Fastest media but volatile (cache, main memory).
Secondary storage: next level in hierarchy, non-volatile, moderately fast access
time.
Also called on-line storage
4 E.g. flash memory, magnetic disks
Tertiary storage: lowest level in hierarchy, non-volatile, slow access time
4 Also called off-line storage
4 E.g. magnetic tape, optical storage
H Typical sectors per track: 200 (on inner tracks) to 400 (on outer tracks)
n To read/write a sector
n Head-disk assemblies
– Would be 1/3 if all tracks had the same number of sectors, and we
n Data-Transfer Rate – the rate at which data can be retrieved from or stored to
the disk.
H 4 to 8 MB per second is typical
4 Multiple disks may share a controller, so rate that controller can handle is
also important
Mean Time To Failure (MTTF) – the average time the disk is expected to run
continuously without any failure.
4 Typically 3 to 5 years
4 Probability of failure of new disks is quite low, corresponding to a
“theoretical MTTF” of 30,000 to 1,200,000 hours for a new disk.
RAID
RAID: Redundant Arrays of Independent Disks
o Disk organization techniques that manage a large numbers of
disks, provide a view of a single disk of
high capacity and high speed by using multiple disks
in parallel, and
high reliability by storing data redundantly, so that data
can be recovered even if a disk fails
The chance that some disk out of a set of N disks will fail is much higher
than the chance that a specific single disk will fail.
E.g. a system with 100 disks, each with MTTF of
100,000 hours (approx. 11 years), will have a system
MTTF of 1000 hours (approx. 41 days)
o Techniques for using redundancy to avoid data loss are critical
with large numbers of disks
Originally a cost-effective alternative to large, expensive disks
NOTES
o I in RAID originally stood for “inexpensive’’
o Today RAIDs are used for their higher reliability and bandwidth.
The “I” is interpreted as independent
Improvement of Reliability via Redundancy
Redundancy – stores extra information that can be used to
rebuild information lost in a disk failure
E.g. Mirroring (or shadowing)
o Duplicate every disk. Logical disk consists of two
physical disks.
o Every write is carried out on both disks
Reads can take place from either disk
o If one disk in a pair fails, data still available in the other
Data loss would occur only if a disk fails, and its
mirror disk also fails before the system is repaired
Probability of combined event is very small
o Except for dependent failure modes
such as fire or building collapse or
electrical power surges
Mean time to data loss depends on mean time to
failure, and mean time to repair
o E.g. MTTF of 100,000 hours, mean time to repair of 10
hours gives mean time to data loss of 500*106 hours (or
57,000 years) for a mirrored pair of disks (ignoring
dependent failure modes)
Improvement in Performance via Parallelism
o Two main goals of parallelism in a disk system:
1. Load balance multiple small accesses to increase throughput
2. Parallelize large accesses to reduce response time.
o Improve transfer rate by striping data across multiple disks.
o Bit-level striping – split the bits of each byte across multiple disks
In an array of eight disks, write bit i of each byte to disk i.
Each access can read data at eight times the rate of a single disk.
But seek/access time worse than for a single disk
Bit level striping is not used much any more
o Block-level striping – with n disks, block i of a file goes to disk (i
mod n)
+1
NOTES Requests for different blocks can run in parallel if the blocks
reside on different disks
A request for a long sequence of blocks can utilize all disks
inparallel
RAID Levels
o Schemes to provide redundancy at lower cost by using disk
striping combined with parity bits
Different RAID organizations, or RAID levels, have
differing cost, performance and reliability characteristics
RAID Level 0: Block striping; non-redundant.
o Used in high-performance applications where loss of data is
not critical.
RAID Level 1: Mirrored disks with block striping
o Offers best write performance.
o Popular for applications such as storing log files in a
database system.
RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with
bit striping.
RAID Level 3: Bit-Interleaved Parity
o a single parity bit is enough for error correction, not just
detection, since we know which disk has failed
When writing data, corresponding parity bits must also
be computed and written to a parity bit disk.
To recover data in a damaged disk, compute XOR of
bits from other disks (including parity bit disk).
o Faster data transfer than with a single disk, but fewer I/Os
per second since every disk has to participate in every I/O.
o Subsumes Level 2 (provides all its benefits at lower cost).
RAID Level 4: Block-Interleaved Parity; uses block-level striping, and
keeps a parity block on a separate disk for corresponding blocks
from N other disks.
o When writing data block, corresponding block of parity bits
must also be computed and written to parity disk.
o To find the value of a damaged block, compute XOR of bits
from corresponding blocks (including parity block) of other
disks.
o P han Level 3
r
o
v
i
d
e
s
h
i
g
h
e
r
I
/
O
r
a
t
e
s
f
o
r
i
n
d
e
p
e
n
d
e
n
t
b
l
o
c
k
r
e
a
d
s
t
Block read goes to a single disk, so blocks stored on
NOTES
different disks can be read in parallel.
o Provides high transfer rates for reads of multiple blocks than no-
strip- ing
o Before writing a block, parity data must be computed.
Can be done by using old parity block, old value of
current block and new value of current block (2 block
reads + 2 block writes)
Or by recomputing the parity value using the new values
of blocks corresponding to the parity block
n More efficient for writing large amounts of
data sequentially
o Parity block becomes a bottleneck for independent block writes
since every block write also writes to parity disk.
RAID Level 5: Block-Interleaved Distributed Parity; partitions data and
parity among all N + 1 disks, rather than storing data in N disks and
parity in 1 disk.
o E.g. with 5 disks, parity block for nth set of blocks is stored on
disk (n mod 5) + 1, with the data blocks stored on the other 4
disks.
o Higher I/O rates than Level 4.
Block writes occur in parallel if the blocks and their parity
blocks are on different disks.
o Subsumes Level 4: provides same benefits, but avoids bottleneck
of parity disk.
RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores
extra redundant information to guard against multiple disk failures.
o Better reliability than Level 5 at a higher cost; It is not used
widely.
NOTES
Free Lists
o Store the address of the first deleted record in the file header.
o Use this first record to store the address of the second
deleted record, and so on
o Can think of these stored addresses as pointers since they “point”
to the location of a record.
o More space efficient representation: reuse space for normal at-
tributes of free records to store pointers. (No pointers stored in
in- use records.)
Fig 2.3 Free list
Variable-Length Records
o Variable-length records arise in database systems in several ways:
Storage of multiple record types in a file.
NOTES Record types that allow variable lengths for one or more
fields.
Record types that allow repeating fields (used in some
older data models).
o Byte string representation
Attach an end-of-record () control character to the end
of each record
Difficulty with deletion
Difficulty with growth
Variable-Length Records: Slotted Page Structure
Pointer method
A variable-length record is represented by a list of fixed-length
records, chained together via pointers.
It can be used even if the maximum record length is not known.
Disadvantage to pointer structure; space is wasted in all records except the
first in a a chain.
Solution is to allow two kinds of block in file:
Anchor block – contains the first records of chain
Overflow block – contains records other than those that are the first
records of chairs.
Hash Functions
o Worst Hash function maps all search-key values to the same bucket;
this makes access time proportional to the number of search-key
values in the file.
o An ideal hash function is uniform, i.e. each bucket is assigned the
same number of search-key values from the set of all possible values.
o Ideal hash function is random, so each bucket will have the same
number of records assigned to it irrespective of the actual
distribution of search- key values in the file.
o Typical hash functions perform computation on the internal binary
repre- sentation of the search-key.
o For example, for a string search-key, the binary
representations of all the characters in the string could be
added and the sum modulo the number of buckets could be
returned. .
Handling of Bucket Overflows
o Bucket overflow can occur because of
Insufficient buckets
NOTES Skew in distribution of records. This can occur due to
two reasons:
multiple records have same search-key value.
chosen hash function produces non-uniform
distribution of key values.
o Although the probability of bucket overflow can be reduced, it cannot
be eliminated; it is handled by using overflow buckets.
o Overflow chaining – the overflow buckets of a given bucket are
chained together in a linked list.
o Above scheme is called closed hashing.
o An alternative, called open hashing, which does not use overflow
buckets, is not suitable for database applications.
Fig 2.10 Hash structure after insertion of Redwood and Round Hill records
Example of a B+-tree
Fig 2.14 Nonleaf node – pointers Bi are the bucket or file record pointers.
B-Tree Index File Example
NOTES
Key words :
Secondary storage , Magnetic Disk , cache , main memory , optical , primary ,
second- ary , tertiary , hashing , indexing , B Tree , B+ Tree , Node , Leaf Node
, Bucket
NOTES Short Questions :
1. Write a short note on Classification of Physical Storage Media.
2. Explain Storage Hierarchy.
3. What is Magnetic Hard Disk Mechanism?
4. What is Performance Measures of Disks?
5. Write a short note on Improvement in Performance via Parallelism.
6. Explain Levels of RAID.
7. Write a short note on Mapping of Objects to Files.
8. Explain Extendable Hash Structure.
9. Define Indexing.
10.Explain B+Tree File Organization.
Answer Vividly :
1. Explain RAID Architecture.
2. Explain Factors in choosing RAID level and Hardware Issues.
3. Briefly describe Various File Operations.
4. Write a note on Organization of Records in Files.
5. Explain Static & Dynamic Hashing.
6. Exlain the basic kinds of indices.
7. Explain B+Tree Index files.
Summary
Many alternative file organizations exist, each appropriate in some situation.
If selection queries are frequent, sorting the file or building an index is
impor- tant.
Hash-based indexes are only good for equality search.
Sorted files and tree-based indexes best for range search; also good for
equal- ity search. (Files rarely kept sorted in practice; B+ tree index is
better.)
Index is a collection of data entries plus a way to quickly find entries with
given key values.
Data entries can be actual data records, <key, rid> pairs, or <key, rid-
list> pairs.
Choice orthogonal to indexing technique is used to locate data entries
with a given key value.
It can have several indexes on a given file of data records, each with a
different search key.
Indexes can be classified as clustered vs. unclustered, primary vs.
NOTES
secondary, and dense vs. sparse. Differences have important consequences
for utility/ performance.
Understanding the nature of the workload for the application, and the
performance goals, is essential to develop a good design.
What are the important queries and updates? What attributes/relations
are involved?
Indexes must be chosen to speed up important queries (and perhaps
some updates!).
Index maintenance overhead on updates to key fields.
Choose indexes that can help many queries, if possible.
Build indexes to support index-only strategies.
Clustering is an important decision; only one index on a given relation
can be clustered!
Order of fields in composite index key can be important.
References :
1. 1.Ramez Elamasri and shankant B.Navathe, “ Fundamentals of
Database Systems”, Third Edition , Pearson Education, Delhi,
2002.
2. 2.Abraham Silberschatz, Henry F.Korth and S.Sundarshan ,
“Data- base System concepts “, Fourth Edition , McGrawHill ,
2002.
3. 3.C.J.Date,”An Introduction to Database
Systems”,Seventh Edition,Pearson Education,Delhi, 2002.
NOTES
RELATIONAL MODEL
Learning Objectives
To Learn Relational Model Concepts
To learn about RelationalAlgebra
To Study SQL , Queries , Views and Constraints
To learn about Relational Calculus
To Have idea of Commercial RDBMSs
To Learn to design Databases with functional dependencies
To learn about various Normal Forms and Database Tuning
INTRODUCTION
Data Models
o A data model is a collection of concepts for describing data.
o A schema is a description of a particular collection of data, using a
given data model.
o The relational model of data is the most widely used model today.
o Main concept: relation, basically a table with rows and columns.
o Every relation has a schema, which describes the columns, or
fields (that is, the data’s structure).
Relational Database: Definitions
o Relational database: a set of relations
o Relation: made up of 2 parts:
o Instance : a table, with rows and columns.
o Schema : specifies name of relation, plus name and type of
each column.
E.G. Students(sid: string, name: string, login: string,
age: integer, gpa: real).
o It can think of a relation as a set of rows or tuples that share the same
structure.
CS 6302 78
79 CS 6302
Importance of the Relational Model
NOTES
Most widely used model.
Vendors: IBM, Informix, Microsoft, Oracle, Sybase,etc.
“Legacy systems” in older models
E.G., IBM’s IMS
Recent competitor: XML
A synthesis emerging: XML & Relational
BEA; numerous start-ups
CS 6302 80
NOTES Bottom-
up o Consider relationships between attributes.
o Build up relations.
o It is also called design by synthesis.
Informal Measures for Design Semantics of the attributes.
o Design a relation schema so that it is easy to explain its meaning.
o A relation schema should correspond to one semantic object
(entity or relationship).
o Example: The first schema is good due to clear meaning.
Faculty (name, number, office)
Department (dcode, name, phone)
or
Faculty_works (number, name, Salary, rank, phone, email)
Reduce redundant data
o Design has a significant impact on storage requirements.
o The second schema needs more storage due to redundancy.
Faculty and Department
or
Faculty_works
Avoid update anomalies
Relation schemes can suffer from update anomalies
Insertion anomaly
1) Insert new faculty into faculty_works
o We must keep the values for the department
consistent between tuples
2) Insert a new department with no faculty members into faculty_works
o We would have to insert nulls for the faculty info.
o We would have to delete this entry later.
Deletion anomaly
o Delete the last faculty member for a department from the
faculty_works relation.
o If we delete the last faculty member for a department from the
database, all the department information disappears as well.
o This is like deleting the department from the database.
Modification anomaly
NOTES
o Update the phone number of a department in the
faculty_works relation.
o We would have to search out each faculty member that works in
that department and update the phone information in each of those
tuples.
Reduce null values in tuples
o Avoid attributes in relations whose values may often be null.
o Reduces the problem of “fat” relations
o Saves physical storage space
o Don’t include a “department name” field for each employee.
Avoid spurious tuples
o Design relation schemes so that they can be joined with
equality conditions on attributes that are either primary or
foreign keys.
o If you don’t, spurious or incorrect data will be generated
o Suppose we replace
Section (number, term, slot, cnum, dcode, faculty_num)
with
Section_info (number, cnum, dcode, term, slot)
Faculty_info (faculty_num, name)
then
Section != Section_info * Faculty_info
Relational Design
Simplest approach (not always best):
Convert each Entity Set to a relation and each relationship to a relation.
o Entity Set Relation
o E.S. attributes become relational attributes.
name manf
Beers
Relation Instance
The current values (relation instance) of a relation are specified by a table
An element t of r is a tuple, represented by a row in a table
attributes
(or columns)
customer- customer- customer-city
name street
Jones Main Harrison
Smith North Rye tuples
Curry North Rye (or rows)
Lindsay Park Pittsfield
customer
Fig 3.3 Relation
instances
Name Address Telephone
Bob 123 Main St 555-1234
Bob 128 Main St 555-1235
Pat 123 Main St 555-1235
Harry 456 Main St 555-2221
Sally 456 Main St 555-2221
NOTE Sally 456 Main St 555-2223
Pat 12 State St 555-1235
S
Unordered Relations
Order of tuples is irrelevant (tuples may be stored in an arbitrary order)
E.g. account relation with unordered tuples
1 7
23 10
Relation r:
A,C (r) A C A C
1 1
1 1
1 = 2
2
Relations r, s: A B A B
1 2
2 3
1
s
r
r s: A B
1
2
1
3
For r s to be valid.
r, s must have the same arity (same number of attributes)
The attribute domains must be compatible (e.g., 2nd column
of r deals with the same type of values as does the
2nd
column of s)
E.g. to find all customers with either an account or a loan
customer-name (depositor) customer-name (borrower)
87 CS 6302
Set Difference Operation – Example
NOTES
Relations r, s: A B A B
1 2
2 3
1
s
r
r – s: A B
1
1
A B C D E
1 10 a
1 10 a
1 20 b
1 10 b
2 10 a
2 10 a
2 20 b
2 10 b
1 1 a
2 0 a
2 2 b
0
Banking Example
branch (branch-name, branch-city, assets)
customer (customer-name, customer-street, customer-only)
account (account-number, branch-name, balance)
loan (loan-number, branch-name, amount)
depositor (customer-name, account-number)
NOTES
borrower (customer-name, loan-number)
Example Queries
Find all loans of over $1200
amount > 1200 (loan)
Find the loan number for each loan of an amount greater than
$1200
loan-number (amount > 1200 (loan))
Find the names of all customers who have a loan at the Perryridge
branch.
Query 1
customer-name(branch-name = “Perryridge” (
borrower.loan-number = loan.loan-number(borrower x loan)))
Query 2
customer-name(loan.loan-number = borrower.loan-number
((branch-name = “Perryridge”(loan)) x borrower))
Find the largest account balance
Rename account relation as d
The query is:
balance(account) - account.balance
(account.balance < d.balance (account x d (account)))
Example :
Relation
A r, s: B A B
1 2
2 3
1
rs rs
A B
2
Example
:
Relations r, s:
B D
A B C D
1 a 1 a
E
2 a 3 a
4 b 1 a
1 a 2 b
2 b 3 b
r s
r s A B C D E
1 a
1 a
1 a
1 a
2 b
r s = { t | t R-S(r) u s ( tu r ) }
Example 1
:
Relations r, s: A B B
1 s
2
3
1
1
1
3
4
6
r s: A r
1
2
Example 2
: Relations r, s: A B C D E
a a 1
a a 1
a b 1 D E
a a 1
a b 3 a 1
a a 1 b 1
a b 1 s
a b 1
r
r s:
A B C
a
a
1
2
Property
NOTES
o Let q – r s
o Then q is the largest relation satisfying q x s r
Definition in terms of the basic algebra
operation Let r(R) and s(S) be relations, and
let S R
r s = R-S (r) –R-S ( (R-S (r) x s) – R-S,S(r))
To see why
Assignment Operation
The assignment operation () provides a convenient way to express
com- plex queries.
o Write query as a sequential program consisting of
a series of assignments
followed by an expression whose value is displayed
as a result of the query.
o Assignment must always be made to a temporary relation
variable.
Example: Write r s as
CN(BN=“Downtown”(depositor account))
CN(BN=“Uptown”(depositor account))
7
7
3
10
g sum(c) (r)
sum-C
27
Fig 3.15 Aggregate Operation
branch-name balance
Perryridge 1300
Brighton 1500
Redwood 700
Fig 3.16 Aggregate Operation - group by
Outer Join
Relation loan
Relation borrower
customer- loan-number
Jones L-170
Smith L-230
Hayes L-155
Inner Join
loan Borrower
Null Values
It is possible for tuples to have a null value, denoted by null, for some of
their attributes.
null signifies an unknown value or that a value does not exist.
The result of any arithmetic expression involving null is null.
Aggregate functions simply ignore null values.
o It is an arbitrary decision. It could have returned null as
result instead.
o We follow the semantics of SQL in its handling of null values.
For duplicate elimination and grouping, null is treated like any other
value, and two nulls are assumed to be the same.
o Alternative: assume each null is different from each other
o Both are arbitrary decisions, so we simply follow SQL.
Comparisons with null values return the special truth value unknown.
o If false was used instead of unknown, then not (A < 5)
would not be equivalent to A >= 5
Three-valued logic using the truth value unknown:
o OR: (unknown or true) = true,
(unknown or false) = unknown
(unknown or unknown) =
unknown
o AND: (true and unknown) = unknown,
NOTE (false and unknown) = false,
S (unknown and unknown) =
unknown
o NOT: (not unknown) = unknown
o In SQL “P is unknown” evaluates to true if predicate P evaluates
to
unknown
Result of select predicate is treated as false if it evaluates to unknown
Examples
Insert information in the database specifying that Smith has
$1200 in account A-973 at the Perryridge branch.
account account {(“Perryridge”, A-973, 1200)}
depositor depositor {(“Smith”, A-973)}
Provide as a gift for all loan customers in the Perryridge
branch, a $200 savings account. Let the loan number serve
as the account number for the new savings account.
r1 (branch-name = “Perryridge” (borrower loan))
account account branch-name, account-number,200 (r1)
depositor depositor customer-name, loan-number(r1)
Updating
It can be
H Used stand-alone within a DBMS command
H Embedded in triggers and stored procedures
H Used in scripting or programming languages
History of SQL-92
SQL was developed by IBM in late 1970s.
SQL-92 was endorsed as a national standard by ANSI in 1992.
SQL3 incorporates some object-oriented concepts but has not
gained acceptance in industry.
Data Definition Language (DDL) is used to define database structures.
Data Manipulation Language (DML) is used to query and update data.
Data Control Language (DCL) is used to have control on transactions.
Example : Commit, Purge which are administrative level commands.
SQL statement is terminated with a semicolon.
Create Table
CREATE TABLE statement is used for creating relations
Each column is described with three parts: column name, data type,
and optional constraints
Example: NOTES
CREATE TABLE PROJECT (ProjectID Integer Primary Key, Name
Char(25) Unique Not Null, Department VarChar (100) Null, MaxHours
Numeric(6,1) Default 100);
Data Types
Standard data types
- Character-string for fixed-length character , bit-string , date and time
- VarChar for variable-length character- It requires additional processing
than Char data types
- Numeric ( Integer or Int and smallInt)
- Real numbers of various precision such as float, real and double precision.
- BLOB, CLOB – Binary Large objects to store objects like images , video
etc.
Constraints
Constraints can be defined within the CREATE TABLE statement, or they can
be added to the table after it is created using the ALTER table statement.
Five types of constraints:
H PRIMARY KEY may not have null values.
H UNIQUE may have null values.
H NULL/NOT NULL
H FOREIGN KEY
H CHECK
Example:
CREATE TABLE PROJECT (ProjectID Integer Primary Key, Name
Char(25) Unique Not Null, Department VarChar (100) Null, MaxHours
Numeric(6,1) Default 100);
ALTER Statement
Example
ALTER TABLE ASSIGNMENT ADD CONSTRAINT EmployeeFK FOR-
EIGN KEY (EmployeeNum) REFERENCES EMPLOYEE
(EmployeeNumber) ON UPDATE CASCADE ON DELETE NO ACTION;
NOTE DROP Statements
S DROP TABLE statement removes tables and their data from the database.
A table cannot be dropped if it contains foreign key values needed by other
table. H Use ALTER TABLE DROP CONSTRAINT to remove integrity
constraints in the other table.
Example:
H DROP TABLE CUSTOMER;
H ALTER TABLE ASSIGNMENT DROP CONSTRAINT ProjectFK;
SELECT Statement
Basic format:
Require quotes around values for Char and VarChar columns, but no quotes
for Integer and Numeric columns.
AND may be used for compound conditions.
IN and NOT IN indicate ‘match any’ and ‘match all’ sets of values, respectively.
Wildcards _ and % can be used with LIKE to specify a single or multiple
unknown characters, respectively.
IS NULL can be used to test for null
values Example: SELECT Statement
SELECT Name, Department, MaxHours FROM
PROJECT;WHERE Name=”XYX”;
Sorting the Results
UPDATE Statement
UPDATE statement is used to modify values of existing data
Example:
UPDATE EMPLOYEE SET Phone = ‘287-1435’ WHERE Name =
‘James’;
UPDATE can also be used to modify more than one column value at a time
UPDATE EMPLOYEE SET Phone = ‘285-0091’, Department =
‘Produc- tion’ WHERE EmployeeNumber = 200;
Date Functions
months_between(date1, date2)
1. select empno, ename, months_between (sysdate, hiredate)/12 from
emp;
2. select empno, ename, round((months_between(sysdate, hiredate)/12),
0) from emp;
3. select empno, ename, trunc((months_between(sysdate, hiredate)/12),
0) from emp;
NOTES
NOTE 1. select ename, add_months (hiredate, 48) from emp;
S 2. select ename, hiredate, add_months (hiredate, 48) from emp;
last_day(date)
1. select hiredate, last_day(hiredate) from emp;
next_day(date, day)
1. select hiredate, next_day(hiredate, ‘MONDAY’) from emp;
Trunc(date, [Foramt])
1. select hiredate, trunc(hiredate, ‘MON’) from emp;
2. select hiredate, trunc(hiredate, ‘YEAR’) from emp;
Character Based Functions
initcap(char_column)
1. select initcap(ename), ename from emp;
lower(char_column)
1. select lower(ename) from emp;
Ltrim(char_column, ‘STRING’)
1. select ltrim(ename, ‘J’) from emp;
Ltrim(char_column, ‘STRING’)
1. select rtrim(ename, ‘ER’) from emp;
Translate(char_column, ‘search char,‘replacement char)
1. select ename, translate(ename, ‘J’, ‘CL’) from emp;
replace(char_column, ‘search string’,‘replacement string’)
1. select ename, replace(ename, ‘J’, ‘CL’) from emp;
Substr(char_column, start_loc, total_char)
1. select ename, substr(ename, 3, 4) from emp;
Mathematical Functions
Abs(numerical_column)
1. select abs(-123) from dual;
ceil(numerical_column)
1. select ceil(123.0452) from dual;
floor(numerical_column)
1. select floor(12.3625) from dual;
Power(m,n)
1. select power(2,4) from dual;
Mod(m,n)
1. select mod(10,2) from dual;
Round(num_col, size)
1. select round(123.26516, 3) from dual;
Trunc(num_col,size)
NOTES
1. select trunc(123.26516, 3) from dual;
sqrt(num_column)
1. select sqrt(100) from dual;
COMPLEX QUERIES USING GROUP FUNCTIONS
Group Functions
There are five built-in functions for SELECT statement:
1. COUNT counts the number of rows in the result.
2. SUM totals the values in a numeric column.
3. AVG calculates an average value.
4. MAX retrieves a maximum value.
5. MIN retrieves a minimum value.
Result is a single number (relation with a single row and a single column).
Column names cannot be mixed with built-in functions.
Built-in functions cannot be used in WHERE clauses.
Example: Built-in Functions
1. Select count (distinct department) from project;
2. Select min (maxhours), max (maxhours), sum (maxhours) from project;
3. Select Avg(sal) from emp;
Built-in Functions and Grouping
GROUP BY allows a column and a built-in function to be used together.
GROUP BY sorts the table by the named column and applies the
built-in function to groups of rows having the same value of the
named column.
WHERE condition must be applied before GROUP BY phrase.
Example
1. Select department, count (*) from employee where employee_number
< 600 group by department having count (*) > 1
VIEWS
Base relation
A named relation whose tuples are physically stored in the database is
called as Base Relation or Base Table.
Definition
It is the tailored presentation of the data contained in one or more tables.
Update of Views
The following SQL Command is used to update a View in Oracle.
SQL> UPDATE VIEW <VIEW-NAME> SET <COLUMNAME= new value >
WHERE (CONDITION);
Dropping a View
The following SQL Command is used to drop a View in Oracle.
SQL> DROP VIEW <VIEW NAME>;
SQL> DROP VIEW VI;
Disadvantages of Views
In some cases, it is not desirable for all users to see the entire logical
model (i.e. all the actual relations stored in the database.)
Consider a person who needs to know a customer’s loan number but has
no need to see the loan amount. This person should see a relation
described, in the relational algebra, by
customer-name, loan-number (borrower loan)
Any relation that is not of the conceptual model but is made visible to a
user as a “virtual relation” is called a view.
Consider the view (named all-customer) consisting of branches and
Examples their customers.
create view all-customer as
branch-name, customer-name (depositor account)
branch-name, customer-name (borrower loan)
branch-name
(branch-name = “Perryridge” (all-customer))
NOTE Updates Through View
S Database modifications expressed as views must be translated to
modifica- tions of the actual relations in the database.
Consider the person who needs to see all loan data in the loan relation
except amount. The view given to the person, branch-loan, is defined
as:
create view branch-loan as
o branch-name, loan-number (loan)
Since we allow a view name to appear wherever a relation name is
allowed, the person may write:
branch-loan branch-loan {(“Perryridge”, L-37)}
The previous insertion must be represented by an insertion into the
actual relation loan from which the view branch-loan is constructed.
An insertion into loan requires a value for amount. The insertion can
be dealt with by either.
o rejecting the insertion and returning an error message to the user.
o inserting a tuple (“L-37”, “Perryridge”, null) into the loan relation
Some updates through views are impossible to translate into
database relation updates.
o create view v as branch-name = “Perryridge” (account))
v v (L-99, Downtown, 23)
Others cannot be translated uniquely.
o all-customer all-customer {(“Perryridge”, “John”)}
Have to choose loan or account,
and create a new loan/account
number!
INTEGRITY CONSTRAINTS
Domain Constraints
Integrity constraints guard against accidental damage to the database,
by ensuring that authorized changes to the database do not result in a
loss of data consistency.
Domain constraints are the most elementary form of integrity constraint.
They test values inserted in the database, and test queries to ensure that
the comparisons make sense.
New domains can be created from existing data types
Referential Integrity
It ensures that a value that appears in one relation for a given set of
attributes also appears for a certain set of attributes in another relation.
Formal Definition
Let r1 (R1) and r2 (R2) be relations with primary keys K1 and K2 respectively.
The subset of R2 is a foreign key referencing K1 in relation r1, if for every t2
in r2 there must be a tuple t1 in r1 such that t1[K1] = t2[].
Referential integrity constraint is also called subset dependency since its can be
written as
(r2) K1 (r1)
Insert. If a tuple t2 is inserted into r2, the system must ensure that there is a tuple t1
in r1 such that t1[K] = t2[]. That is
t2 [] K (r1)
Delete. If a tuple, t1 is deleted from r1, the system must compute the set of tuples in
r2 that reference t1:
= t1[K] (r2)
If this set is not empty, the delete command is rejected as an error, or the
tuples that reference t1 must themselves be deleted (cascading deletions are
possible).
• Set operators are binary and will only work on two relations or sets of
data.
• Can only be used on union compatible sets
R (A1, A2, …, AN) and S(B1, B2, …, BN) are union compatible
if: degree (R) = degree (S) = N
domain (Ai) = domain (Bi) for all i
111 CS 6302
Sailors Reserves
NOTES
sid sname rating age sid bid day
22 Dustin 7 45.0 22 101 10/10/96
22 Dustin 7 45.0 58 103 11/12/96
31 Lubber 8 55.5 22 101 10/10/96
31 Lubber 8 55.5 58 103 11/12/96
58 Rusty 10 35.0 22 101 10/10/96
58 Rusty 10 35.0 58 103 11/12/96
Intersection (n)
Assuming that R & S are union compatible:
Intersection: R n S is the set of tuples in both R and S
Note that R n S = S n
R Example :
SELECT S.sname
FROM Sailors S, Boats B, Reserves R
WHERE S.sid = R.sid and R.bid = B.bid and B.color =
‘red’ INTERSECT
SELECT S.sname
FROM Sailors S, Boats B, Reserves R
WHERE S.sid = R.sid and R.bid = B.bid and B.color = ‘green’;
CS 6302 110
CS 6302 112
NOTE Difference (-)
S • Difference: R – S is the set of tuples that appear in R but do not appear in S
• (R – S) ? (S – R)
Example :
SELECT S.sname
FROM Sailors S, Boats B, Reserves R
WHERE S.sid = R.sid and R.bid = B.bid and B.color =
‘red’ MINUS
SELECT S.sname
FROM Sailors S, Boats B, Reserves R
WHERE S.sid = R.sid and R.bid = B.bid and B.color = ‘green’;
Example:
Faculty (fnum, name, office, salary, rank)
• name, office (Faculty)
• fnum, (Faculty)
salary
Rename ()
• Used to give a name to the resulting relation
• Notation to make relational algebra easier to write and understand
• We can now use the resulting relation in another relational algebra expression.
• Notation:
<New Name> <Relational Expression>
NOTE Example:
S Faculty (fnum, name, office, salary,
rank) Associates σrank =
associate
(Faculty) Result name
(Associates)
Join Operation
• Join is a commonly used sequence of operators.
– Take the Cartesian product of two relations.
– Select only related tuples.
– (Possibly) eliminate duplicate columns.
Example:
• Result R ? dcode = code
S
Kinds of Joins
join: Ajoin with some condition specified
Equijoin: Ajoin where the only comparison operator used is “=“
Natural join
– It is an equijoin followed by the removal of duplicate (superfluous) column(s)
– A Natural join is denoted by (*).
– Standard definition requires that the columns used to join the tables have the
same name.
Size of a Natural Join
• If R contains tuples and S contains
tuples, then the size of R
S is
n R
nS ? <>
Example Queries
Find the names of all customers who have a loan and an account
at the bank:
{ c | l ({ c, l borrower
b,a( l, b, a loan b = “Perryridge”))
a( c, a depositor
b,n( a, b, n account b = “Perryridge”))}
Find the names of all customers who have an account at
all branches located in Brooklyn:
{ c | s, n ( c, s, n customer)
x,y,z( x, y, z branch y = “Brooklyn”)
a,b( x, y, z account c,a depositor)}
Find the names of all customers having a loan at the Perryridge branch:
{t | s borrower(t[customer-name] = s[customer-name]
u loan(u[branch-name] = “Perryridge”
u[loan-number] = s[loan-number]))}
Find the names of all customers who have a loan at the
Perryridge branch, but no account at any branch of the
bank:
{t | s borrower( t[customer-name] = s[customer-name]
u loan(u[branch-name] = “Perryridge”
u[loan-number] = s[loan-number]))
not v depositor (v[customer-name] =
t[customer-name]) }
Find the names of all customers having a loan from the Perryridge
branch, and the cities they live in:
{t | s loan(s[branch-name] = “Perryridge”
u borrower (u[loan-number] = s[loan-number]
t [customer-name] = u[customer-name])
v customer (u[customer-name] = v[customer-name]
t[customer-city] = v[customer-city])))}
Find the names of all customers who have an account at all branches
located in Brooklyn:
NOTES
{t | c customer (t[customer.name] = c[customer-name])
s branch(s [branch-city] = “Brooklyn”
u account ( s [branch-name] = u [branch-name]
s depositor ( t [customer-name] = s [customer-name]
s [account-number] = u[account-number] )) )}
Safety of Expressions
Find the names of all customers who have a loan from the
Perryridge branch and the loan amount:
{ c | s, n ( c, s, n customer)
x,y,z( x, y, z branch y = “Brooklyn”)
a,b( x, y, z account c,a depositor)}
Safety of Expressions
{ x1, x2, …, xn | P(x1, x2, …,
xn)} is safe if all of the following hold:
1. All values that appear in tuples of the expression are values from dom(P)
(that is, the values appear either in P or in a tuple of a relation mentioned
in P).
2. For every “there exists” subformula of the form x (P1(x)), the
subformula is true if and only if there is a value of x in dom(P1) such
that P1(x) is true.
3. For every “for all” subformula of the form x (P1 (x)), the
subformula is true if and only if P1(x) is true for all values x from
dom (P1).
RELATIONAL DATABASE DESIGN
Functional Dependency is a particular relationship between two attributes.
It is a relation as for ever valid instance of one attribute , that value of
attribute uniquely determines the value of other attribute. For any relation
R , attribute A is functionally dependent on A. A B.
Functional Dependencies Definition
P
K
(
P
r
i
m
a
r
y
K
e
y
)
-
>
A
1
,
A
2
,
…
,
A
n
Formally the FD is defined as follows
NOTES
o If X and Y are two sets of attributes, that are subsets of T.
For any two tuples t1 and t2 in r , if t1[X]=t2[X], we must
also have t1[Y]=t2[Y].
Notation:
o If the values of Y are determined by the values of X, then
it is denoted by X -> Y
o Given the value of one attribute, we can determine the value
of another attribute
X f.d. Y or X -> y
121 CS 6302
loan-number customer-name.
NOTES
A functional dependency is trivial if it is satisfied by all instances of a
relation.
o E.g.
o customer-name, loan-number customer-name
o customer-name customer-name
o In general, is trivial if
Closure of a Set of Functional Dependencies
Given a set F set of functional dependencies, there are certain other
func- tional dependencies that are logically implied by F.
o E.g. If A B and B C, then we can infer that A C
The set of all functional dependencies logically implied by F is the
closure of
F.
We denote the closure of F by F+.
We can find all of F+ by applying Armstrong’s Axioms:
o if , then (reflexivity)
o if , then (augmentation)
o if , and , then (transitivity)
These rules are
o sound (generate only functional dependencies that actually hold)
and
o complete (generate all functional dependencies that hold).
Example
R = (A, B, C, G, H,
I) F = { A B
A
C CG
H CG
I
B H}
Redundancy:
Data for branch-name, branch-city, assets are repeated for each
loan that a branch makes.
o Wastes space
o Complicates updating, introducing possibility of inconsistency of
assets value
o Null values
o Cannot store information about a branch if no loans exist
o Can use null values, but they are difficult to handle.
123 CS 6302
CS 6302 124
Decomposition
NOTES
Decompose the relation schema Lending-schema into:
Branch-schema = (branch-name, branch-
city,assets) Loan-info-schema = (customer-name,
loan-number,
branch-name, amount)
All attributes of an original schema (R) must appear in the
decomposition (R1, R2):
R = R1 R 2
Lossless-join decomposition.
For all possible relations r on schema R
r = R1 (r) R2(r)
Example of Non Lossless-Join Decomposition
Decomposition of R = (A, B)
R2 = (A) R2 = (B)
A B A B
1 1
2 2
1
A(r) B(r)
r
A B
A (r) B (r)
1
2
1
2
125 CS 6302
NOTES o Preferably the decomposition should be dependency preserving,
that is, (F F … F )+ = F+
1 2 n
Otherwise, checking updates for violation of functional dependencies
may require computing joins, which is expensive.
Example
R = (A, B, C)
F = {A B, B C)
Can be decomposed in two different ways
o R1 = (A, B), R2 = (B, C)
Lossless-join decomposition:
o R1 R2 = {B} and B BC
Dependency preserving
o R1 = (A, B), R2 = (A, C)
Lossless-join decomposition:
o R1 R2 = {A} and A AB
Not dependency preserving
(cannot check B C without computing R2)
1
R
Testing for Dependency Preservation
To check if a dependency is preserved in a decomposition of R
into R1, R2, …, Rn we apply the following simplified test (with attribute
closure done w.r.t. F)
result =
while (changes to result) do
for each Ri in the decomposition
t = (result Ri)+ Ri
result = result t
If result contains all attributes in , then the functional dependency
is preserved.
We apply the test on all dependencies in F to check if a
decomposition is dependency preserving.
This procedure takes polynomial time, instead of the exponential time
re- quired to compute F+ and (F1 F2 … Fn)+ .
TYPES OF NORMAL FORMS
NOTES
First Normal Form (1NF)
2NF
3NF
Boyce-Codd NF
4NF
5NF
CS 6302 130
Above algorithm ensures:
NOTES
o Each relation schema Ri is in 3NF
o Decomposition is dependency preserving and lossless-
join.
o Proof of correctness is at end of this file (click here)
Example
Relation schema:
Banker-info-schema = (branch-name, customer-
name, banker-name, office-number)
The functional dependencies for this relation schema
are: banker-name branch-name office-
number customer-name branch-name
banker-name
The key is:
{customer-name, branch-name}
j2 l1 k1
j3 l1 k1
null l2 k2
133 CS 6302
CS 6302 134
BUT: tuples
Space overhead: for storing the materialized view need to
be
Time overhead: Need to keep materialized view up to date inserted.
when relations are updated (datab
Database system may not support key declarations as
on materialized views. e,
Multivalued Dependencies Sa
There are database schemas in BCNF that do not seem to be ra,
sufficiently normalized. D
Consider a database B
o classes(course, teacher, book) C
such that (c,t,b) classes means that t is qualified to teach c, and on
b
ce
is a required textbook for c
pt
The database is supposed to list for each course the set of teachers any
s)
one of which can be the course’s instructor, and the set of books, all of
(d
which are required for the course (no matter who teaches it).
at
course ab
as
teacher book e,
database Avi DB Concepts Sa
database Avi Ullman
Hank DB Concepts ra,
database
database Hank Ullman Ul
database Sudarshan DB Concepts lm
database Sudarshan Ullman
an
operating systems Avi OS Concepts
Avi Shaw )
operating systems
operating systems Jim OS Concepts Therefor
operating systems Jim Shaw e, it is
better to
decomp
ose
classes
into:
Fig 3.22 Multi valued Dependencies
There are no non-trivial functional dependencies and therefore the
relation is in BCNF.
Insertion anomalies – i.e. if Sara is a new teacher who can teach database,
two
135 CS 6302
NOTES
CS 6302 136
NOTES course teacher
database Avi
database Hank
database Sudarshan
operating systems Avi
operating systems Jim
teaches
course book
database DB Concepts
database Ullman
operating systems OS Concepts
operating systems Shaw
text
Fig 3.23 Decomposed Tables
We shall see that these two relations are in Fourth Normal Form (4NF).
137 CS 6302
such that t1 [] = t2 [], there exist tuples 3t and t in r such that:
t1[] = t2 [] = t3 [] t4 []
t3[] = t1 []
t3[R – ] = t2[R – ]
t4 [] = t2[]
t4[R – ] = t1[R – ]
Let us see Tabular representation of
CS 6302 138
NOTES
Fig 3.24 Tabular representation of
Example
Let R be a relation schema with a set of attributes that are partitioned
into 3 nonempty subsets.
Y, Z, W
We say that Y Z (Y multidetermines Z)
if and only if for all possible relations r(R)
< y , z , w > r and < y , z , > r
w
1 1 1 2 2 2
then
< y , z , w > r and < y , z , w > r
1 1 2 2 2 1
Note that since the behavior of Z and W are identical it follows that Y Z if Y
W
In our example:
course teacher
course book
The above formal definition is supposed to formalize the notion that given a
particular value of Y (course) it has associated with a set of values of Z
(teacher) and a set of values of W (book), and these two sets are in some
sense indepen- dent of each other.
Note:
o If Y Z then Y Z
o Indeed we have (in above notation) Z1 = Z2
The claim follows.
Theory of MVDs
From the definition of multivalued dependency, we can derive the following rule:
o If , then
That is, every functional dependency is also a multivalued dependency.
The closure D+ of D is the set of all functional and multivalued
dependencies logically implied by D.
o We can compute D+ from D, using the formal definitions of functional
depen- dencies and multivalued dependencies.
o We can manage with such reasoning for very simple multivalued
dependen- cies, which seem to be most common in practice.
o For complex dependencies, it is better to reason about sets of
dependencies using a system of inference rules .
Fourth Normal Form
A relation schema R is in 4NF with respect to a set D of functional and
multivalued dependencies if for all multivalued dependencies in D+ of the
form
, where R and R, at least one of the following hold:
o is trivial (i.e., or = R)
o is a superkey for schema R
If a relation is in 4NF it is in BCNF.
Restriction of Multivalued Dependencies
The restriction of D to R is the set D consisting of
i i
o All functional dependencies in D+ that include only attributes ofi R
o All multivalued dependencies of the form
( R i )
where Ri and is in D+
4NF Decomposition Algorithm
result: = {R};
done := false;
compute D+;
Let D denote the restriction of D+ to R
i i
while (not done)
NOTES
if (there is a schema Ri in result that is not in 4NF)
then begin
let be a nontrivial multivalued dependency that holds
on R i such that Ri is not in Di , and ;
result := (result - R ) i
(R i
- ) ( , );
end
else done:= true;
Note: each Ri is in 4NF, and decomposition is lossless-join
Example
R =(A, B, C, G, H, I)
F ={ A B
B HI
CG H }
R is not in 4NF since A B and A is not a superkey for R.
Decomposition
a) R1 = (A, B) (R1 is in 4NF)
b) R2 = (A, C, G, H, I) (R2 is not in 4NF)
c) R3 = (C, G, H) (R3 is in 4NF)
d) R4 = (A, C, G, I) (R4 is not in 4NF)
Since A B and B HI, A HI, A I
e) R5 = (A, I) (R5 is in 4NF)
f)R6 = (A, C, G) (R6 is in 4NF)
Fig 3.28 The loan Relation Fig 3.29 The branch Relation
CS 6302 140
NOTES
141 CS 6302
CS 6302 142
NOTE
S
143 CS 6302
CS 6302 144
Key words:
NOTES
Data model, schema, Relational Model ,Instance, Top down, Bottom up, anamoly,
null value, tuple, attribute, select , project, union, set difference, cartesian-
product, composition, rename, intersection, join, division, assignment, aggregate,
insertion, update, deletion, sql, group, view, integrity constraints, normalization,
dependency.
Short Answers :
1. Define Data model.
2. Define Schema.
3. Define Relational model.
4. Write a short on the following.
a. Two design approches
b. Informal Measures for Design
c. Modifications of the Database
d. BASIC QUERIES USING SINGLE ROW FUNCTIONS
e. COMPLEX QUERIES USING GROUP FUNCTIONS
f. INTEGRITY CONSTRAINTS
g. Functional Dependencies
h. Multivalued Dependencies
Answer in Detail :
1. Explain Basic Structure of Relational Model.
2. Explain RelationalAlgebra Operations.
3. Explain Formal Definition of RelationalAlgebra.
4. Explain Aggregate Functions and Operations.
5. Explain STRUCTURED QUERY LANGUAGE.
6. Explain VIEWS & View Operations.
7. Explain RELATIONAL ALGEBRAAND CALCULUS Operations.
8. Explain Tuple Relational Calculus.
9. Explain Domain Relational Calculus.
10. Explain Relational Database Design.
11. Explain Normalization – Types of Normal forms.
Summary :
Relation is a tabular representation of data.
Simple and intuitive, currently the most widely used.
Integrity constraints can be specified by the DBA, based on
application semantics. DBMS checks for violations.
Two important ICs: primary and foreign keys
In addition, we always have domain constraints.
145 CS 6302
CS 6302 146
NOTE Powerful and natural query languages exist.
Rules to translate ER to relational model
S
o The relational model has rigorously defined query languages that
are simple and powerful.
o Relational algebra is more operational; useful as internal
representa- tion for query evaluation plans.
o Several ways of expressing a given query; a query optimizer
should choose the most efficient version.
o Relational calculus is non-operational, and users define queries
in terms of what they want, not in terms of how to compute it.
(Declarativeness.)
o Algebra and safe calculus have same expressive power, leading
to the notion of relational completeness.
o SQL was an important factor in the early acceptance of the
relational model; more natural than earlier, procedural query
languages.
o Relationally complete; in fact, significantly more expressive
power than relational algebra.
o Even queries that can be expressed in RA can often be
expressed more naturally in SQL.
o Many alternative ways to write a query; optimizer should look
for most efficient evaluation plan.
In practice, users need to be aware of how queries are optimized
and evaluated for best results.
o NULL for unknown field values brings many complications.
o SQL allows specification of rich integrity constraints.
o Triggers respond to changes in the database.
References :
1. Ramez Elamasri and shankant B.Navathe, “ Fundamentals
of Database Systems”, Third Edition , Pearson Education,
Delhi, 2002.
2. Abraham Silberschatz, Henry F.Korth and S.Sundarshan ,
“Database System concepts “, Fourth Edition ,
McGrawHill, 2002.
3. C.J.Date,”An Introduction to Database
Systems”,Seventh Edition,Pearson Education,Delhi,
2002.
147 CS 6302
NOTES
INTRODUCTION
Basic Principles of Query Execution
o Many DB operations require reading tuples, tuple vs. previous tuples,
or tuples vs. tuples in another table.
o Techniques generally used for implementing operations:
Iteration: for/while loop comparing with all tuples on disk
Index: if comparison of attribute that’s indexed, look up matches
in index & return those
Sort/merge: iteration against presorted data (interesting orders)
Hash: build hash table of the tuple list, probe the hash table
Must be able to support larger-than-memory data
NOTE Query Processing
S Basic Steps in Query Processing
n Parsing and translation
Translate the query into its internal form. It
is then translated into relational algebra.
Parser checks syntax and verifies relations.
n Evaluation
The query-execution engine takes a
query-evaluation plan, executes that plan, and returns the answers to the query.
n Optimizer
The selection of optimal execution plan
Optimization
n A relational algebra expression may have many equivalent expressions
E.g., ( (account)) is equivalent to
balance 2500 balance
( (account))
balance balance 2500
n Each relational algebra operation can be evaluated using one of several different
algorithms
Correspondingly, a relational-algebra expression can be evaluated in
many ways.
n Annotated expression specifying detailed evaluation strategy is called an
evaluation-plan.
E.g., can use an index on balance to find accounts with balance < 2500,
or can perform complete relation scan and discard accounts with balance
2500
n Query Optimization: Amongst all equivalent evaluation plans choose the
Selection Operation
n File scan – search algorithms that locate and retrieve records that fulfill a
selection condition.
n Algorithm A1 (linear search). Scan each file block and test all records to see
CS 6302 150
151 CS 6302
o Very simple to implement, supports any joins predicates
NOTES
o Supports any join predicates
o Cost: # comparisons = t(R) t(S)
# disk accesses = b(R) + t(R) b(S)
Index Nested-Loops Join
For each tuple in outer relation
For each match in inner’s index
Retrieve inner tuple + output joined tuple
Cost: b(R) + t(R) * cost of matching in S
For each R tuple, costs of probing index are about:
o 1.2 for hash index, 2-4 for B+-tree and:
o Clustered index: 1 I/O on average
o Unclustered index: Up to 1 I/O per S tuple
Two-Pass Algorithms
It is different from one pass algorithm since it executes in two phases.
Sort-based
o Need to do a multiway sort first (or have an index)
o Approximately linear in practice, 2 b(T) for table T
Hash-based
Store one relation in a hash table
(Sort-)Merge Join
o Requires data sorted by join attributes
Merge and join sorted files, reading sequentially a block at a
time
Maintain two file pointers
o While tuple at R < tuple at S, advance R (and vice versa)
o While tuples match, output all possible pairings
Preserves sorted order of “outer” relation
Very efficient for presorted data
Can be “hybridized” with NL Join for range joins
May require a sort before (adds cost + delay)
Cost: b(R) + b(S) plus sort costs, if necessary
In practice, approximately linear, 3 (b(R) +
b(S))
Hash-Based Joins
Allows partial pipelining of operations with equality comparisons
Sort-based operations block, but allow range and inequality comparisons
CS 6302 150
CS 6302 152
NOTE Hash joins usually done with static number of hash buckets
S Generally have fairly long chains at each bucket
What happens when memory is too small?
Hash Join
Read entire inner relation into hash table (join attributes as key)
For each tuple from outer, look up in hash table & join
Very efficient, very good for databases
Not fully pipelined
Supports equijoins only
Delay-sensitive
Other Operators
Duplicate removal very similar to grouping
All attributes must match
No aggregate
Union, difference, intersection:
o Read table R, build hash/search tree
o Read table S, add/discard tuples as required
Cost: b(R) + b(S)
OVERVIEW OF COST ESTIMATION & QUERY OPTIMIZATION
A query plan: algebraic tree of operators, with choice of algorithm for each op
Two main issues in optimization:
For a given query, which possible plans are considered?
o Algorithm to search plan space for cheapest (estimated) plan
How is the cost of a plan estimated?
o Ideally: Want to find best plan
o Practically: Avoid worst plans!
The System-R Optimizer: Establishing the
Basic Model
Most widely used model; works well for < 10 joins
Cost estimation: Approximate art at best
o Statistics, maintained in system catalogs, used to estimate cost
of operations and result sizes
o Considers combination of CPU and I/O costs
Plan Space: Too large, must be pruned
o Only the space of left-deep plans is considered.
153 CS 6302
CS 6302 154
o Left-deep plans allow output of each operator to be pipelined into
NOTES
the next operator without storing it in a temporary relation.
o Cartesian products are avoided
Schema for Examples
o Reserves:
Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
o Sailors:
Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
Query Blocks: Units of Optimization
An SQL query is parsed into a collection of query blocks, and
these are optimized one block at a time.
Nested blocks are usually treated as calls to a subroutine,
made once per outer tuple.
RelationalAlgebra Equivalences
Allow us to choose different join orders and to ‘push’ selections
and projections ahead of joins.
Selections:
More Equivalences
A projection commutes with a selection that only uses attributes retained
by the projection.
Selection between attributes of the two arguments of a cross-
product converts cross-product to a join.
A selection on ONLY attributes of R commutes with R ? S:
(R ? S) (R) ? S
If a projection follows a join R ? S, we can “push” it by retaining
only attributes of R (and S) that are needed for the join or are
kept by the projection
Enumeration of Alternative Plans
There are two main cases:
o Single-relation plans
o Multiple-relation plans
For queries over a single relation, queries consist of a combination of
selects, projects, and aggregate ops:
o Each available access path (file scan / index) is considered, and the
one with the least estimated cost is chosen.
155 CS 6302
CS 6302 156
NOTE The different operations are essentially carried out together (e.g., if an index is
S used for a selection, projection is done for each retrieved tuple, and the
resulting tuples are pipelined into the aggregate computation).
Cost Estimation
For each plan considered, must estimate cost:
Must estimate cost of each operation in plan tree.
o Depends on input cardinalities.
Must also estimate size of result for each operation in tree!
o Use information about the input relations.
For selections and joins, assume independence of predicates.
Assume can calculate a “reduction factor” (RF) for each selection predicate.
Estimates for Single-Relation Plans
Index I on primary key matches selection:
o Cost is Height(I)+1 for a B+ tree, about 1.2 for hash index.
Clustered index I matching one or more selects:
o (NPages(I)+NPages(R)) * product of RF’s of matching selects.
Non-clustered index I matching one or more selects:
o (NPages(I)+NTuples(R)) * product of RF’s of matching selects.
Sequential scan of file:
o NPages(R).
Example
Given an index I on rating: (RF= 1/NKeys(I)
o (1/NKeys(I)) * NTuples(R) = (1/10) * 40000 tuples retrieved
Clustered index: (1/NKeys(I)) * (NPages(I)+NPages(R)) = (1/10)
* (50+500) pages are retrieved
Unclustered index: (1/NKeys(I)) * (NPages(I)+NTuples(R)) = (1/10)
* (50+40000) pages are retrieved
Given an index on sid:
o Would have to retrieve all tuples/pages. With a clustered index,
the cost is 50+500, with unclustered index, 50+40000
A simple sequential scan:
o We retrieve all file pages (500)
SEMANTIC QUERY OPTIMIZATION
Remember a relational query does not specify how the system should
compute the result, but it should also compute a meaningful result, that is
why
157 CS 6302
we call it semantic query optimizer.
NOTES
Legacy systems did and therefore provided better performance -
Relational systems use query optimizer.
Modern optimizors are relatively sophisticated there for may provide
better performance than if user was to define strategies
Optimizer has two tasks:
o Logical Optimization : generates a sequence of relational
algebra operations which will solve the query
o Physical Optimization : determines an efficient means of carrying
out each operation.
Optimization
Automatic query optimisation should have:
o cardinality of each domain
o cardinality of each table
o number of values in each column
o number of times each different value occurs in each
column etc
The optimizer:
o rate assessment of the efficiency of any strategy and choose the
most efficient
o be able to try many different strategies
o has had lots of research.
o can use query to see query execution plan in some databases.
System Tables - System Catalogue eg db2
o 20-30 tables of systems information on
o base tables, views, applications, users, authorization,
application plans etc
systables: has one row for each base table in the
database, base table name, creator number of
columns etc.
syscolumns: one row for each column of each relation in the
database column name, base table, data type etc.
sysindexes: one row for every index in the database
index name, base table name, creator name etc.
can be accessed by SQL statements during development of an application.
e.g Select * from syscolumns where tablename = ‘Employee’;
NOTE Simple optimization Plan
S
supplier supplies parts
n m
Fig 4.2 Supplier Parts Relationship
Focuses on:
o Utilisation of resources
o Responsiveness to enquires
o general productivity
Interest:
o Computer Management : summary reports
o Database Manager/Administrator : timing reports -> capacity planning
o Research / System Designers: very detailed timing reports
The Performance life cycle
NOTES
performance
DATABASE report
SYSTEM
usuesrer
perfcohrma
nagnecse performance
request measurement
result
interpretation
system
tuning
environmental
change user
changes
IBM Kit
database
response time
extraction file
Database tool
System SAS
database
buffer
file
CS 6302 160
o OS can provide some services to DBMS e.g. locking which
NOTES
effect performance.
o OS can actually impair performance, some DBMS choose to
ignore some sub-components of OS.
DBMS Application Software
o most relational systems comply to a minimal SQL standard
but performance of each system differs greatly because of
different application types.
o remember!
o measurement and analysis of overall system performance is
difficult because of many inteorperating components.
o so standard benchmarks were developed to test most of these
components in a particular type of environment.
TRANSACTION PROCESSING & PROPERTIES
Transaction Concept
A transaction is a unit of program execution that accesses and
possibly updates various data items.
A transaction must see a consistent database.
During transaction execution the database may be inconsistent.
When the transaction is committed, the database must be consistent.
Two main issues to deal with:
o Failures of various kinds such as hardware failures and
system crashes
o Concurrent execution of multiple transactions
ACID Properties
To preserve integrity of data, the database system must ensure:
Atomicity. Either all operations of the transaction or none are properly reflected in
the database.
Consistency. Execution of a transaction in isolation preserves the consistency of
the database.
Isolation. Although multiple transactions may execute concurrently, each
transaction must be unaware of other concurrently executing transactions.
Intermediate transaction results must be hidden from other concurrently executed
transactions. That is, for every pair of transactions Ti and Tj, it appears to Ti that
either Tj finished execution before Ti
started, or Tj started execution after Ti finished.
CS 6302 162
NOTE Durability. After a transaction completes successfully, the changes it has made to
S the database persist, even if there are system failures.
Example of Fund Transfer
Transaction to transfer $50 from account Ato account B:
1. read( A)
2. A:= A- 50
3. write( A)
4. read(B)
5. B:= B+
50
6. write( B)
Consistency requirement — the sum of A and B is unchanged by the execution of
the transaction.
Atomicity requirement — if the transaction fails after step 3 and before step 6,
the system should ensure that its updates are not reflected in the database, or
else an inconsistency will result.
Durability requirement — once the user has been notified that the transaction
has completed (i.e. the transfer of the $50 has taken place), the updates to the
database by the transaction must persist despite failures.
Isolation requirement — if between steps 3 and 6, another transaction is
allowed to access the partially updated database, it will see an inconsistent
database (the sum A+ B will be less than it should be).
163 CS 6302
CS 6302 164
EXEC SQL UPDATE NOTES
SP SET S# =: SY
WHERE S# = SX;
EXEC SQL COMMIT
RETURN ();
Note:
single unit of work here =
•2 updates to the DBMS
•to 2 separate databases
Therefore it is a sequence of several operations which transforms data from
one consistent state to another.
•either updates or none of them must be performed e.g. banking.
•A.C.I.D
•Atomic, Consistent, Isolated, Durable
ACID properties
•Atomicity - ‘all or nothing’, transaction is an indivisible unit that is
performed either in its entirety or not at all.
•Consistency - a transaction must transform the database from one consistent
state to another.
•Isolation - transactions execute independently of one another. Partial effects
of incomplete transactions should not be visible to other transactions.
•Durability - effects of a successfully completed (committed) transaction are
stored in DB and not lost despite failure.
Transaction State
Active, the initial state; the transaction stays in this state while it is executing
Partially committed, after the final statement has been executed.
Failed, after the discovery that normal execution can no longer proceed.
Aborted, after the transaction has been rolled back and the database
restored to its state prior to the start of the transaction.
Two options after it has been aborted
Kill the transaction
Restart the transaction, only if no internal logical error
Committed, after successful completion.
Transaction Control/concurrency control
•vital for maintaining the CONSISTENCY of database
•allowing RECOVERY after failure
•allowing multiple users to access database CONCURRENTLY
•database must not be left inconsistent because of failure mid-tx.
•other processes should have consistent view of database.
•completed tx. should be ‘logged’ by DBMS and rerun if failure.
165 CS 6302
NOTE Transaction Boundaries
S • SQL identifies start of transaction as BEGIN.
• end of successful transaction as COMMIT.
• or ROLLBACK to abort unsuccessful transaction
• Oracle allows SAVEPOINTS at start of sub-transaction in
nested transactions.
Implementation
• Simplest form - single user
– keep uncommitted updates in memory
– Write to disk immediately a COMMIT is issued.
– Logged tx. re-executed when system is recovering.
– WAL write ahead log
• before view of the data is written to stable log
• carry out operation on data (in memory first)
– Commit Precedence
• after view of update written in log
memory
WAL(A) LOG(A)
1st
begin
A=A+1 commit
time
Read(A) 2nd
Write(A)
disk
memory
WAL(A) LOG(A)
1st
begin A=A+1 commit
tim Read(A) 2nd
Write(A)
disk
CONCURRENT EXECUTIONS
Definition: They are multiple transactions that are allowed to run concur- rently in
the system.
Advantages of concurrent executions
o Increased processor and disk utilization, leading to better
transaction throughput: one transaction can be using the CPU while
another is read- ing from or writing to the disk
o Reduced average response time for transactions: short transactions
need not wait behind long ones
Concurrency control schemes:
They are mechanisms to control the interaction among the concurrent
transactions in order to prevent them from destroying the consistency of the
database.
Schedules
Schedules are sequences that indicate the chronological order in which
instructions of concurrent transactions are executed.
A schedule for a set of transactions must consist of all instructions of
NOTES
those transactions.
Must preserve the order in which the instructions appear in each
individual transaction
Example Schedules
Let T1 transfer $50 from A to B, and
T2 transfer 10% of the Balance from A to
B. The following is a Serial Schedule in
which T1 is followed by T2.
Let T1 and T2 be the transactions
defined previously. The schedule 2 is not a
serial
schedule, but it is equivalent to Schedule 1. Figure 4.8 Schedule -1
Every view serializable schedule, which is not conflict serializable, has blind writes.
Precedence graph: It is a directed graph where the vertices are the transactions
(names). We draw an arc from Ti to Tj if the two transactions conflict, and Ti
accessed
171 CS 6302
NOTE the data item on which the conflict arose earlier. We may label the arc by the item
S that was accessed.
Example :
CS 6302 170
171 CS 6302
If precedence graph is acyclic, the serializability order can be obtained
NOTES
by a topological sorting of the graph. This is a linear order consistent with the
partial order of the graph.
For example, a Serializability order for Schedule Awould be T5
T1T3T2T4.
Summary :
Many DB operations require reading tuples, tuple vs. previous tuples, or tuples
vs. tuples in another table.
Basic Steps in Query Processing are Parsing , translation , Evaluation and
Optimizer
Basic Query Operators are One-pass operators and Multi-pass operators.
Aquery planis algebraic tree ofoperators, with choice ofalgorithmfor each
operator.
With the help of Database Tuning, Performance of a system is determined
by a number of factors such as use ability, portability and timing.
ACID Properties are Atomicity,Consistency,Isolation and Durability.
Questions :
1. Explain the properties of transaction.
2. Define ACID.
3. Differentiate One pass and Two pass Operators.
4. Define Database Tuning.
5. Explain Transaction failures.
6. Explain Serializability, Query processing, Optimization , Recoverability.
CS 6302 170
CS 6302 172
NOTE References :
S 1. Ramez Elamasri and shankant B.Navathe, “ Fundamentals of
Database Systems”, Third Edition , Pearson Education, Delhi, 2002.
2. Abraham Silberschatz, Henry F.Korth and S.Sundarshan , “Database
System concepts “, Fourth Edition , McGrawHill , 2002.
3. C.J.Date,”An Introduction to Database Systems”,Seventh
Edition,Pearson Education,Delhi, 2002.
173 CS 6302
NOTES
LOCKING TECHNIQUES
Lock-Based Protocols
A lock is a mechanism to control concurrent access to a data item Data items can
be locked in two modes:
1. Exclusive (X) mode. Data item can be both read as well as written. X-
lock is requested using lock-X instruction.
2. Shared (S) mode. Data item can only be read. S-lock is requested using
lock-S instruction.
granted.
NOTE
Lock-compatibility matrix
S
Locks
Exclusive
Shared
Granularity
Lock escalation
promotion/escalation of locks occur when a % of individual rows locked
in a table reaches a threshold
becomes table lock = reduce number of locks to manage.
Oracle does not use escalation.
maintains separate locks for tuples and tables.
also a ‘row exclusive’ lock on T itself to prevent another transaction
locking table T.
SELECT *each row which matches the condition in table T receives a Xlock
FROM T
WHERE condition
FOR UPDATE also a ‘row exclusive’ lock on T itself to prevent
another transaction locking table T.
this statement explicitly identifies a set of rows which will be used to
update another table in the database, and whose values must remain
unchanged throughout the rest of the transaction.
DeadLock
Dead lock occurs when a process gets locked due to lack of resource ,
NOTES
which is under the control of other process. And in turn also holds a resource
which is needed by that process.
As clearly depicted by Fig. 5.4, Such a state occurs in DBMS also when
more than one transactions occur on the same resource in case of multi user
environment.
pA
time
pB
• Deadlock happens in DBMS also due to
blockage of resources by few processes.
read R
process
R
read S
process
• DBMS maintains 'Wait For' graphs
read R
showing processes waiting and resources
read S
wait on S S wait on R
held.
• if cycle detected, DBMS selects
• Fu rthermore:
– most recent data
– no.+how far back depends on disk space
– number of rollback segments for uncommitted
tx.
– Log of committed transactions
• Failures can leave database in a corrupted or inconsistent state -weakening
Consistency and Durability features of transaction processing.
• Oracle level of failures:
– statement: causes relevant tx. to be rolled back - database returns to
previous state releasing resources
181 CS 6302
– process: abnormal disconnection form session, rolled back...etc.
– instance: crash in DBMS software, operating system, or hardware -
NOTES
DBA
issues a SHUTDOWNABORT command which triggers off instance
re-
cover procedures.
– media failure: disk head crash, worst case = destroyed database and
log
files. Previous backup version must be restored from another disk.
Data
Replication.
• Consequently:
– SHUTDOWN trashes the memory buffers - recovery used disk data
and:
– ROLLS FORWARD: re-apply committed tx. on log which holds
before/after images of updated rec.
– ROLLBACK uncommitted tx. already written to database using
rollback segments.
5.3.3. Media Recovery
• disk failure - recover from backup tape using Oracle EXPORT/IMPORT
• necessary to backup log files and control files to tape
• DBA decide to keep complete log archives which are costly in terms of:
• time
• space
• administrative overheads
– OR
• cost of manual re-entry
183 CS 6302
CS 6302 184
o Ti’s local copy of a data item X is called xi.
NOTES
We assume, for simplicity, that each data item fits in, and is stored inside, a
single block.
Transaction transfers data items between system buffer blocks and its
private work-area using the following operations :
o read(X) assigns the value of data item X to the local variable xi.
o write(X) assigns the value of local variable xi to data item {X} in
the buffer block.
o both these commands may necessitate the issue of an input(BX)
instruc- tion before the assignment, if the block BX in which X resides
is not already in memory.
Transactions
o Perform read(X) while accessing X for the first time;
o All subsequent accesses are to the local copy.
o After last access, transaction executes write(X).
output(BX ) need not immediately follow write(X). System can perform the
output operation when it deems fit.
buffer
Buffer Block A input(A)
x
Buffer Block B Y A
output(B) B
read(X)
write(Y)
x2 disk
x1
y1
work area work area
of T1 of T2
memory
185 CS 6302
NOTE FAILURE & RECOVERY CONCEPTS
S Types of Failures
Computer failure
Hardware or software error (or dumb user error)
Transaction failure
Transaction abort
Forced by local transaction error
Disk failure
Disk copy of DB lost
RECOVERY TECHNIQUES
The DBMS recovery subsystem maintains several lists including:
o Active transactions (those that have started but not committed yet)
o Committed transactions since last checkpoint
o Aborted transactions since last checkpoint
Checkpoints
A [checkpoint] is another entry in the log
o Indicates a point at which the system writes out to disk all DBMS buffers
that have been modified.
Any transactions, which have committed before a checkpoint will not have
NOTES
to have their WRITE operations redone in the event of a crash since we
can be guaranteed that all updates were done during the checkpoint
operation.
Consists of:
Suspend all executing transactions
Force-write all modified buffers to disk
Write a [checkpoint] record to log
Force the log to disk
Resume executing transactions
Transaction Rollback
If data items have been changed by a failed transaction, the old values must be
restored.
If T is rolled back, if any other transaction has read a value modified by T,
then that transaction must also be rolled back…. cascading effect.
Two Main Techniques
Deferred Update
No physical updates to db until after a transaction commits.
During the commit, log records are made then changes are made permanent
on disk.
What if a Transaction fails?
o No UNDO required
o REDO may be necessary if the changes have not yet been made
permanent before the failure.
Immediate Update
Physical updates to db may happen before a transaction commits.
All changes are written to the permanent log (on disk) before changes are
made to the DB.
What if a Transaction fails?
o After changes are made but before commit – need to UNDO the
changes
o REDO may be necessary if the changes have not yet been made
permanent before the failure.
NOTE Recovery based on Deferred Update
S Deferred update –
Changes are made in memory and after T commits, the changes are made
permanent on disk.
Changes are recorded in buffers and in log file during T’s execution.
At the commit point, the log is force-written to disk and updates are made
in database.
No need to ever UNDO operations because changes are never made permanent
REDO is needed if the transaction fails after the commit but before the changes
are made on disk.
Hence the name NO UNDO/REDO algorithm
2 lists maintained:
Commit list: committed transactions since last checkpoint
Active list: active transactions
REDO all write operations from commit list
in order that they were written to the log
Active transactions are cancelled & must be resubmitted.
Recovery based on Immediate Update
Immediate update:
Updates to disk can happen at any time
But, updates must still first be recorded in the system logs (on disk) before
changes are made to the database.
Need to provide facilities to UNDO operations which have affected the db
2 flavors of this algorithm
UNDO/NO_REDO recovery algorithm
o if the recovery technique ensures that all updates are made to the
database on disk before T commits, we do not need to REDO any
committed transactions
o UNDO/REDO recovery algorithm
o Transaction is allowed to commit before all its changes are written to
the database
o (Note that the log files would be complete at the commit point)
UNDO/REDO Algorithm
NOTES
2 lists maintained:
Commit list: committed transactions since last checkpoint
Active list: active transactions
UNDO all write operations of active transactions.
Undone in the reverse of the order in which they were written to the log
REDO all write operations of the committed
transactions
In the order in which they were written to the log.
Recovery and Atomicity
Modifying the database without ensuring that the transaction will commit may
leave the database in an inconsistent state.
Consider transaction Ti that transfers $50 from account A to account B; goal
is either to perform all database modifications made by Ti or none at all.
Several output operations may be required for i
T (to output A and B). A
failure may occur after one of these modifications has been made but before
all of them are made.
To ensure atomicity despite failures, we first output information describing
the modifications to stable storage without modifying the database itself.
We study two approaches:
o log-based recovery, and
o shadow-paging
We assume (initially) that transactions run serially, that is, one after the other.
SHADOW PAGING
Shadow paging is an alternative to log-based recovery; this scheme is
useful if transactions execute serially.
Idea: maintain two page tables during the lifetime of a transaction –the
current page table, and the shadow page table
Store the shadow page table in nonvolatile storage, such that state of the
database prior to transaction execution may be recovered.
o Shadow page table is never modified during execution.
To start with, both the page tables are identical. Only current page table is
used for data item accesses during execution of the transaction.
CS 6302 190
NOTE Whenever any page is about to be written for the first time
S o A copy of this page is made onto an unused page.
o The current page table is then made to point to the copy
o The update is performed on the copy
To commit a transaction :
1. Flush all modified pages in main memory to disk
2. Output current page table to disk
3. Make the current page table the new shadow page table, as follows:
o keep a pointer to the shadow page table at a fixed (known)
location on disk.
o to make the current page table the new shadow page table,
simply update the pointer to point to current page table on disk
Once pointer to shadow page table has been written, transaction is
commit- ted.
No recovery is needed after a crash — new transactions can start
right away, using the shadow page table.
Pages not pointed to from current/shadow page table should be
freed (garbage collected).
Advantages of shadow-paging over log-based schemes
o no overhead of writing log records
o recovery is trivial
191 CS 6302
CS 6302 190
Disadvantages :
NOTES
o Copying the entire page table is very expensive
n Can be reduced by using a page table structured like a B+-tree
n No need to copy entire tree, only need to copy paths in the
tree that lead to updated leaf nodes
o Commit overhead is high even with above extension
n Need to flush every updated page, and page table
o Data gets fragmented (related pages get separated on disk)
o After every transaction completion, the database pages containing
old versions of modified data need to be garbage collected
o Hard to extend algorithm to allow transactions to run concurrently
Easier to extend log based schemes
Recovery With Concurrent Transactions
We modify the log-based recovery schemes to allow multiple transactions
to execute concurrently.
All transactions share a single disk buffer and a single log
more transactions
We assume concurrency control using strict two-phase locking;
i.e.
the updates of uncommitted transactions should not be visible
to other transactions
Otherwise how to perform undo if T1 updates A, then T2
191 CS 6302
CS 6302 192
NOTE We assume no updates are in progress while the checkpoint is
S carried out (will relax this later).
When the system recovers from a crash, it first does the following:
Initialize undo-list and redo-list to empty
Scan the log backwards from the end, stopping when the first <checkpoint
L> record is found.
For each record found during the backward scan:
if the record is <Ti commit>, add T to redo-list
if the record is <T i start>, then if T iis not in redo-list, add T to undo-list
For every T iin L, if Ti is not in redo-list, add T to undo-list
At this point undo-list consists of incomplete transactions which must be
undone, and redo-list consists of finished transactions that must be redone.
Recovery now continues as follows:
Scan log backwards from most recent record, stopping when
<Ti start> records have been encountered for every Ti in undo-list.
During the scan, perform undo for each log record that belongs to a
transaction in undo-list.
Locate the most recent <checkpoint L> record.
Scan log forwards from the <checkpoint L> record till the end of the log.
During the scan, perform redo for each log record that belongs to a
transaction on redo-list
Example of Recovery
Go over the steps of the recovery algorithm on the following log:
<T0 start>
<T0, A, 0, 10>
<T0 commit>
<T1 start>
<T1, B, 0, 10>
<T2 start> /* Scan in Step 4 stops here */
<T2, C, 0, 10>
<T2, C, 10, 20>
<checkpoint {T1, T2}>
<T3 start>
<T3, A, 10, 20>
<T3, D, 0, 10>
<T3 commit>
193 CS 6302
CS 6302 194
Log Record Buffering
NOTES
Log record buffering: log records are buffered in main memory, instead of
being output directly to stable storage.
Log records are output to stable storage when a block of
log records in the buffer is full, or a log force operation is
executed.
Log force is performed to commit a transaction by forcing all its log
records (including the commit record) to stable storage.
Several log records can thus be output using a single output operation,
reducing the I/O cost.
The rules below must be followed if log records are buffered:
Log records are output to stable storage in the order in
which they are created.
Transaction Ti enters the commit state only when the log record
<Ti commit> has been output to stable storage.
Before a block of data in main memory is output to the
database, all log records pertaining to data in that block must
have been
output to stable storage.
This rule is called the write-ahead logging or WAL rule
Strictly speaking WAL only requires undo information to be output
LOG-BASED RECOVERY
CS 6302 198
When T finishes it last statement, the log record <T commit> is written. Below we
i
We assume that log records are written directly to stable storage (that is, show the
log as it
they are not buffered) appears at
Two approaches using logs three
instances of
o Deferred database modification time.
o Immediate database modification
Checkpoints
Tc Tf
T1
T2
T3
T4
Introduction to DBA
• database administrators
•use specialized tools for archiving , backup, etc
NOTES
•start/stop DBMS or taking DB offline (restructuring tables)
•establish user groups/passwords/access privileges
•backup and recovery
•secruity and integrity
•import/export (see interoperability)
•monitoring tuning for performance
Security
The advantage of having shared access to data is in fact a disadvantage also
•Consequences: loss of competitiveness, legal action from individual
•Restrictions
– Unauthorized users seeing data
– Corruption due to deliberate incorrect updates
– Corruption due to accidental incorrect updates
•Reading ability allocated to those who have a right to know
•Writing capabilities restricted for the casual user - who may accidentally
corrupt data due to lack of understanding
•Authorization is restricted to the chosen few to avoid deliberate corruption
non-computer-based non-computer-based
controls computer-based
data hardware
DBMS communication link user
O.S.
controls
controls
Authorization in SQL
Summary :
A lock is a mechanism to control concurrent access to a data item
THE TWO-PHASE LOCKING PROTOCOL ensures conflict-
serializable schedules. So in order to avoid conflicts between the
various schedules
Failure Classifications are Transaction failure, System crash and Disk failure
DBMS recovery subsystem are Active transactions , Committed
transac- tions since last checkpoint and Aborted transactions since last
checkpoint
Recovery and Atomicity are log-based recovery, and shadow-paging
Log record buffering is to buffer log records in main memory,
instead of being output directly to stable storage.
Database maintains an in-memory buffer of data blocks
The deferred database modification scheme records all modifications to
the log, but defers all the writes to after partial commit.
The immediate database modification scheme allows database updates of
an uncommitted transaction to be made as the writes are issued
Streamline recovery procedure by periodically performing check pointing
Integrity constraints are Implicit Constraints , Relational constraints
,Domain constraints and Referential constraints
Questions :
1. Define Lock.
2. Define Two phase locking protocol.
3. Define Check point.
4. Explain System failures.
5. Explain Integrity Constraints.
6. Explain Database Modifications.
7. Expl ain Various Recovery procedures.
Answer Vividly :
NOTES
1. Explain Deadlock.
2. Explain DB Recovery System.
3. Explain DB security.
4. Explain Concurrency in DB system.
References :
NOTE
NOTES