DB 2014
DB 2014
tgg22
Computer Laboratory
University of Cambridge, UK
Databases, Lent 2014
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 1 / 167
Lecture 01 : What is a DBMS?
DB vs. IR
Relational Databases
ACID properties
Two fundamental trade-offs
OLTP vs. OLAP
Behond ACID/Relational model ...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 2 / 167
Example Database Management Systems (DBMSs)
A few database examples
Banking : supporting customer accounts, deposits and
withdrawals
University : students, past and present, marks, academic status
Business : products, sales, suppliers
Real Estate : properties, leases, owners, renters
Aviation : ights, seat reservations, passenger info, prices,
payments
Aviation : Aircraft, maintenance history, parts suppliers, parts
orders
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 3 / 167
Some observations about these DBMSs ...
They contains highly structured data that has been engineered to
model some restricted aspect of the real world
They support the activity of an organization in an essential way
They support concurrent access, both read and write
They often outlive their designers
Users need to know very little about the DBMS technology used
Well designed database systems are nearly transparent, just part
of our infrastructure
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 4 / 167
Databases vs Information Retrieval
Always ask What problem am I solving?
DBMS IR system
exact query results fuzzy query results
optimized for concurrent updates optimized for concurrent reads
data models a narrow domain domain often open-ended
generates documents (reports) search existing documents
increase control over information reduce information overload
And of course there are many systems that combine elements of DB
and IR.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 5 / 167
Still the dominant approach : Relational DBMSs
The problem : in 1970 you could not
write a database application without
knowing a great deal about the
low-level physical implementation of
the data.
Codds radical idea [C1970]: give
users a model of data and a
language for manipulating that data
which is completely independent of
the details of its physical
representation/implementation.
This decouples development of
Database Management Systems
(DBMSs) from the development of
database applications (at least in an
idealized world).
This is the kind of abstraction at the heart of Computer Science!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 6 / 167
What services do applications expect from a DBMS?
Transactions ACID properties
Atomicity Either all actions are carried out, or none are
logs needed to undo operations, if needed
Consistency If each transaction is consistent, and the database is
initially consistent, then it is left consistent
Applications designers must exploit the DBMSs
capabilities.
Isolation Transactions are isolated, or protected, from the effects of
other scheduled transactions
Serializability, 2-phase commit protocol
Durability If a transactions completes successfully, then its effects
persist
Logging and crash recovery
These concepts should be familiar from Concurrent Systems and
Applications.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 7 / 167
What constitutes a good DBMS application design?
Domain of Interest Domain of Interest
Database Database
real-world change
database update(s)
represent represent
At the very least, this diagram should commute!
Does your database design support all required changes?
Can an update corrupt the database?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 8 / 167
Relational Database Design
Our tools
Entity-Relationship (ER) modeling high-level, diagram-based design
Relational modeling formal model normal forms based
on Functional Dependencies (FDs)
SQL implementation Where the rubber meets the road
The ER and FD approaches are complementary
ER facilitates design by allowing communication with domain
experts who may know little about database technology.
FD allows us formally explore general design trade-offs. Such as
A Fundamental Trade-off in Database Design: the more we
reduce data redundancy, the harder it is to enforce some types of
data integrity. (An example of this is made precise when we look
at 3NF vs. BCNF.)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 9 / 167
ER Demo Diagram (Notation follows SKS book)
1
Employee
Name
Number
ISA
Mechanic Salesman Does
RepairJob Number
Description
Cost Parts
Work
Repairs Car
License
Model
Year
Manufacturer
Buys
Price
Date
Value
Sells
Date
Value
Commission
Client ID
Name Phone
Address
buyer seller
1
By Pvel Calado,
http://www.texample.net/tikz/examples/entity-relationship-diagram
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 10 / 167
A Fundamental Trade-off in Database
Implementation Query response vs. update
throughput
Redundancy is a Bad Thing.
One of the main goals of ER and FD modeling is to reduce data
redundancy. We seek normalized designs.
A normalized database can support high update throughput and
greatly facilitates the task of ensuring semantic consistency and
data integrity.
Update throughput is increased because in a normalized
database a typical transaction need only lock a few data items
perhaps just one eld of one row in a very large table.
Redundancy is a Good Thing.
A de-normalized database can greatly improve the response time
of read-only queries.
Selective and controlled de-normalization is often required in
operational systems.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 11 / 167
OLAP vs. OLTP
OLTP Online Transaction Processing
OLAP Online Analytical Processing
Commonly associated with terms like Decision
Support, Data Warehousing, etc.
OLAP OLTP
Supports analysis day-to-day operations
Data is historical current
Transactions mostly reads updates
optimized for query processing updates
Normal Forms not important important
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 12 / 167
Example : Data Warehouse (Decision support)
business analysis queries
Extract
fast updates
Operational Database Data Warehouse
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 13 / 167
Example : Embedded databases
FIDO = Fetch Intensive Data Organization
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 14 / 167
Example : Hinxton Bio-informatics
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 15 / 167
NoSQL Movement (subject of Lectures 11, 12)
A few technologies
Key-value store
Directed Graph Databases
Main-memory stores
Distributed hash tables
Applications
Googles Map-Reduce
Facebook
Cluster-based computing
...
Always remember to ask : What problem am I solving?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 16 / 167
Recommended Reading
Textbooks
SKS Silberschatz, A., Korth, H.F. and Sudarshan, S. (2002).
Database system concepts. McGraw-Hill (4th edition).
(Adjust accordingly for other editions)
Chapters 1 (DBMSs)
2 (Entity-Relationship Model)
3 (Relational Model)
4.1 4.7 (basic SQL)
6.1 6.4 (integrity constraints)
7 (functional dependencies and normal
forms)
22 (OLAP)
UW Ullman, J. and Widom, J. (1997). A rst course in
database systems. Prentice Hall.
CJD Date, C.J. (2004). An introduction to database systems.
Addison-Wesley (8th ed.).
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 17 / 167
Reading for the fun of it ...
Research Papers (Google for them)
C1970 E.F. Codd, (1970). "A Relational Model of Data for Large
Shared Data Banks". Communications of the ACM.
F1977 Ronald Fagin (1977) Multivalued dependencies and a
new normal form for relational databases. TODS 2 (3).
L2003 L. Libkin. Expressive power of SQL. TCS, 296 (2003).
C+1996 L. Colby et al. Algorithms for deferred view maintenance.
SIGMOD 199.
G+1997 J. Gray et al. Data cube: A relational aggregation
operator generalizing group-by, cross-tab, and sub-totals
(1997) Data Mining and Knowledge Discovery.
H2001 A. Halevy. Answering queries using views: A survey.
VLDB Journal. December 2001.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 18 / 167
Lecture 02 : The relational data model
Mathematical relations and relational schema
Using SQL to implement a relational schema
Keys
Database query languages
The Relational Algebra
The Relational Calculi (tuple and domain)
a bit of SQL
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 19 / 167
Lets start with mathematical relations
Suppose that S
1
and S
2
are sets. The Cartesian product, S
1
S
2
, is
the set
S
1
S
2
= (s
1
, s
2
) [ s
1
S
1
, s
2
S
2
A tabular presentation
name sid age
Fatima fm21 20
Eva ev77 18
James jj25 19
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 23 / 167
Key Concepts
Relational Key
Suppose R(X) is a relational schema with Z X. If for any records u
and v in any instance of R we have
u.[Z] = v.[Z] = u.[X] = v.[X],
then Z is a superkey for R. If no proper subset of Z is a superkey, then
Z is a key for R. We write R(Z, Y) to indicate that Z is a key for
R(Z Y).
Note that this is a semantic assertion, and that a relation can have
multiple keys.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 24 / 167
Creating Tables in SQL
create table Students
(sid varchar(10),
name varchar(50),
age int);
-- insert record with attribute names
insert into Students set
name = Fatima, age = 20, sid = fm21;
-- or insert records with values in same order
-- as in create table
insert into Students values
(jj25 , James , 19),
(ev77 , Eva , 18);
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 25 / 167
Listing a Table in SQL
-- list by attribute order of create table
mysql> select
*
from Students;
+------+--------+------+
| sid | name | age |
+------+--------+------+
| ev77 | Eva | 18 |
| fm21 | Fatima | 20 |
| jj25 | James | 19 |
+------+--------+------+
3 rows in set (0.00 sec)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 26 / 167
Listing a Table in SQL
-- list by specified attribute order
mysql> select name, age, sid from Students;
+--------+------+------+
| name | age | sid |
+--------+------+------+
| Eva | 18 | ev77 |
| Fatima | 20 | fm21 |
| James | 19 | jj25 |
+--------+------+------+
3 rows in set (0.00 sec)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 27 / 167
Keys in SQL
A key is a set of attributes that will uniquely identify any record (row) in
a table.
-- with this create table
create table Students
(sid varchar(10),
name varchar(50),
age int,
primary key (sid));
-- if we try to insert this (fourth) student ...
mysql> insert into Students set
name = Flavia, age = 23, sid = fm21;
ERROR 1062 (23000): Duplicate
entry fm21 for key PRIMARY
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 28 / 167
What is a (relational) database query language?
Input : a collection of Output : a single
relation instances relation instance
R
1
, R
2
, , R
k
= Q(R
1
, R
2
, , R
k
)
How can we express Q?
In order to meet Codds goals we want a query language that is
high-level and independent of physical data representation.
There are many possibilities ...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 29 / 167
The Relational Algebra (RA)
Q ::= R base relation
[
p
(Q) selection
[
X
(Q) projection
[ Q Q product
[ Q Q difference
[ Q Q union
[ Q Q intersection
[
M
(Q) renaming
p is a simple boolean predicate over attributes values.
X = A
1
, A
2
, . . . , A
k
is a set of attributes.
M = A
1
B
1
, A
2
B
2
, . . . , A
k
B
k
is a renaming map.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 30 / 167
Relational Calculi
The Tuple Relational Calculus (TRC)
Q = t [ P(t )
The Domain Relational Calculus (DRC)
Q = (A
1
= v
1
, A
2
= v
2
, . . . , A
k
= v
k
) [ P(v
1
, v
2
, , v
k
)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 31 / 167
The SQL standard
Origins at IBM in early 1970s.
SQL has grown and grown through many rounds of
standardization :
ANSI: SQL-86
Query Language
...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 32 / 167
Selection
R
A B C D
20 10 0 55
11 10 0 7
4 99 17 2
77 25 4 0
=
Q(R)
A B C D
20 10 0 55
77 25 4 0
RA Q =
A>12
(R)
TRC Q = t [ t R t .A > 12
DRC Q = (A, a), (B, b), (C, c), (D, d) [
(A, a), (B, b), (C, c), (D, d) R a > 12
SQL select
*
from R where R.A > 12
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 33 / 167
Projection
R
A B C D
20 10 0 55
11 10 0 7
4 99 17 2
77 25 4 0
=
Q(R)
B C
10 0
99 17
25 4
RA Q =
B,C
(R)
TRC Q = t [ u R t .[B, C] = u.[B, C]
DRC Q = (B, b), (C, c) [
(A, a), (B, b), (C, c), (D, d) R
SQL select distinct B, C from R
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 34 / 167
Why the distinct in the SQL?
The SQL query
select B, C from R
will produce a bag (multiset)!
R
A B C D
20 10 0 55
11 10 0 7
4 99 17 2
77 25 4 0
=
Q(R)
B C
10 0
10 0
99 17
25 4
SQL is actually based on multisets, not sets. We will look into this
more in Lecture 11.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 35 / 167
Lecture 03 : Entity-Relationship (E/R) modelling
Outline
Entities
Relationships
Their relational implementations
n-ary relationships
Generalization
On the importance of SCOPE
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 36 / 167
Some real-world data ...
... from the Internet Movie Database (IMDb).
Title Year Actor
Austin Powers: International Man of Mystery 1997 Mike Myers
Austin Powers: The Spy Who Shagged Me 1999 Mike Myers
Dude, Wheres My Car? 2000 Bill Chott
Dude, Wheres My Car? 2000 Marc Lynn
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 37 / 167
Entities diagrams and Relational Schema
Movie
Title
Year
MovieID
Person
FirstName
LastName
PersonID
These diagrams represent relational schema
Movie(MovieID, Title, Year )
Person(PersonID, FirstName, LastName)
Yes, this ignores types ...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 38 / 167
Entity sets (relational instances)
Movie
MovieID Title Year
55871 Austin Powers: International Man of Mystery 1997
55873 Austin Powers: The Spy Who Shagged Me 1999
171771 Dude, Wheres My Car? 2000
(Tim used line number from IMDb raw le movies.list as MovieID.)
Person
PersonID FirstName LastName
6902836 Mike Myers
1757556 Bill Chott
5882058 Marc Lynn
(Tim used line number from IMDb raw le actors.list as PersonID)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 39 / 167
Relationships
Movie
Title
MovieID
Year ActsIn Person
FirstName
LastName
PersonID
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 40 / 167
Foreign Keys and Referential Integrity
Foreign Key
Suppose we have R(Z, Y). Furthermore, let S(W) be a relational
schema with Z W. We say that Z represents a Foreign Key in S for R
if for any instance we have
Z
(S)
Z
(R). This is a semantic
assertion.
Referential integrity
A database is said to have referential integrity when all foreign key
constraints are satised.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 41 / 167
A relational representation
A relational schema
ActsIn(MovieID, PersonID)
With referential integrity constraints
MovieID
(ActsIn)
MovieID
(Movie)
PersonID
(ActsIn)
PersonID
(Person)
ActsIn
PersonID MovieID
6902836 55871
6902836 55873
1757556 171771
5882058 171771
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 42 / 167
Foreign Keys in SQL
create table ActsIn
( MovieID int not NULL,
PersonID int not NULL,
primary key (MovieID, PersonID),
constraint actsin_movie
foreign key (MovieID)
references Movie(MovieID),
constraint actsin_person
foreign key (PersonID)
references Person(PersonID))
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 43 / 167
Relational representation of relationships, in general?
That depends ...
Mapping Cardinalities for binary relations, R S T
Relation R is meaning
many to many no constraints
one to many t T, s
1
, s
2
S.(R(s
1
, t ) R(s
2
, t )) = s
1
= s
2
many to one s S, t
1
, t
2
T.(R(s, t
1
) R(s, t
2
)) = t
1
= t
2
one to one one to many and many to one
Note that the database terminology differs slightly from standard
mathematical terminology.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 44 / 167
Diagrams for Mapping Cardinalities
ER diagram Relation R is
T R S
many to many (M : N)
T R S
one to many (1 : M)
T R S
many to one (M : 1)
T R S
one to one (1 : 1)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 45 / 167
Relationships to Relational Schema
T
X
Y
R
U
S
Z
W
Relation R is Schema
many to many (M : N) R(X, Z, U)
one to many (1 : M) R(X, Z, U)
many to one (M : 1) R(X, Z, U)
one to one (1 : 1) R(X, Z, U) and/or R(X, Z, U) (alternate keys)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 46 / 167
one to one does not mean a "1-to-1 correspondence
T
X
Y
R
U
S
Z
W
This database instance is OK
S R T
Z W
z
1
w
1
z
2
w
2
z
3
w
3
Z X U
z
1
x
2
u
1
X Y
x
1
y
1
x
2
y
2
x
3
y
3
x
4
y
4
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 47 / 167
Some more real-world data ... (a slight change of
SCOPE)
Title Year Actor Role
Austin Powers: International Man of Mystery 1997 Mike Myers Austin Powers
Austin Powers: International Man of Mystery 1997 Mike Myers Dr. Evil
Austin Powers: The Spy Who Shagged Me 1999 Mike Myers Austin Powers
Austin Powers: The Spy Who Shagged Me 1999 Mike Myers Dr. Evil
Austin Powers: The Spy Who Shagged Me 1999 Mike Myers Fat Bastard
Dude, Wheres My Car? 2000 Bill Chott Big Cult Guard 1
Dude, Wheres My Car? 2000 Marc Lynn Cop with Whips
How will this change our model?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 48 / 167
Will ActsIn remain a binary Relationship?
Movie
Title
Year
MovieID
ActsIn
Role
Person
FirstName
LastName
PersonID
No! An actor can have many roles in the same movie!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 49 / 167
Could ActsIn be modeled as a Ternary Relationship?
Movie
Title
Year
MovieID
ActsIn Person
FirstName
LastName
PersonID
Role
Description
Yes, this works!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 50 / 167
Can a ternary relationship be modeled with multiple
binary relationships?
Movie HasCasting Casting ActsIn Person
RequiresRole
Role
The Casting entity seems articial. What attributes would it have?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 51 / 167
Sometimes ternary to multiple binary makes more
sense ...
Branch Works-On Employee
Job
Branch Involves Project Assigned-To Employee
Requires
Job
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 52 / 167
Generalization
Comedy
ISA
Movie
Drama
Questions
Is every movie either comedy or a drama?
Can a movie be a comedy and a drama?
But perhaps this isnt a good model ...
What attributes would distinguish Drama and Comedy entities?
What abound Science Fiction?
Perhaps Genre would make a nice entity, which could have a
relationship with Movie.
Would a ternary relationship be better?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 53 / 167
Question: What is the right model?
Answer: The question doesnt make sense!
There is no right model ...
It depends on the intended use of the database.
What activity will the DBMS support?
What data is needed to support that activity?
The issue of SCOPE is missing from most textbooks
Suppose that all databases begin life with beautifully designed
schemas.
Observe that many operational databases are in a sorry state.
Conclude that the scope and goals of a database continually
change, and that schema evolution is a difcult problem to solve,
in practice.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 54 / 167
Another change of SCOPE ...
Movies with detailed release dates
Title Country Day Month Year
Austin Powers: International Man of Mystery USA 02 05 1997
Austin Powers: International Man of Mystery Iceland 24 10 1997
Austin Powers: International Man of Mystery UK 05 09 1997
Austin Powers: International Man of Mystery Brazil 13 02 1998
Austin Powers: The Spy Who Shagged Me USA 08 06 1999
Austin Powers: The Spy Who Shagged Me Iceland 02 07 1999
Austin Powers: The Spy Who Shagged Me UK 30 07 1999
Austin Powers: The Spy Who Shagged Me Brazil 08 10 1999
Dude, Wheres My Car? USA 10 12 2000
Dude, Wheres My Car? Iceland 9 02 2001
Dude, Wheres My Car? UK 9 02 2001
Dude, Wheres My Car? Brazil 9 03 2001
Dude, Wheres My Car? Russia 18 09 2001
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 55 / 167
... and an attribute becomes an entity with a
connecting relation.
Movie
Title
Year
MovieID
Movie
Title
MovieID
Year Released MovieRelease
Country
Date
Year
Month
Day
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 56 / 167
Lecture 04 : Relational algebra and relational calculus
Outline
Constructing new tuples!
Joins
Limitations of Relational Algebra
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 57 / 167
Renaming
R
A B C D
20 10 0 55
11 10 0 7
4 99 17 2
77 25 4 0
=
Q(R)
A E C F
20 10 0 55
11 10 0 7
4 99 17 2
77 25 4 0
RA Q =
{BE, DF}
(R)
TRC Q = t [ u R t .A = u.A t .E = u.E t .C =
u.C t .F = u.D
DRC Q = (A, a), (E, b), (C, c), (F, d) [
(A, a), (B, b), (C, c), (D, d) R
SQL select A, B as E, C, D as F from R
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 58 / 167
Union
R
A B
20 10
11 10
4 99
S
A B
20 10
77 1000
=
Q(R, S)
A B
20 10
11 10
4 99
77 1000
RA Q = R S
TRC Q = t [ t R t S
DRC Q = (A, a), (B, b) [ (A, a), (B, b)
R (A, a), (B, b) S
SQL (select
*
from R) union (select
*
from S)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 59 / 167
Intersection
R
A B
20 10
11 10
4 99
S
A B
20 10
77 1000
=
Q(R)
A B
20 10
RA Q = R S
TRC Q = t [ t R t S
DRC Q = (A, a), (B, b) [ (A, a), (B, b)
R (A, a), (B, b) S
SQL
(select
*
from R) intersect (select
*
from S)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 60 / 167
Difference
R
A B
20 10
11 10
4 99
S
A B
20 10
77 1000
=
Q(R)
A B
11 10
4 99
RA Q = R S
TRC Q = t [ t R t , S
DRC Q = (A, a), (B, b) [ (A, a), (B, b)
R (A, a), (B, b) , S
SQL (select
*
from R) except (select
*
from S)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 61 / 167
Wait, are we missing something?
Suppose we want to add information about college membership to our
Student database. We could add an additional attribute for the college.
StudentsWithCollege :
+--------+------+------+--------+
| name | age | sid | college|
+--------+------+------+--------+
| Eva | 18 | ev77 | Kings |
| Fatima | 20 | fm21 | Clare |
| James | 19 | jj25 | Clare |
+--------+------+------+--------+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 62 / 167
Put logically independent data in distinct tables?
Students : +--------+------+------+-----+
| name | age | sid | cid |
+--------+------+------+-----+
| Eva | 18 | ev77 | k |
| Fatima | 20 | fm21 | cl |
| James | 19 | jj25 | cl |
+--------+------+------+-----+
Colleges : +-----+---------------+
| cid | college_name |
+-----+---------------+
| k | Kings |
| cl | Clare |
| sid | Sidney Sussex |
| q | Queens |
... .....
But how do we put them back together again?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 63 / 167
Product
R
A B
20 10
11 10
4 99
S
C D
14 99
77 100 =
Q(R, S)
A B C D
20 10 14 99
20 10 77 100
11 10 14 99
11 10 77 100
4 99 14 99
4 99 77 100
Note the automatic attening
RA Q = R S
TRC Q = t [ u R, v S, t .[A, B] = u.[A, B] t .[C, D] =
v.[C, D]
DRC Q = (A, a), (B, b), (C, c), (D, d) [
(A, a), (B, b) R (C, c), (D, d) S
SQL select A, B, C, D from R, S
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 64 / 167
Product is special!
R
A B
20 10
4 99
=
R
AC, BD
(R)
A B C D
20 10 20 10
20 10 4 99
4 99 20 10
4 99 4 99
is the only operation in the Relational Algebra that created new
records (ignoring renaming),
But usually creates too many records!
Joins are the typical way of using products in a constrained
manner.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 65 / 167
Natural Join
Natural Join
Given R(X, Y) and S(Y, Z), we dene the natural join, denoted
R S, as a relation over attributes X, Y, Z dened as
R S t [ u R, v S, u.[Y] = v.[Y] t = u.[X] u.[Y] v.[Z]
In the Relational Algebra:
R S =
X,Y,Z
(
Y=Y
(R
(S)))
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 66 / 167
Join example
Students
name sid age cid
Fatima fm21 20 cl
Eva ev77 18 k
James jj25 19 cl
Colleges
cid cname
k Kings
cl Clare
q Queens
.
.
.
.
.
.
=
name,cname
(Students Colleges)
name cname
Fatima Clare
Eva Kings
James Clare
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 67 / 167
The same in SQL
select name, cname
from Students, Colleges
where Students.cid = Colleges.cid
+--------+--------+
| name | cname |
+--------+--------+
| Eva | Kings |
| Fatima | Clare |
| James | Clare |
+--------+--------+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 68 / 167
Division
Given R(X, Y) and S(Y), the division of R by S, denoted R S, is the
relation over attributes X dened as (in the TRC)
R S x [ s S, x s R.
name award
Fatima writing
Fatima music
Eva music
Eva writing
Eva dance
James dance
award
music
writing
dance
=
name
Eva
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 69 / 167
Division in the Relational Algebra?
Clearly, R S
X
(R). So R S =
X
(R) C, where C represents
counter examples to the division condition. That is, in the TRC,
C = x [ s S, x s , R.
U =
X
(R) S represents all possible x s for x X(R) and
s S,
so T = U R represents all those x s that are not in R,
so C =
X
(T) represents those records x that are counter
examples.
Division in RA
R S
X
(R)
X
((
X
(R) S) R)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 70 / 167
Query Safety
A query like Q = t [ t R t , S raises some interesting questions.
Should we allow the following query?
Q = t [ t , S
We want our relations to be nite!
Safety
A (TRC) query
Q = t [ P(t )
is safe if it is always nite for any database instance.
Problem : query safety is not decidable!
Solution : dene a restricted syntax that guarantees safety.
Safe queries can be represented in the Relational Algebra.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 71 / 167
Limitations of simple relational query languages
The expressive power of RA, TRC, and DRC are essentially the
same.
stored procedures
recursive queries
... it might be the rst step to destroying the integrity of your data
design.
Why not store the value of Q in a table?
R
be this query with X X
and
Q
C
=
W=W
(
Z=Z
(Q
R
Q
R
))
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 92 / 167
Assertions in SQL
create view C_violations as ....
create assertion check_C
check not (exists C_violations)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 93 / 167
Lecture 06 : Schema renement I
Outline
ER is for top-down and informal (but rigorous) design
FDs are used for bottom-up and formal design and analysis
update anomalies
Reasoning about Functional Dependencies
Heaths rule
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 94 / 167
Update anomalies
Big Table
sid name college course part term_name
yy88 Yoni New Hall Algorithms I IA Easter
uu99 Uri Kings Algorithms I IA Easter
bb44 Bin New Hall Databases IB Lent
bb44 Bin New Hall Algorithms II IB Michaelmas
zz70 Zip Trinity Databases IB Lent
zz70 Zip Trinity Algorithms II IB Michaelmas
How can we tell if an insert record is consistent with current
records?
Can we record data about a course before students enroll?
Will we wipe out information about a college when last student
associated with the college is deleted?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 95 / 167
Redundancy implies more locking ...
... at least for correct transactions!
Big Table
sid name college course part term_name
yy88 Yoni New Hall Algorithms I IA Easter
uu99 Uri Kings Algorithms I IA Easter
bb44 Bin New Hall Databases IB Lent
bb44 Bin New Hall Algorithms II IB Michaelmas
zz70 Zip Trinity Databases IB Lent
zz70 Zip Trinity Algorithms II IB Michaelmas
Change New Hall to Murray Edwards College
output Y Z
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 106 / 167
Attribute Closure Algorithm
Input : a set of FDs F and a set of attributes X.
Output : Y = closure(F, X)
1
Y := X
2
while there is some S T F with S Y and T , Y, then
Y := Y T.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 107 / 167
An Example (UW1997, Exercise 3.6.1)
R(A, B, C, D) with F made up of the FDs
A, B C
C D
D A
What is F
+
?
Brute force!
Lets just consider all possible nonempty sets X there are only 15...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 108 / 167
Example (cont.)
F = A, B C, C D, D A
For the single attributes we have
A
+
= A,
B
+
= B,
C
+
= A, C, D,
C
CD
= C, D
DA
= A, C, D
D
+
= A, D
D
DA
= A, D
The only new dependency we get with a single attribute on the left is
C A.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 109 / 167
Example (cont.)
F = A, B C, C D, D A
Now consider pairs of attributes.
A, B
+
= A, B, C, D,
so A, B D is a new dependency
A, C
+
= A, C, D,
so A, C D is a new dependency
A, D
+
= A, D,
so nothing new.
B, C
+
= A, B, C, D,
so B, C A, D is a new dependency
B, D
+
= A, B, C, D,
so B, D A, C is a new dependency
C, D
+
= A, C, D,
so C, D A is a new dependency
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 110 / 167
Example (cont.)
F = A, B C, C D, D A
For the triples of attributes:
A, C, D
+
= A, C, D,
A, B, D
+
= A, B, C, D,
so A, B, D C is a new dependency
A, B, C
+
= A, B, C, D,
so A, B, C D is a new dependency
B, C, D
+
= A, B, C, D,
so B, C, D A is a new dependency
And since A, B, C, D+ = A, B, C, D, we get no new
dependencies with four attributes.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 111 / 167
Example (cont.)
We generated 11 new FDs:
C A A, B D
A, C D B, C A
B, C D B, D A
B, D C C, D A
A, B, C D A, B, D C
B, C, D A
Can you see the Key?
A, B, B, C, and B, D are keys.
Note: this schema is already in 3NF! Why?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 112 / 167
Consequences of Armstrongs Axioms
Union If F [= Y Z and F [= Y W, then F [= Y W, Z.
Pseudo-transitivity If F [= Y Z and F [= U, Z W, then
F [= Y, U W.
Decomposition If F [= Y Z and W Z, then F [= Y W.
Exercise : Prove these using Armstrongs axioms!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 113 / 167
Proof of the Union Rule
Suppose we have
F [= Y Z,
F [= Y W.
By augmentation we have
F [= Y, Y Y, Z,
that is,
F [= Y Y, Z.
Also using augmentation we obtain
F [= Y, Z W, Z.
Therefore, by transitivity we obtain
F [= Y W, Z.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 114 / 167
Example application of functional reasoning.
Heaths Rule
Suppose R(A, B, C) is a relational schema with functional
dependency A B, then
R =
A,B
(R)
A
A,C
(R).
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 115 / 167
Proof of Heaths Rule
We rst show that R
A,B
(R)
A
A,C
(R).
If u = (a, b, c) R, then u
1
= (a, b)
A,B
(R) and
u
2
= (a, c)
A,C
(R).
Since (a, b)
A
(a, c) = (a, b, c) we know
u
A,B
(R)
A
A,C
(R).
In the other direction we must show R
=
A,B
(R)
A
A,C
(R) R.
If u = (a, b, c) R
= (a, b
, c) R such that
u
2
=
A,C
((a, b
, c)).
However, the functional dependency tells us that b = b
, so
u = (a, b, c) R.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 116 / 167
Closure Example
R(A, B, C, D, E, F) with
A, B C
B, C D
D E
C, F B
What is the closure of A, B?
A, B
A,BC
= A, B, C
B,CD
= A, B, C, D
DE
= A, B, C, D, E
So A, B
+
= A, B, C, D, E and A, B C, D, E.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 117 / 167
Lecture 07 : Normal Forms
Outline
First Normal Form (1NF)
Second Normal Form (2NF)
3NF and BCNF
Multi-valued dependencies (MVDs)
Fourth Normal Form
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 118 / 167
The Plan
Given a relational schema R(X) with FDs F :
Reason about FDs
R
2.1
(A, C). This is in BCNF. Done.
R
2.2
(B, C). This is in BCNF. Done.
Exercise : Try starting with any of the other BCNF violations and see
where you end up.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 140 / 167
The GDM does not always preserve dependencies!
R(A, B, C, D, E)
A, B C
D, E C
B D
A, B
+
= A, B, C, D,
so A, B C, D,
and A, B, E is a key.
B, E
+
= B, C, D, E ,
so B, E C, D,
and A, B, E is a key (again)
Lets try for a BCNF decomposition ...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 141 / 167
Decomposition 1
Decompose R(A, B, C, D, E) using A, B C, D :
R
1
(A, B, C, D). Decompose this using B D:
R
1.1
(B, D). Done.
R
1.2
(A, B, C). Done.
R
2
(A, B, E). Done.
But in this decomposition, how will we enforce this dependency?
D, E C
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 142 / 167
Decomposition 2
Decompose R(A, B, C, D, E) using B, E C, D:
R
3
(B, C, D, E). Decompose this using D, E C
R
3.1
(C, D, E). Done.
R
3.2
(B, D, E). Decompose this using B D:
R
3.2.1
(B, D). Done.
R
3.2.2
(B, E). Done.
R
4
(A, B, E). Done.
But in this decomposition, how will we enforce this dependency?
A, B C
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 143 / 167
Summary
It is always possible to obtain BCNF that has the lossless-join
property (using GDM)
I
(R
2
)
I
(R
1
)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 147 / 167
Movie Ratings example
Scope = UK
Title Year Rating
Austin Powers: International Man of Mystery 1997 15
Austin Powers: The Spy Who Shagged Me 1999 12
Dude, Wheres My Car? 2000 15
Scope = Earth
Title Year Country Rating
Austin Powers: International Man of Mystery 1997 UK 15
Austin Powers: International Man of Mystery 1997 Malaysia 18SX
Austin Powers: International Man of Mystery 1997 Portugal M/12
Austin Powers: International Man of Mystery 1997 USA PG-13
Austin Powers: The Spy Who Shagged Me 1999 UK 12
Austin Powers: The Spy Who Shagged Me 1999 Portugal M/12
Austin Powers: The Spy Who Shagged Me 1999 USA PG-13
Dude, Wheres My Car? 2000 UK 15
Dude, Wheres My Car? 2000 USA PG-13
Dude, Wheres My Car? 2000 Malaysia 18PL
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 148 / 167
Example of attribute migrating to strong entity set
From single-country scope,
Movie
Title
Year
Rating
RatingReason
MovieID
to multi-country scope:
Movie
Title
MovieID
Year
Rated
Reason
Rating
Country
RatingValue
Note that relation Rated has an attribute!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 149 / 167
Beware of FFDs = Faux Functional Dependencies
(US ratings)
Title Year Rating RatingReason
Stoned 2005 R drug use
Wasted 2006 R drug use
High Life 2009 R drug use
Poppies: Odyssey of an opium eater 2009 R drug use
But
Title Rating, RatingReason
is not a functional dependency.
This is a mildly amusing illustration of a real and pervasive problem
deriving a functional dependency after the examination of a limited set
of data (or after talking to only a few domain experts).
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 150 / 167
Oh, but the real world is such a bother!
from IMDb raw data le certicates.list
2 Fast 2 Furious (2003) Switzerland:14 (canton of Vaud)
2 Fast 2 Furious (2003) Switzerland:16 (canton of Zurich)
28 Days (2000) Canada:13+ (Quebec)
28 Days (2000) Canada:14 (Nova Scotia)
28 Days (2000) Canada:14A (Alberta)
28 Days (2000) Canada:AA (Ontario)
28 Days (2000) Canada:PA (Manitoba)
28 Days (2000) Canada:PG (British Columbia)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 151 / 167
Ternary or multiple binary relationships?
T R S
U
T R3 E R1 S
R2
U
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 152 / 167
Ternary or multiple binary relationships?
T R S
U
T R2 S R1 U
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 153 / 167
Look again at ER Demo Diagram
2
How might this be rened using FDs or MVDs?
Employee
Name
Number
ISA
Mechanic Salesman Does
RepairJob Number
Description
Cost Parts
Work
Repairs Car
License
Model
Year
Manufacturer
Buys
Price
Date
Value
Sells
Date
Value
Commission
Client ID
Name Phone
Address
buyer seller
2
By Pvel Calado,
http://www.texample.net/tikz/examples/entity-relationship-diagram
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 154 / 167
Lecture 10 : On-line Analytical Processing (OLAP)
Outline
Limits of SQL aggregation
OLAP : Online Analytic Processing
Data cubes
Star schema
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 155 / 167
Limits of SQL aggregation
Flat tables are great for processing, but hard for people to read
and understand.
Pivot tables and cross tabulations (spreadsheet terminology) are
very useful for presenting data in ways that people can
understand.
SQL does not handle pivot tables and cross tabulations well.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 156 / 167
OLAP vs. OLTP
OLTP : Online Transaction Processing (traditional databases)