0% found this document useful (0 votes)
46 views

DB 2014

database

Uploaded by

light_aether
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

DB 2014

database

Uploaded by

light_aether
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 167

Databases 2014

tgg22
Computer Laboratory
University of Cambridge, UK
Databases, Lent 2014
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 1 / 167
Lecture 01 : What is a DBMS?
DB vs. IR
Relational Databases
ACID properties
Two fundamental trade-offs
OLTP vs. OLAP
Behond ACID/Relational model ...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 2 / 167
Example Database Management Systems (DBMSs)
A few database examples
Banking : supporting customer accounts, deposits and
withdrawals
University : students, past and present, marks, academic status
Business : products, sales, suppliers
Real Estate : properties, leases, owners, renters
Aviation : ights, seat reservations, passenger info, prices,
payments
Aviation : Aircraft, maintenance history, parts suppliers, parts
orders
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 3 / 167
Some observations about these DBMSs ...
They contains highly structured data that has been engineered to
model some restricted aspect of the real world
They support the activity of an organization in an essential way
They support concurrent access, both read and write
They often outlive their designers
Users need to know very little about the DBMS technology used
Well designed database systems are nearly transparent, just part
of our infrastructure
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 4 / 167
Databases vs Information Retrieval
Always ask What problem am I solving?
DBMS IR system
exact query results fuzzy query results
optimized for concurrent updates optimized for concurrent reads
data models a narrow domain domain often open-ended
generates documents (reports) search existing documents
increase control over information reduce information overload
And of course there are many systems that combine elements of DB
and IR.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 5 / 167
Still the dominant approach : Relational DBMSs
The problem : in 1970 you could not
write a database application without
knowing a great deal about the
low-level physical implementation of
the data.
Codds radical idea [C1970]: give
users a model of data and a
language for manipulating that data
which is completely independent of
the details of its physical
representation/implementation.
This decouples development of
Database Management Systems
(DBMSs) from the development of
database applications (at least in an
idealized world).
This is the kind of abstraction at the heart of Computer Science!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 6 / 167
What services do applications expect from a DBMS?
Transactions ACID properties
Atomicity Either all actions are carried out, or none are
logs needed to undo operations, if needed
Consistency If each transaction is consistent, and the database is
initially consistent, then it is left consistent
Applications designers must exploit the DBMSs
capabilities.
Isolation Transactions are isolated, or protected, from the effects of
other scheduled transactions
Serializability, 2-phase commit protocol
Durability If a transactions completes successfully, then its effects
persist
Logging and crash recovery
These concepts should be familiar from Concurrent Systems and
Applications.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 7 / 167
What constitutes a good DBMS application design?
Domain of Interest Domain of Interest
Database Database
real-world change
database update(s)
represent represent
At the very least, this diagram should commute!
Does your database design support all required changes?
Can an update corrupt the database?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 8 / 167
Relational Database Design
Our tools
Entity-Relationship (ER) modeling high-level, diagram-based design
Relational modeling formal model normal forms based
on Functional Dependencies (FDs)
SQL implementation Where the rubber meets the road
The ER and FD approaches are complementary
ER facilitates design by allowing communication with domain
experts who may know little about database technology.
FD allows us formally explore general design trade-offs. Such as
A Fundamental Trade-off in Database Design: the more we
reduce data redundancy, the harder it is to enforce some types of
data integrity. (An example of this is made precise when we look
at 3NF vs. BCNF.)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 9 / 167
ER Demo Diagram (Notation follows SKS book)
1
Employee
Name
Number
ISA
Mechanic Salesman Does
RepairJob Number
Description
Cost Parts
Work
Repairs Car
License
Model
Year
Manufacturer
Buys
Price
Date
Value
Sells
Date
Value
Commission
Client ID
Name Phone
Address
buyer seller
1
By Pvel Calado,
http://www.texample.net/tikz/examples/entity-relationship-diagram
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 10 / 167
A Fundamental Trade-off in Database
Implementation Query response vs. update
throughput
Redundancy is a Bad Thing.
One of the main goals of ER and FD modeling is to reduce data
redundancy. We seek normalized designs.
A normalized database can support high update throughput and
greatly facilitates the task of ensuring semantic consistency and
data integrity.
Update throughput is increased because in a normalized
database a typical transaction need only lock a few data items
perhaps just one eld of one row in a very large table.
Redundancy is a Good Thing.
A de-normalized database can greatly improve the response time
of read-only queries.
Selective and controlled de-normalization is often required in
operational systems.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 11 / 167
OLAP vs. OLTP
OLTP Online Transaction Processing
OLAP Online Analytical Processing
Commonly associated with terms like Decision
Support, Data Warehousing, etc.
OLAP OLTP
Supports analysis day-to-day operations
Data is historical current
Transactions mostly reads updates
optimized for query processing updates
Normal Forms not important important
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 12 / 167
Example : Data Warehouse (Decision support)
business analysis queries
Extract
fast updates
Operational Database Data Warehouse
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 13 / 167
Example : Embedded databases
FIDO = Fetch Intensive Data Organization
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 14 / 167
Example : Hinxton Bio-informatics
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 15 / 167
NoSQL Movement (subject of Lectures 11, 12)
A few technologies
Key-value store
Directed Graph Databases
Main-memory stores
Distributed hash tables
Applications
Googles Map-Reduce
Facebook
Cluster-based computing
...
Always remember to ask : What problem am I solving?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 16 / 167
Recommended Reading
Textbooks
SKS Silberschatz, A., Korth, H.F. and Sudarshan, S. (2002).
Database system concepts. McGraw-Hill (4th edition).
(Adjust accordingly for other editions)
Chapters 1 (DBMSs)
2 (Entity-Relationship Model)
3 (Relational Model)
4.1 4.7 (basic SQL)
6.1 6.4 (integrity constraints)
7 (functional dependencies and normal
forms)
22 (OLAP)
UW Ullman, J. and Widom, J. (1997). A rst course in
database systems. Prentice Hall.
CJD Date, C.J. (2004). An introduction to database systems.
Addison-Wesley (8th ed.).
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 17 / 167
Reading for the fun of it ...
Research Papers (Google for them)
C1970 E.F. Codd, (1970). "A Relational Model of Data for Large
Shared Data Banks". Communications of the ACM.
F1977 Ronald Fagin (1977) Multivalued dependencies and a
new normal form for relational databases. TODS 2 (3).
L2003 L. Libkin. Expressive power of SQL. TCS, 296 (2003).
C+1996 L. Colby et al. Algorithms for deferred view maintenance.
SIGMOD 199.
G+1997 J. Gray et al. Data cube: A relational aggregation
operator generalizing group-by, cross-tab, and sub-totals
(1997) Data Mining and Knowledge Discovery.
H2001 A. Halevy. Answering queries using views: A survey.
VLDB Journal. December 2001.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 18 / 167
Lecture 02 : The relational data model
Mathematical relations and relational schema
Using SQL to implement a relational schema
Keys
Database query languages
The Relational Algebra
The Relational Calculi (tuple and domain)
a bit of SQL
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 19 / 167
Lets start with mathematical relations
Suppose that S
1
and S
2
are sets. The Cartesian product, S
1
S
2
, is
the set
S
1
S
2
= (s
1
, s
2
) [ s
1
S
1
, s
2
S
2

A (binary) relation over S


1
S
2
is any set r with
r S
1
S
2
.
In a similar way, if we have n sets,
S
1
, S
2
, . . . , S
n
,
then an n-ary relation r is a set
r S
1
S
2
S
n
= (s
1
, s
2
, . . . , s
n
) [ s
i
S
i

Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 20 / 167


Relational Schema
Let X be a set of k attribute names.
We will often ignore domains (types) and say that R(X) denotes a
relational schema.
When we write R(Z, Y) we mean R(Z Y) and Z Y = .
u.[X] = v.[X] abbreviates u.A
1
= v.A
1
u.A
k
= v.A
k
.

X represents some (unspecied) ordering of the attribute names,


A
1
, A
2
, . . . , A
k
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 21 / 167
Mathematical vs. database relations
Suppose we have an n-tuple t S
1
S
2
S
n
. Extracting the i -th
component of t , say as
i
(t ), feels a bit low-level.
Solution: (1) Associate a name, A
i
(called an attribute name) with
each domain S
i
. (2) Instead of tuples, use records sets of pairs
each associating an attribute name A
i
with a value in domain S
i
.
A database relation R over the schema
A
1
: S
1
A
2
: S
2
A
n
: S
n
is a nite set
R (A
1
, s
1
), (A
2
, s
2
), . . . , (A
n
, s
n
) [ s
i
S
i

Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 22 / 167


Example
A relational schema
Students(name: string, sid: string, age : integer)
A relational instance of this schema
Students =
(name, Fatima), (sid, fm21), (age, 20),
(name, Eva), (sid, ev77), (age, 18),
(name, James), (sid, jj25), (age, 19)

A tabular presentation
name sid age
Fatima fm21 20
Eva ev77 18
James jj25 19
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 23 / 167
Key Concepts
Relational Key
Suppose R(X) is a relational schema with Z X. If for any records u
and v in any instance of R we have
u.[Z] = v.[Z] = u.[X] = v.[X],
then Z is a superkey for R. If no proper subset of Z is a superkey, then
Z is a key for R. We write R(Z, Y) to indicate that Z is a key for
R(Z Y).
Note that this is a semantic assertion, and that a relation can have
multiple keys.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 24 / 167
Creating Tables in SQL
create table Students
(sid varchar(10),
name varchar(50),
age int);
-- insert record with attribute names
insert into Students set
name = Fatima, age = 20, sid = fm21;
-- or insert records with values in same order
-- as in create table
insert into Students values
(jj25 , James , 19),
(ev77 , Eva , 18);
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 25 / 167
Listing a Table in SQL
-- list by attribute order of create table
mysql> select
*
from Students;
+------+--------+------+
| sid | name | age |
+------+--------+------+
| ev77 | Eva | 18 |
| fm21 | Fatima | 20 |
| jj25 | James | 19 |
+------+--------+------+
3 rows in set (0.00 sec)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 26 / 167
Listing a Table in SQL
-- list by specified attribute order
mysql> select name, age, sid from Students;
+--------+------+------+
| name | age | sid |
+--------+------+------+
| Eva | 18 | ev77 |
| Fatima | 20 | fm21 |
| James | 19 | jj25 |
+--------+------+------+
3 rows in set (0.00 sec)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 27 / 167
Keys in SQL
A key is a set of attributes that will uniquely identify any record (row) in
a table.
-- with this create table
create table Students
(sid varchar(10),
name varchar(50),
age int,
primary key (sid));
-- if we try to insert this (fourth) student ...
mysql> insert into Students set
name = Flavia, age = 23, sid = fm21;
ERROR 1062 (23000): Duplicate
entry fm21 for key PRIMARY
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 28 / 167
What is a (relational) database query language?
Input : a collection of Output : a single
relation instances relation instance
R
1
, R
2
, , R
k
= Q(R
1
, R
2
, , R
k
)
How can we express Q?
In order to meet Codds goals we want a query language that is
high-level and independent of physical data representation.
There are many possibilities ...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 29 / 167
The Relational Algebra (RA)
Q ::= R base relation
[
p
(Q) selection
[
X
(Q) projection
[ Q Q product
[ Q Q difference
[ Q Q union
[ Q Q intersection
[
M
(Q) renaming
p is a simple boolean predicate over attributes values.
X = A
1
, A
2
, . . . , A
k
is a set of attributes.
M = A
1
B
1
, A
2
B
2
, . . . , A
k
B
k
is a renaming map.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 30 / 167
Relational Calculi
The Tuple Relational Calculus (TRC)
Q = t [ P(t )
The Domain Relational Calculus (DRC)
Q = (A
1
= v
1
, A
2
= v
2
, . . . , A
k
= v
k
) [ P(v
1
, v
2
, , v
k
)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 31 / 167
The SQL standard
Origins at IBM in early 1970s.
SQL has grown and grown through many rounds of
standardization :

ANSI: SQL-86

ANSI and ISO : SQL-89, SQL-92, SQL:1999, SQL:2003,


SQL:2006, SQL:2008
SQL is made up of many sub-languages :

Query Language

Data Denition Language

System Administration Language

...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 32 / 167
Selection
R
A B C D
20 10 0 55
11 10 0 7
4 99 17 2
77 25 4 0
=
Q(R)
A B C D
20 10 0 55
77 25 4 0
RA Q =
A>12
(R)
TRC Q = t [ t R t .A > 12
DRC Q = (A, a), (B, b), (C, c), (D, d) [
(A, a), (B, b), (C, c), (D, d) R a > 12
SQL select
*
from R where R.A > 12
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 33 / 167
Projection
R
A B C D
20 10 0 55
11 10 0 7
4 99 17 2
77 25 4 0
=
Q(R)
B C
10 0
99 17
25 4
RA Q =
B,C
(R)
TRC Q = t [ u R t .[B, C] = u.[B, C]
DRC Q = (B, b), (C, c) [
(A, a), (B, b), (C, c), (D, d) R
SQL select distinct B, C from R
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 34 / 167
Why the distinct in the SQL?
The SQL query
select B, C from R
will produce a bag (multiset)!
R
A B C D
20 10 0 55
11 10 0 7
4 99 17 2
77 25 4 0
=
Q(R)
B C
10 0
10 0
99 17
25 4
SQL is actually based on multisets, not sets. We will look into this
more in Lecture 11.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 35 / 167
Lecture 03 : Entity-Relationship (E/R) modelling
Outline
Entities
Relationships
Their relational implementations
n-ary relationships
Generalization
On the importance of SCOPE
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 36 / 167
Some real-world data ...
... from the Internet Movie Database (IMDb).
Title Year Actor
Austin Powers: International Man of Mystery 1997 Mike Myers
Austin Powers: The Spy Who Shagged Me 1999 Mike Myers
Dude, Wheres My Car? 2000 Bill Chott
Dude, Wheres My Car? 2000 Marc Lynn
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 37 / 167
Entities diagrams and Relational Schema
Movie
Title
Year
MovieID
Person
FirstName
LastName
PersonID
These diagrams represent relational schema
Movie(MovieID, Title, Year )
Person(PersonID, FirstName, LastName)
Yes, this ignores types ...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 38 / 167
Entity sets (relational instances)
Movie
MovieID Title Year
55871 Austin Powers: International Man of Mystery 1997
55873 Austin Powers: The Spy Who Shagged Me 1999
171771 Dude, Wheres My Car? 2000
(Tim used line number from IMDb raw le movies.list as MovieID.)
Person
PersonID FirstName LastName
6902836 Mike Myers
1757556 Bill Chott
5882058 Marc Lynn
(Tim used line number from IMDb raw le actors.list as PersonID)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 39 / 167
Relationships
Movie
Title
MovieID
Year ActsIn Person
FirstName
LastName
PersonID
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 40 / 167
Foreign Keys and Referential Integrity
Foreign Key
Suppose we have R(Z, Y). Furthermore, let S(W) be a relational
schema with Z W. We say that Z represents a Foreign Key in S for R
if for any instance we have
Z
(S)
Z
(R). This is a semantic
assertion.
Referential integrity
A database is said to have referential integrity when all foreign key
constraints are satised.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 41 / 167
A relational representation
A relational schema
ActsIn(MovieID, PersonID)
With referential integrity constraints

MovieID
(ActsIn)
MovieID
(Movie)

PersonID
(ActsIn)
PersonID
(Person)
ActsIn
PersonID MovieID
6902836 55871
6902836 55873
1757556 171771
5882058 171771
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 42 / 167
Foreign Keys in SQL
create table ActsIn
( MovieID int not NULL,
PersonID int not NULL,
primary key (MovieID, PersonID),
constraint actsin_movie
foreign key (MovieID)
references Movie(MovieID),
constraint actsin_person
foreign key (PersonID)
references Person(PersonID))
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 43 / 167
Relational representation of relationships, in general?
That depends ...
Mapping Cardinalities for binary relations, R S T
Relation R is meaning
many to many no constraints
one to many t T, s
1
, s
2
S.(R(s
1
, t ) R(s
2
, t )) = s
1
= s
2
many to one s S, t
1
, t
2
T.(R(s, t
1
) R(s, t
2
)) = t
1
= t
2
one to one one to many and many to one
Note that the database terminology differs slightly from standard
mathematical terminology.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 44 / 167
Diagrams for Mapping Cardinalities
ER diagram Relation R is
T R S
many to many (M : N)
T R S
one to many (1 : M)
T R S
many to one (M : 1)
T R S
one to one (1 : 1)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 45 / 167
Relationships to Relational Schema
T
X
Y
R
U
S
Z
W
Relation R is Schema
many to many (M : N) R(X, Z, U)
one to many (1 : M) R(X, Z, U)
many to one (M : 1) R(X, Z, U)
one to one (1 : 1) R(X, Z, U) and/or R(X, Z, U) (alternate keys)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 46 / 167
one to one does not mean a "1-to-1 correspondence
T
X
Y
R
U
S
Z
W
This database instance is OK
S R T
Z W
z
1
w
1
z
2
w
2
z
3
w
3
Z X U
z
1
x
2
u
1
X Y
x
1
y
1
x
2
y
2
x
3
y
3
x
4
y
4
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 47 / 167
Some more real-world data ... (a slight change of
SCOPE)
Title Year Actor Role
Austin Powers: International Man of Mystery 1997 Mike Myers Austin Powers
Austin Powers: International Man of Mystery 1997 Mike Myers Dr. Evil
Austin Powers: The Spy Who Shagged Me 1999 Mike Myers Austin Powers
Austin Powers: The Spy Who Shagged Me 1999 Mike Myers Dr. Evil
Austin Powers: The Spy Who Shagged Me 1999 Mike Myers Fat Bastard
Dude, Wheres My Car? 2000 Bill Chott Big Cult Guard 1
Dude, Wheres My Car? 2000 Marc Lynn Cop with Whips
How will this change our model?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 48 / 167
Will ActsIn remain a binary Relationship?
Movie
Title
Year
MovieID
ActsIn
Role
Person
FirstName
LastName
PersonID
No! An actor can have many roles in the same movie!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 49 / 167
Could ActsIn be modeled as a Ternary Relationship?
Movie
Title
Year
MovieID
ActsIn Person
FirstName
LastName
PersonID
Role
Description
Yes, this works!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 50 / 167
Can a ternary relationship be modeled with multiple
binary relationships?
Movie HasCasting Casting ActsIn Person
RequiresRole
Role
The Casting entity seems articial. What attributes would it have?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 51 / 167
Sometimes ternary to multiple binary makes more
sense ...
Branch Works-On Employee
Job
Branch Involves Project Assigned-To Employee
Requires
Job
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 52 / 167
Generalization
Comedy
ISA
Movie
Drama
Questions
Is every movie either comedy or a drama?
Can a movie be a comedy and a drama?
But perhaps this isnt a good model ...
What attributes would distinguish Drama and Comedy entities?
What abound Science Fiction?
Perhaps Genre would make a nice entity, which could have a
relationship with Movie.
Would a ternary relationship be better?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 53 / 167
Question: What is the right model?
Answer: The question doesnt make sense!
There is no right model ...
It depends on the intended use of the database.
What activity will the DBMS support?
What data is needed to support that activity?
The issue of SCOPE is missing from most textbooks
Suppose that all databases begin life with beautifully designed
schemas.
Observe that many operational databases are in a sorry state.
Conclude that the scope and goals of a database continually
change, and that schema evolution is a difcult problem to solve,
in practice.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 54 / 167
Another change of SCOPE ...
Movies with detailed release dates
Title Country Day Month Year
Austin Powers: International Man of Mystery USA 02 05 1997
Austin Powers: International Man of Mystery Iceland 24 10 1997
Austin Powers: International Man of Mystery UK 05 09 1997
Austin Powers: International Man of Mystery Brazil 13 02 1998
Austin Powers: The Spy Who Shagged Me USA 08 06 1999
Austin Powers: The Spy Who Shagged Me Iceland 02 07 1999
Austin Powers: The Spy Who Shagged Me UK 30 07 1999
Austin Powers: The Spy Who Shagged Me Brazil 08 10 1999
Dude, Wheres My Car? USA 10 12 2000
Dude, Wheres My Car? Iceland 9 02 2001
Dude, Wheres My Car? UK 9 02 2001
Dude, Wheres My Car? Brazil 9 03 2001
Dude, Wheres My Car? Russia 18 09 2001
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 55 / 167
... and an attribute becomes an entity with a
connecting relation.
Movie
Title
Year
MovieID
Movie
Title
MovieID
Year Released MovieRelease
Country
Date
Year
Month
Day
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 56 / 167
Lecture 04 : Relational algebra and relational calculus
Outline
Constructing new tuples!
Joins
Limitations of Relational Algebra
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 57 / 167
Renaming
R
A B C D
20 10 0 55
11 10 0 7
4 99 17 2
77 25 4 0
=
Q(R)
A E C F
20 10 0 55
11 10 0 7
4 99 17 2
77 25 4 0
RA Q =
{BE, DF}
(R)
TRC Q = t [ u R t .A = u.A t .E = u.E t .C =
u.C t .F = u.D
DRC Q = (A, a), (E, b), (C, c), (F, d) [
(A, a), (B, b), (C, c), (D, d) R
SQL select A, B as E, C, D as F from R
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 58 / 167
Union
R
A B
20 10
11 10
4 99
S
A B
20 10
77 1000
=
Q(R, S)
A B
20 10
11 10
4 99
77 1000
RA Q = R S
TRC Q = t [ t R t S
DRC Q = (A, a), (B, b) [ (A, a), (B, b)
R (A, a), (B, b) S
SQL (select
*
from R) union (select
*
from S)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 59 / 167
Intersection
R
A B
20 10
11 10
4 99
S
A B
20 10
77 1000
=
Q(R)
A B
20 10
RA Q = R S
TRC Q = t [ t R t S
DRC Q = (A, a), (B, b) [ (A, a), (B, b)
R (A, a), (B, b) S
SQL
(select
*
from R) intersect (select
*
from S)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 60 / 167
Difference
R
A B
20 10
11 10
4 99
S
A B
20 10
77 1000
=
Q(R)
A B
11 10
4 99
RA Q = R S
TRC Q = t [ t R t , S
DRC Q = (A, a), (B, b) [ (A, a), (B, b)
R (A, a), (B, b) , S
SQL (select
*
from R) except (select
*
from S)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 61 / 167
Wait, are we missing something?
Suppose we want to add information about college membership to our
Student database. We could add an additional attribute for the college.
StudentsWithCollege :
+--------+------+------+--------+
| name | age | sid | college|
+--------+------+------+--------+
| Eva | 18 | ev77 | Kings |
| Fatima | 20 | fm21 | Clare |
| James | 19 | jj25 | Clare |
+--------+------+------+--------+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 62 / 167
Put logically independent data in distinct tables?
Students : +--------+------+------+-----+
| name | age | sid | cid |
+--------+------+------+-----+
| Eva | 18 | ev77 | k |
| Fatima | 20 | fm21 | cl |
| James | 19 | jj25 | cl |
+--------+------+------+-----+
Colleges : +-----+---------------+
| cid | college_name |
+-----+---------------+
| k | Kings |
| cl | Clare |
| sid | Sidney Sussex |
| q | Queens |
... .....
But how do we put them back together again?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 63 / 167
Product
R
A B
20 10
11 10
4 99
S
C D
14 99
77 100 =
Q(R, S)
A B C D
20 10 14 99
20 10 77 100
11 10 14 99
11 10 77 100
4 99 14 99
4 99 77 100
Note the automatic attening
RA Q = R S
TRC Q = t [ u R, v S, t .[A, B] = u.[A, B] t .[C, D] =
v.[C, D]
DRC Q = (A, a), (B, b), (C, c), (D, d) [
(A, a), (B, b) R (C, c), (D, d) S
SQL select A, B, C, D from R, S
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 64 / 167
Product is special!
R
A B
20 10
4 99
=
R
AC, BD
(R)
A B C D
20 10 20 10
20 10 4 99
4 99 20 10
4 99 4 99
is the only operation in the Relational Algebra that created new
records (ignoring renaming),
But usually creates too many records!
Joins are the typical way of using products in a constrained
manner.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 65 / 167
Natural Join
Natural Join
Given R(X, Y) and S(Y, Z), we dene the natural join, denoted
R S, as a relation over attributes X, Y, Z dened as
R S t [ u R, v S, u.[Y] = v.[Y] t = u.[X] u.[Y] v.[Z]
In the Relational Algebra:
R S =
X,Y,Z
(
Y=Y
(R

(S)))
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 66 / 167
Join example
Students
name sid age cid
Fatima fm21 20 cl
Eva ev77 18 k
James jj25 19 cl
Colleges
cid cname
k Kings
cl Clare
q Queens
.
.
.
.
.
.
=

name,cname
(Students Colleges)
name cname
Fatima Clare
Eva Kings
James Clare
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 67 / 167
The same in SQL
select name, cname
from Students, Colleges
where Students.cid = Colleges.cid
+--------+--------+
| name | cname |
+--------+--------+
| Eva | Kings |
| Fatima | Clare |
| James | Clare |
+--------+--------+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 68 / 167
Division
Given R(X, Y) and S(Y), the division of R by S, denoted R S, is the
relation over attributes X dened as (in the TRC)
R S x [ s S, x s R.
name award
Fatima writing
Fatima music
Eva music
Eva writing
Eva dance
James dance

award
music
writing
dance
=
name
Eva
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 69 / 167
Division in the Relational Algebra?
Clearly, R S
X
(R). So R S =
X
(R) C, where C represents
counter examples to the division condition. That is, in the TRC,
C = x [ s S, x s , R.
U =
X
(R) S represents all possible x s for x X(R) and
s S,
so T = U R represents all those x s that are not in R,
so C =
X
(T) represents those records x that are counter
examples.
Division in RA
R S
X
(R)
X
((
X
(R) S) R)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 70 / 167
Query Safety
A query like Q = t [ t R t , S raises some interesting questions.
Should we allow the following query?
Q = t [ t , S
We want our relations to be nite!
Safety
A (TRC) query
Q = t [ P(t )
is safe if it is always nite for any database instance.
Problem : query safety is not decidable!
Solution : dene a restricted syntax that guarantees safety.
Safe queries can be represented in the Relational Algebra.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 71 / 167
Limitations of simple relational query languages
The expressive power of RA, TRC, and DRC are essentially the
same.

None can express the transitive closure of a relation.


We could extend RA to more powerful languages (like Datalog).
SQL has been extended with many features beyond the Relational
Algebra.

stored procedures

recursive queries

ability to embed SQL in standard procedural languages


Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 72 / 167
Lecture 05 : SQL and integrity constraints
Outline
NULL in SQL
three-valued logic
Multisets and aggregation in SQL
Views
General integrity constraints
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 73 / 167
What is NULL in SQL?
What if you dont know Kims age?
mysql> select
*
from students;
+------+--------+------+
| sid | name | age |
+------+--------+------+
| ev77 | Eva | 18 |
| fm21 | Fatima | 20 |
| jj25 | James | 19 |
| ks87 | Kim | NULL |
+------+--------+------+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 74 / 167
What is NULL?
NULL is a place-holder, not a value!
NULL is not a member of any domain (type),
For records with NULL for age, an expression like age > 20
must unknown!
This means we need (at least) three-valued logic.
Let represent We dont know!
T F
T T F
F F F F
F
T F
T T T T
F T F
T
v v
T F
F T

Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 75 / 167
NULL can lead to unexpected results
mysql> select
*
from students;
+------+--------+------+
| sid | name | age |
+------+--------+------+
| ev77 | Eva | 18 |
| fm21 | Fatima | 20 |
| jj25 | James | 19 |
| ks87 | Kim | NULL |
+------+--------+------+
mysql> select
*
from students where age <> 19;
+------+--------+------+
| sid | name | age |
+------+--------+------+
| ev77 | Eva | 18 |
| fm21 | Fatima | 20 |
+------+--------+------+
select ... where P
The select statement only returns those records where the where
predicate evaluates to true.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 76 / 167
The ambiguity of NULL
Possible interpretations of NULL
There is a value, but we dont know what it is.
No value is applicable.
The value is known, but you are not allowed to see it.
...
A great deal of semantic muddle is created by conating all of these
interpretations into one non-value.
On the other hand, introducing distinct NULLs for each possible
interpretation leads to very complex logics ...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 77 / 167
Not everyone approves of NULL
C. J. Date [D2004], Chapter 19
Before we go any further, we should make it very clear that in our
opinion (and in that of many other writers too, we hasten to add),
NULLs and 3VL are and always were a serious mistake and have no
place in the relational model.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 78 / 167
age is not a good attribute ...
The age column is guaranteed to go out of date! Lets record dates of
birth instead!
create table Students
( sid varchar(10) not NULL,
name varchar(50) not NULL,
birth_date date,
cid varchar(3) not NULL,
primary key (sid),
constraint student_college foreign key (cid)
references Colleges(cid) )
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 79 / 167
age is not a good attribute ...
mysql> select
*
from Students;
+------+---------+------------+-----+
| sid | name | birth_date | cid |
+------+---------+------------+-----+
| ev77 | Eva | 1990-01-26 | k |
| fm21 | Fatima | 1988-07-20 | cl |
| jj25 | James | 1989-03-14 | cl |
+------+---------+------------+-----+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 80 / 167
Use a view to recover original table
(Note : the age calculation here is not correct!)
create view StudentsWithAge as
select sid, name,
(year(current_date()) - year(birth_date)) as age,
cid
from Students;
mysql> select
*
from StudentsWithAge;
+------+---------+------+-----+
| sid | name | age | cid |
+------+---------+------+-----+
| ev77 | Eva | 19 | k |
| fm21 | Fatima | 21 | cl |
| jj25 | James | 20 | cl |
+------+---------+------+-----+
Views are simply identiers that represent a query. The views name
can be used as if it were a stored table.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 81 / 167
But that calculation is not correct ...
Clearly the calculation of age does not take into account the day and
month of year.
From 2010 Database Contest (winner : Sebastian Probst Eide)
SELECT year(CURRENT_DATE()) - year(birth_date) -
CASE WHEN month(CURRENT_DATE()) < month(birth_date)
THEN 1
ELSE
CASE WHEN month(CURRENT_DATE()) = month(birth_date)
THEN
CASE WHEN day(CURRENT_DATE()) < day(birth_date)
THEN 1
ELSE 0
END
ELSE 0
END
END
AS age FROM Students
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 82 / 167
An Example ...
mysql> select
*
from marks;
+-------+-----------+------+
| sid | course | mark |
+-------+-----------+------+
| ev77 | databases | 92 |
| ev77 | spelling | 99 |
| tgg22 | spelling | 3 |
| tgg22 | databases | 100 |
| fm21 | databases | 92 |
| fm21 | spelling | 100 |
| jj25 | databases | 88 |
| jj25 | spelling | 92 |
+-------+-----------+------+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 83 / 167
... of duplicates
mysql> select mark from marks;
+------+
| mark |
+------+
| 92 |
| 99 |
| 3 |
| 100 |
| 92 |
| 100 |
| 88 |
| 92 |
+------+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 84 / 167
Why Multisets?
Duplicates are important for aggregate functions.
mysql> select min(mark),
max(mark),
sum(mark),
avg(mark)
from marks;
+-----------+-----------+-----------+-----------+
| min(mark) | max(mark) | sum(mark) | avg(mark) |
+-----------+-----------+-----------+-----------+
| 3 | 100 | 666 | 83.2500 |
+-----------+-----------+-----------+-----------+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 85 / 167
The group by clause
mysql> select course,
min(mark),
max(mark),
avg(mark)
from marks
group by course;
+-----------+-----------+-----------+-----------+
| course | min(mark) | max(mark) | avg(mark) |
+-----------+-----------+-----------+-----------+
| databases | 88 | 100 | 93.0000 |
| spelling | 3 | 100 | 73.5000 |
+-----------+-----------+-----------+-----------+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 86 / 167
Visualizing group by
sid course mark
ev77 databases 92
ev77 spelling 99
tgg22 spelling 3
tgg22 databases 100
fm21 databases 92
fm21 spelling 100
jj25 databases 88
jj25 spelling 92
group by
=
course mark
spelling 99
spelling 3
spelling 100
spelling 92
course mark
databases 92
databases 100
databases 92
databases 88
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 87 / 167
Visualizing group by
course mark
spelling 99
spelling 3
spelling 100
spelling 92
course mark
databases 92
databases 100
databases 92
databases 88
min(mark)
=
course min(mark)
spelling 3
databases 88
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 88 / 167
The having clause
How can we select on the aggregated columns?
mysql> select course,
min(mark),
max(mark),
avg(mark)
from marks
group by course
having min(mark) > 60;
+-----------+-----------+-----------+-----------+
| course | min(mark) | max(mark) | avg(mark) |
+-----------+-----------+-----------+-----------+
| databases | 88 | 100 | 93.0000 |
+-----------+-----------+-----------+-----------+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 89 / 167
Use renaming to make things nicer ...
mysql> select course,
min(mark) as minimum,
max(mark) as maximum,
avg(mark) as average
from marks
group by course
having minimum > 60;
+-----------+---------+---------+---------+
| course | minimum | maximum | average |
+-----------+---------+---------+---------+
| databases | 88 | 100 | 93.0000 |
+-----------+---------+---------+---------+
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 90 / 167
Materialized Views
Suppose Q is a very expensive, and very frequent query.
Why not de-normalize some data to speed up the evaluation of Q?

This might be a reasonable thing to do, or ...

... it might be the rst step to destroying the integrity of your data
design.
Why not store the value of Q in a table?

This is called a materialized view.

But now there is a problem: How often should this view be


refreshed?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 91 / 167
General integrity constraints
Suppose that C is some constraint we would like to enforce on our
database.
Let Q
C
be a query that captures all violations of C.
Enforce (somehow) that the assertion that is always Q
C
empty.
Example
C = Z W, and FD that was not preserved for relation R(X),
Let Q
R
be a join that reconstructs R,
Let Q

R
be this query with X X

and
Q
C
=
W=W
(
Z=Z
(Q
R
Q

R
))
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 92 / 167
Assertions in SQL
create view C_violations as ....
create assertion check_C
check not (exists C_violations)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 93 / 167
Lecture 06 : Schema renement I
Outline
ER is for top-down and informal (but rigorous) design
FDs are used for bottom-up and formal design and analysis
update anomalies
Reasoning about Functional Dependencies
Heaths rule
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 94 / 167
Update anomalies
Big Table
sid name college course part term_name
yy88 Yoni New Hall Algorithms I IA Easter
uu99 Uri Kings Algorithms I IA Easter
bb44 Bin New Hall Databases IB Lent
bb44 Bin New Hall Algorithms II IB Michaelmas
zz70 Zip Trinity Databases IB Lent
zz70 Zip Trinity Algorithms II IB Michaelmas
How can we tell if an insert record is consistent with current
records?
Can we record data about a course before students enroll?
Will we wipe out information about a college when last student
associated with the college is deleted?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 95 / 167
Redundancy implies more locking ...
... at least for correct transactions!
Big Table
sid name college course part term_name
yy88 Yoni New Hall Algorithms I IA Easter
uu99 Uri Kings Algorithms I IA Easter
bb44 Bin New Hall Databases IB Lent
bb44 Bin New Hall Algorithms II IB Michaelmas
zz70 Zip Trinity Databases IB Lent
zz70 Zip Trinity Algorithms II IB Michaelmas
Change New Hall to Murray Edwards College

Conceptually simple update

May require locking entire table.


Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 96 / 167
Redundancy is the root of (almost) all database evils
It may not be obvious, but redundancy is also the cause of update
anomalies.
By redundancy we do not mean that some values occur many
times in the database!

A foreign key value may be have millions of copies!


But then, what do we mean?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 97 / 167
Functional Dependency
Functional Dependency (FD)
Let R(X) be a relational schema and Y X, Z X be two attribute
sets. We say Y functionally determines Z, written Y Z, if for any two
tuples u and v in an instance of R(X) we have
u.Y = v.Y u.Z = v.Z.
We call Y Z a functional dependency.
A functional dependency is a semantic assertion. It represents a rule
that should always hold in any instance of schema R(X).
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 98 / 167
Example FDs
Big Table
sid name college course part term_name
yy88 Yoni New Hall Algorithms I IA Easter
uu99 Uri Kings Algorithms I IA Easter
bb44 Bin New Hall Databases IB Lent
bb44 Bin New Hall Algorithms II IB Michaelmas
zz70 Zip Trinity Databases IB Lent
zz70 Zip Trinity Algorithms II IB Michaelmas
sid name
sid college
course part
course term_name
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 99 / 167
Keys, revisited
Candidate Key
Let R(X) be a relational schema and Y X. Y is a candidate key if
1
The FD Y X holds, and
2
for no proper subset Z Y does Z X hold.
Prime and Non-prime attributes
An attribute A is prime for R(X) if it is a member of some candidate key
for R. Otherwise, A is non-prime.
Database redundancy roughly means the existence of non-key
functional dependencies!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 100 / 167
Semantic Closure
Notation
F [= Y Z
means that any database instance that that satises every FD of F,
must also satisfy Y Z.
The semantic closure of F, denoted F
+
, is dened to be
F
+
= Y Z [ Y Z atts(F)and F [= Y Z.
The membership problem is to determine if Y Z F
+
.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 101 / 167
Reasoning about Functional Dependencies
We write F Y Z when Y Z can be derived from F via the
following rules.
Armstrongs Axioms
Reexivity If Z Y, then F Y Z.
Augmentation If F Y Z then F Y, W Z, W.
Transitivity If F Y Z and F [= Z W, then F Y W.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 102 / 167
Logical Closure (of a set of attributes)
Notation
closure(F, X) = A [ F X A
Claim 1
If Y W F and Y closure(F, X), then W closure(F, X).
Claim 2
Y W F
+
if and only if W closure(F, Y).
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 103 / 167
Soundness and Completeness
Soundness
F f = f F
+
Completeness
f F
+
= F f
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 104 / 167
Proof of Completeness (soundness left as an exercise)
Show (F f ) = (F [= f ):
Suppose (F Y Z) for R(X).
Let Y
+
= closure(F, Y).
B Z, with B , Y
+
.
Construct an instance of R with just two records, u and v, that
agree on Y
+
but not on X Y
+
.
By construction, this instance does not satisfy Y Z.
But it does satisfy F! Why?

let S T be any FD in F, with u.[S] = v.[S].

So S Y+. and so T Y+ by claim 1,

and so u.[T] = v.[T]


Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 105 / 167
Closure
By soundness and completeness
closure(F, X) = A [ F X A = A [ X A F
+

Claim 2 (from previous lecture)


Y W F
+
if and only if W closure(F, Y).
If we had an algorithm for closure(F, X), then we would have a (brute
force!) algorithm for enumerating F
+
:
F
+
for every subset Y atts(F)

for every subset Z closure(F, Y),

output Y Z
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 106 / 167
Attribute Closure Algorithm
Input : a set of FDs F and a set of attributes X.
Output : Y = closure(F, X)
1
Y := X
2
while there is some S T F with S Y and T , Y, then
Y := Y T.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 107 / 167
An Example (UW1997, Exercise 3.6.1)
R(A, B, C, D) with F made up of the FDs
A, B C
C D
D A
What is F
+
?
Brute force!
Lets just consider all possible nonempty sets X there are only 15...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 108 / 167
Example (cont.)
F = A, B C, C D, D A
For the single attributes we have
A
+
= A,
B
+
= B,
C
+
= A, C, D,

C
CD
= C, D
DA
= A, C, D
D
+
= A, D

D
DA
= A, D
The only new dependency we get with a single attribute on the left is
C A.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 109 / 167
Example (cont.)
F = A, B C, C D, D A
Now consider pairs of attributes.
A, B
+
= A, B, C, D,

so A, B D is a new dependency
A, C
+
= A, C, D,

so A, C D is a new dependency
A, D
+
= A, D,

so nothing new.
B, C
+
= A, B, C, D,

so B, C A, D is a new dependency
B, D
+
= A, B, C, D,

so B, D A, C is a new dependency
C, D
+
= A, C, D,

so C, D A is a new dependency
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 110 / 167
Example (cont.)
F = A, B C, C D, D A
For the triples of attributes:
A, C, D
+
= A, C, D,
A, B, D
+
= A, B, C, D,

so A, B, D C is a new dependency
A, B, C
+
= A, B, C, D,

so A, B, C D is a new dependency
B, C, D
+
= A, B, C, D,

so B, C, D A is a new dependency
And since A, B, C, D+ = A, B, C, D, we get no new
dependencies with four attributes.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 111 / 167
Example (cont.)
We generated 11 new FDs:
C A A, B D
A, C D B, C A
B, C D B, D A
B, D C C, D A
A, B, C D A, B, D C
B, C, D A
Can you see the Key?
A, B, B, C, and B, D are keys.
Note: this schema is already in 3NF! Why?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 112 / 167
Consequences of Armstrongs Axioms
Union If F [= Y Z and F [= Y W, then F [= Y W, Z.
Pseudo-transitivity If F [= Y Z and F [= U, Z W, then
F [= Y, U W.
Decomposition If F [= Y Z and W Z, then F [= Y W.
Exercise : Prove these using Armstrongs axioms!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 113 / 167
Proof of the Union Rule
Suppose we have
F [= Y Z,
F [= Y W.
By augmentation we have
F [= Y, Y Y, Z,
that is,
F [= Y Y, Z.
Also using augmentation we obtain
F [= Y, Z W, Z.
Therefore, by transitivity we obtain
F [= Y W, Z.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 114 / 167
Example application of functional reasoning.
Heaths Rule
Suppose R(A, B, C) is a relational schema with functional
dependency A B, then
R =
A,B
(R)
A

A,C
(R).
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 115 / 167
Proof of Heaths Rule
We rst show that R
A,B
(R)
A

A,C
(R).
If u = (a, b, c) R, then u
1
= (a, b)
A,B
(R) and
u
2
= (a, c)
A,C
(R).
Since (a, b)
A
(a, c) = (a, b, c) we know
u
A,B
(R)
A

A,C
(R).
In the other direction we must show R

=
A,B
(R)
A

A,C
(R) R.
If u = (a, b, c) R

, then there must exist tuples


u
1
= (a, b)
A,B
(R) and u
2
= (a, c)
A,C
(R).
This means that there must exist a u

= (a, b

, c) R such that
u
2
=
A,C
((a, b

, c)).
However, the functional dependency tells us that b = b

, so
u = (a, b, c) R.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 116 / 167
Closure Example
R(A, B, C, D, E, F) with
A, B C
B, C D
D E
C, F B
What is the closure of A, B?
A, B
A,BC
= A, B, C
B,CD
= A, B, C, D
DE
= A, B, C, D, E
So A, B
+
= A, B, C, D, E and A, B C, D, E.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 117 / 167
Lecture 07 : Normal Forms
Outline
First Normal Form (1NF)
Second Normal Form (2NF)
3NF and BCNF
Multi-valued dependencies (MVDs)
Fourth Normal Form
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 118 / 167
The Plan
Given a relational schema R(X) with FDs F :
Reason about FDs

Is F missing FDs that are logically implied by those in F?


Decompose each R(X) into smaller R
1
(X
1
), R
2
(X
2
), R
k
(X
k
),
where each R
i
(X
i
) is in the desired Normal Form.
Are some decompositions better than others?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 119 / 167
Desired properties of any decomposition
Lossless-join decomposition
A decomposition of schema R(X) to S(Y Z) and T(Y (X Z)) is a
lossless-join decomposition if for every database instances we have
R = S T.
Dependency preserving decomposition
A decomposition of schema R(X) to S(Y Z) and T(Y (X Z)) is
dependency preserving, if enforcing FDs on S and T individually has
the same effect as enforcing all FDs on S T.
We will see that it is not always possible to achieve both of these goals.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 120 / 167
First Normal Form (1NF)
We will assume every schema is in 1NF.
1NF
A schema R(A
1
: S
1
, A
2
: S
2
, , A
n
: S
n
) is in First Normal Form
(1NF) if the domains S
1
are elementary their values are atomic.
name
Timothy George Grifn
=
rst_name middle_name last_name
Timothy George Grifn
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 121 / 167
Second Normal Form (2NF)
Second Normal Form (2NF)
A relational schema R is in 2NF if for every functional dependency
X A either
A X, or
X is a superkey for R, or
A is a member of some key, or
X is not a proper subset of any key.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 122 / 167
3NF and BCNF
Third Normal Form (3NF)
A relational schema R is in 3NF if for every functional dependency
X A either
A X, or
X is a superkey for R, or
A is a member of some key.
Boyce-Codd Normal Form (BCNF)
A relational schema R is in BCNF if for every functional dependency
X A either
A X, or
X is a superkey for R.
Is something missing?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 123 / 167
Another look at Heaths Rule
Given R(Z, W, Y) with FDs F
If Z W F
+
, the
R =
Z,W
(R)
Z,Y
(R)
What about an implication in the other direction? That is, suppose we
have
R =
Z,W
(R)
Z,Y
(R).
Q Can we conclude anything about FDs on R? In particular,
is it true that Z W holds?
A No!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 124 / 167
We just need one counter example ...
R =
A,B
(R)
A,C
(R)
A B C
a b
1
c
1
a b
2
c
2
a b
1
c
2
a b
2
c
1
A B
a b
1
a b
2
A C
a c
1
a c
2
Clearly A B is not an FD of R.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 125 / 167
A concrete example
course_name lecturer text
Databases Tim Ullman and Widom
Databases Fatima Date
Databases Tim Date
Databases Fatima Ullman and Widom
Assuming that texts and lecturers are assigned to courses
independently, then a better representation would in two tables:
course_name lecturer
Databases Tim
Databases Fatima
course_name text
Databases Ullman and Widom
Databases Date
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 126 / 167
Time for a denition! MVDs
Multivalued Dependencies (MVDs)
Let R(Z, W, Y) be a relational schema. A multivalued dependency,
denoted Z W, holds if whenever t and u are two records that agree
on the attributes of Z, then there must be some tuple v such that
1
v agrees with both t and u on the attributes of Z,
2
v agrees with t on the attributes of W,
3
v agrees with u on the attributes of Y.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 127 / 167
A few observations
Note 1
Every functional dependency is multivalued dependency,
(Z W) = (Z W).
To see this, just let v = u in the above denition.
Note 2
Let R(Z, W, Y) be a relational schema, then
(Z W) (Z Y),
by symmetry of the denition.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 128 / 167
MVDs and lossless-join decompositions
Fun Fun Fact
Let R(Z, W, Y) be a relational schema. The decomposition R
1
(Z, W),
R
2
(Z, Y) is a lossless-join decomposition of R if and only if the MVD
Z W holds.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 129 / 167
Proof of Fun Fun Fact
Proof of (Z W) = R =
Z,W
(R)
Z,Y
(R)
Suppose Z W.
We know (from proof of Heaths rule) that R
Z,W
(R)
Z,Y
(R).
So we only need to show
Z,W
(R)
Z,Y
(R) R.
Suppose r
Z,W
(R)
Z,Y
(R).
So there must be a t R and u R with
r =
Z,W
(t )
Z,Y
(u).
In other words, there must be a t R and u R with t .Z = u.Z.
So the MVD tells us that then there must be some tuple v R
such that
1
v agrees with both t and u on the attributes of Z,
2
v agrees with t on the attributes of W,
3
v agrees with u on the attributes of Y.
This v must be the same as r , so r R.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 130 / 167
Proof of Fun Fun Fact (cont.)
Proof of R =
Z,W
(R)
Z,Y
(R) = (Z W)
Suppose R =
Z,W
(R)
Z,Y
(R).
Let t and u be any records in R with t .Z = u.Z.
Let v be dened by v =
Z,W
(t )
Z,Y
(u) (and we know
v R by the assumption).
Note that by construction we have
1
v.Z = t .Z = u.Z,
2
v.W = t .W,
3
v.Y = u.Y.
Therefore, Z W holds.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 131 / 167
Fourth Normal Form
Trivial MVD
The MVD Z W is trivial for relational schema R(Z, W, Y) if
1
Z W ,= , or
2
Y = .
4NF
A relational schema R(Z, W, Y) is in 4NF if for every MVD Z W
either
Z W is a trivial MVD, or
Z is a superkey for R.
Note : 4NF BCNF 3NF 2NF
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 132 / 167
Summary
We always want the lossless-join property. What are our options?
3NF BCNF 4NF
Preserves FDs Yes Maybe Maybe
Preserves MVDs Maybe Maybe Maybe
Eliminates FD-redundancy Maybe Yes Yes
Eliminates MVD-redundancy No No Yes
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 133 / 167
Inclusions
Clearly BCNF 3NF 2NF. These are proper inclusions:
In 2NF, but not 3NF
R(A, B, C), with F = A B, B C.
In 3NF, but not BCNF
R(A, B, C), with F = A, B C, C B.
This is in 3NF since AB and AC are keys, so there are no
non-prime attributes
But not in BCNF since C is not a key and we have C B.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 134 / 167
Lectire 08 : Schema renement III and advanced
design
Outline
General Decomposition Method (GDM)
The lossless-join condition is guaranteed by GDM
The GDM does not always preserve dependencies!
FDs vs ER models?
Weak entities
Using FDs and MVDs to rene ER models
Another look at ternary relationships
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 135 / 167
General Decomposition Method (GDM)
GDM
1
Understand your FDs F (compute F
+
),
2
nd R(X) = R(Z, W, Y) (sets Z, W and Y are disjoint) with FD
Z W F
+
violating a condition of desired NF,
3
split R into two tables R
1
(Z, W) and R
2
(Z, Y)
4
wash, rinse, repeat
Reminder
For Z W, if we assume Z W = , then the conditions are
1
Z is a superkey for R (2NF, 3NF, BCNF)
2
W is a subset of some key (2NF, 3NF)
3
Z is not a proper subset of any key (2NF)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 136 / 167
The lossless-join condition is guaranteed by GDM
This method will produce a lossless-join decomposition because
of (repeated applications of) Heaths Rule!
That is, each time we replace an S by S
1
and S
2
, we will always
be able to recover S as S
1
S
2
.
Note that in GDM step 3, the FD Z W may represent a key
constraint for R
1
.
But does the method always terminate? Please think about this ....
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 137 / 167
General Decomposition Method Revisited
GDM++
1
Understand your FDs and MVDs F (compute F
+
),
2
nd R(X) = R(Z, W, Y) (sets Z, W and Y are disjoint) with either
FD Z W F
+
or MVD Z W F
+
violating a condition of
desired NF,
3
split R into two tables R
1
(Z, W) and R
2
(Z, Y)
4
wash, rinse, repeat
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 138 / 167
Return to Example Decompose to BCNF
R(A, B, C, D)
F = A, B C, C D, D A
Which FDs in F
+
violate BCNF?
C A
C D
D A
A, C D
C, D A
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 139 / 167
Return to Example Decompose to BCNF
Decompose R(A, B, C, D) to BCNF
Use C D to obtain
R
1
(C, D). This is in BCNF. Done.
R
2
(A, B, C) This is not in BCNF. Why? A, B and B, C are the only
keys, and C A is a FD for R
1
. So use C A to obtain

R
2.1
(A, C). This is in BCNF. Done.

R
2.2
(B, C). This is in BCNF. Done.
Exercise : Try starting with any of the other BCNF violations and see
where you end up.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 140 / 167
The GDM does not always preserve dependencies!
R(A, B, C, D, E)
A, B C
D, E C
B D
A, B
+
= A, B, C, D,
so A, B C, D,
and A, B, E is a key.
B, E
+
= B, C, D, E ,
so B, E C, D,
and A, B, E is a key (again)
Lets try for a BCNF decomposition ...
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 141 / 167
Decomposition 1
Decompose R(A, B, C, D, E) using A, B C, D :
R
1
(A, B, C, D). Decompose this using B D:

R
1.1
(B, D). Done.

R
1.2
(A, B, C). Done.
R
2
(A, B, E). Done.
But in this decomposition, how will we enforce this dependency?
D, E C
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 142 / 167
Decomposition 2
Decompose R(A, B, C, D, E) using B, E C, D:
R
3
(B, C, D, E). Decompose this using D, E C

R
3.1
(C, D, E). Done.

R
3.2
(B, D, E). Decompose this using B D:

R
3.2.1
(B, D). Done.

R
3.2.2
(B, E). Done.
R
4
(A, B, E). Done.
But in this decomposition, how will we enforce this dependency?
A, B C
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 143 / 167
Summary
It is always possible to obtain BCNF that has the lossless-join
property (using GDM)

But the result may not preserve all dependencies.


It is always possible to obtain 3NF that preserves dependencies
and has the lossless-join property.

Using methods based on minimal covers (for example, see


EN2000).
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 144 / 167
Recall : a small change of scope ...
... changed this entity
Movie
Title
Year
MovieID
into two entities and a relationship :
Movie
Title
MovieID
Released MovieRelease
Country
Date
Year
Month
Day
But is there something odd about the MovieRelease entity?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 145 / 167
MovieRelease represents a Weak entity set
Movie
Title
MovieID
Released MovieRelease
Country
Date
Year
Month
Day
Denition
Weak entity sets do not have a primary key.
The existence of a weak entity depends on an identifying entity set
through an identifying relationship.
The primary key of the identifying entity together with the weak
entities discriminators (dashed underline in diagram) identify each
weak entity element.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 146 / 167
Can FDs help us think about implementation?
R(I, T, D, C)
I T
I = MovieID
T = Title
D = Date
C = Country
Turn the decomposition crank to obtain
R
1
(I, T) R
2
(I, D, C)

I
(R
2
)
I
(R
1
)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 147 / 167
Movie Ratings example
Scope = UK
Title Year Rating
Austin Powers: International Man of Mystery 1997 15
Austin Powers: The Spy Who Shagged Me 1999 12
Dude, Wheres My Car? 2000 15
Scope = Earth
Title Year Country Rating
Austin Powers: International Man of Mystery 1997 UK 15
Austin Powers: International Man of Mystery 1997 Malaysia 18SX
Austin Powers: International Man of Mystery 1997 Portugal M/12
Austin Powers: International Man of Mystery 1997 USA PG-13
Austin Powers: The Spy Who Shagged Me 1999 UK 12
Austin Powers: The Spy Who Shagged Me 1999 Portugal M/12
Austin Powers: The Spy Who Shagged Me 1999 USA PG-13
Dude, Wheres My Car? 2000 UK 15
Dude, Wheres My Car? 2000 USA PG-13
Dude, Wheres My Car? 2000 Malaysia 18PL
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 148 / 167
Example of attribute migrating to strong entity set
From single-country scope,
Movie
Title
Year
Rating
RatingReason
MovieID
to multi-country scope:
Movie
Title
MovieID
Year
Rated
Reason
Rating
Country
RatingValue
Note that relation Rated has an attribute!
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 149 / 167
Beware of FFDs = Faux Functional Dependencies
(US ratings)
Title Year Rating RatingReason
Stoned 2005 R drug use
Wasted 2006 R drug use
High Life 2009 R drug use
Poppies: Odyssey of an opium eater 2009 R drug use
But
Title Rating, RatingReason
is not a functional dependency.
This is a mildly amusing illustration of a real and pervasive problem
deriving a functional dependency after the examination of a limited set
of data (or after talking to only a few domain experts).
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 150 / 167
Oh, but the real world is such a bother!
from IMDb raw data le certicates.list
2 Fast 2 Furious (2003) Switzerland:14 (canton of Vaud)
2 Fast 2 Furious (2003) Switzerland:16 (canton of Zurich)
28 Days (2000) Canada:13+ (Quebec)
28 Days (2000) Canada:14 (Nova Scotia)
28 Days (2000) Canada:14A (Alberta)
28 Days (2000) Canada:AA (Ontario)
28 Days (2000) Canada:PA (Manitoba)
28 Days (2000) Canada:PG (British Columbia)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 151 / 167
Ternary or multiple binary relationships?
T R S
U
T R3 E R1 S
R2
U
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 152 / 167
Ternary or multiple binary relationships?
T R S
U
T R2 S R1 U
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 153 / 167
Look again at ER Demo Diagram
2
How might this be rened using FDs or MVDs?
Employee
Name
Number
ISA
Mechanic Salesman Does
RepairJob Number
Description
Cost Parts
Work
Repairs Car
License
Model
Year
Manufacturer
Buys
Price
Date
Value
Sells
Date
Value
Commission
Client ID
Name Phone
Address
buyer seller
2
By Pvel Calado,
http://www.texample.net/tikz/examples/entity-relationship-diagram
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 154 / 167
Lecture 10 : On-line Analytical Processing (OLAP)
Outline
Limits of SQL aggregation
OLAP : Online Analytic Processing
Data cubes
Star schema
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 155 / 167
Limits of SQL aggregation
Flat tables are great for processing, but hard for people to read
and understand.
Pivot tables and cross tabulations (spreadsheet terminology) are
very useful for presenting data in ways that people can
understand.
SQL does not handle pivot tables and cross tabulations well.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 156 / 167
OLAP vs. OLTP
OLTP : Online Transaction Processing (traditional databases)

Data is normalized for the sake of updates.


OLAP : Online Analytic Processing

These are (almost) read-only databases.

Data is de-normalized for the sake of queries!

Multi-dimensional data cube emerging as common data model.

This can be seen as a generalization of SQLs group by


Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 157 / 167
OLAP Databases : Data Models and Design
The big question
Is the relational model and its associated query language (SQL) well
suited for OLAP databases?
Aggregation (sums, averages, totals, ...) are very common in
OLAP queries

Problem : SQL aggregation quickly runs out of steam.

Solution : Data Cube and associated operations (spreadsheets on


steroids)
Relational design is obsessed with normalization

Problem : Need to organize data well since all analysis queries


cannot be anticipated in advance.

Solution : Multi-dimensional fact tables, with hierarchy in


dimensions, star-schema design.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 158 / 167
A very inuential paper [G+1997]
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 159 / 167
From aggregates to data cubes
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 160 / 167
The Data Cube
Data modeled as an n-dimensional (hyper-) cube
Each dimension is associated with a hierarchy
Each point records facts
Aggregation and cross-tabulation possible along all dimensions
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 161 / 167
Hierarchy for Location Dimension
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 162 / 167
Cube Operations
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 163 / 167
The Star Schema as a design tool
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 164 / 167
Lecture 09 : Guest Lecture
Grant Allen (Google)
Technology Program Manager for Googles Site Reliability Engineering
team.
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 165 / 167
Lectures 11 and 12 : Beyond ACID/Relational
framework
Slides will be available a later in the term
XML or JSON as a data exchange language
Not all applications require ACID
NoSQL Movement
Rise of Web and cluster-based computing
CAP = Consistency, Availability, and Partition tolerance
The CAP theorem (pick any two!)
Eventual consistency
Relationships vs. Aggregates
Aggregate data models?
Key-value store
Can a database really be schemaless?
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 166 / 167
The End
(http://xkcd.com/327)
Timothy G. Grifn (cl.cam.ac.uk) Databases 2014 DB 2014 167 / 167

You might also like