Overview - Explain - Measuring Performance - Disk Architectures - Indexes - Join Algorithms (CTD.)
Overview - Explain - Measuring Performance - Disk Architectures - Indexes - Join Algorithms (CTD.)
up!
Overview
EXPLAIN
Measuring Performance
Disk Architectures
Indexes
Mechanics
Parsing
Motivation, Definition,
Demonstration
Classification
Optimization
Join Algorithms
Nested Loop
Simple
Index
CS3/586
10/10/16
Lecture 6
Slide 1
Learning objectives
LO6.1: Use SQL to declare indexes
LO6.2: Determine the I/O cost of finding record(s) using
a B+ tree
LO6.3: Given a join query, calculate the cost using each
join algorithm: Nested loops, Index Nested Loops,
Sort-Merge
LO6.4: Parse a query
LO6.5: Use VP to answer questions about optimization
Slide 2
SQL interface
SQL
Security
Parser
Relational Algebra(RA)
Catalog
Optimizer
Executable Plan (RA+Algorithms)
Concurrency
Crash
Recovery
Plan Executor
Files, Indexes &
Access Methods
Database, Indexes
Operator
algorithms
indexes
how a
disk works
Slide 3
Components of a Disk *
Spindle
Disk head
Tracks
Sector
Arm movement
Platters
Arm assembly
Slide 5
More terminology
Spindle
Disk head
Tracks
Arm movement
both surfaces.
All the tracks that you
can reach from one
position of the arm isArm assembly
called a cylinder
(imaginary!).
Sector
Platters
Slide 6
C lock Ticks
10
Andromdeda
Tape /Optical
Robot
10 6 Disk
100
10
2
1
Memory
On Board Cache
On Chip Cache
Registers
2,000 Years
Pluto
Sacramento
2 Years
1.5 hr
This Campus
10 min
This Room
My Head 1 min
From http://research.microsoft.com/~gray/papers/AlphaSortSigmod.doc
Slide 9
Slide 11
Slide 12
Postgres EXPLAIN
Output for
EXPLAIN SELECT * FROM indiv WHERE zip = 97223;
Seq Scan on indiv (cost=0.00.. 109495.94 rows=221 width=166)
Filter:(zip = 97223::bpchar)
Sequential
Scan
I/Os to get
first row
I/Os to get
last row*
Rows
retrieved
Average Row
Width
*Actually this includes CPU costs but we will call it I/O costs to simplify
Slide 13
Slide 14
Slide 15
Data Entries*
Before we learn about how indexes are built, we must understand the
concept of data entries.
Given a search key value, the index produces a data entry, which
produces the data record in one I/O.
Other real-life indexes will help motivate this concept.
Each of the following indexes speeds up data retrieval. What is the
search key, data entry, and data record for each one?
Search Key
Data Entry
Data Record
Library Catalog
Google
Mapquest
Slide 18
Slide 19
Example B+ Tree
Root
17
Entries <= 17
5
2*
3*
27
13
5*
Entries > 17
7* 8*
14* 16*
22* 24*
30
27* 29*
Slide 20
17
2*
3*
27
13
5*
7* 8*
14* 16*
22* 24*
30
27* 29*
How many I/Os are required to retrieve data records with search
key values x, 13 < x < 27? Assume x is a unique key.
How many I/Os are required to retrieve data records with search
key values x, 3 < x < 15? Assume x is a unique key.
Slide 21
B+ Tree Indexes
Non-leaf
Pages
Leaf
Pages
(Sorted by search key)
P0
K 1
P1
K 2
P 2
K m Pm
Slide 22
Slide 23
Index Classification
Primary vs. secondary: If the indexs search key
contains the relations primary key, then the index is
called a primary index, otherwise a secondary index.
The index created by the DBMS for the primary key is usually
called the primary index.
Slide 24
CLUSTERED
Index entries
direct search for
data entries
Data entries
UNCLUSTERED
Data entries
(Index File)
(Data file)
Data Records
Data Records
Slide 25
Lest you get carried away: a table can have only one
clustered index. Why?
DBMSs make their primary indexes clustered.
Slide 27
(apocryphal)
Slide 29
Slide 30
A simple join
SELECT *
FROM indiv L, comm R
WHERE L.commid=R.commid
Review how to compute this join by hand, with the cl versions
of the tables.
M = 23,224 pages in L, pL = 39 rows per page,
N = 414 pages in R, pR = 24 rows per page.
These (estimated) statistics are stored in the system catalog.
In PostgreSQL, retrieve number of pages with the function
SELECT pg_relation_size('tablename')/8192;
Retrieve rows per page using
SELECT COUNT(*)/(pages in L or R) FROM L or R;
Slide 31
For each row in the outer table L, we scan the entire inner
table R, row by row.
Cost: M + (pL * M) * N = 23,224 + (39*23,224)*414 I/Os
= 374,997,928 I/Os 3,749,979 seconds 43 days
Memory Buffers:
Table R
on disk
... 2
13
12
27
1
5
Slide 33
Memory Buffers:
2 ...
12
6 ...
... 2
13
Table R
on disk
... 2
13
12
27
1
5
27
1
5
Query Answer
2 2
Slide 34
Memory Buffers:
2 ...
12
6 ...
... 2
13
No match:
Discard!
Table R
on disk
... 2
13
12
27
1
5
Query Answer
2 2
Slide 35
Memory Buffers:
2 ...
12
6 ...
12
27
No match:
Discard!
Table R
on disk
... 2
13
12
27
1
5
Query Answer
2 2
Slide 36
Memory Buffers:
2 ...
12
6 ...
12
27
No match:
Discard!
Table R
on disk
... 2
13
12
27
1
5
Query Answer
2 2
Slide 37
Memory Buffers:
2 ...
12
6 ...
1
5
No match:
Discard!
Table R
on disk
... 2
13
12
27
1
5
Query Answer
2 2
Slide 38
Memory Buffers:
2 ...
12
6 ...
1
5
No match:
Discard!
Table R
on disk
... 2
13
12
27
1
5
Query Answer
2 2
Slide 39
Memory Buffers:
2 ...
12
6 ...
... 2
13
No match:
Discard!
Table R
on disk
... 2
13
12
27
1
5
Query Answer
2 2
Slide 40
Memory Buffers:
2 ...
12
6 ...
... 2
13
No match:
Discard!
Table R
on disk
... 2
13
12
27
1
5
Query Answer
2 2
Slide 41
Memory Buffers:
2 ...
12
6 ...
12
27
Match!
Query Answer
2 2
12 12
Table R
on disk
... 2
13
12
27
1
5
And so forth
Slide 42
Slide 43
External Sorting
36
92 88 66 51 43
29
Runs on disk
23 21 20 18 9 7
Slide 44
Sort-Merge Join
This join algorithm is the one many people think of when asked
how they would join two tables. It is also the simplest to
visualize. It involves three steps.
1. Sort L on lcommid
2. Sort R on rcommid
3. Merge the sorted L and R on lcommid and rcommid.
Slide 46
BUT, almost every real life join is a foreign key join. One of
the joining attributes is a key, so the duplicate value
problem does not occur.
Slide 47
I/O Cost
O( )
M + PL*M*N M*N
M + PL*M*(cost
of index access*)
8 Hours
5(M+N)
M+
N
20 minutes
Slide 50
Parser
Join algorithms,
Heap, Index,
Covered in
CS587/410
Buffer Management
Disk Space Management
DB
Slide 51
Query Parser
Relational Algebra Expression (Query Tree)
Query Optimizer
Plan
Generator
Plan Cost
Estimator
Catalog
Manager
The Parser
The Optimizer:
Chooses the plan with lowest cost (of the plans considered,
which is not necessarily all possible plans)
Slide 53
SQL Query:
SELECT
FROM
USING
WHERE
commname
comm JOIN indiv
commid
indiv.zip=97223;
commname
indiv.zip=97223
commid=commid
comm
indiv
Slide 54
Slide 55
Slide 56
*Statistics for calculating these costs are kept in the system catalog.
Slide 57
Dynamic Programming
commname
indiv.zip=96828
commid=commid
comm
indiv
Slide 59
Slide 60
http://www.java.com/en/download/index.jsp
Click on Visual_Planner.jar
cs.pdx.edu/~len/386/VP1.7.zip
File/Open
Navigate to the directory where you put VP1.7
Choose noindex.pln
Slide 61
Slide 64
Slide 65
Slide 66
LO6.2 EXERCISE*
Consider the B+-tree index on slide 21. Assume none
of the tree is in memory and the index is unique.
Assume that in the data file, every data record is on a
different page. How many disk I/Os are needed to
retrieve all records with search key values x, 7 < x <
16?
Slide 67
LO6.3: EXERCISE
Consider the join query:
SELECT *
FROM comm L, cand R JOIN ON (assoccand = candid )
Calculate the cost of a nested loop, index nested loop
and sort-merge join.
Slide 68
LO6.4: EXERCISE
Follow the instructions on slide 61 to set up the Visual
Planner. Open the file noindex.pln
What is the startup cost and the total cost of the left input?
Slide 69