0% found this document useful (0 votes)
21 views

Lec6 QP Indexing

The document summarizes the basics of query processing and indexing in database systems. It discusses how (1) an SQL query is parsed and optimized into logical and physical query plans before execution, (2) data is stored on disk in files organized by rows, and (3) indexes can be created on attributes to enable faster retrieval and updating of records compared to scanning the entire data file sequentially. Indexes store key-pointer pairs to allow quick access to records given a search key value. Common index structures include hash tables and B+ trees.

Uploaded by

Previzsla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lec6 QP Indexing

The document summarizes the basics of query processing and indexing in database systems. It discusses how (1) an SQL query is parsed and optimized into logical and physical query plans before execution, (2) data is stored on disk in files organized by rows, and (3) indexes can be created on attributes to enable faster retrieval and updating of records compared to scanning the entire data file sequentially. Indexes store key-pointer pairs to allow quick access to records given a search key value. Common index structures include hash tables and B+ trees.

Uploaded by

Previzsla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

CMPT 354:

Database System I
Lecture 6. Basics of Query Processing and Indexing

1
Outline
• Query Processing
• What happens when an SQL query is issued?

• Indexing
• How to speed up query performance?

2
Query Processing Steps
SQL query

SQL Parser

Logical Optimization
Query
optimization
Physical Optimization

Query Execution

Disk
3
Example
• Offering (oID, dept, cNum, term, instructor)
• Took (sID, oID, grade)

Q: Student number of all students who have taken CMPT 354

SELECT sID
FROM Offering O, Took T
WHERE O.oID = T.oID
AND O.dept = ‘CMPT’
AND O.cNum = ‘354’
4
Offering (oID, dept, cNum, term, instructor)
Took (sID, oID, grade)
SQL Parser
• From the input SQL text to a logical plan
SELECT sID
psID
FROM Offering O, Took T
WHERE O.oID = T.oID
AND O.dept = ‘CMPT’
sdept = ‘CMPT’ Ù cNum = 354
AND O.cNum = ‘354’


psID (sdept = ‘CMPT’ Ù cNum = 354 (Offering ⨝ Took))

Relational algebra expression is also Offering Took


called the “logical query plan” 5
Logical Optimization
• Find the optimal logical plan
psID psID

sdept = ‘CMPT’ Ù cNum = 354 ⨝


sdept = ‘CMPT’ Ù cNum = 354

Offering Took Offering Took


6
Physical Optimization
• Find the optimal physical plan
psID psID
(Nested loop)
⨝ (Hash Join)

V.S.
(Scan & write to T) (Scan & write to T)
sdept = ‘CMPT’ Ù cNum = 354 sdept = ‘CMPT’ Ù cNum = 354

Offering Took Offering Took


(File Scan) (File Scan) (File Scan) (File7 Scan)
Query Execution
• From a physical plan to actual machine code
psID

(Hash Join)
⨝ “Volcano Iterator Model”
Machine Code
(Scan & write to T) (e.g., C++)
sdept = ‘CMPT’ Ù cNum = 354

Offering Took
(File Scan) (File Scan) 8
Summary
• Logical plans:
• Created by the parser from the input SQL text
• Expressed as a relational algebra tree
• Each SQL query has many possible logical plans
• Physical plans:
• Goal is to choose an efficient implementation for each
operator in the RA
• Each logical plan has many possible physical plans
• Query Optimization:
• Find the optimal logical plan
• Find the optimal physical plan
9
Outline
• Query Processing
• What happens when an SQL query is issued?

• Indexing
• How to speed up query performance?

10
Query Performance
• My database application is too slow… why?
• One of the queries is very slow… why?

• To address these problems, we need to understand:


• How is data organized on disk
• What is an index
• How to select indexes

11
sID dept cNum Term instructor
10 CMPT 345 SP 2018 Jiannan
Data Storage 20 CMPT 454 FA 2018 Martin
… … … … …

• DBMSs store data in files


• Most common 10 CMPT 345 SP 2018 Jiannan
Block 1
organization is row-wise 20 CMPT 454 FA 2018 Martin

storage 30 … … … …
Block 2
40 …
• On disk, a file is split into
blocks 50
Block 3
• Each block contains a 60

set of tuples 70
Block 4
80

In the example, we have 4 blocks with 2 tuples each


12
Scanning a Data File
• Data file is stored on Disk
• Consequence: Sequential IO is MUCH FASTER than
random IO
• Good: read blocks 1, 2, 3, 4, 5
• Bad: read blocks 2342, 11, 321, 9
• Rule of thumb:
• Random reading 1-2% of the file ≈ sequential scanning the
entire file

13
Data File Types
• Heap file
• Unsorted
• Sequential file
• Sorted according to some attribute(s) called key

Note: key here means something different from primary


key: it just means that we order the file according to that
attribute. In our example we ordered by sID. Might as well
order by instructor, if that seems a better idea for the
applications running on our database.
14
Index Motivation (1)
Student(name, age)

• Suppose we want to search for students of a specific age

• First idea: Sort the records by age… we know how to do


this fast!

• How many IO operations to search over N sorted records?


• Simple scan: O(N)
• Binary search: O(𝐥𝐨𝐠 𝟐 𝑵)

Could we get even cheaper search? E.g. go from 𝐥𝐨𝐠 𝟐 𝑵


à 𝐥𝐨𝐠 𝟐𝟎𝟎 𝑵?
Index Motivation (2)

• What about if we want to insert a new student, but


keep the list sorted?
2

1,3 4,5 6,7 1,2 3,4 5,6 7,

• We would have to potentially shift N records,


requiring up to ~ 2*N/P IO operations (where P = #
of records per page)!

Could we get faster insertions?


Index Motivation (3)

• What about if we want to be able to search quickly


along multiple attributes (e.g. not just age)?
• We could keep multiple copies of the records, each
sorted by one attribute set… this would take a lot of
space

Can we get fast search over multiple attribute


sets without taking too much space?

We’ll create separate data structures called


indexes to address all these points
Index
• An additional file, that allows fast access to records in
the data file given a search key
• The index contains (key, value) pairs:
• The key = an attribute value (e.g., student ID or age)
• The value = a pointer to the record
• An index can store the full rows it points to (primary
index) or pointers to those rows (secondary index)
• We’ll mainly consider secondary indexes
• Could have many indexes for one table

18
Different Keys
• Primary key
• uniquely identifies a tuple

• Key of the sequential file


• how the data file is sorted

• Index key
• how the index is organized

19
Example 1: Index on sID
Data File
Index
10 CMPT 345 SP 2018 Jiannan
10 20 CMPT 454 FA 2018 Martin
20
30 … … … …
30
40 …
40
50 50
60 60
70
70
80
80

20
Example 2: Index on cNum
Data File
Index
10 CMPT 345 SP 2018 Jiannan
102 20 CMPT 454 FA 2018 Martin
110
30 … 110 … …
225
40 … 276
276
354 50 225
383 60 383
454
70 102
470
80 470

21
Index Organization
• Common indexes:
• Hash tables
• B+ trees

• Specialized indexes
• R-trees
• Inverted index
•…

22
B+ Tree Example
K = 30?

30 < 80 80

30 in [20,60) 20 60 100 120 140

30 in [30,40) 10 15 18 20 30 40 50 60 65 80 85 90

Not all nodes pictured


To the data! 10 12 15 20 28 30 40 60 63 80 84 89
Clustered vs. Unclustered Index

30 30

Index File
22 25 28 29 32 34 37 38 22 25 28 29 32 34 37 38

19 22 27 28 30 33 35 37 Data file 19 33 27 22 37 28 35 30

Clustered Unclustered
Clustered vs. Unclustered Index
• Recall that for a disk with block access, sequential IO is
much faster than random IO

• For exact search, no difference between clustered /


unclustered

• For range search over R values: difference between


1 random IO + R sequential IO, and R random IO
SELECT *
FROM R

x
WHERE R.K > ? And R.K < ?

Inde
d
stere
lu
Unc

Sequential Scan

Cost dex
In
te red
Clus

0 100
Percentage tuples retrieved
26
Summary
• Index = a file that enables direct access to records
in another data file
• B+ tree / Hash table
• Clustered/unclustered

• Data resides on disk


• Organized in blocks
• Sequential IO is more efficient than random IO
• Random read 1-2% of data worse than sequential scan
of the entire file

27
Creating Indexes in SQL

• Offering (oID, dept, cNum, term, instructor)


CREATE INDEX IDX1 ON Offering(dept)

Which query(s) could be affected by IDX1?

SELECT oID FROM Offering


(A) WHERE dept = ‘CMPT’

SELECT oID FROM Offering


(B)
WHERE cNum = ‘354’

SELECT oID FROM Offering


(C)
WHERE dept = ‘CMPT’ AND cNum = ‘354’ 28
Creating Indexes in SQL

• Offering (oID, dept, cNum, term, instructor)


CREATE INDEX IDX2 ON Offering(dept, cNum)

Which query(s) could be affected by IDX2?

SELECT oID FROM Offering


(A) WHERE dept = ‘CMPT’

SELECT oID FROM Offering


(B)
WHERE cNum = ‘354’

SELECT oID FROM Offering


(C)
WHERE dept = ‘CMPT’ AND cNum = ‘354’ 29
Which Indexes?

• How many indexes could we create?

• Which indexes should we create?

30
Which Indexes?
• The index selection problem
• Given a table, and a “workload” (SFU CourSys
application with lots of SQL queries), decide which
indexes to create (and which ones NOT to create!)

• Who does index selection:


• The database administrator DBA

• Semi-automatically, using a database administration tool

31
Index Selection: Which Search Key
• Make some attribute K a search key if the WHERE
clause contains:
• An exact match on K
• A range predicate on K
• A join on K

32
The Index Selection Problem 1
• Your workload is

100000 queries 100000 queries


SELECT sID SELECT sID
FROM Student FROM Student
WHERE name = ? WHERE gender = ?

Which one is better?


A. Index on name
B. Index on gender
33
The Index Selection Problem 2
• Your workload is

100000 queries 100000 queries


SELECT sID SELECT sID
FROM Student FROM Student
WHERE name like ? WHERE age = ?

Which one is better?


A. Index on name
B. Index on age
34
The Index Selection Problem 3
• Your workload is

100000 queries 100 queries


SELECT sID SELECT sID
FROM Student FROM Student
WHERE name = ? WHERE age = ?

Which one(s) are useful?


A. Index on name
B. Index on age
C. Index on name, age
35
D. Index on age, name
The Index Selection Problem 4
• Your workload is

100000 queries 100000 queries


SELECT sID SELECT sID
FROM Student FROM Student
WHERE fname = ? WHERE fname = ? AND age > ?

Which one is better?


A. Index on (fname, age)
B. Index on (age, fname)
36
The Index Selection Problem 5
• Your workload:
100000 queries 100 queries 100000 queries
SELECT sID SELECT sID INSERT INTO Student
FROM Student FROM Student VALUES (?, …, ?)
WHERE name = ? WHERE age = ?

Which one(s) are useful?


A. Index on name
B. Index on age
C. Index on name, age
37
D. Index on age, name
Basic Index Selection Guidelines
• Consider queries in workload in order of importance

• Consider relations accessed by query


• No point indexing other relations

• Look at WHERE clause for possible search key

• Try to choose indexes that speed up multiple queries

38
Summary
• Query Processing
• SQL Parser
• Logical Optimization
• Physical Optimization
• Query Execution

• Indexing
• Data Storage
• Index motivation
• Index Selection

39
Acknowledge
• Some lecture slides were copied from or inspired by the
following course materials
• “W4111: Introduction to databases” by Eugene Wu at
Columbia University
• “CSE344: Introduction to Data Management” by Dan Suciu at
University of Washington
• “CMPT354: Database System I” by John Edgar at Simon Fraser
University
• “CS186: Introduction to Database Systems” by Joe Hellerstein
at UC Berkeley
• “CS145: Introduction to Databases” by Peter Bailis at Stanford
• “CS 348: Introduction to Database Management” by Grant
Weddell at University of Waterloo
40

You might also like