0% found this document useful (0 votes)
56 views67 pages

Queryopt

This document discusses PostgreSQL query optimization. It introduces EXPLAIN and optimization concepts like indexes, execution plans, and the query optimizer. Indexes can improve query performance but slow down updates. The order of columns in a concatenated index is important. The query optimizer determines the best execution plan based on factors like indexes and statistics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views67 pages

Queryopt

This document discusses PostgreSQL query optimization. It introduces EXPLAIN and optimization concepts like indexes, execution plans, and the query optimizer. Indexes can improve query performance but slow down updates. The order of columns in a concatenated index is important. The query optimizer determines the best execution plan based on factors like indexes and statistics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

PostgreSQL query optimization

Nguyễn Hải Châu


Email: [email protected]
Trường Đại học Công nghệ
Đại học Quốc gia Hà Nội

N. H. Châu (VNU-UET) Query optimization 1 / 67


Introduction

SQL is a declarative language: SQL statements describe what


users want to get, but not how
Imperative language: Users specify what to do to get desired
results
In PostgreSQL: the database optimizer choose the ”best” way to
execute a SQL statement
The ”best” way is determined by many different factors, such as
storage structures, indexes, and data statistics
Two types of databases: OLTP and OLAP
OLTP (Online Transaction Processing): for business
applications
OLAP (Online Analytical Processing): for BI (Business
Intelligence) and reporting

N. H. Châu (VNU-UET) Query optimization 2 / 67


Roadmap for query optimization

EXPLAIN statement and tools


Theory:
Query processing overview
Algorithm cost models
Index structures
Execution plans

N. H. Châu (VNU-UET) Query optimization 3 / 67


Building blocks of SQL

DDL (Data Definition Language): CREATE, ALTER, DROP,


RENAME
DML (Data Manipulation Language): SELECT, INSERT,
UPDATE, DELETE
DCL (Data Control Language): GRANT, REVOKE
TCL (Transaction Control Language): COMMIT, ROLLBACK,
SAVEPOINT
Querying and Filtering: WHERE, JOIN, UNION, INTERSECT,
GROUP BY, ORDER BY, HAVING

N. H. Châu (VNU-UET) Query optimization 4 / 67


pgadmin EXPLAIN

N. H. Châu (VNU-UET) Query optimization 5 / 67


EXPLAIN visualizer resources

https://www.postgresqltutorial.com/postgresql-tutorial/
postgresql-explain/
https://theartofpostgresql.com/explain-plan-visualizer/ →
https://explain.dalibo.com/
https://explain.depesz.com/

N. H. Châu (VNU-UET) Query optimization 6 / 67


Anatomy of an index

N. H. Châu (VNU-UET) Query optimization 7 / 67


Introduction

An index is a distinct structure in the database that is built


using the CREATE INDEX statement
An index is very similar to the index at the end of a book
A database index is changed frequently
Using index increases search speed but slows down updates
(insert/delete/update statement): the first power of indexing
Important data structures of index: doubly link list and a search
tree

N. H. Châu (VNU-UET) Query optimization 8 / 67


The index leaf nodes

N. H. Châu (VNU-UET) Query optimization 9 / 67


The search tree (B-tree)
B-tree = balanced tree, not binary tree

N. H. Châu (VNU-UET) Query optimization 10 / 67


B-tree traversal

N. H. Châu (VNU-UET) Query optimization 11 / 67


Indexes can be slow

An index lookup requires three steps:


Tree traversal
Following the leaf nodes doubly link list
Fetching the data from tables
Multiple leaf nodes may be need to be read to find matching
entries
Table access may require multiple block read

N. H. Châu (VNU-UET) Query optimization 12 / 67


The WHERE clause
... defines search conditions

N. H. Châu (VNU-UET) Query optimization 13 / 67


The equality operator

The equality operator is the most trivial and most frequently


used
Indexing mistakes can effect WHERE clause with multiple
conditions

N. H. Châu (VNU-UET) Query optimization 14 / 67


Primary key search

1 create table employees (


2 employee_id integer not null,
3 first_name varchar(64) not null,
4 last_name varchar(64) not null,
5 date_of_birth date not null,
6 phone_number VARCHAR(16) not null,
7 primary key (employee_id)
8 );
9
10 \d employees
11 Table "public.employees"
12 Column | Type | Collation | Nullable | Default
13 ---------------+-----------------------+-----------+----------+---------
14 employee_id | integer | | not null |
15 first_name | character varying(64) | | not null |
16 last_name | character varying(64) | | not null |
17 date_of_birth | date | | not null |
18 phone_number | character varying(16) | | not null |
19 Indexes:
20 "employees_pkey" PRIMARY KEY, btree (employee_id)

N. H. Châu (VNU-UET) Query optimization 15 / 67


Primary key search

1 select first_name, last_name from employees where employee_id = 6569;


2 first_name | last_name
3 ----------------+-----------------
4 56M8GMZ4xv0w | K4ATJi4SVLAURe6
5 (1 row)
6
7 explain (analyze, verbose, costs, buffers)
8 select first_name, last_name from employees where employee_id = 6569;
9 QUERY PLAN
10 ------------------------------------------------------------------------------------
11 Index Scan using employees_pkey on public.employees (cost=0.29..8.30 rows=1
,→ width=29) (actual time=0.025..0.028 rows=1 loops=1)
12 Output: first_name, last_name
13 Index Cond: (employees.employee_id = 6569)
14 Buffers: shared hit=3
15 Planning Time: 0.091 ms
16 Execution Time: 0.052 ms
17 (6 rows)

N. H. Châu (VNU-UET) Query optimization 16 / 67


Concatenated index

1 alter table employees add column subsidiary_id integer;


2 update employees set subsidiary_id = 10;
3 update employees set subsidiary_id = 20 where employee_id % 2 = 0;
4
5 create unique index update_idx on employees(employee_id, subsidiary_id);
6
7 \d employees
8 Table "public.employees"
9 Column | Type | Collation | Nullable | Default
10 ------------------+-----------------------+-----------+----------+---------
11 employee_id | integer | | not null |
12 first_name | character varying(64) | | not null |
13 last_name | character varying(64) | | not null |
14 date_of_birth | character varying | | not null |
15 telephone_number | character varying(16) | | not null |
16 subsidiary_id | integer | | |
17 Indexes:
18 "employees_pkey" PRIMARY KEY, btree (employee_id)
19 "update_idx" UNIQUE, btree (employee_id, subsidiary_id)

N. H. Châu (VNU-UET) Query optimization 17 / 67


Concatenated index

1 select first_name, last_name from employees where employee_id=214164 and


,→ subsidiary_id = 20;
2 first_name | last_name
3 ---------------+-----------------
4 vtJiHCvQxHtP | syiYv8EhsZ9aMpK
5 (1 row)
6
7 explain (analyze, verbose, costs, buffers)
8 select first_name, last_name from employees where employee_id=214164 and
,→ subsidiary_id = 20;
9 QUERY PLAN
10 ------------------------------------------------------------------------------------
11 Index Scan using update_idx on public.employees (cost=0.42..8.44 rows=1
,→ width=29) (actual time=0.032..0.034 rows=1 loops=1)
12 Output: first_name, last_name
13 Index Cond: ((employees.employee_id = 214164) AND (employees.subsidiary_id
,→ = 20))
14 Buffers: shared hit=4
15 Planning Time: 0.123 ms
16 Execution Time: 0.060 ms
17 (6 rows)

N. H. Châu (VNU-UET) Query optimization 18 / 67


Concatenated index

1 select first_name, last_name from employees where subsidiary_id = 20 limit 1;


2 first_name | last_name
3 --------------+-----------------
4 vtJiHCvQxHtP | syiYv8EhsZ9aMpK
5 (1 row)
6
7 explain (analyze, verbose, costs, buffers)
8 select first_name, last_name from employees where subsidiary_id = 20 limit 1;
9 QUERY PLAN
10 ------------------------------------------------------------------------------------
11 Limit (cost=0.00..0.08 rows=1 width=29) (actual time=0.011..0.011 rows=1
,→ loops=1)
12 Output: first_name, last_name
13 Buffers: shared hit=1
14 -> Seq Scan on public.employees (cost=0.00..6929.00 rows=84573 width=29)
,→ (actual time=0.009..0.009 rows=1 loops=1)
15 Output: first_name, last_name
16 Filter: (employees.subsidiary_id = 20)
17 Buffers: shared hit=1
18 Planning Time: 0.078 ms
19 Execution Time: 0.025 ms

N. H. Châu (VNU-UET) Query optimization 19 / 67


Concatenated index

N. H. Châu (VNU-UET) Query optimization 20 / 67


Concatenated index

A concatenated index (multi-column, composite or combined


index) is one index across multiple columns
Order of columns in concatenated index is important
Multi index slow

N. H. Châu (VNU-UET) Query optimization 21 / 67


Exercise

What happen if we use

1 create unique index update_idx on employees(subsidiary_id, employee_id);


2 -- and
3 create index additional_idx on employees(subsidiary_id);

Conduct experiments to answer

N. H. Châu (VNU-UET) Query optimization 22 / 67


The query optimizer

The query optimizer, or query planner, is the database


component that transforms an SQL statement into an execution
plan: compiling, or parsing
Cost-based optimizers (CBO) generate many execution plan
variations and calculate a cost value for each plan based on the
operations and estimated row numers, then choose the ”best”
exexution plan
Rule-based optimizers (RBO) generate the execution plan using
a hard- coded rule set: less flexible and seldom used

N. H. Châu (VNU-UET) Query optimization 23 / 67


Changing index

1 -- No additional index
2 select first_name, last_name, subsidiary_id, phone_number
3 from employees where last_name = 'xyz' and subsidiary_id = 10;
4 QUERY PLAN
5 ------------------------------------------------------------------------------------
6 Seq Scan on public.employees (cost=0.00..259.57 rows=1 width=346) (actual
,→ time=1.647..1.647 rows=0 loops=1)
7 Output: first_name, last_name, subsidiary_id, phone_number
8 Filter: (((employees.last_name)::text = 'xyz'::text) AND
,→ (employees.subsidiary_id = 10))
9 Rows Removed by Filter: 7376
10 Buffers: shared hit=202
11 Planning Time: 0.069 ms
12 Execution Time: 1.673 ms
13 (7 rows)

N. H. Châu (VNU-UET) Query optimization 24 / 67


Changing index
1 -- create unique index update_idx on employees(subsidiary_id, employee_id)
2 select first_name, last_name, subsidiary_id, phone_number
3 from employees where last_name = 'xyz' and subsidiary_id = 10;
4 QUERY PLAN
5 ------------------------------------------------------------------------------------
6 Bitmap Heap Scan on public.employees (cost=4.56..99.27 rows=1 width=346)
,→ (actual time=1.908..1.909 rows=0 loops=1)
7 Output: first_name, last_name, subsidiary_id, phone_number
8 Recheck Cond: (employees.subsidiary_id = 10)
9 Filter: ((employees.last_name)::text = 'xyz'::text)
10 Rows Removed by Filter: 3711
11 Heap Blocks: exact=84
12 Buffers: shared hit=84 read=22
13 -> Bitmap Index Scan on update_idx (cost=0.00..4.56 rows=37 width=0)
,→ (actual time=0.687..0.688 rows=7376 loops=1)
14 Index Cond: (employees.subsidiary_id = 10)
15 Buffers: shared read=22
16 Planning:
17 Buffers: shared hit=18 read=1
18 Planning Time: 0.282 ms
19 Execution Time: 1.943 ms
20 (14 rows)
N. H. Châu (VNU-UET) Query optimization 25 / 67
Changing index

The query with updated index is slower


Reason: Rows removed by filter is smaller
Choosing the best execution plan depends on
The table’s data distribution
How the optimizer uses statistics about the contents of the
database

N. H. Châu (VNU-UET) Query optimization 26 / 67


Statistics

A cost-based optimizer uses statistics about tables, columns,


and indexes
Column: the number of distinct values, the smallest and largest
values, the number of NULL ocurrences and the column’s data
distribution
Table: size in rows and blocks
Index: the tree depth, the number of leaf nodes, the number of
distinct keys and the clustering factor

N. H. Châu (VNU-UET) Query optimization 27 / 67


Indexes with functions

N. H. Châu (VNU-UET) Query optimization 28 / 67


A case-insensitive search

1 explain (analyze, costs, verbose, buffers)


2 select first_name, last_name, phone_number
3 from employees where upper(last_name)=upper('xyz');
4 QUERY PLAN
5 ------------------------------------------------------------------------------------
6 Seq Scan on public.employees (cost=0.00..312.64 rows=37 width=40) (actual
,→ time=2.766..2.766 rows=0 loops=1)
7 Output: first_name, last_name, phone_number
8 Filter: (upper((employees.last_name)::text) = 'XYZ'::text)
9 Rows Removed by Filter: 7376
10 Buffers: shared hit=202
11 Planning Time: 0.071 ms
12 Execution Time: 2.780 ms
13 (7 rows)

N. H. Châu (VNU-UET) Query optimization 29 / 67


The case-insensitive search with index

1 create index emp_up_name on employees (upper(last_name));


2 explain (analyze, costs, verbose, buffers)
3 select first_name, last_name, phone_number
4 from employees where upper(last_name)=upper('xyz');
5 QUERY PLAN
6 ------------------------------------------------------------------------------------
7 Bitmap Heap Scan on public.employees (cost=4.57..99.28 rows=37 width=40)
,→ (actual time=0.027..0.028 rows=0 loops=1)
8 Output: first_name, last_name, phone_number
9 Recheck Cond: (upper((employees.last_name)::text) = 'XYZ'::text)
10 Buffers: shared read=2
11 -> Bitmap Index Scan on emp_up_name (cost=0.00..4.56 rows=37 width=0)
,→ (actual time=0.025..0.025 rows=0 loops=1)
12 Index Cond: (upper((employees.last_name)::text) = 'XYZ'::text)
13 Buffers: shared read=2
14 Planning:
15 Buffers: shared hit=17 read=1
16 Planning Time: 0.318 ms
17 Execution Time: 0.057 ms
18 (11 rows)

N. H. Châu (VNU-UET) Query optimization 30 / 67


Function-based index

An index whose definition contains functions or expressions is a


so-called function-based index (FBI)
A function-based index applies the function first and puts the
result into the index
The database can use a function-based index if the exact
expression of the index definition appears in an SQL statement

N. H. Châu (VNU-UET) Query optimization 31 / 67


User-defined functions

1 create or replace function get_age(date_of_birth date)


2 returns int as
3 $$
4 begin return round((current_date-date_of_birth)/365.0); end;
5 $$ language plpgsql;
6 explain (analyze, costs, verbose, buffers)
7 select first_name, last_name, get_age(date_of_birth)
8 from employees where get_age(date_of_birth) = 42;
9 QUERY PLAN
10 ------------------------------------------------------------------------------------
11 Seq Scan on public.employees (cost=0.00..2142.49 rows=37 width=33) (actual
,→ time=0.461..14.245 rows=110 loops=1)
12 Output: first_name, last_name, get_age(date_of_birth)
13 Filter: (get_age(employees.date_of_birth) = 42)
14 Rows Removed by Filter: 7289
15 Buffers: shared hit=191
16 Planning:
17 Buffers: shared hit=21
18 Planning Time: 0.250 ms
19 Execution Time: 14.277 ms
20 (9 rows)

N. H. Châu (VNU-UET) Query optimization 32 / 67


Immutable functions
1 create or replace function get_age(date_of_birth date)
2 returns int as
3 $$
4 begin return round((current_date-date_of_birth)/365.0); end;
5 $$ language plpgsql;
6 create index emp_up_name on employees (get_age(date_of_birth));
7 ERROR: functions in index expression must be marked IMMUTABLE
8
9 create or replace function get_age(date_of_birth date)
10 returns int as
11 $$
12 begin return round((current_date-date_of_birth)/365.0); end;
13 $$ immutable language plpgsql;
14 create index emp_up_name on employees (get_age(date_of_birth));
15 CREATE INDEX

User must ensure the function is immutable


Although get_age is declared and accepted by PostgreSQL to
be function-based index, it will not work: get_age is not
deterministic
N. H. Châu (VNU-UET) Query optimization 33 / 67
SQL parameterized queries and bind parameters

Bind parameters, also dynamic parameters or bind variables,


used to pass data to the database using placeholders like ?,
:name or @name
Advantages of bind parameters:
Security: Bind variables are the best way to prevent SQL
injection
Performance: Database with execution plan cache can reuse
execution plance of the same statements
Note: bind parameters are not wildcards (%, _)
Bind parameters cannot change the structure of an SQL
statement. To change the structure of an SQL statement during
runtime, use dynamic SQL

N. H. Châu (VNU-UET) Query optimization 34 / 67


Example: MySQL bind parameters and PHP

1 // No bind parameters
2 $mysqli->query("select first_name, last_name"
3 . " from employees"
4 . " where subsidiary_id = " . $subsidiary_id);
5
6 // Using a bind parameter
7 if ($stmt = $mysqli->prepare("select first_name, last_name"
8 . " from employees"
9 . " where subsidiary_id = ?"))
10 {
11 $stmt->bind_param("i", $subsidiary_id);
12 $stmt->execute();
13 } else {
14 /* handle SQL error */
15 }

N. H. Châu (VNU-UET) Query optimization 35 / 67


PostgreSQL PREPARE and EXECUTE

A prepared statement is a server-side object that can be used to


optimize performance
When the PREPARE statement is executed, the specified
statement is parsed, analyzed, and rewritten
When an EXECUTE command is subsequently issued, the
prepared statement is planned and executed: avoiding repetitive
parse analysis work and allowing the execution plan to depend
on the specific parameter values supplied

N. H. Châu (VNU-UET) Query optimization 36 / 67


PostgreSQL PREPARE and EXECUTE

1 -- Prepare statements
2 prepare add_employee(int, varchar(64), varchar(64), date, varchar(16), int) as
3 insert into employees values ($1, $2, $3, $4, $5, $6);
4
5 prepare get_employee(int) as
6 select * from employees where employee_id = $1;
7
8 -- Execute statements
9 start transaction;
10 execute add_employee(10, 'FN', 'LN', '2005-01-02', 123, 12);
11 execute get_employee(10);
12 rollback;
13
14 -- Deallocate prepare statements
15 deallocate prepare add_employee;
16 deallocate prepare get_employee;
17 deallocate prepare all;

N. H. Châu (VNU-UET) Query optimization 37 / 67


Searching for ranges

The access predicates are the start and stop conditions for an
index lookup. They define the scanned index range
Inequality operators (<, >, between) can use indexes like the
equal operator
Performance risk of range search: leaf node traversal → the
golden rule is to keep the scanned index range as small as
possible
<, >, between may not able to use the order of multi-column
index order
Rule of thumb: Use index for equality first, the inequality

N. H. Châu (VNU-UET) Query optimization 38 / 67


Indexing like filters

LIKE filters can only use the characters before the first wildcard
during tree traversal
Only the part before the first wildcard serves as an access
predicate
The remaining characters are just filter predicates that do not
narrow the scanned index range

N. H. Châu (VNU-UET) Query optimization 39 / 67


Indexing like filters

1 create index ln_idx on employees (last_name);


2 explain (analyze, verbose, costs, settings, buffers, wal, timing, summary)
,→ select * from employees where last_name like 'xyz%';
3 QUERY PLAN
4 ------------------------------------------------------------------------------------
5 Seq Scan on public.employees (cost=0.00..245.49 rows=1 width=52) (actual
,→ time=0.766..0.767 rows=0 loops=1)
6 Output: employee_id, first_name, last_name, date_of_birth, phone_number,
,→ subsidiary_id
7 Filter: ((employees.last_name)::text ~~ 'xyz%'::text)
8 Rows Removed by Filter: 7399
9 Buffers: shared hit=153
10 Planning:
11 Buffers: shared hit=16 read=1
12 Planning Time: 0.167 ms
13 Execution Time: 0.779 ms
14 (9 rows)

N. H. Châu (VNU-UET) Query optimization 40 / 67


Indexing like filters

1 create index ln_idx on employees (last_name varchar_pattern_ops);


2 explain (analyze, verbose, costs, settings, buffers, wal, timing, summary)
3 select * from employees where last_name like 'xyz%';
4 QUERY PLAN
5 ------------------------------------------------------------------------------------
6 Index Scan using ln_idx on public.employees (cost=0.28..8.30 rows=1
,→ width=52) (actual time=0.023..0.023 rows=0 loops=1)
7 Output: employee_id, first_name, last_name, date_of_birth, phone_number,
,→ subsidiary_id
8 Index Cond: (((employees.last_name)::text ~>=~ 'xyz'::text) AND
,→ ((employees.last_name)::text ~<~ 'xy{'::text))
9 Filter: ((employees.last_name)::text ~~ 'xyz%'::text)
10 Buffers: shared read=2
11 Planning:
12 Buffers: shared hit=16 read=1
13 Planning Time: 0.314 ms
14 Execution Time: 0.048 ms
15 (9 rows)

N. H. Châu (VNU-UET) Query optimization 41 / 67


Index merge

One index with multiple columns is better than multiple indexes


separately
One index scan is faster than two or more
Bitmap index is almost unusable for OLTP

N. H. Châu (VNU-UET) Query optimization 42 / 67


Partial index
A partial index is useful for commonly used where conditions
that use constant values
For example, a common query (find unprocessed messages) in
queueing systems:
1 SELECT message FROM messages
2 WHERE processed = 'N' AND receiver = ?

A normal index:
1 CREATE INDEX messages_todo
2 ON messages (receiver, processed)

A better solution is the partial index:


1 CREATE INDEX messages_todo ON messages (receiver)
2 WHERE processed = 'N'

N. H. Châu (VNU-UET) Query optimization 43 / 67


Partial index

1 create index partial_idx on employees (phone_number) where subsidiary_id=10;


2 \d employees
3 Table "public.employees"
4 Column | Type | Collation | Nullable | Default
5 ---------------+-----------------------+-----------+----------+---------
6 employee_id | integer | | not null |
7 first_name | character varying(64) | | not null |
8 last_name | character varying(64) | | not null |
9 date_of_birth | date | | not null |
10 phone_number | character varying(16) | | not null |
11 subsidiary_id | integer | | |
12 Indexes:
13 "employees_pkey" PRIMARY KEY, btree (employee_id)
14 "partial_idx" btree (phone_number) WHERE subsidiary_id = 10
15 "update_idx" UNIQUE, btree (subsidiary_id, employee_id)

N. H. Châu (VNU-UET) Query optimization 44 / 67


Obfuscation conditions

N. H. Châu (VNU-UET) Query optimization 45 / 67


Obfuscation conditions
Obfuscated conditions are where clauses that are phrased in a
way that prevents proper index usage
Most obfuscations involve DATE types
A solution: function-based index, for example:

1 CREATE INDEX index_name ON table_name (TRUNC(sale_date))

but we must always use TRUNC


Alternative solution: Use explicit range condition

1 -- Range query
2 SELECT ... FROM sales
3 WHERE sale_date BETWEEN quarter_begin(?)
4 AND quarter_end(?)
5 -- Index: A straight index on SALE_DATE is enough to optimize this query

N. H. Châu (VNU-UET) Query optimization 46 / 67


Obfuscation conditions PostgreSQL

1 CREATE FUNCTION quarter_begin(dt timestamp with time zone)


2 RETURNS timestamp with time zone AS $$
3 BEGIN
4 RETURN date_trunc('quarter', dt);
5 END;
6 $$ LANGUAGE plpgsql;
7
8 CREATE FUNCTION quarter_end(dt timestamp with time zone)
9 RETURNS timestamp with time zone AS $$
10 BEGIN
11 RETURN date_trunc('quarter', dt)
12 + interval '3 month'
13 - interval '1 microsecond';
14 END;
15 $$ LANGUAGE plpgsql;

N. H. Châu (VNU-UET) Query optimization 47 / 67


Numeric strings

Numeric strings are numbers that are stored in text columns


Normally it is a bad practice although we can create an index for
the numeric string
Rule: Use numeric types to store numbers

N. H. Châu (VNU-UET) Query optimization 48 / 67


The JOIN operation

N. H. Châu (VNU-UET) Query optimization 49 / 67


The JOIN operation

Building block: two tables join


Join order affects performance
Using bind parameters is very important to complex join
statements to avoid recompiling

N. H. Châu (VNU-UET) Query optimization 50 / 67


Nested loop using PHP

1 $qb = $em->createQueryBuilder();
2 $qb->select('e')
3 ->from('Employees', 'e')
4 ->where("upper(e.last_name) like :last_name")
5 ->setParameter('last_name', 'WIN%');
6 $r = $qb->getQuery()->getResult();
7 foreach ($r as $row) {
8 // process Employee
9 foreach ($row->getSales() as $sale) {
10 // process Sale for Employee
11 }
12 }

Indexing for nested loop is like indexing for SELECT


SQL joins are more efficient than nested loops

N. H. Châu (VNU-UET) Query optimization 51 / 67


Hash join

The hash join is to fix the weak spot of nested loop: many
B-tree traversals
A hash join requires an entirely indexing approach than the
nested loop join
Indexing strategy:
No need to index the join columns
Only indexes for independent where predicates improve hash
join performance
Important: Indexing join predicates doesn’t improve hash join
performance
Indexing a hash join is independent of the join order

N. H. Châu (VNU-UET) Query optimization 52 / 67


Hash join example

1 -- Join
2 SELECT *
3 FROM sales s
4 JOIN employees e ON (s.subsidiary_id = e.subsidiary_id
5 AND s.employee_id = e.employee_id )
6 WHERE s.sale_date > trunc(sysdate) - INTERVAL '6' MONTH
7
8 -- Index for WHERE predicate
9 CREATE INDEX sales_date ON sales (sale_date);

MySQL Community Edition supports hash join since version 8.0

N. H. Châu (VNU-UET) Query optimization 53 / 67


Sort merge

The sort-merge join combines two sorted lists like a zipper


Both sides of the join must be sorted by the join predicates
A sort-merge join needs the same indexes as the hash join, that
is an index for the independent conditions to read all candidate
records in one shot.
Indexing the join predicates is useless
then sort-merge join is like hash join
Sort-merge is absolute symmetry and very useful for outer joins

N. H. Châu (VNU-UET) Query optimization 54 / 67


Clustering data

N. H. Châu (VNU-UET) Query optimization 55 / 67


Clustering data

Clustering data: to store consecutively accessed data closely


together so that accessing it requires fewer IO operations
Example: Column oriented databases are common in OLAP
processing: accessing many rows but only a few columns
Indexes allow one to cluster data: the second power of indexing

N. H. Châu (VNU-UET) Query optimization 56 / 67


Index filter predicates used intentionally

Index predicates can be used to group consecutively accessed


data together
WHERE clause predicates that cannot serve as access predicate
are good candidates for this technique
The query performance depends on the physical distribution of
accessed rows
Reordering real data in the disk is impractival because it can
serve only one sequence
The indexing clustering factor: probability that two succeeding
index entries refer to the same table block
One can add many columns to an index so that they are
automatically stored in a well defined order: second power of
indexing

N. H. Châu (VNU-UET) Query optimization 57 / 67


Index-only scan

The index-only scan is one of the most powerful tuning methods


of all
It not only avoids accessing the table to evaluate the where
clause, but avoids accessing the table completely if the database
can find the selected columns in the index itself
To cover an entire query, an index must contain all columns
from the SQL statement: covering index
The performance advantage of an index-only scans depends on
the number of accessed rows and the index clustering factor

N. H. Châu (VNU-UET) Query optimization 58 / 67


Sorting and grouping

N. H. Châu (VNU-UET) Query optimization 59 / 67


Indexing ORDER BY

If the index order corresponds to the order by clause, the


database can omit the explicit sort operation: the same index
that is used for the where clause must also cover the order by
clause
Tip: Use the full index definition in the order by clause to find
the reason for an explicit sort operation

N. H. Châu (VNU-UET) Query optimization 60 / 67


Indexing ASC, DESC and NULLS FIRST/LAST

DBMS can read indexes in both directions


When using mixed ASC and DESC modifiers in the order by
clause, one must define the index likewise in order to use it for a
pipelined order by

N. H. Châu (VNU-UET) Query optimization 61 / 67


Indexing GROUP BY

SQL has two GROUP BY algorithms:


Hash algorithm: aggregates the input records in a temporary
hash table; once all input records are processed, the hash table
is returned as the result
Sort/group algorithm: first sorts the input data by the grouping
key; afterwards, the DBMS just needs to aggregate them
The sort/group algorithm can use an index to avoid the sort
operation

N. H. Châu (VNU-UET) Query optimization 62 / 67


Modifying data

N. H. Châu (VNU-UET) Query optimization 63 / 67


INSERT

The number of indexes on a table is the most dominant factor


for insert performance
The more indexes a table has, the slower the execution becomes
The insert statement is the only operation that cannot directly
benefit from indexing because it has no WHERE clause

N. H. Châu (VNU-UET) Query optimization 64 / 67


Improve INSERT performance

Use indexes deliberately and sparingly, and avoid redundant


indexes
This is also beneficial for delete and update statements
N. H. Châu (VNU-UET) Query optimization 65 / 67
DELETE

Unlike the insert statement, the DELETE statement has a where


clause that can use all the index methods in the WHERE clause
In fact, the delete statement works like a select that is followed
by an extra step to delete the identified rows
N. H. Châu (VNU-UET) Query optimization 66 / 67
UPDATE

An update statement must relocate the changed index entries to


maintain the index order
The response time is basically the same as for the respective
delete and insert statements together
The update performance, just like INSERT and DELETE, also
depends on the number of indexes on the table
N. H. Châu (VNU-UET) Query optimization 67 / 67

You might also like