0% found this document useful (0 votes)
19 views18 pages

IoTDBS 2024 Presentation

IoTDBS 2024 Presentation

Uploaded by

jq4q7z6nny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views18 pages

IoTDBS 2024 Presentation

IoTDBS 2024 Presentation

Uploaded by

jq4q7z6nny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Sample-Based

Cardinality
Estimation in Full
Outer Join Queries
April 28-30, 2024
Onsite Presentation

Uriy Grigorev, Olga Pluzhnikova , Evgeny Detkov


Bauman Moscow State Technical University, BMSTU
Moscow, Russia
e-mail: [email protected], [email protected],
[email protected]

Andrey Ploutenko , Aleksey Burdakov


Amur State University, AmSU
Blagoveschensk , Russia
e-mail: [email protected], [email protected]
Problem Statement
A general database query:

SELECT attributes, aggregates


FROM T 1 JOIN T 2 ON ... JOIN T m ON ...
WHERE (condition on T 1 ) AND (condition on T 2 ) ... AND (condition on T m )
AND IN (select ... from ... where ...)
AND EXISTS (select ... from ... where ...)
AND NOT EXISTS (select ... from ... where ...)
GROUP BY .... ORDER BY ...;

U= Q 1 Q 2 ... Q m ,

where Q i = (select attr i from Ti where condition by T i ) – reading from original tables (subqueries).

Main task (in the future): development of an algorithm for selecting a query execution plan
and a cost model for queries with a large number of tables in the connection (m  100).

Current task (topic of the presentation): development of a method for estimating the
cardinality of intermediate tables ( IT) : join U subqueries Q = (Q 1 , Q 2 , ..., Q m ) and
subplans ( Q i 1 , Q i 2 , ... , Q ik )  Q .

3/8/17 2
Estimating cardinality (number of records) plays a key
role in creating efficient query plans in large RDBs
IMDB dataset , 113 queries (3 to 16 joins) 1 .
Table 1. Impact of the accuracy of cardinality estimation of intermediate tables (IT) 1 .

For 35.6%, 22%, 47.1%, 25.4%, 33.2% of requests, the execution time exceeded more than 2
times for the corresponding DBMS. A considerable percentage of queries with execution time
exceeding two orders of magnitude (>100). The problem is still relevant.

1 Leis
V. et al . How good are query optimizers, really? //Proceedings of the VLDB Endowment. –
2015. – T. 9. – no. 3. – P. 204-215
4/23/2 3 3
Existing Methods for Estimating the
Cardinality of a Staging Table
1. Histograms and samples
• widely used in DBMSs
• usually based on simplified assumptions and expert-developed heuristics
2. Query-based machine learning (ML)
• attempt to train a model to estimate Card(T,Q) from a query
• aome advanced ML methods improve performance by using more complex models, e.g.:
• deep neural networks (DNNs)
• gradient boosted trees
3. Data-driven ML methods
• query-agnostic
• treat each tuple in T as a point chosen according to the joint distribution
P T ( A )= P T ( A 1 , A 2 , . . . , A k ). Let P T (Q) = P T ( A 1 ∈ R 1 ∧ A 2 ∈ R 2 ∧...∧ A k ∈ R k ) -
probability that corresponds to the query Q ==> Card(T,Q)= P T (Q)  |T|

Current open source and commercial DBMSs primarily use two traditional CardEst methods:
• histograms in PostgreSQL and MS SQL Server
• sampling in MySQL and MariaDB

4/23/2 3 3
Existing Cardinality Estimation Methods
Disadvantages
Disadvantage Method

The cardinality is estimated for each subplan, so the evaluation time is large and all
proportional to the number of subplans of the original query
Correlations between selectivity and connectivity attributes are not taken into account histogram-based

Requires indexes on foreign keys of connections sample-based

Only table joins based on attribute equality are considered all

The tables to be joined must form an acyclic graph ML-based

Simplified premises or “magic” numbers are used when analyzing complex table all
filtering conditions ('!=', LIKE , etc.)
The cardinality estimate degrades as the number of joined tables increases all

The justification of the methods is given at the level of heuristics all

STATS test , query Q 57 (see below), subquery {1,5,3,4,2},


subquery cardinality assessment by explain ( PostgreSQL ) - 125,416 records,
7 hours (clockwise) at night the subquery has not yet been executed in PostgreSQL 15,
the actual cardinality of the subquery is 1,375,709,726,310 records.
4/23/21
5
Full Outer Join (FJJ) tables
(Q1 , Q2 , ..., Qm)
(1)
C ( Q ) - cardinality of joining tables ( Q 1 , Q 2 , ..., Q m ), F – the number of lines in the PVS, value 1 Qj , i
is 0, if in the i -th row the PVS =( Q 1 ⊲⊳ Q 2 ... ⊲⊳ Q m ) the attributes of some Q j are equal to the
empty symbol  (there is no connection with the record from Q j ), otherwise it is equal to 1.
An example of a theta join. SELECT * FROM Q1, Q2, Q3 WHERE Q1.A1= Q2.A1 and Q1.A2>=
Q2.A2 and Q2.A3!= Q3.A3;
1 Qj , i =1;

navigation;

C ( Q )=6:
(2,12)-(2,10,33)-(23)
(2,12)-(2,10,33)-(23)
(4,10)-(4,9,33)-(23)
(4,10)-(4,9,33)-(23)
(9,33)-(9,13,23)-(33)
(9,33)-(9,13,23)-(33)
Disadvantage: Implementation of the PVS takes a lot of time (here each table Q j acts as one block).

4/23/21 6
Full Outer Join (FOJ) of Blocks

(2)

Q j ,i - i -th block of table Q j . The amount is taken over all combinations of blocks

Example (query and tables see above): Q1.A1= Q2.A1 and Q1.A2>= Q2.A2 and Q2.A3!= Q3.A3;

The selected areas of the table Q j are


its blocks
- these are PVA blocks Q j ,i
С(Q1,1, Q2,1,Q3,1)=2, С(Q1,1, Q2,1,Q3,2)=0,
С(Q1,1, Q2,2,Q3,1)=0, С(Q1,1, Q2,2,Q3,2)=0,
С(Q1,2, Q2,1,Q3,1)=2, С(Q1,2, Q2,1,Q3,2)=0,
С(Q1,2, Q2,2,Q3,1)=0, С(Q1,2, Q2,2,Q3,2)=0,
С(Q1,3, Q2,1,Q3,1)=0, С(Q1,3, Q2,1,Q3,2)=0,
С(Q1,3, Q2,2,Q3,1)=0, С(Q1,3, Q2,2,Q3,2)=2.
C(Q)=6
Disadvantage: Estimation of cardinality of connection of one combination of blocks (Q1,i1 ,..., Qm,im )
is executed quickly (the blocks are small), but the number of combinations (Q1,i1 ,..., Qm,im )
may be very large

4/23/21
7
Proposed Method for Estimating the Cardinality of
Tables (Q 1 , Q 2 , ..., Q m ) ( EVACAR)

To reduce the amount of calculations, we will use the theory of approximate calculation of
aggregates.
1. With probability  g , select a combination of blocks g =( i 1 ,..., i m ):
m
 g =  (1/ N j ) , N j - number of blocks in table Q (3)
j =1
2. We’ll make it j . spicy for her. full
outer join: FOJ𝑔 = FOJ(𝑄1,𝑖1 , . . . , 𝑄𝑚,𝑖𝑚 ).
3. For this PVA g , we calculate the cardinality cg = c(Q1,i1 ,..., Qm,im ) according to formula (1).

4. Samples g are repeated n times. Next, we evaluate the cardinality using the
formula: 1 cg N m
c(Q, n) =  ( ) =  cg , N =  N j (4)
n g g n g j =1
Evaluation properties (4):
1. c(Q, n) ⎯⎯⎯
n →
→ c(Q) and n( E c(Q, n) = c(Q)) , i.e. estimate (4) is unbiased for any n.
2. Property 1 is true for any probability distribution {  g } : cg  0 → π g  0

4/23/21
8
Theoretical Assessment of the Accuracy of
Cardinality Calculations

Confidence interval of estimate c (Q, n) (relative value):

| c(Q ) − c(Q, n) | N 1
=  tn −1, ( 2  cg2 − 1)  (5)
c(Q ) с (Q ) g (n − 1)
c ( Q ) – true value of cardinality; for n >121 coefficient t n -1,  practically does not depend on
n , and for  =0.9; 0.95; 0.99 it is equal to 1.645; 1.960; 2,576.

To simplify the analysis of formula (5), we assume that the cardinality value c ( Q ) is uniformly
distributed over K combinations (chains) g =( i 1 , ..., i m ). That is, with g = c ( Q )/ K , |{ g }|= K .
Then we get

N 1
  tn −1, ( − 1)  (6)
K (n − 1)
Conclusio. The larger K , that is, the number of combinations g with non-empty block joins, the
smaller the relative error  .
4/23/21
9
Implementation of the EVACAR Method
(Prototype)
Reading records from source tables:

a) query execution stage b) plan building stage

Host - windows

Virtual Machine Virtual machine (VM) with Ubuntu 18.04.5 OS, 1 Intel
Client DB Server Core i5 CPU, 4GB RAM, 20GB disk. The EVACAR
program is implemented in C language (gcc compiler),
Test PostgreSQL Test DB the program size is 40 KB. STATS Test Dataset.
EVACAR
(Q57) 15 (STATS)

Qj blocks read by cursor (libpq)

4/23/21
10
Test Description
Testing on the STATS dataset, specifically Query Q57 – most representative
used to analyze CardEst methods. • where 6 tables are joined and search
Complex properties: conditions are applied to them (we
• large number of attributes will call them subqueries)
• strong distributed skewness
• high attribute correlation
• complex table join scheme SELECT COUNT(*)
FROM users as u, badges as b, postHistory as ph ,
votes as v, posts as p, postLinks as pl
WHERE p.Id = pl.RelatedPostId AND u.Id =
p.OwnerUserId AND u.Id = b.UserId AND u.Id =
Execution ph.UserId AND u.Id = v.UserId
• run on a VM in the Postgresql AND p.CommentCount >=0 AND p.CommentCount
<=13
environment 15 about 17 mins AND ph.PostHistoryTypeId =5 AND ph.CreationDate
• ~7 mins. of pure virtual machine time <='2014-08-13 09:20:10'::timestamp
• query result: 17,849,233,970 records AND v.CreationDate >='2010-07-19
00:00:00'::timestamp
AND b.Date <='2014-09-09 10:24:35'::timestamp
AND u.Views >=0 AND u.DownVotes >=0 AND
u.CreationDate >='2010-08-04
16:59:53'::timestamp AND u.CreationDate <='2014-
07-22 15:15 :22'::timestamp;

3/8/17 11
Comparison with BayesCard , DeepDB and FLAT Methods in
Terms of Accuracy

EVACAR : product of the number of


blocks N j was equal to N = 10 5 , the
number of samples g was equal to n
=10.

Sampling for BayesCard , DeepDB and


FLAT - 1,
Sample for EVACAR – 50 (for each
estimate c( Q,n ) ).

EVACAR is better (yellow):


BayesCard - for 13% of subplans ,
DeepDB - for 38% of subplans ,
FLAT – for 42% of subplans .

EVACAR is worse (green):


in 1 case.

3/8/17 12
Comparison with BayesCard , DeepDB and FLAT Methods for
Performance and Memory

BayesCard , DeepDB and FLAT :


Two different Linux servers. One, with 32 Intel ( R ) Xeon ( R ) Platinum 8163 processors clocked at 2.50
GHz, Tesla V 100 SXM 2 GPU , and 64 GB RAM, was used to train the models. Another, with 64 Intel
Xeon E 5-2682 processors clocked at 2.50 GHz, was used for cardinality estimation on PostgreSQL.
EVACAR :
One virtual machine (VM) Ubuntu 18.04.5 with 1 Intel Core i 5 CPU with a frequency of 1.6 GHz and
4GB RAM.

Space-time characteristics of compared cardinality estimation methods


BayesCard DeepDB Flat EVACAR

Average cardinality estimation 2200/24=92


5.8 87 175
time per subplan request , ms. 90/24=3.8

Model size, MB. 5.9 162 310 13.1 (7.1) for n=10
Training time, min. 1.8 108 262 not required
Model update time when inserting
12 248 360 not required
10 6 records into the database, s.
EVACAR : time measurement - gprof program , memory measurement - program valgrind .
3/8/17 13
Advantages of the EVACAR Method
Advantage

Mathematical basis: the property of a complete outer join, the theory of sum estimation based on sampling

No need to train and retrain the model, unlike in methods based on machine learning

Condition for joining tables can be arbitrary (theta join), that is, it is not necessarily equality of attributes

No problems with assessing the selectivity of the source tables, since records are joined after executing
subqueries (slide 10)
No access database indexes, so their presence is not required

No assumptions about the independence of attributes and the uniform distribution of records across
domain values (unlike the classical approach)
Accuracy of cardinality estimation and the running time of the algorithm are regulated by the number of
samples n and the product of the number of blocks Nj
Q j , i blocks are small in size, so their complete external connection is performed quickly

3/8/17 14
Advantages of the EVACAR method
(continued)
The cardinality of each subplan is estimated based on the sample for the original query Q , that is, no
additional costs are required.

One tree of blocks and structures is built for the entire query, and then its subtrees are used to
evaluate the cardinality of any subplan . Therefore, evaluation of subplans is performed quickly.

3/8/17 15
Advantages of the EVACAR method
(continued)
The connection graph of query tables may contain cycles.
Example:
select * from A,B where A.a1=B.b1 and A.a2> (select avg(C.c3) from C where B.b2=C.c1 and
C.c2=A.a3);
The connection conditions ( A . a 1= B . b 1) - ( B . b 2= C . c 1) - ( C . c 2= A . a 3 ) form a cyclic
graph A - B - C - A. In addition, an additional condition ( >) is applied to the Aa 2 attribute .

Structures with
record numbers of iB iC
child blocks (to iC
obtain chains iA, iB, iA iB c(i ) = (1A,i = 1)  (1B ,i = 1)  (1C ,i = 1)  1a 3= c 2 and a 2  c31, i
iC)
A C

Additional blocks
with table attributes iA iC

a2 a3 c2 c31=
Additional filtration a3=c2 and avgc1,c2(c3)
conditions a2>c31

3/8/17 16
Disadvantages of the Method and Directions
for Further Research
N 1
  tn −1, ( − 1)  ( see (6))
K (n − 1)
m
N =  N j - - product of the number of blocks of tables Q j ,
j =1
K - number of combinations of blocks g =( i 1 , ..., i m ) with non-empty
connections,
n - sample size of combinations g .
Flaw . The more N and the smaller K , the greater the error in cardinality estimation for a fixed n .
Solutions
1. Increasing sample size n due to parallelization.
1 cg N
c(Q, n) =  ( ) =  cg ( see (4))
n g g n g
The calculation of the sum can be parallelized across several cores. If n increase from 10 to 100 (i.e.
use 10 cores), then the confidence interval (95%) of the q -error for query Q 57 decreases from ( -
9.8  4.4) (see subplot 24 on slide 12) up to (-2.1  1.6). Those. almost 4 times.
2. The accuracy of the estimate c ( Q , n ) significantly depends on the probability distribution {  g
}. If  g  with g / c ( Q ), then the error in cardinality calculation will be minimal.
3/8/17 17
Thank you!

Uriy Grigorev, Olga Pluzhnikova , Evgeny Detkov


Bauman Moscow State Technical University, BMSTU
Moscow, Russia
e-mail: [email protected], [email protected],
[email protected]

Andrey Ploutenko , Aleksey Burdakov


Amur State University, AmSU
Blagoveschensk , Russia
e-mail: [email protected], [email protected]

3/8/17 18

You might also like