Calculating Selectivity: Whoami?
Calculating Selectivity: Whoami?
Jonathan Lewis
jonathanlewis.wordpress.com
www.jlcomp.demon.co.uk
Who am I ?
Independent Consultant
31+ years in IT
26+ using Oracle
Strategy, Design, Review,
Briefings, Educational,
Trouble-shooting
Oracle author of the year 2006
Select Editor’s choice 2007
UKOUG Inspiring Presenter 2011
ODTUG 2012 Best Presenter (d/b)
UKOUG Inspiring Presenter 2012
UKOUG Lifetime Award (IPA) 2013
Member of the Oak Table Network
Oracle ACE Director
O1 visa for USA
Warning (b)
create index t1_i1 on t1(mod_200,mod_10000);
select * from t1 where mod_200 = 100 and mod_10000 = 100;
Solutions
If the optimizer has used a guess (typically 1%, 5%, or 0.25%) that is most
inappropriate then setting optimizer_dynamic_sampling to level 3 may help.
In some cases it may be best to use level 4. Generally it's better to use a
cursor or table-level hint to force sampling.
http://jonathanlewis.wordpress.com/?s=optimizer_dynamic_sampling
11g lets you declare (and collect stats on) “virtual columns” – which may solve
many of the selectivity problems due to predicates on function(col). In harder
cases you may find that "extended stats" (in particular, "column groups") will
help, but you are limited to 20 sets of extended stats per table. There may be
side effects in (e.g. IOTs, replication, dbms_redefinition et. al.)
1 - filter("RAND_300"<>150)
column != {constant}, selectivity = (1 - user_tab_cols.density)
1 - filter(SIGN("MOD_10000")=1)
function(column) = {constant}, selectivity = 1%
1 - filter(TRUNC(INTERNAL_FUNCTION("DATE_1000"))<>TO_DATE('
2015-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
function(column) != {constant}, selectivity = 5%
Virtual Columns
alter table t1
add (trunc_date generated always as (trunc(date_1000))virtual)
;
begin
dbms_stats.gather_table_stats(
user,'t1',method_opt=>'for columns trunc_date size 1'
);
end;
select
statement_id, cardinality
from plan_table
where operation = 'TABLE ACCESS'
order by
plan_id
;
STATEMENT_ID CARDINALITY
GTLT 60006 >, < (1800-1200)/(9999-0)
GELT 60106 >=, < + 1/num_distinct
GELE 60206 >=, <= + 1/num_distinct
Our intuition is nearly correct - but it's not quite how Oracle reaches the answer.
The optimizer assumes the data is continuous, and evenly spread between the
low_value and the high_value. As a rough approximation you don't often have to
worry much about whether the range is open or closed.
select * from t1
where mod_10000 > (select 9100 from dual); These examples get special
treatment in 12c (& 11.2.0.4)
select * from t1
where mod_10000 between (select 9100 from dual)
and (select 9101 from dual);
But with indexes and unknown values the selectivities are 0.9% and 0.45% !!
.
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 157 | 8792 | 1138 (8)| 00:00:06 |
|* 1 | TABLE ACCESS FULL| T1 | 157 | 8792 | 1138 (8)| 00:00:06 |
The selectivity of "like 'X%'" seems to derived from ">= X and < {first value that is too large}"
.
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 950K| 50M| 1151 (9)| 00:00:06 |
|* 1 | TABLE ACCESS FULL| T1 | 950K| 50M| 1151 (9)| 00:00:06 |
begin
dbms_stats.gather_table_stats(
user,'t1',
method_opt=>'for columns alpha_20 size 10'
);
end;
select
endpoint_number E_no,
endpoint_value E_val,
to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') hex_value
from
user_tab_histograms
where
table_name = 'T1'
and column_name = 'ALPHA_20'
/
.
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 5000 | 268K| 1113 (6)| 00:00:06 |
|* 1 | TABLE ACCESS FULL| T1 | 5000 | 268K| 1113 (6)| 00:00:06 |
1 - filter("MOD_200"=250)
Value outside known range - estimate uses "linear decay"
1 - filter("MOD_200"=350)
Value further outside known range - increases "linear decay" effect
1 - filter("MOD_200">=350)
Outside known range, range-based predicate treated the same as equality
3719
Cardinality
1206
Requested Value
250 350
-199 0 199 398
Low value High value
Multiple predicates - or
select
*
from t1
where mod_200 = 100 -- 1/200, 5000
or rand_300 = 150 -- 1/300, 3333
; -- sum = 8,333
If we simply take the sum we have double counted some rows (the overlap)
Selectivity (p1) OR (p2) = selectivity(p1) + selectivity(p2) - selectivity(p1 and p2)
= selectivity(p1) + selectivity(p2) - selectivity(p1) * selecitivyt(p2)
1/200 + 1/300 - 1/60000 = 0.00831666…
Query: "List parcels collected in the last 24 hours but not yet delivered"