Database Modeling - notes-VI
Database Modeling - notes-VI
Assume we have a file of 10,000,000 records of mail order customers for a large
commercial business. Customer records have attributes for customer name,
customer number, street address, city, state, zip code, phone number, employer, job
title, credit rating, date of last purchase, and total amount of purchases. Assume that
the record size is 250 bytes; block size is 5000 bytes (bf=20); and pointer size,
including record offset, is 5 bytes (bfac=1000). The query to be analyzed is “Find
all customers whose job title is ‘engineer’, city is ‘chicago’, and total amount of
purchases is greater than $1,000.” For each AND condition we have the following
hit rates, that is, records that satisfy each condition:
job title is ‘engineer’: 84,000
records city is ‘chicago’: 210,000
records
total amount of purchases > $1000: 350,000 records
total number of target records that satisfy all three conditions = 750
query cost (inverted file)
= merge of 3 accession lists + access 750 target records
1
Secondary Indexes using B + -trees
* used by Oracle and many others for non-unique indexes
* index nodes contain key/pointer pairs in the same way as a primary key
index using a B+-tree
* key at each level, leaf and non-leaf, is the concatenation of attributes used
in the query , e.g. jobtitle, city, total_purchases (as attributes of consumer)
* leaf node pointers are to the blocks containing records with the given
combination of attribute values indicated in the concatenated keys
* analysis of queries and updates for this type of index proceeds in the same way
as a primary key (unique) index, keeping in mind that the key formats are
different in the two cases
query iotime (B+tree secondary index) = rba*Trba
= [h + ceil(t/bfac) – 1 + t] * Trba
where h is the height of the B+tree index, bfac is the blocking factor for the
accession list (i.e. the number of pointer/key pairs in the leaf nodes in the B+tree),
and t is the number of target records in the table that satisfies all the conditions in
the query.
2
Denormalization
* motivation – poor performance by normalized databases
table review is
associated with the tables employee and manager as the table that follows shows.
The extension of the review table, review-ext, is shown as a means of reducing
the number of joins required in the query shown below. This extension results in a
real denormalization, that is,
review_no -> emp_id -> emp_name, emp_address
with the side effects of add and update anomalies. However, the delete anomaly
cannot occur because the original data is redundant in the extended schema.
3
Table Denormalization Algorithm
1. Select the dominant processes based on such criteria as high frequency of
execution, high volume of data accessed, response time constraints, or explicit
high priority.
4. Consider also the possibility of denormalization due to a join table and its side
effects. If a join table schema appears to have lower storage and processing cost
and insignificant side effects, then consider using that schema for physical design
in addition to the original candidate table schema. Otherwise use only the original
schema.
4
Join Strategies
1. nested loop: complexity O(mn)
2. merge-join: complexity O(n log2 n)
3. indexed join: complexity O(n)
4. hash-join: complexity O(n)
where m and n are the rows of the two tables to be joined
Assume
* assigned_to table has 50,000 rows
* project table has 250 rows
* let the blocking factors for the assigned_to and project tables be 100 and 50,
respectively, and the block size is equal for the two tables.
the common join column is project_name.
High Selectivity Joins
select p.project_name, p.project_leader, a.emp_id
from project as p, assigned_to as a
where p.project_name = a.project_name;
If a sequential block access requires an average of 10 ms, the total time required is 2505 seconds.
Nested Loop Case 2: project is the outer loop table.
Note that this strategy does not take advantage of row order for these tables
Merge-Join Case 1: Both project and assigned_to are already ordered by
. project_name.
join cost = merge time (to scan both tables)
= 50,000/100 + 250/50
= 505 sequential block accesses (or 5.05 seconds)
Merge-Join Case 2: Only project is ordered by project_name.
join cost = sort time for assigned_to + merge time (to scan both sorted tables)
= (50,000*log2 50,000)/100 + 50,000/100 + 250/50
= (50,000*16)/100 + 500 + 5
= 8505 sequential block accesses (or 85.05 seconds)
5
Merge-Join Case 3: Neither project nor assigned_to are ordered by
. project_name.
join cost = sort time for both tables + merge time for both tables
= (50,000*log2 50,000)/100 +(250*log2 250)/50 + 50,000/100
+ 250/50
= 8000 + 40 + 500 + 5
= 8545 sequential block accesses (or 85.45 seconds)
We see that the sort phase of the merge-join strategy is the costliest component, but it still
significantly improves performance compared to the nested loop strategy.
Low Selectivity Joins
Let ntr=100 qualifying rows for the foreign key table (assigned_to) and let ntr=1 row for the
primary key table (project) in the example below. Assume h=2 for the unique index to project,
Tsba
= 10 ms, and Trba = 40 ms.
select p.project_name, p.project_leader, a.emp_id
from project as p, assigned_to as a
where p.project_name = a.project_name
and p.project_name = 'financial analysis';
Indexed join Case 1: Scan foreign key table once and index to the primary key
join cost = scan the entire foreign key table (assigned_to)
+ index to the primary key table (project) qualifying row
= 50,000/100 sba + (h+1) rba
= 500 sba + 3 rba (or 5.12 seconds)
For the next case, assume the nonunique index height, hn = 3, index blocking factor
bfac = 500, with ntr = 100 target foreign key rows as given above.
Indexed join Case 2: Index to both the primary key table and the foreign key
Join cost = index to the primary key table + index to the foreign key table
= (h+1) rba + [hn + ceil(ntr/bfac) – 1 + ntr] rba
= 3 rba + [3 + 0 + 100] rba
= 106 rba (or 4.24 seconds)
Indexed join Case 3: Nonunique indexes required for both tables due to join on
two nonkeys.
Join cost = index to the first table + index to the second table
= [h1 + ceil(ntr1/bfac1) –1 + ntr1] rba
+ [h2 + ceil(ntr2/bfac2) –1 + ntr2] rba
Hash join Case 1:
join cost = scan first table (assigned_to) + scan second table (project)
+ access qualifying rows in the two tables
= 50,000/100 sba + 250/50 sba + 100 rba + 1 rba
= 505 sba + 101 rba (or 9.09 seconds)
In the hash join strategy, the table scans may only have to be done infrequently as
long as the hash file in RAM remains intact for a series of queries, so in Case 1
above, the incremental cost for the given query requires only 101 rba or 4.04
seconds.
6
VI. Database Distribution Strategies
Overview of Distributed Databases
Distributed database - a collection of multiple, logically interrelated databases distributed over a
computer network [OzVa91].
Distributed Database Management System (DDBMS) - a software system that permits the
management of a distributed database and makes the distribution transparent to the users. If
heterogeneous, it may allow transparent simultaneous access to data on multiple dissimilar systems.
Advantages
1. Improves performance, e.g. it saves communication costs and reduces query delays by providing
data at the sites where it is most frequently accessed.
2. Improves the reliability and availability of a system by providing alternate sites from where
the information can be accessed.
3. Increases the capacity of a system by increasing the number of sites where the data can be located.
4. Allows users to exercise control over their own data while allowing others to share some of the
data from other sites.
3. Makes performance evaluation difficult because a process running at one node may impact the
entire network.
7
8