Data Storage and Access Methods: Min Song IS698
Data Storage and Access Methods: Min Song IS698
External Model
Application 1
External Model
External Model
External Model
Conceptual requirements
Application 2
Conceptual requirements
Application 3
Conceptual requirements
Application 4
Conceptual Model
Logical Model
Internal Model
Conceptual requirements
Physical Design
Storage Format
Choosing the storage format of each field (attribute). The DBMS provides some set of data types that can be used for the physical storage of fields in the database Data Type (format) is chosen to minimize storage space and maximize data integrity
Stores numbers from 32,768 to 32,767 (no fractions) 2 bytes Long Integer (Default) Stores numbers from 2,147,483,648 to 2,147,483,647 (no fractions). 4 bytes Single Stores numbers from -3.402823E38 to 1.401298E45 for negative values and from 1.401298E45 to 3.402823E38 for positive values. 4 bytes Double Stores numbers from 1.79769313486231E308 to 4.94065645841247E324 for negative values and from 1.79769313486231E308 to 4.94065645841247E324 for positive values. 15 8 bytes Replication ID Globally unique identifier (GUID) N/A 16 bytes
Data Storage
Storing Data: Disks Buffer manager Representing relational data in a disk
Main Memory
Fastest, most expensive (excluding cache) Today: 512MB are common even on PCs Many databases could fit in memory
New industry trend: Main Memory Database E.g TimesTen
Secondary Storage
Disks Slower, cheaper than main memory Persistent !!! The unit of disk I/O = block
Typically 1 block = 4k A disk block is also called a disk page or simply a page
Block
Blocking factor (bfr) for a file is the average number of records stored in a disk block. Suppose the block size of a database system is 2000 bytes. Customer table has an average record length of 190 bytes. Assume the overhead of a block for the data is 100 bytes.
What is the blocking factor?
Spindle Tracks
Sector
Arm movement
Platters
Arm assembly
L3
L4
Address = B+L1+L2
Information about field types same for all records in a file; stored in system catalogs. Finding ith field requires scan of record. Note the importance of schema information!
Record Header
To schema length F1
L1 header timestamp F2 L2
F3
L3
F4
L4
Need the header because: The schema may change for a while new+old may coexist Records from different relations may coexist
header
F1 L1
F2 L2
F3
L3
F4
L4
length
Place the fixed fields first: F1, F2 Then the variable length fields: F3, F4 Null values take 2 bytes only Sometimes they take 0 bytes (when at the end)
header
F1 L1
F2 L2
F3
L3
length
R1
R2
R2
R3
When records are very large Or even medium size: saves space in blocks
BLOB
Binary large objects Supported by modern database systems E.g. images, sounds, etc. Storage: attempt to cluster blocks together
Modifications: Insertion
File is unsorted
add it to the end
File is sorted:
Is there space in the right block ?
Yes: we are lucky, store it there
Overflow Blocks
Blockn-1 Blockn Blockn+1
Overflow
After a while the file starts being dominated by overflow blocks: time to reorganize
Modifications: Deletions
Free space in block, shift records Maybe be able to eliminate an overflow block
Modifications: Updates
If new record is shorter than previous, easy If it is longer, need to shift records, create overflow blocks
Physical Addresses
Each block and each record have a physical address that consists of:
The host The disk The cylinder number The track number The block within the track For records: an offset in the block sometimes this is in the blocks header
Logical Addresses
Logical address: a string of bytes (1016) More flexible: can blocks/records around But need translation table:
Logical address L1 L2 L3 Physical address P1 P2 P3
Physical Design
Internal Model/Physical Model
User request Interface 1
External Model
Interface 3
Data Base
Physical Design
Interface 1: User request to the DBMS. The user presents a query, the DBMS determines which physical DBs are needed to resolve the query Interface 2: The DBMS uses an internal model access method to access the data stored in a logical database. Interface 3: The internal model access methods and OS access methods access the physical records of the database.
Differences in
Access Efficiency Storage Efficiency
Physical Sequential
Key values of the physical records are in logical sequence Main use is for dump and restore Access method may be used for storage as well as retrieval Storage Efficiency is near 100% Access Efficiency is poor (unless fixed size physical records)
Indexed Sequential
Key values of the physical records are in logical sequence Access method may be used for storage and retrieval Index of key values is maintained with entries for the highest key values per block(s) Access Efficiency depends on the levels of index, storage allocated for index, number of database records, and amount of overflow Storage Efficiency depends on size of index and volatility of database
Index Sequential
Adams Becker Dumpling
Getta Harty
Block 2
Block 3
Address
1 2
001 003 . . 150 251 . . 385 455 480 . . 536 605 610 . . 678 705 710 . . 785
7 8 9
Key Value
536 678
Address
3 4
Key Value
785 805
Address
5 6
791 . . 805
Indexed Random
Key values of the physical records are not necessarily in logical sequence Index may be stored and accessed with Indexed Sequential Access Method Index has an entry for every data base record. These are in ascending order. The index keys are in logical sequence. Database records are not necessarily in ascending sequence. Access method may be used for storage and retrieval
Indexed Random
Becker Harty
Actual Value Adams Becker Dumpling Getta Address Block Number 2 1 3 2
Adams Getta
Harty
Dumpling
Btree
F || P || Z| B || D || F| H || L || P| R || S || Z|
Inverted
Key values of the physical records are not necessarily in logical sequence Access Method is better used for retrieval An index for every field to be inverted may be built Access efficiency depends on number of database records, levels of index, and storage allocated for index
Inverted
CH 145 101, 103,104
Actual Value CH 145 CS 201 CS 623 PH 345 Address Block Number 1 2 3
Student name
Course Number
Adams Becker
CH145 cs201
Dumpling ch145
CS 201 102
Getta
Harty Mobile
ch145
cs623 cs623
Direct
Key values of the physical records are not necessarily in logical sequence There is a one-to-one correspondence between a record key and the physical address of the record May be used for storage and retrieval Access efficiency always 1 Storage efficiency depends on density of keys No duplicate keys permitted
Hashing
Key values of the physical records are not necessarily in logical sequence Many key values may share the same physical address (block) May be used for storage and retrieval Access efficiency depends on distribution of keys, algorithm for key transformation and space allocated Storage efficiency depends on distibution of keys and algorithm used for key transformation
Hashed
more space needed for addition and deletion of records after initial load
Moderately Fast Moderately Fast Very fast with multiple indexes OK if dynamic
Impractical Possible but needs a full scan can create wasted space Adding records requires rewriting file Updating records usually requires rewriting file
OK if dynamic
Easy but requires Maintenance of indexes
very easy
very easy