02 Storage (1)
02 Storage (1)
Outline
• Where and How data are stored?
– physical level
– logical level
2
Building a Database: High-Level
• Design conceptual schema using a data model, e.g. ER, UML,
etc.
3
Building a Database: Logical-Level
• Design logical schema, e.g. relational, network, hierarchical,
object-relational, XML, etc schemas
• Data Definition Language (DDL)
4
Populating a Database
• Data Manipulation Language (DML)
5
Transaction operations
• Transaction: a collection of operations performing a single
logical function
BEGIN TRANSACTION transfer
UPDATE bank-account SET balance = balance - 100 WHERE account=1
UPDATE bank-account SET balance = balance + 100 WHERE account=2
COMMIT TRANSACTION transfer
6
Where and How all this information is
stored?
• Metadata: tables, attributes, data types, constraints, etc
• Data: records
• Transaction logs, indices, etc
7
Where: In Main Memory?
• Fast!
• But:
– Too small
– Too expensive
– Volatile
8
Physical Storage Media
• Primary Storage
– Cache
– Main memory
• Secondary Storage
– Flash memory
– Magnetic disk
• Offline Storage
– Optical disk
– Magnetic tape
9
Magnetic Disks
• Random Access
• Inexpensive
• Non-volatile
10
How do disks work?
• Platter: covered with magnetic recording material
• Track: logical division of platter surface
• Sector: hardware division of tracks
• Block: OS division of tracks
– Typical block sizes:
512 B, 2KB, 4KB
• Read/write head
11
Disk I/O
• Disk I/O := block I/O
– Hardware address is converted to Cylinder, Surface and Sector
number
– Modern disks: Logical Sector Address 0…n
• Access time: time from read/write request to when data transfer
begins
– Seek time: the head reaches correct track
• Average seek time 5-10 msec
– Rotation latency time: correct block rotated
under head
• 5400 RPM, 15K RPM
• On average 4-11 msec
• Block Transfer Time
12
Optimize I/O
• Database system performance I/O bound
• Improve the speed of access to disk:
– Scheduling algorithms (elevator algorithm)
– File Organization (heap, index, hash)
• Introduce disk redundancy
– Redundant Array of Independent Disks (RAID)
• Reduce number of I/Os
– Query optimization, indexes
13
Where and How all this information is
stored?
• Metadata: tables, attributes, data types, constraints, etc
• Data: records
• Transaction logs, indices, etc
14
Storage Access
• A collection of files
– Physically partitioned into pages
– Typical database page sizes: 2KB, 4KB, 8KB
– Reduce number of block I/Os := reduce number of page I/Os
– How?
• Buffer Manager
15
Buffer Management (1/2)
• Buffer: storing a page copy
• Buffer manager: manages a pool of buffers
– Requested page in pool: hit!
– Requested page in disk:
• Allocate page frame
• Read page and pin
• Problems?
disk
buffer pool
16
Buffer Management (2/2)
• What if no empty page frame exists:
– Select victim page
– Each page associated with dirty flag
– If page selected dirty, then write it back to disk
• Which page to select?
– Replacement policies (LRU, MRU)
Page request
disk
buffer pool
17
Disk Arrays
• Single disk becomes bottleneck
• Disk arrays
– instead of single large disk
– many small parallel disks
• read N blocks in a single access time
• concurrent queries
• tables spanning among disks
• Redundant Arrays of Independent Disks (RAID)
– 7 levels (0-6)
– reliability
– redundancy
– parallelism
18
RAID Technology
• A natural solution is a large array of small independent
disks acting as a single higher-performance logical disk.
• A concept called data striping is used, which utilizes
parallelism to improve disk performance.
• Data striping distributes data transparently over multiple
disks to make them appear as a single large, fast disk.
RAID Technology (cont.)
• Different raid organizations were defined based on different
combinations of the two factors of granularity of data
interleaving (striping) and pattern used to compute redundant
information.
– Raid level 0 (striping) has no redundant data and hence has
the best write performance at the risk of data loss
– Raid level 1 uses mirrored disks.
– Raid level 2 uses memory-style redundancy by using
Hamming codes, which contain parity bits for distinct
overlapping subsets of components. Level 2 includes both
error detection and correction.
– Raid level 3 uses a single parity disk relying on the disk
controller to figure out which disk has failed.
– Raid Levels 4 and 5 use block-level data striping, with level
5 distributing data and parity information across all disks.
– Raid level 6 applies the so-called P + Q (two parity)
redundancy scheme using Reed-Solomon codes to protect
against up to two disk failures by using just two redundant
disks.
RAID level 0
• Block level striping
• No redundancy
• maximum bandwidth
• automatic load balancing
• best write performance
• but, no reliability
0 1 2 3
4 5
21
Raid level 1
• Mirroring
– Two identical copies stored in two different disks
• Parallel reads
• Sequential writes
• transfer rate comparable to single disk rate
• most expensive solution
0 0 2 2
1 1
23
RAID level 4
• block level striping
• parity block for each block in data disks
– P1 = B0 XOR B1 XOR B2
– B2 = B0 XOR B1 XOR P1
• an update:
– P1’ = B0’ XOR B0 XOR P1 (every update -> must write parity disk)
B0 B1 B2 P1
24
RAID level 5 and 6
• subsumes RAID 4
• parity disk not a bottleneck
– parity blocks distributed on all disks
• RAID 6
– tolerates two disk failures
– P+Q redundancy scheme
• 2 bits of redundant data for each 4 bits of data
– more expensive writes
BM B1
B0 PX’ B2 P1
PN BY’ PX BY
student
cid name
00112233 Paul
The physical disk blocks that are allocated to hold the records
of a file can be contiguous, linked, or indexed.
What if a record is deleted?
• Depending on the type of records:
– Fixed-length records
– Variable-length records
29
Fixed-length record files
• Upon record deletion:
– Packed page scheme
– Bitmap
Slot 1 Slot 1
Slot 2 Slot 2
... ... ...
Slot N Slot N
...
Free Space
Slot M
Page header N N-1 1 ... 1
0 ... 0 1 N
M N 21
Packed Bitmap
30
Variable-length record files
• When do we have a file with variable-length records?
– Column datatype: variable length
– create table t (field1 int, field2 varchar2(n))
• Problems:
– Holes created upon deletion have variable size
– Find large enough free space for new record
• Could use previous approaches: maximum record size
– a lot of space wasted
• Use slotted page structure
– Slot directory
– Each slot storing offset, size of record
...
– Record IDs: page number, slot number
32 ... 16 38 N
N 2 1 31
Record Organization
• Fixed-length record formats
– Fields stored consecutively
• Variable-length record formats
– Array of offsets
– NULL values when start offset = end offset
f1 f2 f3 f4
Base address (B)
L1 L2 L3 L4
f3 Address = B+L1+L2
f1 f2 f3 f4
Base address (B)
32
Operation on Files
• Typical file operations include:
– OPEN: Prepares the file for access and associates a pointer that will refer
to a current file record at each point in time.
– FIND: Searches for the first file record that satisfies a certain condition
and makes it the current file record.
– FINDNEXT: Searches for the next file record (from the current record) that
satisfies a certain condition and makes it the current file record.
– READ: Reads the current file record into a program variable.
– INSERT: Inserts a new record into the file & makes it the current file
record.
– DELETE: Removes the current file record from the file, usually by marking
the record to indicate that it is no longer valid.
– MODIFY: Changes the values of some fields of the current file record.
– CLOSE: Terminates access to the file.
– REORGANIZE: Reorganizes the file records.
• For example, the records marked deleted are physically removed from
the file or a new organization of the file records is created.
– READ_ORDERED: Read the file blocks in order of a specific field of the
file.
File Organization
(later we study it in a more detailed way)
34
Heap Files
• Simplest file structure
• Efficient insert
• Slow search and delete
– Equality search: half pages fetched on average
– Range search: all pages must be fetched
file
header
35
Sorted (Ordered) files
• Sorted records based on ordering field (e.g. Ename)
– If ordering field same as key field, ordering key field (e.g. Empno)
• Slow inserts and deletes
• Fast logarithmic search
start of file
Page 1 Page 2
insert
start of file
Page 1 Page 2
36
Sorted (Ordered) Files
• Also called a sequential file.
• File records are kept sorted by the values of an ordering field.
• Insertion is expensive: records must be inserted in the correct
order.
– It is common to keep a separate unordered overflow (or
transaction) file for new records to improve insertion
efficiency; this is periodically merged with the main ordered
file.
• A binary search can be used to search for a record on its
ordering field value.
– This requires reading and searching log2 of the file blocks on
the average, an improvement over linear search.
• Reading the records in order of the ordering field is quite
efficient.
Hashed Files
• Hash function h on hash field distributes pages into buckets
• Efficient equality searches, inserts and deletes
• No support for range searches
null
null
hash field h
Overflow page
…
38
Hashed Files
• Hashing for disk files is called External Hashing
• The file blocks are divided into M equal-sized buckets,
numbered bucket0, bucket1, ..., bucketM-1.
– Typically, a bucket corresponds to one (or a fixed number of)
disk block.
• One of the file fields is designated to be the hash key of the file.
• The record with hash key value K is stored in bucket i, where
i=h(K), and h is the hashing function.
• Search is very efficient on the hash key.
• Collisions occur when a new record hashes to a bucket that is
already full.
– An overflow file is kept for storing such records.
– Overflow records that hash to each bucket can be linked
together.
Hashed Files
Summary (1/2)
• Why Physical Storage Organization?
– understanding low-level details which affect data access
– make data access more efficient
• Primary Storage (memory), Secondary Storage (disk)
– memory fast
– disk slow but non-volatile
• Data stored in files
– partitioned into pages physically
– partitioned into records logically
• Optimize I/Os
– scheduling algorithms
– RAID
– page replacement strategies
41
Summary (2/2)
• File Organization
– how each file type performs
• Page Organization
– strategies for record deletion
• Record Organization
42
Topics for today
• How to lay out data on disk
• How to move it to memory
43
What are the data items we want to store?
• a salary
• a name
• a date
• a picture
8
bits
44
To represent:
• Integer (short): 2 bytes
e.g., 35 is
00000000 00100011
45
To represent:
• Characters
Example:
A: 1000001
a: 1100001
5: 0110101
LF: 0001010
46
To represent:
• Boolean
e.g., TRUE
FALSE 1111 1111
0000 0000
• Application specific
e.g., RED 1 GREEN 3
BLUE 2 YELLOW 4 …
Can we use less than 1
byte/code?
Yes, but only if desperate...
47
To represent:
• Dates
e.g.: - Integer, # days since Jan 1, 1900
- 8 characters, YYYYMMDD
- 7 characters, YYYYDDD
• Time
e.g. - Integer, seconds since midnight
- characters, HHMMSSFF
48
To represent:
• String of characters
– Null terminated
e.g.,
– Length given
c a t
e.g.,
- Fixed length
3 c a t
49
To represent:
• Bag of bits
Length Bits
50
Key Point
51
Also
• Type of an item: Tells us how to
interpret
(plus size if fixed)
52
Overview Data Items
Records
Blocks
Files
Memory
53
Record - Collection of related data
items (called FIELDS)
54
Types of records:
• Main choices:
– FIXED vs VARIABLE FORMAT
– FIXED vs VARIABLE LENGTH
55
Fixed format
A SCHEMA (of a table record) contains
following information
- # fields
- type of each field
- order in record
- meaning of each field
56
Example: fixed format and length
Employee record
(1) E#, 2 byte integer
(2) E.name, 10 char. Schema
(3) Dept, 2 byte code
55 s m i t h 02
Records
83 j o n e s 01
57
Variable format
• Record itself contains format
“Self Describing”
58
Example: variable format and length
2 5 I 46 4 S 4 F O RD
Code identifying
Length of str.
# Fields
field as E#
Integer type
String type
59
Variable format useful for:
• “sparse” records
• repeating fields
• evolving formats
60
• EXAMPLE: var format record with
repeating fields
Employee one or more children
61
Note: Repeating fields does not imply
- variable format, nor
- variable size
63
Record header - data at beginning
that describes record
May contain:
- record type
- record length
- time stamp
- other stuff ...
64
Next: placing records into blocks
blocks ...
assume fixed
length blocks
a file
65
Options for storing records in blocks:
(1) separating records
(2) spanned vs. unspanned
(3) sequencing
(4) indirection
66
(1) Separating records
Block
R1 R2 R3
(a) no need to separate - fixed size recs.
(b) special marker
(c) give record lengths (or offsets)
- within each record
- in block header
67
(2) Spanned vs. Unspanned
• Spanned
R1 R2
block 1 R3 R42 R5
block
...
R3 R3 R7
R1 R2 (a) (b)
R4 R5 R6 (a)
68
With spanned records:
R3 R3 R7
R1 need R2
indication(a) (b)
R4 R5 R6 (a)
need indication
of partial record of continuation
“pointer” to rest (+ from where?)
69
Spanned vs. unspanned:
• Unspanned is much simpler, but may waste space…
• Spanned essential if
record size > block size
70
(3) Sequencing
• Ordering records in file (and block) by some key value
71
Why sequencing?
Typically to make it possible to efficiently read records in order
(e.g., to do a merge-join — discussed later)
72
Sequencing Options
(a) Next record physically contiguous
...
(b) Linked
R1 Next (R1)
R1 Next (R1)
73
Sequencing Options
(c) Overflow area
Records
in sequence header
R1
R2.1
R2
R1.3
R3
R4.7
R4
R5
74
(4) Indirection
Rx
75
(4) Indirection
Rx
Many options:
Physical Indirect
76
Purely Physical
Device ID
E.g., Record Cylinder #
Address = Track #
or ID Block #
Offset in block Block ID
77
Fully Indirect
map
rec ID
r address
a
Rec ID Physical
addr.
78
Tradeoff
Flexibility Cost
to move records of indirection
(for deletions, insertions) (manage the map)
79
Physical Indirect
Many options
in between …
80
Example: Indirection in block
Header
A block: Free
space
R3
R4
R1 R2
81
Block header - data at beginning that
describes block
May contain:
- File ID (or RELATION or DB ID)
- This block ID
- Record directory
- Pointer to free space
- Type of block (e.g. contains recs type 4;
is overflow, …)
- Pointer to other blocks “like it”
- Timestamp ...
82
Other Topics
(1) Insertion/Deletion
(2) Buffer Management
(3) Comparison of Schemes
83
Deletion
Block
Rx
84
Options:
(a) Immediately reclaim space
(b) Mark deleted
85
As usual, many tradeoffs...
• How expensive is to move valid record to free space for
immediate reclaim?
• How much space is wasted?
– e.g., deleted records, delete fields, free space chains,...
86
Concern with deletions
Dangling pointers
R1 ?
87
Solution #1: Do not worry
88
Solution #2: Tombstones
E.g., Leave “MARK” in map or old location
• Physical IDs
A block
89
Solution #2: Tombstones
E.g., Leave “MARK” in map or old
location
• Logical IDs
map
ID LOC
Never reuse
7788 ID 7788 nor
space in map...
90
Insert
Easy case: records not in sequence
Insert new record at end of file or in
deleted slot
If records are variable size, not
as easy...
91
Insert
Hard case: records in sequence
If free space “close by”, not too bad...
Or use overflow idea...
92
Interesting problems:
93
Free
space
94
Buffer Management
• DB features needed
• Why LRU may be bad
• Pinned blocks
• Forced output
• Double buffering (prefetch)
95
Row vs Column Store
• So far, we assumed that fields of a record are stored
contiguously (row store)...
• Another option is to store like fields together (column store)
96
Row Store
• Example: Order consists of
– id, cust, prod, store, price, date, qty
97
Column Store
• Example: Order consists of
– id, cust, prod, store, price, date, qty
98
Row vs Column Store
• Advantages of Column Store
– more compact storage (fields need not start at byte boundaries)
– efficient reads on data mining operations
• Advantages of Row Store
– writes (multiple fields of one record) more efficient
– efficient reads for record access (OLTP)
99
Comparison
• There are 10,000,000 ways to organize my data on disk…
100
Issues:
Flexibility Space Utilization
Complexity Performance
101
To evaluate a given strategy, compute following parameters:
-> space used for expected data
-> expected time to
- fetch record given key
- fetch record with next key
- insert record
- append record
- delete record
- update record
- read all file
- reorganize file
102
Summary
• How to lay out data on disk
Data Items
Records
Blocks
Files
Memory
DBMS
103
Next
How to find a record quickly,
given a key
104