File Organization: File organization in a DBMS is how records are
arranged and stored on a storage medium to optimize performance
for operations like search, insert, and delete. Different methods
Key concepts
Files and records: A file is a collection of related records, and a
record is a group of fields (data elements).
Blocks: On a storage device, data is stored in blocks. File
organization maps records to these physical blocks.
Logical vs. Physical: File organization is the logical relationship
between records, while the physical storage is the arrangement
on the disk
Indexed Sequential Access Methods(ISAM): The Indexed Sequential
Access Method (ISAM) is a file organization technique in Database
Management Systems (DBMS) that facilitates both sequential and
random access to records.
Components of ISAM:
Data File: This file stores the actual records, which are
organized in sequential order based on a designated key field
(often the primary key).
Index File: This file contains index entries, which are essentially
pointers to blocks or records within the data file. These index
entries are also sorted according to the key.
Overflow Area: This is a separate area used to store new
records that cannot be accommodated in their sorted position
within the primary data file due to space constraints.
Advantages of ISAM:
Efficient for both sequential and random access: Combines the
benefits of both access methods.
Fast retrieval: Indexes enable quick location of records.
Supports range queries: Efficiently retrieves records within a
specified range of key values.
Disadvantages of ISAM:
Static structure: The static nature of the index can lead to
performance issues with frequent updates (insertions,
deletions).
Overflow chains: Excessive updates can create scattered
overflow chains, hindering performance.
Requires more disk space: Additional space is needed to store
the index file and overflow area.
Implementation using B tree: B-trees are fundamental data
structures for implementing indexes in Database Management
Systems (DBMS). Their design optimizes for disk I/O operations,
which are significantly slower than in-memory operations.
Operations:
Searching: To find a record, the DBMS starts at the root node
and traverses down the tree. At each internal node, it compares
the search key with the node's keys to determine which child
node to follow.
Insertion: A new key-value pair is inserted into the appropriate
leaf node. If a leaf node becomes full, it is split into two, and
the median key is promoted to the parent node.
Deletion: Deleting a key-value pair involves removing it from
the leaf node. If a node becomes underflowed (has fewer keys
than the minimum allowed).
Implementation using B+ tree: A B+ tree is a self-balancing tree
data structure widely used in Database Management Systems
(DBMS) for indexing large datasets. It is an optimized version of
the B-tree, designed for efficient disk-based storage and
retrieval.
Key Characteristics of B+ Trees in DBMS:
All data in leaf nodes: Unlike B-trees where data can be in
internal nodes, in a B+ tree, all actual data records (or pointers
to them) are stored exclusively in the leaf nodes.
Internal nodes as index guides: Internal nodes (non-leaf nodes)
only store keys to guide the search to the correct leaf
node. They do not contain data records.
Linked leaf nodes: All leaf nodes are linked together in a
sequential manner, forming a sorted linked list. This allows for
efficient sequential access and range queries.
Balanced structure: All leaf nodes are at the same level (height)
from the root, ensuring consistent search performance.
High fanout: B+ trees typically have a high branching factor
(order), meaning each internal node can have many
children. This results in a shallower tree, reducing the number
of disk I/O operations required for data access.
Hashing: Hashing in a Database Management System (DBMS) is
a technique used to directly map search-key values to disk block
addresses, allowing for efficient retrieval, insertion, and
deletion of records without the need for extensive searching or
indexing.
How Hashing Works:
Hash Function: A mathematical function takes the search-key
value as input and calculates a hash address, which corresponds
to the physical address of a data block (bucket) in memory or
disk.
Buckets: These are storage units (usually disk blocks) that can
hold one or more data records.
Operations:
Insertion: To insert a new record, the hash function is
applied to its search key to determine the target
bucket. The record is then stored in that bucket.
Search: To search for a record, the hash function is applied
to its search key to find the bucket where it should
reside. The system then directly accesses that bucket to
retrieve the record.
Deletion: To delete a record, it is first located using the
hash function, and then removed from its respective
bucket.
Types of Hashing:
Static Hashing: The number of buckets remains fixed, and the
hash function always maps a key to the same bucket
address. Collision handling strategies (like chaining or open
addressing) are crucial here.
Dynamic Hashing: The hash table can grow or shrink
dynamically based on the number of records. This is more
suitable for databases with fluctuating data volumes, as it helps
manage overflow and underflow more efficiently.
Collision Resolution: Collision resolution in a DBMS, particularly
within the context of hashing, refers to the techniques used to
handle situations where two or more different keys map to the
same location (or index) in a hash table. This is known as a
collision.
Common Collision Resolution Techniques:
Separate Chaining (Open Hashing):
Each slot in the hash table points to a linked list.
When a collision occurs, the new key is simply added to
the linked list at that particular slot.
Example: Consider a hash table of size 5 and hash
function h(k) = k % 5. We want to insert keys 12, 15, 22,
25, 37.
h(12) = 12 % 5 = 2. Key 12 is placed in slot 2.
h(15) = 15 % 5 = 0. Key 15 is placed in slot 0.
h(22) = 22 % 5 = 2. Collision with 12. Key 22 is added
to the linked list at slot 2, so slot 2 now contains [12
-> 22].
h(25) = 25 % 5 = 0. Collision with 15. Key 25 is added
to the linked list at slot 0, so slot 0 now contains [15
-> 25].
h(37) = 37 % 5 = 2. Collision with 12 and 22. Key 37 is
added to the linked list at slot 2, so slot 2 now
contains [12 -> 22 -> 37].
Open Addressing:
When a collision occurs, the system probes for an alternative
empty slot in the hash table itself.
Types of Open Addressing:
Linear Probing: If a slot h(k) is occupied, it tries h(k)
+1, h(k)+2, and so on, until an empty slot is found.
Example: Using the same hash function and keys as
above, but with linear probing:
h(12) = 2. Key 12 is in slot 2.
h(15) = 0. Key 15 is in slot 0.
h(22) = 2. Slot 2 is occupied. Try (2+1)%5 = 3.
Slot 3 is empty. Key 22 is in slot 3.
h(25) = 0. Slot 0 is occupied. Try (0+1)%5 = 1.
Slot 1 is empty. Key 25 is in slot 1.
Quadratic Probing: If h(k) is occupied, it tries h(k)
+1^2, h(k)+2^2, h(k)+3^2, and so on.
Double Hashing: Uses a second hash function h2(k) to
determine the step size for probing if the initial slot h(k) is
occupied. The probe sequence is h(k), h(k) + h2(k), h(k) +
2*h2(k)
Extendible Hashing: Extendible Hashing is a dynamic hashing
technique used in Database Management Systems (DBMS) to
manage data storage efficiently.
Key Components:
Directory: An array of pointers to buckets. Each entry in the
directory corresponds to a possible hash value prefix. The size
of the directory can double or halve dynamically.
Buckets: Storage units that hold the actual data records. Each
bucket has a fixed capacity.
Global Depth (GD): The number of bits used from the hash
value to index into the directory.
Local Depth (LD): The number of bits used from the hash value
to distinguish records within a specific bucket. If a bucket's local
depth is less than the global depth, multiple directory entries
point to it.