UNIT 5 dbms
UNIT 5 dbms
1. Registers
• Located within the CPU.
• Smallest and fastest type of storage.
• Used to hold data currently being processed.
7. Magnetic Tapes
• Sequential access storage, unlike disks which are random access.
• Often used for backups and archiving due to their high capacity and low cost.
• Much slower access times compared to magnetic disks.
Types of Storage:
1. Primary Storage: Includes registers, cache memory, and main memory (RAM). It's
the main memory where the operating system, application programs, and data in current
use are kept for quick access by the computer's processor.
2. Secondary Storage: Encompasses data storage devices like HDDs, SSDs, CDs, and
USB drives. It is non-volatile and retains data even when the computer is turned off.
3. Tertiary Storage or Off-line Storage: Often involves magnetic tape systems or
optical disk archives. This is slower than secondary storage and is used for data
archiving and backup.
4. Quaternary Storage: Refers to methods like cloud storage or other remote storage
techniques where data is stored in remote servers and is fetched over the internet or
other networks.
Note: The storage hierarchy in a DBMS serves as a framework for balancing the trade-offs
between cost, speed, and capacity. By strategically using different levels of this hierarchy, a
DBMS can ensure fast access times for frequently used data while also providing large-capacity
storage for data that's accessed less frequently.
File organization and indexing are fundamental concepts in database management systems (DBMS)
that deal with efficient storage, retrieval, and management of data.
File Organization
File organization refers to the arrangement of data on storage devices. The method chosen can
have a profound effect on the efficiency of various database operations.
Common methods of file organization include:
• Ordered Records: Records in a sequential file are stored based on a key field.
• Continuous Memory Allocation: The records are stored in contiguous memory locations.
• No Direct Access: To access a record, you have to traverse from the first record until you find
the desired one.
• Simplicity: The design and logic behind sequential file organization are straightforward.
• Efficient for Batch Processing: Since records are stored in sequence, sequential processing
(like batch updates) can be done efficiently.
• Less Overhead: There's no need for complex algorithms or mechanisms to store records.
• Inefficient for Random Access**: If you need a specific record, you may have to go through
many records before finding the desired one. This makes random access slow.
• Insertion and Deletion**: Inserting or deleting a record (other than at the end) can be time-
consuming since you may need to shift records to maintain the order.
• Redundancy Issues**: There's a risk of redundancy if checks are not made before inserting
records. For example, a record with the same key might get added twice if not checked.
Practical Application: Suppose you have a file of students ordered by their roll number:
Roll No | Name
--------|--------
1 | Madhu
2 | Naveen
4 | Shivaji
5 | Durga
In a sequential file, if you wanted to add a student with roll number 6, you would append them
at the end. However, if you wanted to insert a student with a roll number 3 which is between 2
and 4, you would need to shift all subsequent records to maintain the sequence, which can be
time-consuming.
• Hash Function: A hash function converts a record's key value into an address.
• Buckets: A bucket typically stores one or more records. A hash function might map multiple
keys to the same bucket.
• No Ordering of Records: Records are not stored in any specific logical order.
• Rapid Access: If the hash function is efficient and there's minimal collision, the retrieval of a
record is very quick.
• Uniform Distribution: A good hash function will distribute records uniformly across all
buckets.
• Efficient Search: Searching becomes efficient as only a specific bucket needs to be searched
rather than the entire file.
Disadvantages Hash File Organization
• Collisions**: A collision occurs when two different keys hash to the same bucket. Handling
collisions can be tricky and might affect access time.
• Dependency on Hash Function**: The efficiency of a hash file organization heavily depends
on the hash function used. A bad hash function can lead to clustering and inefficient utilization
of space.
• Dynamic Growth and Shrinking**: If the number of records grows or shrinks significantly,
rehashing might be needed which is an expensive operation.
Practical Application: Imagine a database that holds information about books.
Each book has a unique ISBN number. A hash function takes an ISBN and returns an address.
When you want to find a particular book's details, you hash the ISBN, which directs you to a
particular bucket. If two books' ISBNs hash to the same value, you handle that collision, maybe
by placing the new record in a linked list associated with that bucket.
• Primary Data File: The actual database file where records are stored.
• Index: An auxiliary file that contains key values and pointers to the corresponding records in
the data file.
• Multi-level Index: Sometimes, if the index becomes large, a secondary (or even tertiary) index
can be created on the primary index to expedite searching further.
• Quick Random Access: Direct access to records is possible using the index.
• Flexible Searches: Since an index provides a mechanism to jump directly to records, different
types of search operations (like range queries) can be efficiently supported.
• Ordered Access: If the primary file is ordered, then indexed file organization can support
efficient sequential access too.
• Overhead of Maintaining Index: Every time a record is added, deleted, or updated, the index
also needs to be updated. This can introduce overhead.
• Space Overhead: Indexes consume additional storage space.
• Complexity: Maintaining multiple levels of indexes can introduce complexity in terms of
design and implementation.
Practical Application: Consider a database that holds information about students.
where each student has a unique student ID. The main file would contain detailed records for
each student. A separate index file would contain student IDs and pointers to the location of
the detailed records in the main file. When you want to find a specific student's details, you
first search the index to find the pointer and then use that pointer to fetch the record directly
from the main file.
Note: Indexed file organization strikes a balance between the direct and sequential access of
records. While it offers faster search capabilities, there's a trade-off in terms of space and
maintenance overhead. It is ideal for databases where search operations are predominant, and
where the overhead of maintaining the index can be justified by the improved access times.
Indexing
Indexing involves creating an auxiliary structure (an index) to improve data retrieval times.
Just like the index in the back of a book, a database index provides pointers to the locations of
records.
Structure of Index
We can create indices using some columns of the database.
|-------------|----------------|
| Search Key | Data Reference |
|-------------|----------------|
• The search key is the database’s first column, and it contains a duplicate or copy of the
table’s candidate key or primary key. The primary key values are saved in sorted order
so that the related data can be quickly accessible.
• The data reference is the database’s second column. It contains a group of pointers that
point to the disk block where the value of a specific key can be found.
Types of Indexes:
1. Single-level Index: A single index table that contains pointers to the actual data
records.
2. Multi-level Index: An index of indexes. This hierarchical approach reduces the
number of accesses (disk I/O operations) required to find an entry.
3. Dense and Sparse Indexes:
o In a dense index, there's an index entry for every search key value in the
database.
o In a sparse index, there are fewer index entries. One entry might point to several
records.
4. Primary and Secondary Indexes:
o A primary index is an ordered file whose records are of fixed length with two
fields. The first field is the same as the primary key, and the second field is a
pointer to the data block. There's a one-to-one relationship between the number
of entries in the index and the number of records in the main file.
o A secondary index provides a secondary means of accessing data. For each
secondary key value, the index points to all the records with that key value.
5. Clustered vs. Non-clustered Index:
o In a clustered index, the rows of data in the table are stored on disk in the same
order as the index. There can only be one clustered index per table.
o In a non-clustered index, the order of rows does not match the index's order.
You can have multiple non-clustered indexes.
6. Bitmap Index: Used mainly for data warehousing setups, a bitmap index uses bit
arrays (bitmaps) and usually involves columns that have a limited number of distinct
values.
7. B-trees and B+ trees: Balanced tree structures that ensure logarithmic access time. B+
trees are particularly popular in DBMS for their efficiency in disk I/O operations.
Benefits of Indexing:
• Faster search and retrieval times for database operations.
Drawbacks of Indexing:
• Overhead for insert, update, and delete operations, as indexes need to be maintained.
• Additional storage requirements for the index structures.
If we create a clustered index on `BookID`, the physical order of records would be rearranged
based on the ascending order of `BookID`.
The table would then look like this:
Now, when you want to find a book based on its ID, the DBMS can quickly locate the data
because the data is stored in the order of the BookID.
A primary index is an ordered file whose records are of fixed length with two fields. The first
field is the same as the primary key of the data file, and the second field is a pointer to the data
block where that specific key can be found.
The primary index can be classified into two types
1. Dense Index: In this, an index entry appears for every search key value in the data file.
2. Sparse (or Non-Dense) Index: Here, index records are created only for some of the
search key values. A sparse index reduces the size of the index file.
Assuming each block of our storage can hold two records, our blocks will be:
In the dense primary index, there's an entry for every key value, whereas in the sparse primary
index, only the first key value of each block (the block's anchor record) gets an entry in the
index. In this way, sparse indexes are more compact but might require more sequential block
reads if a search key isn't an anchor for a block.
Characteristics of a :
1. Provides an alternative path to access the data.
2. Can be either dense or sparse.
3. Allows for non-unique values.
4. Does not guarantee the order of records in the data file.
5. Unlike a primary (often clustered) index, a secondary index is typically a non-clustered
index. This means the physical order of rows in a table is not the same as the index
order.
Here, each age value has a direct pointer to the corresponding record.
If another student with an age of 22 is added:
Hash-Based Indexing
In hash-based indexing, a hash function is used to convert a key into a hash code. This hash
code serves as an index where the value associated with that key is stored. The goal is to
distribute the keys uniformly across an array, so that access time is, on average, constant.
Let's break down some of these elements to further understand how hash-based indexing works
in practice:
Buckets
In hash-based indexing, the data space is divided into a fixed number of slots known as
"buckets." A bucket usually contains a single page (also known as a block), but it may have
additional pages linked in a chain if the primary page becomes full. This is known as overflow.
Hash Function
The hash function is a mapping function that takes the search key as an input and returns the
bucket number where the record should be located. Hash functions aim to distribute records
uniformly across buckets to minimize the number of collisions (two different keys hashing to
the same bucket).
Insert Operations
When a new record is inserted into the dataset, its search key is hashed to find the appropriate
bucket. If the primary page of the bucket is full, an additional overflow page is allocated and
linked to the primary page. The new record is then stored on this overflow page.
Search Operations
To find a record with a specific search key, the hash function is applied to the search key to
identify the bucket. All pages (primary and overflow) in that bucket are then examined to find
the desired record.
Limitations
Hash-based indexing is not suitable for range queries or when the search key is not known. In
such cases, a full scan of all pages is required, which is resource-intensive.
Hash Function: H(x) = ASCII value of first letter of the name mod 3
• Alice: 65 mod 3 = 2
• Bob: 66 mod 3 = 0
• Carol: 67 mod 3 = 1
Buckets:
Bucket 0: Bob
Bucket 1: Carol
Bucket 2: Alice
Tree-based Indexing
The most commonly used tree-based index structure is the B-Tree, and its variations like B+
Trees and B* Trees. In tree-based indexing, data is organized into a tree-like structure. Each
node represents a range of key values, and leaf nodes contain the actual data or pointers to the
data.
• Sorted Data: They maintain data in sorted order, making it easier to perform range
queries.
• Balanced Tree: B-Trees and their variants are balanced, meaning the path from the
root node to any leaf node is of the same length. This balancing ensures that data
retrieval times are consistently fast, even as the dataset grows.
• Multi-level Index: Tree-based indexes can be multi-level, which helps to minimize the
number of disk I/Os required to find an item.
• Dynamic Nature: B-Trees are dynamic, meaning they're good at inserting and deleting
records without requiring full reorganization.
• Versatility: They are useful for both exact-match and range queries.
•
ID Name
1 Abhi
2 Bharath
3 Chinni
4 Devid
[1, 3]
/ \
[1] [3, 4]
/ \ / \
1 2 3 4
In the tree, navigating from the root to the leaf nodes will lead us to the desired data record.
Pros of Tree-based Indexing:
• Efficient for range queries.
• Good for both exact and partial matches.
• Keeps data sorted.
File Organizations
1. Objective: To physically store records on storage media in an organized manner.
2. Methodologies: Includes sequential, random (or direct), and hashed file organizations,
among others.
3. Implications:
• Sequential organization is suitable for batch processing but inefficient for random
access.
• Direct or random organization allows fast access but can be inefficient in terms of
storage space.
• Hashed file organization is excellent for equality searches but not for range-based
queries.
4. Real-world Examples: Ledger systems, log files, archival systems.
Indexing
1. Objective: To create a data structure that improves the speed of data retrieval operations.
2. Methodologies: Includes clustered, non-clustered, primary, secondary, composite, bitmap,
and hash indexes, among others.
3. Implications:
• Clustered indexes are excellent for range-based queries but slow down insert/update
operations.
• Non-clustered indexes improve data retrieval speed but can take up additional storage.
• Bitmap indexes are useful for low-cardinality columns.
4. Real-world Examples: Search engines, e-commerce websites, any application that requires
fast data retrieval.
Performance Tuning
1. Objective: To optimize the resources used by the database for efficient transaction
processing.
2. Methodologies: Query optimization, index tuning, denormalization, database sharding,
caching, partitioning, etc.
3. Implications:
• Query optimization can dramatically reduce the resources needed for query processing.
• Proper indexing can mitigate the need for full-table scans.
• Denormalization and caching can improve read operations but may compromise data
integrity or consistency.
4. Real-world Examples: Financial trading systems, real-time analytics, high-performance
computing.
1. Granularity:
• File organization is about how data is stored at the file level.
• Indexing is about improving data access at the table or even column level.
• Performance tuning is a broad set of activities that can encompass both file organization
and indexing among many other techniques.
2. Resource Usage:
• File organization techniques aim to use disk space efficiently.
• Indexing aims to use both disk space and memory for fast data retrieval.
• Performance tuning aims to optimize all system resources including CPU, memory,
disk, and network bandwidth.
3. Query Efficiency:
• File organization generally impacts how efficiently data can be read or written to disk.
• Indexing significantly impacts how efficiently queries can retrieve data.
• Performance tuning seeks to optimize both read and write operations through a variety
of methods.
4. Complexity:
• File organization is relatively straightforward.
• Indexing can become complex depending on the types of indexes and the nature of the
queries.
• Performance tuning is usually the most complex as it involves a holistic understanding
of hardware, software, data, and queries.
Components of ISAM
1. Data File: The records in the data file are stored in sorted order based on a key field.
2. Index File: The index file contains index entries, which are pointers to blocks (or
records) in the data file. These index entries are also sorted based on the key.
3. Overflow Area: This is used to store new records that are inserted but cannot fit into
the primary data area. The overflow area is also organized in a sorted manner.
Example of ISAM
Let's consider a simplified library database where we have a file containing book information
sorted by ISBN numbers.
Data File
| ISBN | Title |
|-------|--------------------|
| 123 | Book A |
| 234 | Book B |
| 345 | Book C |
| 456 | Book D |
Index File
Overflow Area
(Initially Empty)
• Search in the index for ISBN 345, and find that it is located at block 3.
• Go to block 3 in the data file to retrieve the record for ISBN 345 ("Book C").
Inserting a new book with ISBN 133
• The new book would ideally go into block position 1 (between ISBN 123 and 234).
• Since the block is full, the new record with ISBN 133 ("New Book") will go into the
overflow area.
• An index entry for ISBN 133 pointing to the overflow area is added.
Updated Overflow Area
| ISBN | Title |
|-------|-----------|
| 133 | New Book |
ISAM provides a reasonable compromise between efficient use of disk space and fast retrieval
time. However, it's worth noting that modern databases often use more sophisticated indexing
methods like B-trees and B+ trees, which generalize and extend the capabilities of ISAM.
B+ Trees
A B+ Tree is a type of self-balancing tree structure commonly used in databases and file systems
to maintain sorted data in a way that allows for efficient insertion, deletion, and search
operations. B+ Trees are an extension of B-Trees but differ mainly in the way they handle leaf
nodes, which contain all the key values and point to the actual records.
Operations on B+ Trees:
1. Search: Starts at the root and traverses down the tree, guided by the key values in each
node, until it reaches the appropriate leaf node.
2. Insert: Inserts a new key-value pair and then reorganizes the tree as needed to maintain
its properties.
3. Delete: Removes a key-value pair and then reorganizes the tree, again to maintain its
properties.
-------------------
|| || || ||
-------------------
Insertion
1. Insert 10:
- The tree is empty, so 10 becomes the root.
[10]
2. Insert 20:
- There's room in the leaf node for 20.
[10, 20]
3. Insert 5:
- Still room in the leaf node for 5.
[5, 10, 20]
4. Insert 6:
- The leaf node is full; split it and promote the smallest key in the right node to be the new root.
[10]
/ \
[5, 6] [10, 20]
5. Insert 12:
- Insert into the appropriate leaf node.
[10, , ]
/ \
[5, 6] [10, 12, 20]
6. Insert 30:
- Need to split the right leaf node, promote 20.
[10 , 20 ,]
/ | \
[5, 6] [10, 12] [20, 30]
7. Insert 7:
- Insert into the appropriate leaf node.
[10 , 20 , ]
/ | \
[5, 6, 7] [10, 12] [20, 30]
8. Insert 17:
- Insert into the appropriate leaf node and split.
[10 , 20 , ]
/ | \
[5, 6, 7] [10, 12] [17, 20, 30]
[10 , 17 , 20 ]
/ | | \
[5, 6, 7] [10, 12] [17] [20, 30]
[10 , 17 , 20 ]
/ | | \
[5, 6, 7] [12] [17] [20, 30]
2. Since the key 10 is also present in the internal node, we replace it with its in-order
predecessor (or successor based on design), which is 7.
[7 , 17 , 20 ]
/ | | \
[5, 6] [7, 12] [17] [20, 30]
And that's how B+ Trees work for search, insert, and delete operations. B+ Trees are dynamic,
adapting efficiently as keys are added or removed, which makes them quite useful for databases
where high-speed data retrieval is crucial.