0% found this document useful (0 votes)
0 views

Module Iippt

This document covers data storage and indexing, focusing on file organization methods such as heap, sequential, and hash file organizations, and their respective advantages and disadvantages. It also discusses primary and secondary index structures, their use cases, and the impact of indexing on data retrieval performance. Key index types include dense and sparse indexes, along with various structures like B-Trees and dynamic hashing techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Module Iippt

This document covers data storage and indexing, focusing on file organization methods such as heap, sequential, and hash file organizations, and their respective advantages and disadvantages. It also discusses primary and secondary index structures, their use cases, and the impact of indexing on data retrieval performance. Key index types include dense and sparse indexes, along with various structures like B-Trees and dynamic hashing techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

MODULE II

Data Storage and


Indexes
File Organizations, Primary and Secondary Index Structures
CONTENTS

 Introduction
 File Organization
 Index Types (Primary, Secondary)
 Use Cases
Introduction

🧠 Why Data Storage & Indexing Matter


• Efficient data storage ensures optimal use of memory and disk.
• Fast data retrieval is critical for performance in large-scale
databases.

🔍 Role of Indexing
• Indexes accelerate search operations.
• Reduce the need to scan the entire dataset.
• Crucial for query optimization.
File Organization Overview

📂 What is File Organization?


• The method used to store records on disk.
• Determines how data is accessed, inserted, updated, and deleted.

⚙️Why It Matters:
• Performance Impact on:
• 🔍 Search speed
• ➕ Insertion efficiency
• ❌ Deletion complexity
• 🔄 Update cost
• Choice of file organization affects the efficiency of queries and
maintenance tasks.
• 🧱 Common Types:

• Heap (Unordered)
• Sequential (Sorted)
• Hash-based
Heap File Organization

• 🔄 Unordered records — new records are placed wherever space is


available.
• ➕ Fast insertions — no need to maintain order.
• 🔍 Slow searches — full scan required unless indexed.
• 🧹 Deletion leaves empty spaces that may need periodic cleanup
(compaction).
 “In heap file organization, records are stored
in no specific order.
 It’s efficient for inserting new data since
there’s no need to sort or shift existing
records.
 However, searching is inefficient—unless an
index is in place—because the database may
have to scan every single record to find what
it needs.
 This method is best for workloads with heavy
insert operations and minimal searching.”
Sequential File Organization

• 📈 Records stored in sorted order (typically by primary key).


• 🔍 Efficient for range queries and ordered data retrieval.
• ➕ Fast sequential access (e.g., retrieving top 10 records).
• ❌ Slow insertions/deletions — maintaining order requires shifting
data or rewriting.
 “Sequential file organization stores records in a
sorted manner—usually based on a key like
EmployeeID or Name.
 This is ideal when you need to process data in
order, or handle range-based queries.
 The trade-off? Inserting or deleting data can be
expensive.
 The system might need to shift multiple
records or reorganize the file, which adds
overhead.
 It’s best used for systems with frequent read
and range operations but infrequent updates.”
Hash File Organization

•🔑 Uses a hash function to compute the storage location from a key.

:
Example hash(EmployeeID) → bucket number
•⚡ Fast access for equality searches (e.g., WHERE ID = 123)

•❌ Inefficient for range queries (no order preserved).

•🪣 Data stored in buckets, with one or more records per bucket.

•🚨 Hash collisions may occur — handled using overflow chains or open addressing.

•🔄 Dynamic hashing (e.g., extendible hashing) can help grow with data.
 “In
hash file organization, a hash
function is applied to a key field—like
an employee ID—to determine where
the record should go.
 This method shines when it comes to
equality lookups: it’s extremely fast.
But there’s a downside—since data
isn't stored in any particular order,
range queries become almost useless.
 Another challenge is collisions—multiple keys might
hash to the same location.

 To handle that, we use techniques like overflow


buckets or chaining.

 Systems can also use dynamic hashing methods


like extendible hashing to automatically expand as
more data is added, avoiding overflow.”
Indexing Overview

•📚 What is an Index?
A data structure that speeds up data retrieval by
providing
quick lookup paths to records.
•🎯 Why use Indexes?
•Avoid scanning entire files (full table scan).
•Improve performance for searches, joins, and sorting.
🔑 Types of Indexes:
•Primary Index: Based on the primary key, often
sorted and unique.
•Secondary Index: Built on non-primary fields, can
be non-unique.
•📊 Index Structures commonly used:
•B-Trees / B+ Trees
•Hash Indexes
•Bitmap Indexes (for low-cardinality columns)
 Indexes in databases are like the index in a book—it
helps you find the exact page where information is
located without flipping through every page. They
drastically improve search speed by providing
shortcuts.
 Primary indexes are created on the key fields that
uniquely identify records and usually correspond to
how data is sorted on disk. Secondary indexes let you
quickly search based on other attributes, even if the
data isn’t stored in that order.
 There are different data structures used for indexes,
with B-Trees being the most popular because they
keep data sorted and balanced for efficient search,
insert, and delete operations.”
Primary Index
• Definition:
An index built on the primary key of the table, which uniquely
identifies each record.
• 📄 File Organization:
The data file is usually sorted on this key.
• Types:
• Sparse Index: Index entries point to blocks, not individual
records (used when data is sorted).
• Dense Index: Index entries for every record (used when
data is unsorted).
• 🔍 Advantages:
• Fast access to records by primary key.
• Enables efficient range queries due to sorted data.
• 🚧 Constraints:
• Only one primary index per file (due to sorting
requirement).

“The primary index is built on the primary key,
which means the file itself is sorted on this key.
This sorting allows for fast direct access and
efficient range queries.
 There are two main types: sparse and dense.
Sparse indexes only have entries for some blocks
(like the first record in each block), so they use
less space but require scanning within a block.
Dense indexes have entries for every record,
giving very fast lookup but using more space.
 Because the data must be sorted on the primary
key, there can only be one primary index per file.”
Secondary Index

•🔎 Definition:
An index built on a non-primary key attribute (non-sorting key).
• File Organization:
The data file is not sorted on the secondary index key.
•🧩 Characteristics:
•Always dense: contains an index entry for every record.
•Supports multiple secondary indexes per table.
•🔄 Use Cases:
•Querying based on fields other than the primary key (e.g., searching by City or
Department).
•⚠️Performance Considerations:
•Can cause additional I/O cost (since data is unordered on this field).
•Requires more storage for index maintenance.
 “Secondary indexes are created on fields other than
the primary key. Unlike primary indexes, the data
file isn’t sorted on these fields, so the index must
contain entries for every record—this is why they
are always dense.
 You can have many secondary indexes on a table,
allowing flexible query capabilities on different
attributes. The trade-off is that these indexes can
increase storage requirements and slow down
insertions and deletions because the index must be
updated.
 Secondary indexes are essential when you want to
search or filter on non-primary key fields efficiently.”
Dense vs. Sparse Index

Feature Dense Index Sparse Index


Some records (usually
Index Entry for Every record
one per block)
Space Overhead High Low
Slightly slower (needs
Lookup Speed Faster (direct access)
block scan)
Suitable For Unsorted data Sorted data
Higher (more entries
Maintenance Cost Lower
to update)
“Dense and sparse indexes are two strategies for indexing
records.
• Dense indexes have an entry for every record. This
means lookups are very fast since you can find exactly
where the record is. But they take more space and are
more expensive to maintain because every insert or
delete affects the index.
• Sparse indexes only have entries for some records—
usually the first record in each block. This reduces space
and maintenance overhead but means you must scan
within the block after locating it, making lookups slightly
slower.
 Sparse indexes only work well if the data file is sorted on
the key.”
Various Index Structures
Index Structures: Hashing, Dynamic
Hashing, Multilevel, B & B+ Trees
1. Hash-Based Indexes

•Use hash functions to map keys directly to


buckets.
•Fast for equality searches (e.g., WHERE key = value ).

•Poor support for range queries.


•Fixed-size buckets can cause overflow.
2. Dynamic Hashing Techniques

• Extendible Hashing: Directory


doubles when buckets overflow;
supports growth.
• Linear Hashing: Buckets split
gradually to handle overflow.
• Avoids costly full rehashing.
3. Multilevel Indexes

• Indexes built on top of indexes to


reduce search time.
• Example: Two-level index where first
level points to blocks of second-level
index.
• Improves lookup speed by reducing
disk I/O.
4. B-Trees and B+ Trees

• Balanced tree structures ideal for databases and


file systems.
• All leaves at the same depth; supports sorted data
storage.
• B-Tree stores keys and records at all nodes.
• B+ Tree stores keys in internal nodes and actual
records only in leaf nodes.
THANK YOU

You might also like