0% found this document useful (0 votes)
32 views

Database File Organisation Lecture

The document discusses different methods for physically organizing data on disk in a database management system, including heap files which store records in the order they are inserted, sequential or ordered files which store records based on a sorted field, and hash files which use a hash function to determine physical placement. It covers topics like efficient insertion, searching, and updating of records depending on the file organization method. The file organization method chosen can significantly impact database performance for retrieval and updating of data.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Database File Organisation Lecture

The document discusses different methods for physically organizing data on disk in a database management system, including heap files which store records in the order they are inserted, sequential or ordered files which store records based on a sorted field, and hash files which use a hash function to determine physical placement. It covers topics like efficient insertion, searching, and updating of records depending on the file organization method. The file organization method chosen can significantly impact database performance for retrieval and updating of data.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Advanced

Databases
DBMS File Organisation
Dr David Hamill
Physical & Logical

Logical storage structures, such as


Physical (disk) storage contains all tablespaces, segments, extents,
the files in the database. and blocks, appear on the disk
but are not part of the dataset.
•The File Organisation defines
how records are mapped onto
disk blocks.
•There are four types of File
Organisation, these include….
Introduction
• We will focus on how data is physically
store on secondary storage:
• Different physical organization of data.
• How physical organization is managed

• Physical organization of data can


significantly affect performance of retrieving
information and updates to a database.
Introduction
Application
User
Program
• Primary Storage: Fastest storage
medium (cache and main memory). Data
can be accessed directly by CPU. Limited
capacity DBMS

• Secondary Storage: Media such as


magnetic disks. Normally cost less but
greater capacity. Slower access to data. File Manager
Data must be loaded to primary storage
before being operated on. Disk Manager

• Tertiary Storage: Used mainly for backup Stored DB


and archival. Tapes/DVD-ROM etc.
Introduction
• How can we effectively store large amounts of data on disk?
• Key question for database designers and database
administrators.
• Different options will be available with regards to how
the data can be organised on disk.
• File Organisation
• Data stored on disk will be organised as files of records.
• Each record is a collection of data values interpreted as
facts about entities, their attributes and relationships.
• Storage of records should make it possible to locate
them efficiently when needed.
File Organisation Types
• Heap (or Unordered)
• Records are placed on disk in no
particular order

• Ordered (or Sequential)


• Records are ordered by the value of a
specified field

• Hash
• Records are placed on disk according to
a hash function

A hash function is any function that can be used to map data of arbitrary
size to fixed-size values.
Decisions you must take…….
•One very important design aspect when creating a new
table is the decision to create or not create a clustered
index.

•A table that does not have a clustered index is referred


to as a HEAP and a table that has a clustered index is
referred to as a clustered table.

•A clustered table provides a few benefits over a heap


such as controlling how the data is sorted and stored,
the ability to use the index to find rows quickly and the
ability to reorganize the data by rebuilding the clustered
index. Because a heap or a clustered index determines
the physical storage of your table data, there can only
be one of these per table.

•So, a table can either have one heap or one clustered


index.
A clustered Indexed Table
• Data is stored based on the clustered
index key
• Data can be retrieved quickly based on
the clustered index key, if the query
uses the indexed columns
• Data pages are linked for faster
sequential access
• Additional time is needed to maintain
the clustered index based on INSERT,
UPDATE and DELETE activity
• A primary key is a unique index that is
clustered by default.
You will also hear of the term - Clustered File
System
• Definition: Wikipedia (https://en.wikipedia.org/wiki/Clustered_file_system)
• A clustered file system is a file system which is shared by being simultaneously mounted on
multiple servers.
• A clustered file system leverages multiple physical storage servers which simultaneously
mount the file system so that it can be accessed and managed as one single logical
system
• If the cluster is just meant to provide redundancy when one node fails, each server can
operate autonomously and a clustered file system is not required. However, if the clusters
work collaboratively and handle more demanding tasks, a CFS may be necessary. The CFS
allows users to access the same files and data concurrently.
1. Heap File - Insertion
• One of the simplest and most basic types of
file organisation.
• Records are stored in the file in the order in
which they are inserted.
• New records are inserted in the last page of
the file; if there is insufficient space in the
last page, a new page is added to the file.
• This makes insertion very efficient -O(1)
complexity.
1. Heap Files - Searching
• Searching for records is very inefficient though –O(n)
complexity.

• Specific data can not be retrieved quickly, unless


there are also non-clustered indexes

• Since there is no particular ordering with respect


to field values, a linear search must be performed
to access a record.

• A linear search involves reading pages from the


file until the required record is found.
1. Heap Files - Deletion

• Physical deletion leaves unused space in the block.


• A large amount of spaces being wasted if there are frequent deletions.
• File size will increase and consequently available disk space and performance will progressively
deteriorate as deletions occur.
• Heap files will require regular reorganisation to reclaim the unused space.
Advantages of Heap File Organization Method
1.This is a popular method when huge amount of records needs to be added in the
database. Since the records are assigned to free data blocks in memory there is no need to
perform any special check for existing records, when a new record needs addition. This
makes it easier to insert multiple records all at once without worrying about messing with
the file organization.
2.When the records are less and file size is small, it is faster to search and retrieve the data
from database using heap file organization compared to sequential file organization.

Disadvantages of Heap File Organization method


3.This method is inefficient if the file size is big, as the search, retrieve and update operations
consumes more time compared to sequential file organization.
4.This method doesn’t use the memory space efficiently, thus it requires memory cleanup
and optimization to free the unused data blocks in memory.

https://beginnersbook.com/2022/06/heap-file-organization-in-dbms/
2. Ordered/Sequential Files
• The records in a file can be physically ordered based on the
values of one or more of the fields.

• Such a file organisation is called an ordered file (or


sequential file)

• The field(s) that the file is sorted on is called the ordering


field(s).
2. Ordered Files – Search…Order By
• Consider the following SQL query:
SELECT *
FROM Staff
ORDER BY Sno;

• If the tuples are already ordered according to the ordering field Sno it
should be possible to reduce the execution time for the query as no
sorting is necessary.
18

2. Ordered Files - Search


• Consider the following SQL query:

SELECT *
FROM Staff
WHERE Sno = ‘SG37’;

• In this case we can use a binary search to execute the query involving
a search condition based on the ordering field Sno
2. Ordered Files - Search
Binary Search Algorithm Example
SELECT *
FROM Staff Sno Page
WHERE Sno = ‘SL20’ SG14 1
1. Initial mid-page is page 5. ‘SG37’ is not the SG21 2
record we are searching for. The value being SG24 3
searched for is greater than ‘SG37’ so we
SG36 4
discard the top half of the file.
2. Retrieve the mid-page of the bottom half of 1 SG37 5
the file, that is page 7. The value of the key SL20 6
field ‘SL21’ is greater than ‘SL20’. 4
SL21 7
3. Discard the bottom half of the search space. 2
4. Retrieve the mid-page of the remaining search SL37 8
space, that is page 6 which contains the record SL66 9
we are searching for.
2. Ordered Files - Search
• In general, the binary search is more efficient than a linear
search.
2. Ordered Files – Insertions & Deletions
• If there is not sufficient space then it would be necessary to move one or
more records onto the next page. This may cause a cascading effect.

• One solution is to use an overflow or transaction file. Insertions are


added to the overflow and periodically merged with the main file
• Efficient for insertions
• Inefficient for retrievals

When deleting a record we must reorganise the records to remove the free
slot.
Advantages of Sequential File Organization
1.It is simple to adapt method. The implementation is simple compared to
other file organization methods.
2.It is fast and efficient when we are dealing with huge amount of data.
3.This method of file organization is mostly used for generating various
reports and performing statistical operations on data.
4.Data can be stored on a cheap storage devices.

Disadvantages of Sequential File Organization


5.Sorting the file takes extra time and it requires additional storage for
sorting operation.
6.Searching a record is time consuming process in sequential file organization
as the records are searched in a sequential order.
3. Hashing in Database Management Systems

Hashing technique is used to calculate the direct location of a data record on the disk without
using index structure. In this technique, data is stored at the data blocks whose address is
generated by using the hashing function.

The memory location where these records are stored is known as data bucket or data blocks.

Types of Hashing – Static Hashing | Dynamic Hashing

•More info
So why would you choose to use hashing?

• For a huge database structure, it’s tough to search all the index values through all its level
and then you need to reach the destination data block to get the desired data.
• Hashing method is used to index and retrieve items in a database as it is faster to search that
specific item using the shorter hashed key instead of using its original value.
• Hashing is an ideal method to calculate the direct location of a data record on the disk
without using index structure.
• There are two types: Static Hashing and Dynamic Hashing
• Data buckets are memory locations where the records are stored. It is also known as Unit Of
Storage.
Static Hashing
• Records do not have to be written sequentially to the file.
• A hash function is used to calculate the address of a page
in which the record is to be stored based on one or more
fields in the record- O(1) lookup complexity. A hash
function, is a mapping function which maps all the set
of search keys to the address where actual records are
placed.
• The base field is called the hash field.
• If the hash field is also a key field of the file then it is
called the hash key.
• Records in a hash file will appear randomly distributed
across the available file space. For this reason, hash files
are sometimes called random or direct files.
Static Hashing - Functions
•Inserting a record: When a new record requires to be inserted into the table, you can generate an
address for the new record using its hash key. When the address is generated, the record is
automatically stored in that location.
•Searching: When you need to retrieve the record, the same hash function should be helpful to
retrieve the address of the bucket where data should be stored.
•Delete a record: Using the hash function, you can first fetch the record which is you wants to delete.
Then you can remove the records for that address in memory.
Dynamic Hashing
• Each address generated by a hash function corresponds to a page (or a
bucket) with slots for multiple records. Data buckets are memory locations
where the records are stored. It is also known as Unit Of Storage.

• Within a bucket, records are placed in order of arrival.

• When the same address is generated for two or more records a collision is
said to have occurred and the records are called synonyms in this case.
• We must insert the new record in another position when a collision occurs.
• Collision management complicates hash file management and degrades overall
performance
Hashing – Static/Dynamic
• The hashing techniques we have considered so far are static in that the
hash address space is fixed when the file is created. When the space
becomes full it is said to be saturated.
• In this case it is necessary to reorganise the hash structure
• This may involve creating a new file with more space, then choosing a
new hash function and mapping the old file to the new file.

• An alternative is dynamic hashing


• This allows the file size to change dynamically to accommodate growth
and shrinkage of the database.
The limitations of Hashing
• The use of hashing for retrievals depends upon the complete hash
field. In general, hashing is inappropriate for retrievals based on
pattern matching or ranges of values.

• Hashing is also inappropriate for retrievals based on a field other than


the hash field. In this case, it would be necessary to perform a linear
search to find the record
Advantages of Hash File Organization
1.This method doesn’t require sorting explicitly as the records are automatically sorted in the
memory based on hash keys.
2.Reading and fetching a record is faster compared to other methods as the hash key is used to
quickly read and retrieve the data from database.
3.Records are not dependant on each other and are not stored in consecutive memory locations so
that prevents the database from read, write, update, delete anomalies.

Disadvantages of Hash File Organization


4.Can cause accidental deletion of data, if columns are not selected properly for hash function. For
example, while deleting an Employee "Steve" using Employee_Name as hash column can cause
accidental deletion of other employee records if the other employee name is also "Steve". This can
be avoided by selecting the attributes properly, for example in this case combining age, department
or SSN with the employee_name for hash key can be more accurate in finding the distinct record.
5.Memory is not efficiently used in hash file organization as records are not stored in consecutive
memory locations.
6.If there are more than one hash columns, searching a record using a single attribute will not give
accurate results.
Overview

Click here for more info


The End

You might also like