Ise Vi File Structures 10is63 Notes
Ise Vi File Structures 10is63 Notes
FILE STRUCTURES
Subject Code: 10IS63
PART – A
UNIT – 1 7 Hours
Introduction: File Structures: The Heart of the file structure Design, A Short History of File
Structure Design, A Conceptual Toolkit; Fundamental File Operations: Physical Files and
Logical Files, Opening Files, Closing Files, Reading and Writing, Seeking, Special Characters,
The Unix Directory Structure, Physical devices and Logical Files, File-related Header Files,
UNIX file System Commands; Secondary Storage and System Software: Disks, Magnetic Tape,
Disk versus Tape; CD-ROM: Introduction, Physical Organization, Strengths and Weaknesses;
Storage as Hierarchy, A journey of a Byte, Buffer Management, Input /Output in UNIX.
UNIT – 2 6 Hours
Fundamental File Structure Concepts, Managing Files of Records : Field and Record
Organization, Using Classes to Manipulate Buffers, Using Inheritance for Record Buffer Classes,
Managing Fixed Length, Fixed Field Buffers, An Object-Oriented Class for Record Files,
Record Access, More about Record Structures, Encapsulating Record Operations in a Single
Class, File Access and File Organization.
UNIT – 3 7 Hours
Organization of Files for Performance, Indexing: Data Compression, Reclaiming Space in
files, Internal Sorting and Binary Searching, Keysorting; What is an Index? A Simple Index for
Entry-Sequenced File, Using Template Classes in C++ for Object I/O, Object-Oriented support
for Indexed, Entry-Sequenced Files of Data Objects, Indexes that are too large to hold in
Memory, Indexing to provide access by Multiple keys, Retrieval Using Combinations of
Secondary Keys, Improving the Secondary Index structure: Inverted Lists, Selective indexes,
Binding.
UNIT – 4 6 Hours
Cosequential Processing and the Sorting of Large Files: A Model for Implementing
Cosequential Processes, Application of the Model to a General Ledger Program, Extension of the
Model to include Mutiway Merging, A Second Look at Sorting in Memory, Merging as a Way of
Sorting Large Files on Disk.
PART - B
UNIT – 5 7 Hours
Multi-Level Indexing and B-Trees: The invention of B-Tree, Statement of the problem,
Indexing with Binary Search Trees; Multi-Level Indexing, BTrees, Example of Creating a B-
Tree, An Object-Oriented Representation of B-Trees, B-Tree Methods; Nomenclature, Formal
Definition of B-Tree Properties, Worst-case Search Depth, Deletion, Merging and
Redistribution, Redistribution during insertion; B* Trees, Buffering of pages; Virtual BTrees;
Variable-length Records and keys.
UNIT – 6 6 Hours
Indexed Sequential File Access and Prefix B + Trees: Indexed Sequential Access, Maintaining
a Sequence Set, Adding a Simple Index to the Sequence Set, The Content of the Index:
Separators Instead of Keys, The Simple Prefix B+ Tree and its maintenance, Index Set Block
Size, Internal Structure of Index Set Blocks: A Variable-order B- Tree, Loading a Simple Prefix
B+ Trees, B-Trees, B+ Trees and Simple Prefix B+ Trees in Perspective.
UNIT – 7 7 Hours
Hashing: Introduction, A Simple Hashing Algorithm, Hashing Functions and Record
Distribution, How much Extra Memory should be used?, Collision resolution by progressive
overflow, Buckets, Making deletions, Other collision resolution techniques, Patterns of record
access.
UNIT – 8 6 Hours
Extendible Hashing: How Extendible Hashing Works, Implementation, Deletion, Extendible
Hashing Performance, Alternative Approaches.
Text Books:
1. Michael J. Folk, Bill Zoellick, Greg Riccardi: File Structures-An Object Oriented Approach
with C++, 3rd Edition, Pearson Education, 1998. (Chapters 1 to 12 excluding 1.4, 1.5, 5.5, 5.6,
8.6, 8.7, 8.8)
Reference Books:
1. K.R. Venugopal, K.G. Srinivas, P.M. Krishnaraj: File Structures Using C++, Tata McGraw-
Hill, 2008.
2. Scot Robert Ladd: C++ Components and Algorithms, BPB Publications, 1993.
3. Raghu Ramakrishan and Johannes Gehrke: Database Management Systems, 3rd Edition,
McGraw Hill, 2003.
TABLE OF CONTENTS
PART – A
PART - B
UNIT – 1
Introduction to the Design and Specification of File Structures
x Even with a balanced binary tree, dozens of accesses were required to find a record
in moderate-sized files.
x A method was needed to keep a tree balanced when each node of the tree was not a
single record, as in a binary tree, but a file block containing hundreds of records.
Hence, B-Trees were introduced.
x AVL trees grow from top down as records are added, B-Trees grow from the bottom
up.
x B-Trees provided excellent access performance but, a file could not be accessed
sequentially with efficiency.
x The above problem was solved using B+ tree which is a combination of a B-Tree and
a sequential linked list added at the bottom level of the B-Tree.
x To further reduce the number of disk accesses, hashing was introduced for files that
do not change size greatly over time.
x Extendible, dynamic hashing was introduced for volatile, dynamic files which
change.
Example:
int Input;
Input = open ("Daily.txt", O_RDONLY);
The following flags can be bitwise ored together for the access mode:
O_RDONLY : Read only
O_WRONLY : Write only
O_RDWR : Read or write
O_CREAT : Create file if it does not exist
O_EXCL : If the file exists, truncate it to a length of zero, destroying its
contents. (used only with O_CREAT)
O_APPEND : Append every write operation to the end of the file
O_TRUNC : Delete any prior file contents
Pmode- protection mode
The security status of a file, defining who is allowed to access a file, and which access
modes are allowed.
Example:
x For write, one or more values (as variables or constants) must be supplied to the
write function, to provide the data for the file.
x For unformatted transfers, the amount of data to be transferred must also be
supplied.
2.4.1 Read and Write Functions
Reading
x The C++ read function is used to read data from a file for handle level access.
x The read function must be supplied with (as an arguments):
o The source file to read from
o The address of the memory block into which the data will be stored
o The number of bytes to be read(byte count)
x The value returned by the read function is the number of bytes read.
Read function:
Prototypes:
int read (int Handle, void * Buffer, unsigned Length);
Example:
read (Input, &C, 1);
Writing
x The C++ write function is used to write data to a file for handle level access.
x The handle write function must be supplied with (as an arguments):
o The logical file name used for sending data
o The address of the memory block from which the data will be written
o The number of bytes to be write
x The value returned by the write function is the number of bytes written.
Write function:
Prototypes:
int write (int Handle, void * Buffer, unsigned Length);
Example:
write (Output, &C, 1);
2.4.2 Files with C Streams and C++ Stream Classes
x For FILE level access, the logical file is declared as a pointer to a FILE (FILE *)
x The FILE structure is defined in the stdio.h header file.
Opening
The C++ fopen function is used to open a file for FILE level access.
x The FILE fopen function must be supplied with (as arguments):
o The name of the physical file
o The access mode
x The value returned by the fopen is a pointer to an open FILE, and is assigned to the
file variable.
fopen function:
Prototypes:
FILE * fopen (const char* Filename, char * Access);
Example:
FILE * Input;
Input = fopen ("Daily.txt", "r");
The access mode should be one of the following strings:
r
Open for reading (existing file only) in text mode
r+
Open for update (existing file only)
w
Open (or create) for writing (and delete any previous data)
w+
Open (or create) for update (and delete any previous data)
a
Open (or create) for append with file pointer at current EOF (and keep any previous
data) in text mode
a+
Open (or create) for append update (and keep any previous data)
Closing
The C++ fclose function is used to close a file for FILE level access.
The FILE fclose function must be supplied with (as an argument):
o A pointer to the FILE structure of the logical file
The value returned by the fclose is 0 if the close succeeds, and &neq;0 if the close fails..
Prototypes:
int fclose (FILE * Stream);
Example:
fclose (Input);
Reading
The C++ fread function is used to read data from a file for FILE level access.
The FILE fread function must be supplied with (as an arguments):
o A pointer to the FILE structure of the logical file
o The address of the buffer into which the data will be read
o The number of items to be read
o The size of each item to be read, in bytes
The value returned by the fread function is the number of items read.
Prototypes:
size_t fread (void * Buffer, size_t Size, size_t Count, FILE * Stream);
Example:
fread (&C, 1, 1, Input);
Writing
The C++ fwrite function is used to write data to a file for FILE level access.
The FILE fwrite function must be supplied with (as an arguments):
o A pointer to the FILE structure of the logical file
o The address of the buffer from which the data will be written
o The number of items to be written
o The size of each item to be written, in bytes
The value returned by the fwrite function is the number of items written.
Prototypes:
size_t fwrite (void * Buffer, size_t Size, size_t Count, FILE * Stream);
Example:
fwrite (&C, 1, 1, Output);
2.4.3 Programs in C++ to Display the contents of a File
The first simple file processing program opens a file for input and reads it, character by
character, sending each character to the screen after it is read from the file. This program
includes the following steps
1. Display a prompt for the name of the input file.
2. Read the user’s response from the keyboard into a variable called filename.
3. Open the file for input.
4. While there are still characters to be read from the input file,
Read a character from the file;
Write the character to the terminal screen.
5. Close the input file.
Figures 2.2 and 2.3 are C++ implementations of this program using C streams and C++ stream
classes, respectively.
In the C++ version, the call file.unsetf(ios::skipws) causes operator >> to include white
space (blanks, end-of-line,tabs, ans so on).
2.4.4 Detecting End of File
end-of-file
A physical location just beyond the last datum in a file.
2.5 Seeking
The action of moving directly to a certain position in a file is called seeking.
seek
To move to a specified location in a file.
byte offset
The distance, measured in bytes, from the beginning.
x Seeking moves an attribute in the file called the file pointer.
x C++ library functions allow seeking.
x In DOS, Windows, and UNIX, files are organized as streams of bytes, and locations
are in terms of byte count.
x Seeking can be specified from one of three reference points:
o The beginning of the file.
o The end of the file.
Example
The C++ fseek function is used to move the file pointer of a file identified by its FILE
structure.
The FILE fseek function must be supplied with (as an arguments):
o The file descriptor of the file(file)
o The number of bytes to move from some origin in the file(byte_offset)
o The starting point from which the byte_offset is to be taken(origin)
The Origin argument should be one of the following, to designate the reference point:
SEEK_SET: Beginning of file
SEEK_CUR: Current file position
SEEK_END: End of file
The value returned(pos) by the fseek function is the positon of the read/write pointer from
the beginning of the file after its moved
Prototypes:
long fseek (FILE * file, long Offset, int Origin);
Example:
long pos;
fseek (FILE * file, long Offset, int Origin);
...
pos=fseek (Output, 100, SEEK_BEG);
Example:
A connection between standard output of one process and standard input of a second
process.
x In both DOS and UNIX, the standard output of one program can be piped
(connected) to the standard input of another program with the | symbol.
x Example:
cluster
A group of sectors handled as a unit of file allocation. A cluster is a fixed number of
contiguous sectors
extent
A physical section of a file occupying adjacent clusters.
fragmentation
Unused space within a file.
x Clusters are also referred to as allocation units (ALUs).
x Space is allocated to files as integral numbers of clusters.
x A file can have a single extent, or be scattered in several extents.
x Access time for a file increases as the number of separate extents increases, because
of seeking.
x Defragmentation utilities physically move files on a disk so that each file has a
single extent.
x Allocation of space in clusters produces fragmentation.
x A file of one byte is allocated the space of one cluster.
x On average, fragmentation is one-half cluster per file.
3.1.5 Organizing Tracks by Block
x Mainframe computers typically use variable size physical blocks for disk drives.
x Track capacity is dependent on block size, due to fixed overhead (gap and address
block) per block.
3.1.6 The Cost of a Disk Access
direct access device
A data storage device which supports direct access.
direct access
Accessing data from a file by record position with the file, without accessing intervening
records.
access time
The total time required to store or retrieve data.
transfer time
The time required to transfer the data from a sector, once the transfer has begun.
seek time
The time required for the head of a disk drive to be positioned to a designated cylinder.
rotational delay
The time required for a designated sector to rotate to the head of a disk drive.
x Access time of a disk is related to physical movement of the disk parts.
x Disk access time has three components: seek time, rotational delay, and transfer
time.
x Seek time is affected by the size of the drive, the number of cylinders in the drive,
and the mechanical responsiveness of the access arm.
x Average seek time is approximately the time to move across 1/3 of the cylinders.
x Rotational delay is also referred to as latency.
x Rotational delay is inversely proportional to the rotational speed of the drive.
x Average rotational delay is the time for the disk to rotate 180°.
x Transfer is inversely proportional to the rotational speed of the drive.
x Transfer time is inversely proportional to the physical length of a sector.
x Transfer time is roughly inversely proportional to the number of sectors per track.
x CD-ROM is read only. i.e., it is a publishing medium rather than a data storage and
retrieval like magnetic disks.
reflected beam's intensity. This pattern of changing intensity of the reflected beam is converted
into binary data.
-Double Buffering
-Buffer Pooling
Move mode and Locate mode
Scatter/Gather I/O
UNIT-2
Fundamental File Structures Concepts
4.1 Field and Record Organization
4.1.1 A Stream File
x In the Windows, DOS, UNIX, and LINUX operating systems, files are not internally
structured; they are streams of individual bytes.
F r e d F l i n t s t o n e 4 4 4 4 G r a n ...
x The only file structure recognized by these operating systems is the separation of a
text file into lines.
o For Windows and DOS, two characters are used between lines, a carriage
return (ASCII 13) and a line feed (ASCII 10);
o For UNIX and LINUX, one character is used between lines, a line feed
(ASCII 10);
x The code in applications programs can, however, impose internal organization on
stream files.
Record Structures
record
A subdivision of a file, containing data related to a single entity.
field
A subdivision of a record containing a single attribute of the entity which the record
describes.
stream of bytes
A file which is regarded as being without structure beyond separation into a sequential set
of bytes.
Display (Clerk);
Display (Customer);
}
void Display (Person Someone) {
cout << Someone.FirstName << Someone.LastName
<< Someone.Address << Someone.City
<< Someone.State << Someone.ZIP;
}
x In memory, each Person will appear as an aggregate, with the individual values
being parts of the aggregate:
Person
Clerk
FirstName LastName Address City State ZIP
Fred Flintstone 4444 Granite Place Rockville MD 0001
x The output of this program will be:
FredFlintstone4444 Granite PlaceRockvilleMD00001LilyMunster1313
Mockingbird LaneHollywoodCA90210
x Obviously, this output could be improved. It is marginally readable by people, and
it would be difficult to program a computer to read and correctly interpret this
output.
Record
Record 1 # # Record 3 # Record 4 # Record 5 #
2
x The records within a file are followed by a delimiting byte or series of bytes.
x The delimiter cannot occur within the records.
x Records within a file can have different sizes.
Record
110 Record 1 40 100 Record 3 80 Record 4 70 Record 5
2
x The records within a file are prefixed by a length byte or bytes.
x Records within a file can have different sizes.
x Different files can have different length records.
x Programs which access the file must know the size and format of the length prefix.
x Offset, or position, of the nth record of a file cannot be calculated.
x There is external overhead for record separation equal to the size of the length prefix
per record.
x There should be no internal fragmentation (unused space within records.)
x There may be no external fragmentation (unused space outside of records) after file
updating.
x Individual records cannot always be updated in place.
x Algorithms for Accessing Prefixed Variable Length Records
x Code for Accessing PreFixed Variable Length Records
x Example:
x 0 A 0 46 69 72 73 74 20 4C 69 6E 65 B 0 53 65 ..First Line..Se
x 10 63 6F 6E 64 20 4C 69 6E 65 1F 0 54 68 69 72 64 cond Line..Third
x 20 20 4C 69 6E 65 20 77 69 74 68 20 6D 6F 72 65 20 Line with more
x 30 63 68 61 72 61 63 74 65 72 73 characters
x Disadvantage: the offset of each record can be calculated from its record number.
This makes direct access possible.
x Disadvantage: there is space overhead for the delimiter suffix.
x Advantage: there will probably be no internal fragmentation (unusable space within
records.)
Indexed Variable Length Records
x Advantage: the offset of each record is be contained in the index, and can be looked
up from its record number. This makes direct access possible.
x Disadvantage: there is space overhead for the index file.
x Disadvantage: there is time overhead for the index file.
x Advantage: there will probably be no internal fragmentation (unusable space within
records.)
x The time overhead for accessing the index file can be minimized by reading the
entire index file into memory when the files are opened.
Fixed Field Count Records
x Records can be recognized if they always contain the same (predetermined) number
of fields.
Delineation of Fields in a Record
Fixed Length Fields
Field 1 Field 2 Field 3 Field 4 Field 5
x Each record is divided into fields of correspondingly equal size.
x Different fields within a record have different sizes.
x Different records can have different length fields.
x Programs which access the record must know the field lengths.
x There is no external overhead for field separation.
x There may be internal fragmentation (unused space within fields.)
Delimited Variable Length Fields
Field
Field 1 ! ! Field 3 ! Field 4 ! Field 5 !
2
x The fields within a record are followed by a delimiting byte or series of bytes.
x Fields within a record can have different sizes.
x Different records can have different length fields.
x Programs which access the record must know the delimiter.
x The delimiter cannot occur within the data.
x If used with delimited records, the field delimiter must be different from the record
delimiter.
x There is external overhead for field separation equal to the size of the delimiter per
field.
x There should be no internal fragmentation (unused space within fields.)
Length Prefixed Variable Length Fields
12 Field 1 4 Field 10 Field 3 8 Field 4 7 Field 5
2
x The fields within a record are prefixed by a length byte or bytes.
x Fields within a record can have different sizes.
x Different records can have different length fields.
x Programs which access the record must know the size and format of the length
prefix.
x There is external overhead for field separation equal to the size of the length prefix
per field.
x There should be no internal fragmentation (unused space within fields.)
Representing Record or Field Length
x Record or field length can be represented in either binary or character form.
x The length can be considered as another hidden field within the record.
x This length field can be either fixed length or delimited.
x When character form is used, a space can be used to delimit the length field.
x A two byte fixed length field could be used to hold lengths of 0 to 65535 bytes in
binary form.
x A two byte fixed length field could be used to hold lengths of 0 to 99 bytes in
decimal character form.
x A variable length field delimited by a space could be used to hold effectively any
length.
x In some languages, such as strict Pascal, it is difficult to mix binary values and
character values in the same file.
x The C++ language is flexible enough so that the use of either binary or character
format is easy.
Tagged Fields
x Tags, in the form "Keyword=Value", can be used in fields.
x Use of tags does not in itself allow separation of fields, which must be done with
another method.
x Use of tags adds significant space overhead to the file.
x Use of tags does add flexibility to the file structure.
x Fields can be added without affecting the basic structure of the file.
x Tags can be useful when records have sparse fields - that is, when a significant
number of the possible attributes are absent.
Byte Order
x The byte order of integers (and floating point numbers) is not the same on all
computers.
x This is hardware dependent (CPU), not software dependent.
x Many computers store numbers as might be expected: 4010 = 2816 is stored in a four
byte integer as 00 00 00 28.
x PCs reverse the byte order, and store numbers with the least significant byte first:
4010 = 2816 is stored in a four byte integer as 28 00 00 00.
x On most computers, the number 40 would be stored in character form in its ASCII
values: 34 30.
x IBM mainframe computers use EBCDIC instead of ASCII, and would store "40" as
F4 F0.
A search which reads each record sequentially from the beginning until the record
or records being sought are found.
x A sequential search is O(n); that is, the search time is proportional to the number of
items being searched.
x For a file of 1000 records and unique random search keys, an average of 500 records
must be read to find the desired record.
x For an unsuccessful search, the entire file must be examined.
x Sequential is unsatisfactory for most file searches.
x Sequential search is satisfactory for certain special cases:
o Sequential search is satisfactory for small files.
o Sequential search is satisfactory for files that are searched only infrequently.
o Sequential search is satisfactory when a high percentage of the records in a
file will match.
o Sequential search is required for unstructured text files.
5.1.3 Unix Tools for Sequential Processing
x Unix style tools for MS-DOS are available from the KU chat BBS website.
x Linux style tools for MS-DOS are available from the Cygwin website.
x The cat (concatenate) utility can be used to copy files to standard output.
x The cat (concatenate) utility can be used to combine (concatenate) two or more files
into one.
x The grep (general regular expression) prints lines matching a pattern.
x The grep manual (man page) is available on line (Shadow Island.)
x The wc (word count) utility counts characters, words, and lines in a file.
x wc and other utilities are also available from National Taiwan University.
5.1.4 Direct Access
Accessing data from a file by record position with the file, without accessing
intervening records.
relative record number
An ordinal number indicating the position of a record within a file.
x Direct access allows individual records to be read from different locations in the file
without reading intervening records.
x Files often begin with headers, which describe the data in the file, and the
organization of the file.
x 0 10 0 56 2 0 44 1C 0 0 0 0 0 0 0 0 0 ..V..D..........
x 10 2C 0 5 4E 61 6E 63 79 5 4A 6F 6E 65 73 D 31 ,..Nancy.Jones.1
x 20 32 33 20 45 6C 6D 20 50 6C 61 63 65 8 4C 61 6E 23 Elm Place.Lan
x 30 67 73 74 6F 6E 2 4F 4B 5 37 32 30 33 32 34 0 gston.OK.720324.
x 40 6 48 65 72 6D 61 6E 7 4D 75 6E 73 74 65 72 15 .Herman.Munster.
x 50 31 33 31 33 20 4D 6F 63 6B 69 6E 67 62 69 72 64 1313 Mockingbird
x 60 20 4C 61 6E 65 5 54 75 6C 73 61 2 4F 4B 5 37 Lane.Tulsa.OK.7
x 70 34 31 31 34 34 0 5 55 68 75 72 61 5 53 6D 69 41144..Uhura.Smi
x 80 74 68 13 31 32 33 20 54 65 6C 65 76 69 73 69 6F th.123 Televisio
x 90 6E 20 4C 61 6E 65 A 45 6E 74 65 72 70 72 69 73 n Lane.Enterpris
A0 65 2 43 41 5 39 30 32 31 30 e.CA.90210
x The above dump represents a file with a 16 byte (10 00) header, Variable length
records with a 2 byte length prefix, and fields delimited by ASCII code 28 (1C16).
The actual data begins at byte 16 (1016).
5.3.3 Metadata
metadata
Data which describes the data in a file or table.
5.3.4 Mixing Object Types in One File
5.3.5 Representation Independent File Access
5.3.6 Extensibility
extensibility
Having the ability to be extended (e.g., by adding new fields) without redesign.
sequential access
Access of data in order.
Accessing data from a file whose records are organized on the basis of their
successive physical positions.
direct access
Access of data in arbitrary order, with variable access time.
Accessing data from a file by record position with the file, without accessing
intervening records.
x For direct access to be useful, the relative record number of the record of interest
must be known.
x Direct access is often used to support keyed access.
Keyed Access
keyed access
Accessing data from a file by an alphanumeric key associated with each record.
key
A value which is contained within or associated with a record and which can be used
to identify the record.
UNIT- 3
Organizing Files for Performance, Indexing
6.1. Data Compression
x Compression can reduce the size of a file, improving performance.
x File maintenance can produce fragmentation inside of the file. There are ways to
reuse this space.
x There are better ways than sequential search to find a particular record in a file.
x Keysorting is a way to sort medium size files.
x We have already considered how important it is for the file system designer to
consider how a file is to be accessed when deciding how to create fields, records,
and other file structures. In this chapter, we continue to focus on file organization,
but the motivation is different. We look at ways to organize or reorganize files in
order to improve performance.
x In the first section, we look at how to organize files to make them smaller.
Compression techniques make file smaller by encoding them to remove redundant
or unnecessary information.
data compression
The encoding of data in such a way as to reduce its size.
redundancy reduction
Any form of compression which removes only redundant information.
x In this section, we look at ways to make files smaller, using data compression. As
with many programming techniques, there are advantages and disadvantages to data
compression. In general, the compression must be reversed before the information is
used. For this tradeoff,
o Smaller files use less storage space.
o The transfer time of disk access is reduced.
o The transmission time to transfer files over a network is reduced.
But,
o Program complexity and size are increased.
o Computation time is increased.
o Data portability may be reduced.
o With some compression methods, information is unrecoverably lost.
o Direct access may become prohibitably expensive.
o Data compression is possible because most data contains redundant
(repeated) or unnecessary information.
x Run-length encoding is useful only when the text contains long runs of a single
value.
x Run-length encoding is useful for images which contain solid color areas.
x Run-length encoding may be useful for text which contains strings of blanks.
x Example:
x uncompressed text (hexadecimal format):
x 40 40 40 40 40 40 43 43 41 41 41 41 41 42
x compressed text (hexadecimal format):
x FE 06 40 43 43 FE 05 41 42
where FE is the compression escape code, followed by a length byte, and the byte to
be repeated.
x c 011 g 0011
x d 0000
x Compressed Text (binary):
x 10100000000110110010110011 (26 bits)
x Compressed Text (hexadecimal):
x A0 1B 96 60
6.2.4 Irreversible Compression Techniques
irreversible compression
Any form of compression which reduces information.
reversible compression
Compression with no alteration of original information upon reconstruction.
x Irreversible compression goes beyond redundancy reduction, removing information
which is not actually necessary, making it impossible to recover the original form of
the data.
x Irreversible compression is useful for reducing the size of graphic images.
x Irreversible compression is used to reduce the bandwidth of audio for digital
recording and telecommunications.
x JPEG image files use an irreversible compression based on cosine transforms.
x The amount of information removed by JPEG compression is controllable.
x The more information removed, the smaller the file.
x For photographic images, a significant amount of information can be removed
without noticably affecting the image.
x For line graphic images, the JPEG compression may introduce aliasing noise.
x GIF images files irreversibly compress images which contain more than 256 colors.
x The GIF format only allows 256 colors.
x The compression of GIF formatting is reversible for images which have fewer than
256 colors, and lossy for images which have more than 256 colors.
x Recommendation:
o Use JPEG for photographic images.
o Use GIF for line drawings.
6.2.5. Compression in UNIX
x The UNIX pack and unpack utilities use Huffman encoding.
x The UNIX compress and uncompress utilities use Lempil-Ziv encoding.
x Lempil-Ziv is a variable length encoding which replaces strings of characters with
numbers.
x The length of the strings which are replaced increases as the compression advances
through the text.
x Lempel-Ziv compression does not store the compression table with the compressed
text. The compression table can be reproduced during the decompression process.
x Lempel-Ziv compression is used by "zip" compression in DOS and Windows.
x Lempel-Ziv compression is a redundancy reduction compression - it is completely
reversible, and no information is lost.
x The ZIP utilities actually support several types of compression, including Lempil-
Ziv and Huffman.
6.3 Reclaiming Space in Files
6.3.1 Record Deletion and Storage Compaction
external fragmentation
Fragmentation in which the unused space is outside of the allocated areas.
compaction
The removal of fragmentation from a file by moving records so that they are all
physically adjacent.
x As files are maintained, records are added, updated, and deleted.
x The problem: as records are deleted from a file, they are replaced by unused spaces
within the file.
x The updating of variable length records can also produce fragmentation.
x Compaction is a relatively slow process, especially for large files, and is not
routinely done when individual records are deleted.
6.3.2 Deleting Fixed-Length Records for Reclaiming Space Dynamically
linked list
A container consisting of a series of nodes, each containing data and a reference to the
location of the logically next node.
avail list
A list of the unused spaces in a file.
stack
A last-in first-out container, which is accessed only at one end.
Record 1 Record 2 Record 3 Record 4 Record 5
x Deleted records must be marked so that the spaces will not be read as data.
x One way of doing this is to put a special character, such as an asterisk, in the first
byte of the deleted record space.
Record 1 Record 2 * Record 4 Record 5
x If the space left by deleted records could be reused when records are added,
fragmentation would be reduced.
x To reuse the empty space, there must be a mechanism for finding it quickly.
x One way of managing the empty space within a file is to organize as a linked list,
known as the avail list.
x The location of the first space on the avail list, the head pointer of the linked list, is
placed in the header record of the file.
x Each empty space contains the location of the next space on the avail list, except for
the last space on the list.
x The last space contains a number which is not valid as a file location, such as -1.
x
First Fit
S l ot S l ot
Header Slot @50 Slot @120 Slot @200 Slot @430
@300 @370
370 * -1 70 Record * 50 100 Record * 200 60 Record
x The simplest placement strategy is first fit.
x With first fit, the spaces on the avail list are scanned in their logical order on the
avail list.
x The first space on the list which is big enough for a new record to be added is the
one used.
x The used space is delinked from the avail list, or, if the new record leaves unused
space, the new (smaller) space replaces the olf space.
x Adding a 70 byte record, only the first two entries on the list are checked:
S l ot S l ot S l ot S l ot
Header Slot @50 Slot @120 Slot @430
@200 @230 @300 @370
370 * -1 70 Record * 5030 Record Record * 200 60 Record
x As records are deleted, the space can be added to the head of the list, as when the list
is managed as a stack.
Best Fit
x The best fit strategy leaves the smallest space left over when the new record is
added.
x There are two possible algorithms:
1. Manage deletions by adding the new record space to the head of the list, and
scan the entire list for record additions.
2. Manage the avail list as a sorted list; the first fit on the list will then be the
best fit.
x Best Fit, Sorted List:
S l ot S l ot
Header Slot @50 Slot @120 Slot @200 Slot @430
@300 @370
370 * 200 70 Record * -1 100 Record * 50 60 Record
x Adding a 65 byte record, only the first two entries on the list are checked:
S l ot S l ot
Header Slot @50 Slot @120 Slot @200 Slot @430
@300 @370
370 Record Record * -1 100 Record * 20060 Record
x Best Fit, Unsorted List:
S l ot S l ot
Header Slot @50 Slot @120 Slot @200 Slot @430
@300 @370
200 * 370 70 Record * 50 100 Record * -1 60 Record
x Adding a 65 byte record, all three entries on the list are checked:
S l ot S l ot
Header Slot @50 Slot @120 Slot @200 Slot @430
@300 @370
200 Record Record * 370 100 Record * -1 60 Record
x The 65 byte record has been stored in a 70 byte space; rather than create a 5 byte
external fragment, which would be useless, the 5 byte excess has become internal
fragmentation within tbe record.
Worst Fit
x The worst fit strategy leaves the largest space left over when the new record is
added.
x The rational is that the leftover space is most likely to be usable for another new
record addition.
key field
x The reference field of a secondary index can be a direct reference to the location of
the entry in the data file.
x The reference field of a secondary index can also be an indirect reference to the
location of the entry in the data file, through the primary key.
x Indirect secondary key references simplify updating of the file set.
x Indirect secondary key references increase access time.
7.7 Retrieval Using Combinations of Secondary Keys
x The search for records by multiple keys can be done on multiple index, with the
combination of index entries defining the records matching the key combination.
x If two keys are to be combined, a list of entries from each key index is retrieved.
x For an "or" combination of keys, the lists are merged.
x I.e., any entry found in either list matches the search.
x For an "and" combination of keys, the lists are matched.
x I.e., only entries found in both lists match the search.
7.8 Improving the Secondary Index Structure: Inverted Lists
inverted list
An index in which the reference field is the head pointer of a linked list of reference
items.
7.10 Binding
binding
The association of a symbol with a value.
locality
A condition in which items accessed temporally close are also physically close.
UNIT-4
Cosequential Processing and the Sorting of Large Files
8.1 An Object Oriented Model for Implementing Cosequential Processes
cosequential operations
Operations which involve accessing two or more input files sequentially and in parallel,
resulting in one or more output files produced by the combination of the input data.
8.1.1 Considerations for Cosequential Algorithms
x Initialization - What has to be set up for the main loop to work correctly.
x Getting the next item on each list - This should be simple and easy, from the main
algorithm.
x Synchronization - Progress of access in the lists should be coordinated.
x Handling End-Of-File conditions - For a match, processing can stop when the end of
any list is reached.
x Recognizing Errors - Items out of sequence can "break" the synchronization.
8.1.2 Matching Names in Two Lists
match
The process of forming a list containing all items common to two or more lists.
8.1.3 Cosequential Match Algorithm
x Initialize (open the input and output files.)
x Get the first item from each list.
x While there is more to do:
o Compare the current items from each list.
o If the items are equal,
Process the item.
Get the next item from each list.
Set more to true iff none of this lists is at end of file.
o If the item from list A is less than the item from list B,
Get the next item from list A.
Set more to true iff list A is not at end-of-file.
o If the item from list A is more than the item from list B,
Get the next item from list B.
Set more to true iff list B is not at end-of-file.
x Finalize (close the files.)
8.1.4 Cosequential Match Code
x void Match (char * InputName1,
x char * InputName2,
x char * OutputName) {
x /* Local Declarations */
x OrderedFile Input1;
x OrderedFile Input2;
x OrderedFile Output;
x int Item1;
x int Item2;
x int more;
x
x /* Initialization */
x cout << "Data Matched:" << endl;
x Input1.open (InputName1, ios::in);
x Input2.open (InputName2, ios::in);
x Output.open (OutputName, ios::out);
x
x /* Algorithm */
x Input1 >> Item1;
x Input2 >> Item2;
x more = Input1.good() && Input2.good();
x while (more) {
x cout << Item1 << ':' << Item2 << " => " << flush; /* DEMO only */
x if (Item1 < Item2) {
x Input1 >> Item1;
x cout << '\n'; /* DEMO only */
x } else if (Item1 > Item2) {
x Input2 >> Item2;
x cout << '\n'; /* DEMO only */
x } else {
x Output << Item1 << endl;
x cout << Item1 << endl; /* DEMO only */
x Input1 >> Item1;
x Input2 >> Item2;
x }
x more = Input1.good() && Input2.good();
x }
x
x /* Finalization */
x Input1.close ();
x Input2.close ();
x Output.close ();
x }
x
8.1.5 OrderedFile Class Declaration
x class OrderedFile : public fstream {
x public:
x void open (char * Name, int Mode);
x int good (void);
x OrderedFile & operator >> (int & Item);
x private:
x int last;
x static int HighValue;
x static int LowValue;
x };
8.1.6 OrderedFile Class Implementation
x int OrderedFile::HighValue = INT_MAX;
x int OrderedFile::LowValue = INT_MIN;
x void OrderedFile :: open (char * Name, int Mode) {
x fstream :: open (Name, Mode);
x last = LowValue;
x }
x
x OrderedFile & OrderedFile :: operator >> (int & Item) {
x fstream::operator >> (Item);
x if (eof()) {
x Item = HighValue;
x } else if (Item < last) {
x Item = HighValue;
x cerr << "Sequence Error\n";
x }
x last = Item;
x return *this;
x }
x
x int OrderedFile :: good () {
x return fstream::good() && (last != HighValue);
x }
x
8.1.7 Match main Function
x int main (int, char * []) {
x /* Local Declarations */
x char OutputName [50];
x char InputName2 [50];
x char InputName1 [50];
x
x /* Initialization */
x cout << "Input name 1? ";
x cin >> InputName1;
x cout << "Input name 2? ";
x cin >> InputName2;
x cout << "Output name? ";
x cin >> OutputName;
x
x /* Algorithm */
x Match (InputName1, InputName2, OutputName);
x
x /* Report to system */
x return 0;
x }
x
8.1.8 Merging Two Lists
merge
The process of forming a list containing all items in any of two or more lists.
8.1.9 Cosequential Merge Algorithm
x Initialize (open the input and output files.)
x Get the first item from each list.
x While there is more to do:
o Compare the current items from each list.
o If the items are equal,
x /* Algorithm */
x Merge (InputName1, InputName2, OutputName);
x
x /* Report to system */
x return 0;
x }
x
8.1.12 General Cosequential Algorithm
x Initialize (open the input and output files.)
x Get the first item from each list
x While there is more to do:
o Compare the current items from each list
o Based on the comparison, appropriately process one or all items.
o Get the next item or items from the appropriate list or lists.
o Based on the whether there were more items, determine if there is more to
do.
x Finalize (close the files.)
high value
A value greater than any valid key.
low value
A value less than any valid key.
sequence checking
Verification of correct order.
synchronization loop
The main loop of the cosequential processing model, which is responsible for
synchronizing inputs.
8.1.13 Cosequential Algorithm Summary
x Two or more input files are to be processed in a parallel fashion to produce one or
more output files.
o In some cases, an output file may be the same as one of the input files.
x Each file is sorted on one or more key fields,and all files are ordered in the same
way on the same fields.
o It is not necessary that all files have the same record structure
x In some cases, a HighValue must exist which is greater thanall valid key values, and
a LowValue must exist which is less than all valid key values.
run
In a merge sort, an ordered subset of the data.
8.5 Heaps
8.5.1 Logical Structure of a Heap
R
L
x Sift L: L R
x
L
R
x Insert C: L R C
x
R C
x Sift C: C R L
x
R L
x Insert A: C R L A
x
R L
A
x Sift A: A C L R
C L
R
x Insert H: A C L R H
x
C L
R H
x Sift H: A C L R H
x
C L
R H
x Insert V: A C L R H V
x
C L
R H V
x Sift V: A C L R H V
x
C L
R H V
x Insert E: A C L R H V E
x
C L
R H V E
x Sift E: A C E R H V L
x
C E
R H V L
x Algorithm:
o Build a heap from the data.
o While there are items in the heap.
Remove the root from the heap.
Replace the root with the last leaf of the heap.
Sift the root down to restore the array to a heap.
8.4.1 Removing items from a heap
x Initial Heap: A C E R H V L
C E
R H V L
x Remove A: A C E R H V L
x
C E
R H V L
x Move Up L: L C E R H V
x
C E
R H V
x Sift Down L: C H E R L V
x
H E
R L V
x Remove C: C H E R L V
x
H E
R L V
x Move Up V: V H E R L
x
H E
R L
x Sift Down V: E H V R L
x
H V
R L
Open a run
While the heap is not empty:
Remove the root record from the heap
Write the record to the run
Close the run
x This process will produce equal size runs of n records each.
x For an input file of N records, there will be N/n runs.
8.8 Using Replacement Selection for the Distribution Phase of a Merge Sort
replacement selection
An algorithm for creating the initial runs of a mergesort which is based on a heapsort,
which adds new records to the heap when possible, to lengthen the runs.
Array ACERH
Heap
x Read J from the input and add it to the heap (J > A):
Array CHERJ Run A
Heap
x Read B from the input and add it to the array, just past the heap (B < C):
Array EHJR B Run AC
Heap
x Read T from the input and add it to the heap (T > E):
Array HRJT B Run ACE
Heap
x Read L from the input and add it to the heap (L > H):
Array JLTR B Run ACEH
Heap
x Read D from the input and add it to the array, just past the heap (D < J):
Array LRT DB Run ACEHJ
Heap
x Read V from the input and add it to the heap (V > L):
Array RTV DB Run ACEHJL
Heap
x Read K from the input and add it to the array, just past the heap (K < R):
Array TV KDB Run ACEHJLR
Heap
x Read N from the input and add it to the array, just past the heap (N < T):
Array V NKDB Run ACEHJLRT
Heap
x Read Q from the input and add it to the array, just past the (null) heap (Q < V):
Heap
x Repeat
8.8.2 Advantages of Replacement Sort over Heapsort
x Replacement sort runs will be, on average, twice as long as heapsort runs, using the
same size array.
x There will be half as many replacement sort runs, on average, as heapsort runs,
using the same size array.
8.5 Merging as a Way of Sorting Large Files on Disk
8.5.1 Single Step merging
k-way merge
A merge of order k.
order of a merge
The number of input lists being merged.
x If the distribution phase creates k runs, a single k-way merge can be used to produce
the final sorted file.
x A significant amount of seeking is used by a k-way merge, assuming the input runs
are on the same disk.
8.9.2 Multistep Merging
multistep merge
A merge which is carried out in two or more stages, with the output of one stage being
the input to the next stage.
x A multistep merge increases the number of times each record will be read and
written.
x Using a multistep merge can decrease the number of seeks, and reduce the overall
merge time.
x When sorting with tape, multiple runs are placed in a single file.
balanced merge
A multistep merge which uses the same number of output files as input files.
int main () {
Transiaction Entry;
Transiactions Journal;
Account CurrentAccount;
Accounts Ledger;
Summary Report;
bool More
Ledger.Open ("ledger.txt");
Journal.Open ("sorted.txt");
Report.Open ("summary.txt");
while (More) {
if (Entry.Account == CurrentAccount.Number) {
CurrentAccount.Post (Entry);
Report << Entry;
Journal >> Entry;
More = Journal.good ();
} else if (Entry.Account > CurrentAccount.Number) {
Report.Foot (CurrentAccount);
Ledger.Update (CurrentAccount);
Ledger >> CurrentAccount.
if (Ledger.good()) {
CurrentAccount.Extend (Entry);
Report.Head (CurrentAccount);
} else {
More = false;
}
} else { /*Entry.Account < CurrentAccount.Number */
Report.Error (Entry);
Journal >> Entry;
More = Journal.good ();
}
}
Report.Foot (CurrentAccount);
Journal.Close ();
Ledger.Close ();
Report.Close ();
return 0;
}
Source Code - transactions.h
#include "Account.h";
}
Source Code - account.cpp
#include "Account.h"
UNIT-5
Multilevel Indexing and B-Trees
9.1 Introduction: The Invention of B-Trees
9.2 Statement of the Problem
x Searching an index must be faster than binary searching.
x Inserting and deleting must be as fast as searching.
9.3 Indexing with Binary Search Trees
binary tree
A tree in which each node has at most two children.
leaf
A node at the lowest level of a tree.
height-balanced tree
A tree in which the difference between the heights of subtrees in limited.
x Binary trees grow from the top down: new nodes are added as new leaves.
x Binary trees become unbalanced as new nodes are added.
x The imbalance of a binary tree depends on the order in which nodes are added and
deleted.
x Worst case search with a balanced binary tree is log2 (N + 1) compares.
x Worst case search with an unbalanced binary tree is >N compares.
9.3.1 AVL Trees
AVL Tree
A binary tree which maintains height balance (to HB(1)) by means of localized
reorganizations of the nodes.
x Balancing a paged binary tree can involve rotations across pages, involving physical
movement of nodes.
9.3.3 Problems with Paged Binary Trees
9.4 Multilevel Indexing: A Better Approach to Tree Indexes
UNIT-6
Indexed Sequential File Access and Prefix B+Trees
10.1 Indexed Sequential Access
indexed sequential access
Access which can be either indexed or sequential.
variable order
Order which does not always have the same value.
shortest separator
The shortest value which can be compared to a key to determine the proper block of an
index.
x The index entries determine which page of the sequence set may contain the search
key.
10.5 The Simple Prefix B+Tree
simple prefix B+ tree
A B+ tree in which the index set contains simple prefix separators.
UNIT-7
Hashing
11.1 Introduction
x Key driven file access should be O(1) - that is, the time to access a record should be a
constant which does not vary with the size of the dataset.
x Indexing can be regarded as a table driven function which translates a key to a numeric
location.
x Hashing can be regarded as a computation driven function which translates a key to a
numeric location.
hashing
The transformation of a search key into a number by means of mathematical calculations.
randomize
To transform in an apparently random way.
x Hashing uses a repeatable pseudorandom function.
x The hashing function should produce a uniform distribution of hash values.
uniform distribution
A randomization in which each value in a range has an equal probability.
x For each key, the result of the hashing function is used as the home address of the record.
home address
The address produced by the hashing of a record key.
x Under ideal conditions, hashing provides O(1) key driven file access.
Combined methods
x Practical hashing functions often combine techniques.
x Example:
Key = 123-45-6789
123
456
+789
1368
1368 % 11 = 4
h(123-45-6789) = 4
x For non-numeric keys, the key is simply treated as though it were a number, using its
internal binary representation.
x Example:
Key = "Kemp"
"Kemp" = 4B656D7016 = 126493835210
11.3 Hashing Functions and Record Distributions
x The size of the key space is typically much larger than the space of hashed values.
x This means that more than one key will map to the same hash value.
Collisions
x synonyms
Keys which hash to the same value.
x collision
An attempt to store a record at an address which does not have sufficient room
x packing density
The ratio of used space to allocated space.
x For simple hashing, the probability of a synonym is the same as the packing density.
Open Chaining
x Open chaining forms a linked list, or chain, of synonyms.
x The overflow records can be kept in the same file as the hash table itself:
x The overflow records can be kept in a separate file:
Scatter Tables
x If all records are moved into a separate "overflow" area, with only links being left in the
hash table, the result is a scatter table.
x A scatter table scatter table is smaller than an index for the same data.
UNIT-8
Extendible Hashing
12.2 How Extendible Hashing Works
Tries
x trie
A search tree in which the child of each node is determined by subsequent charaters of
the key.
x An alphabetic (radix 26) trie potentially has one child node for each letter of the alphabet.
x A decimal (radix 10) trie has up to 10 children for each note.
x The trie can be shortened by the use of buckets.
x The bucket distribution can be balanced by the use of hashing.
Key Hash
5554321 100111001
5550123 10111010
5541234 100111100
5551234 1011110
3217654 100111101
1237654 10011011
5557654 101110011
1234567 1101001
Linear Hashing
linear hashing
An application of hashing in which the address space is extended by one bucket each
time an overflow occurs