0% found this document useful (0 votes)
18 views

Unit 5 Os

The document summarizes the structure of magnetic disks and solid-state disks used for mass storage. It discusses the basic components of magnetic disks including platters, tracks, sectors, read-write heads and cylinders. It describes how disks are accessed, including seek time, rotational latency and transfer rates. Solid-state disks are also introduced as an alternative storage technology that uses flash memory and has no moving parts, making them faster but more expensive than magnetic disks. Traditional head-sector-cylinder addressing is replaced by linear block addressing on modern disks.

Uploaded by

Atharva Shasane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Unit 5 Os

The document summarizes the structure of magnetic disks and solid-state disks used for mass storage. It discusses the basic components of magnetic disks including platters, tracks, sectors, read-write heads and cylinders. It describes how disks are accessed, including seek time, rotational latency and transfer rates. Solid-state disks are also introduced as an alternative storage technology that uses flash memory and has no moving parts, making them faster but more expensive than magnetic disks. Traditional head-sector-cylinder addressing is replaced by linear block addressing on modern disks.

Uploaded by

Atharva Shasane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

21CS211 – Operating Systems – Unit 5

10. Mass-Storage Structure

10.1 Overview of Mass-Storage Structure

10.1.1 Magnetic Disks

 Traditional magnetic disks have the following basic structure:


o One or more platters in the form of disks covered with magnetic
media. Hard disk platters are made of rigid metal, while "floppy" disks are
made of more flexible plastic.
o Each platter has two working surfaces. Older hard disk drives would
sometimes not use the very top or bottom surface of a stack of platters, as
these surfaces were more susceptible to potential damage.
o Each working surface is divided into a number of concentric rings
called tracks. The collection of all tracks that are the same distance from
the edge of the platter, ( i.e. all tracks immediately above one another in
the following diagram ) is called a cylinder.
o Each track is further divided into sectors, traditionally containing 512
bytes of data each, although some modern disks occasionally use larger
sector sizes. ( Sectors also include a header and a trailer, including
checksum information among other things. Larger sector sizes reduce the
fraction of the disk consumed by headers and trailers, but increase internal
fragmentation and the amount of disk that must be marked bad in the case
of errors. )
o The data on a hard drive is read by read-write heads. The standard
configuration ( shown below ) uses one head per surface, each on a
separate arm, and controlled by a common arm assembly which moves all
heads simultaneously from one cylinder to another. ( Other configurations,
including independent read-write heads, may speed up disk access, but
involve serious technical difficulties. )
o The storage capacity of a traditional disk drive is equal to the number of
heads ( i.e. the number of working surfaces ), times the number of tracks
per surface, times the number of sectors per track, times the number of
bytes per sector. A particular physical block of data is specified by
providing the head-sector-cylinder number at which it is located.

1
21CS211 – Operating Systems – Unit 5

Figure 10.1 - Moving-head disk mechanism.

 In operation the disk rotates at high speed, such as 7200 rpm ( 120 revolutions per
second. ) The rate at which data can be transferred from the disk to the computer
is composed of several steps:
o The positioning time, a.k.a. the seek time or random access time is the
time required to move the heads from one cylinder to another, and for the
heads to settle down after the move. This is typically the slowest step in
the process and the predominant bottleneck to overall transfer rates.
o The rotational latency is the amount of time required for the desired
sector to rotate around and come under the read-write head.This can range
anywhere from zero to one full revolution, and on the average will equal
one-half revolution. This is another physical step and is usually the second
slowest step behind seek time. ( For a disk rotating at 7200 rpm, the
average rotational latency would be 1/2 revolution / 120 revolutions per
second, or just over 4 milliseconds, a long time by computer standards.
o The transfer rate, which is the time required to move the data
electronically from the disk to the computer. ( Some authors may also use
the term transfer rate to refer to the overall transfer rate, including seek
time and rotational latency as well as the electronic data transfer rate. )
 Disk heads "fly" over the surface on a very thin cushion of air. If they should
accidentally contact the disk, then a head crash occurs, which may or may not
permanently damage the disk or even destroy it completely. For this reason it is
normal to park the disk heads when turning a computer off, which means to move
the heads off the disk or to an area of the disk where there is no data stored.
 Floppy disks are normally removable. Hard drives can also be removable, and
some are even hot-swappable, meaning they can be removed while the computer
is running, and a new hard drive inserted in their place.
 Disk drives are connected to the computer via a cable known as the I/O
Bus. Some of the common interface formats include Enhanced Integrated Drive
Electronics, EIDE; Advanced Technology Attachment, ATA; Serial ATA, SATA,

2
21CS211 – Operating Systems – Unit 5

Universal Serial Bus, USB; Fiber Channel, FC, and Small Computer Systems
Interface, SCSI.
 The host controller is at the computer end of the I/O bus, and the disk
controller is built into the disk itself. The CPU issues commands to the host
controller via I/O ports. Data is transferred between the magnetic surface and
onboard cache by the disk controller, and then the data is transferred from that
cache to the host controller and the motherboard memory at electronic speeds.

10.1.2 Solid-State Disks - New

 As technologies improve and economics change, old technologies are often used
in different ways. One example of this is the increasing used of solid state disks,
or SSDs.
 SSDs use memory technology as a small fast hard disk. Specific implementations
may use either flash memory or DRAM chips protected by a battery to sustain the
information through power cycles.
 Because SSDs have no moving parts they are much faster than traditional hard
drives, and certain problems such as the scheduling of disk accesses simply do not
apply.
 However SSDs also have their weaknesses: They are more expensive than hard
drives, generally not as large, and may have shorter life spans.
 SSDs are especially useful as a high-speed cache of hard-disk information that
must be accessed quickly. One example is to store filesystem meta-data, e.g.
directory and inode information, that must be accessed quickly and often. Another
variation is a boot disk containing the OS and some application executables, but
no vital user data. SSDs are also used in laptops to make them smaller, faster, and
lighter.
 Because SSDs are so much faster than traditional hard disks, the throughput of the
bus can become a limiting factor, causing some SSDs to be connected directly to
the system PCI bus for example.

10.1.3 Magnetic Tapes - was 12.1.2

 Magnetic tapes were once used for common secondary storage before the days of
hard disk drives, but today are used primarily for backups.
 Accessing a particular spot on a magnetic tape can be slow, but once reading or
writing commences, access speeds are comparable to disk drives.
 Capacities of tape drives can range from 20 to 200 GB, and compression can
double that capacity.

10.2 Disk Structure

 The traditional head-sector-cylinder, HSC numbers are mapped to linear block addresses
by numbering the first sector on the first head on the outermost track as sector 0.
Numbering proceeds with the rest of the sectors on that same track, and then the rest of
the tracks on the same cylinder before proceeding through the rest of the cylinders to the

3
21CS211 – Operating Systems – Unit 5

center of the disk. In modern practice these linear block addresses are used in place of the
HSC numbers for a variety of reasons:
1. The linear length of tracks near the outer edge of the disk is much longer than for
those tracks located near the center, and therefore it is possible to squeeze many
more sectors onto outer tracks than onto inner ones.
2. All disks have some bad sectors, and therefore disks maintain a few spare sectors
that can be used in place of the bad ones. The mapping of spare sectors to bad
sectors in managed internally to the disk controller.
3. Modern hard drives can have thousands of cylinders, and hundreds of sectors per
track on their outermost tracks. These numbers exceed the range of HSC numbers
for many ( older ) operating systems, and therefore disks can be configured for
any convenient combination of HSC values that falls within the total number of
sectors physically on the drive.
 There is a limit to how closely packed individual bits can be placed on a physical media,
but that limit is growing increasingly more packed as technological advances are made.
 Modern disks pack many more sectors into outer cylinders than inner ones, using one of
two approaches:
o With Constant Linear Velocity, CLV, the density of bits is uniform from cylinder
to cylinder. Because there are more sectors in outer cylinders, the disk spins
slower when reading those cylinders, causing the rate of bits passing under the
read-write head to remain constant. This is the approach used by modern CDs and
DVDs.
o With Constant Angular Velocity, CAV, the disk rotates at a constant angular
speed, with the bit density decreasing on outer cylinders. ( These disks would
have a constant number of sectors per track on all cylinders. )

10.3 Disk Attachment

Disk drives can be attached either directly to a particular host ( a local disk ) or to a network.

10.3.1 Host-Attached Storage

 Local disks are accessed through I/O Ports as described earlier.


 The most common interfaces are IDE or ATA, each of which allow up to two
drives per host controller.
 SATA is similar with simpler cabling.
 High end workstations or other systems in need of larger number of disks
typically use SCSI disks:
o The SCSI standard supports up to 16 targets on each SCSI bus, one of
which is generally the host adapter and the other 15 of which can be disk
or tape drives.
o A SCSI target is usually a single drive, but the standard also supports up to
8 units within each target. These would generally be used for accessing
individual disks within a RAID array. ( See below. )
o The SCSI standard also supports multiple host adapters in a single
computer, i.e. multiple SCSI busses.

4
21CS211 – Operating Systems – Unit 5

o Modern advancements in SCSI include "fast" and "wide" versions, as well


as SCSI-2.
o SCSI cables may be either 50 or 68 conductors. SCSI devices may be
external as well as internal.
o See wikipedia for more information on the SCSI interface.
 FC is a high-speed serial architecture that can operate over optical fiber or four-
conductor copper wires, and has two variants:
o A large switched fabric having a 24-bit address space. This variant allows
for multiple devices and multiple hosts to interconnect, forming the basis
for the storage-area networks, SANs, to be discussed in a future section.
o The arbitrated loop, FC-AL, that can address up to 126 devices ( drives
and controllers. )

10.3.2 Network-Attached Storage

 Network attached storage connects storage devices to computers using a remote


procedure call, RPC, interface, typically with something like NFS filesystem
mounts. This is convenient for allowing several computers in a group common
access and naming conventions for shared storage.
 NAS can be implemented using SCSI cabling, or ISCSI uses Internet protocols
and standard network connections, allowing long-distance remote access to shared
files.
 NAS allows computers to easily share data storage, but tends to be less efficient
than standard host-attached storage.

Figure 10.2 - Network-attached storage.

10.3.3 Storage-Area Network

 A Storage-Area Network, SAN, connects computers and storage devices in a


network, using storage protocols instead of network protocols.
 One advantage of this is that storage access does not tie up regular networking
bandwidth.
 SAN is very flexible and dynamic, allowing hosts and devices to attach and
detach on the fly.
 SAN is also controllable, allowing restricted access to certain hosts and devices.

5
21CS211 – Operating Systems – Unit 5

Figure 10.3 - Storage-area network.

10.4 Disk Scheduling

 As mentioned earlier, disk transfer speeds are limited primarily by seek


times and rotational latency. When multiple requests are to be processed there is also
some inherent delay in waiting for other requests to be processed.
 Bandwidth is measured by the amount of data transferred divided by the total amount of
time from the first request being made to the last transfer being completed, ( for a series
of disk requests. )
 Both bandwidth and access time can be improved by processing requests in a good order.
 Disk requests include the disk address, memory address, number of sectors to transfer,
and whether the request is for reading or writing.

10.4.1 FCFS Scheduling

 First-Come First-Serve is simple and intrinsically fair, but not very efficient.
Consider in the following sequence the wild swing from cylinder 122 to 14 and
then back to 124:

Figure 10.4 - FCFS disk scheduling.

6
21CS211 – Operating Systems – Unit 5

10.4.2 SSTF Scheduling

 Shortest Seek Time First scheduling is more efficient, but may lead to starvation
if a constant stream of requests arrives for the same general area of the disk.
 SSTF reduces the total head movement to 236 cylinders, down from 640 required
for the same set of requests under FCFS. Note, however that the distance could be
reduced still further to 208 by starting with 37 and then 14 first before processing
the rest of the requests.

Figure 10.5 - SSTF disk scheduling.

10.4.3 SCAN Scheduling

 The SCAN algorithm, a.k.a. the elevator algorithm moves back and forth from
one end of the disk to the other, similarly to an elevator processing requests in a
tall building.

Figure 10.6 - SCAN disk scheduling.

7
21CS211 – Operating Systems – Unit 5

 Under the SCAN algorithm, If a request arrives just ahead of the moving head
then it will be processed right away, but if it arrives just after the head has passed,
then it will have to wait for the head to pass going the other way on the return trip.
This leads to a fairly wide variation in access times which can be improved upon.
 Consider, for example, when the head reaches the high end of the disk: Requests
with high cylinder numbers just missed the passing head, which means they are
all fairly recent requests, whereas requests with low numbers may have been
waiting for a much longer time. Making the return scan from high to low then
ends up accessing recent requests first and making older requests wait that much
longer.

10.4.4 C-SCAN Scheduling

 The Circular-SCAN algorithm improves upon SCAN by treating all requests in a


circular queue fashion - Once the head reaches the end of the disk, it returns to the
other end without processing any requests, and then starts again from the
beginning of the disk:

Figure 10.7 - C-SCAN disk scheduling.

12.4.5 LOOK Scheduling

 LOOK scheduling improves upon SCAN by looking ahead at the queue of


pending requests, and not moving the heads any farther towards the end of the
disk than is necessary. The following diagram illustrates the circular form of
LOOK:

8
21CS211 – Operating Systems – Unit 5

Figure 10.8 - C-LOOK disk scheduling.

10.4.6 Selection of a Disk-Scheduling Algorithm

 With very low loads all algorithms are equal, since there will normally only be
one request to process at a time.
 For slightly larger loads, SSTF offers better performance than FCFS, but may lead
to starvation when loads become heavy enough.
 For busier systems, SCAN and LOOK algorithms eliminate starvation problems.
 The actual optimal algorithm may be something even more complex than those
discussed here, but the incremental improvements are generally not worth the
additional overhead.
 Some improvement to overall filesystem access times can be made by intelligent
placement of directory and/or inode information. If those structures are placed in
the middle of the disk instead of at the beginning of the disk, then the maximum
distance from those structures to data blocks is reduced to only one-half of the
disk size. If those structures can be further distributed and furthermore have their
data blocks stored as close as possible to the corresponding directory structures,
then that reduces still further the overall time to find the disk block numbers and
then access the corresponding data blocks.
 On modern disks the rotational latency can be almost as significant as the seek
time, however it is not within the OSes control to account for that, because
modern disks do not reveal their internal sector mapping schemes, ( particularly
when bad blocks have been remapped to spare sectors. )
o Some disk manufacturers provide for disk scheduling algorithms directly
on their disk controllers, ( which do know the actual geometry of the disk
as well as any remapping ), so that if a series of requests are sent from the
computer to the controller then those requests can be processed in an
optimal order.
o Unfortunately there are some considerations that the OS must take into
account that are beyond the abilities of the on-board disk-scheduling
algorithms, such as priorities of some requests over others, or the need to

9
21CS211 – Operating Systems – Unit 5

process certain requests in a particular order. For this reason OSes may
elect to spoon-feed requests to the disk controller one at a time in certain
situations.

10.5 Disk Management

105.1 Disk Formatting

 Before a disk can be used, it has to be low-level formatted, which means laying
down all of the headers and trailers marking the beginning and ends of each
sector. Included in the header and trailer are the linear sector numbers, and error-
correcting codes, ECC, which allow damaged sectors to not only be detected, but
in many cases for the damaged data to be recovered ( depending on the extent of
the damage. ) Sector sizes are traditionally 512 bytes, but may be larger,
particularly in larger drives.
 ECC calculation is performed with every disk read or write, and if damage is
detected but the data is recoverable, then a soft error has occurred. Soft errors are
generally handled by the on-board disk controller, and never seen by the OS. ( See
below. )
 Once the disk is low-level formatted, the next step is to partition the drive into one
or more separate partitions. This step must be completed even if the disk is to be
used as a single large partition, so that the partition table can be written to the
beginning of the disk.
 After partitioning, then the filesystems must be logically formatted, which
involves laying down the master directory information ( FAT table or inode
structure ), initializing free lists, and creating at least the root directory of the
filesystem. ( Disk partitions which are to be used as raw devices are not logically
formatted. This saves the overhead and disk space of the filesystem structure, but
requires that the application program manage its own disk storage requirements. )

10.5.2 Boot Block

 Computer ROM contains a bootstrap program ( OS independent ) with just


enough code to find the first sector on the first hard drive on the first controller,
load that sector into memory, and transfer control over to it. ( The ROM bootstrap
program may look in floppy and/or CD drives before accessing the hard drive,
and is smart enough to recognize whether it has found valid boot code or not. )
 The first sector on the hard drive is known as the Master Boot Record, MBR, and
contains a very small amount of code in addition to the partition table. The
partition table documents how the disk is partitioned into logical disks, and
indicates specifically which partition is the active or boot partition.
 The boot program then looks to the active partition to find an operating system,
possibly loading up a slightly larger / more advanced boot program along the way.
 In a dual-boot ( or larger multi-boot ) system, the user may be given a choice of
which operating system to boot, with a default action to be taken in the event of
no response within some time frame.

10
21CS211 – Operating Systems – Unit 5

 Once the kernel is found by the boot program, it is loaded into memory and then
control is transferred over to the OS. The kernel will normally continue the boot
process by initializing all important kernel data structures, launching important
system services ( e.g. network daemons, sched, init, etc. ), and finally providing
one or more login prompts. Boot options at this stage may include single-
user a.k.a. maintenance or safe modes, in which very few system services are
started - These modes are designed for system administrators to repair problems
or otherwise maintain the system.

Figure 10.9 - Booting from disk in Windows 2000.

10.5.3 Bad Blocks

 No disk can be manufactured to 100% perfection, and all physical objects wear
out over time. For these reasons all disks are shipped with a few bad blocks, and
additional blocks can be expected to go bad slowly over time. If a large number of
blocks go bad then the entire disk will need to be replaced, but a few here and
there can be handled through other means.
 In the old days, bad blocks had to be checked for manually. Formatting of the disk
or running certain disk-analysis tools would identify bad blocks, and attempt to
read the data off of them one last time through repeated tries. Then the bad blocks
would be mapped out and taken out of future service. Sometimes the data could
be recovered, and sometimes it was lost forever. ( Disk analysis tools could be
either destructive or non-destructive. )
 Modern disk controllers make much better use of the error-correcting codes, so
that bad blocks can be detected earlier and the data usually recovered. ( Recall
that blocks are tested with every write as well as with every read, so often errors
can be detected before the write operation is complete, and the data simply written
to a different sector instead. )
 Note that re-mapping of sectors from their normal linear progression can throw
off the disk scheduling optimization of the OS, especially if the replacement
sector is physically far away from the sector it is replacing. For this reason most
disks normally keep a few spare sectors on each cylinder, as well as at least one
spare cylinder. Whenever possible a bad sector will be mapped to another sector
on the same cylinder, or at least a cylinder as close as possible. Sector
slipping may also be performed, in which all sectors between the bad sector and

11
21CS211 – Operating Systems – Unit 5

the replacement sector are moved down by one, so that the linear progression of
sector numbers can be maintained.
 If the data on a bad block cannot be recovered, then a hard error has occurred.,
which requires replacing the file(s) from backups, or rebuilding them from
scratch.

10.6 Swap-Space Management

 Modern systems typically swap out pages as needed, rather than swapping out entire
processes. Hence the swapping system is part of the virtual memory management system.
 Managing swap space is obviously an important task for modern OSes.

10.6.1 Swap-Space Use

 The amount of swap space needed by an OS varies greatly according to how it is


used. Some systems require an amount equal to physical RAM; some want a
multiple of that; some want an amount equal to the amount by which virtual
memory exceeds physical RAM, and some systems use little or none at all!
 Some systems support multiple swap spaces on separate disks in order to speed up
the virtual memory system.

10.6.2 Swap-Space Location

Swap space can be physically located in one of two locations:

 As a large file which is part of the regular filesystem. This is easy to


implement, but inefficient. Not only must the swap space be accessed
through the directory system, the file is also subject to fragmentation
issues. Caching the block location helps in finding the physical blocks, but
that is not a complete fix.
 As a raw partition, possibly on a separate or little-used disk. This allows
the OS more control over swap space management, which is usually faster
and more efficient. Fragmentation of swap space is generally not a big
issue, as the space is re-initialized every time the system is rebooted. The
downside of keeping swap space on a raw partition is that it can only be
grown by repartitioning the hard drive.

12.6.3 Swap-Space Management: An Example

 Historically OSes swapped out entire processes as needed. Modern systems swap
out only individual pages, and only as needed. ( For example process code blocks
and other blocks that have not been changed since they were originally loaded are
normally just freed from the virtual memory system rather than copying them to
swap space, because it is faster to go find them again in the filesystem and read
them back in from there than to write them out to swap space and then read them
back. )

12
21CS211 – Operating Systems – Unit 5

 In the mapping system shown below for Linux systems, a map of swap space is
kept in memory, where each entry corresponds to a 4K block in the swap space.
Zeros indicate free slots and non-zeros refer to how many processes have a
mapping to that particular block ( >1 for shared pages only. )

Figure 10.10 - The data structures for swapping on Linux systems.

10.7 RAID Structure

 The general idea behind RAID is to employ a group of hard drives together with some
form of duplication, either to increase reliability or to speed up operations, ( or sometimes
both. )
 RAID originally stood for Redundant Array of Inexpensive Disks, and was designed to
use a bunch of cheap small disks in place of one or two larger more expensive ones.
Today RAID systems employ large possibly expensive disks as their components,
switching the definition to Independent disks.

10.7.1 Improvement of Reliability via Redundancy

 The more disks a system has, the greater the likelihood that one of them will go
bad at any given time. Hence increasing disks on a system
actually decreases the Mean Time To Failure, MTTF of the system.
 If, however, the same data was copied onto multiple disks, then the data would
not be lost unless both ( or all ) copies of the data were damaged simultaneously,
which is a MUCH lower probability than for a single disk going bad. More
specifically, the second disk would have to go bad before the first disk was
repaired, which brings the Mean Time To Repair into play. For example if two
disks were involved, each with a MTTF of 100,000 hours and a MTTR of 10
hours, then the Mean Time to Data Loss would be 500 * 10^6 hours, or 57,000
years!
 This is the basic idea behind disk mirroring, in which a system contains identical
data on two or more disks.
o Note that a power failure during a write operation could cause both disks
to contain corrupt data, if both disks were writing simultaneously at the
time of the power failure. One solution is to write to the two disks in
series, so that they will not both become corrupted ( at least not in the
same way ) by a power failure. And alternate solution involves non-
volatile RAM as a write cache, which is not lost in the event of a power
failure and which is protected by error-correcting codes.

13
21CS211 – Operating Systems – Unit 5

10.7.2 Improvement in Performance via Parallelism

 There is also a performance benefit to mirroring, particularly with respect to


reads. Since every block of data is duplicated on multiple disks, read operations
can be satisfied from any available copy, and multiple disks can be reading
different data blocks simultaneously in parallel. ( Writes could possibly be sped
up as well through careful scheduling algorithms, but it would be complicated in
practice. )
 Another way of improving disk access time is with striping, which basically
means spreading data out across multiple disks that can be accessed
simultaneously.
o With bit-level striping the bits of each byte are striped across multiple
disks. For example if 8 disks were involved, then each 8-bit byte would be
read in parallel by 8 heads on separate disks. A single disk read would
access 8 * 512 bytes = 4K worth of data in the time normally required to
read 512 bytes. Similarly if 4 disks were involved, then two bits of each
byte could be stored on each disk, for 2K worth of disk access per read or
write operation.
o Block-level striping spreads a filesystem across multiple disks on a block-
by-block basis, so if block N were located on disk 0, then block N + 1
would be on disk 1, and so on. This is particularly useful when filesystems
are accessed in clusters of physical blocks. Other striping possibilities
exist, with block-level striping being the most common.

10.7.3 RAID Levels

 Mirroring provides reliability but is expensive; Striping improves performance,


but does not improve reliability. Accordingly there are a number of different
schemes that combine the principals of mirroring and striping in different ways, in
order to balance reliability versus performance versus cost. These are described
by different RAID levels, as follows: ( In the diagram that follows, "C" indicates a
copy, and "P" indicates parity, i.e. checksum bits. )
1. Raid Level 0 - This level includes striping only, with no mirroring.
2. Raid Level 1 - This level includes mirroring only, no striping.
3. Raid Level 2 - This level stores error-correcting codes on additional disks,
allowing for any damaged data to be reconstructed by subtraction from the
remaining undamaged data. Note that this scheme requires only three extra
disks to protect 4 disks worth of data, as opposed to full mirroring. ( The
number of disks required is a function of the error-correcting algorithms,
and the means by which the particular bad bit(s) is(are) identified. )
4. Raid Level 3 - This level is similar to level 2, except that it takes
advantage of the fact that each disk is still doing its own error-detection,
so that when an error occurs, there is no question about which disk in the
array has the bad data. As a result a single parity bit is all that is needed to
recover the lost data from an array of disks. Level 3 also includes striping,
which improves performance. The downside with the parity approach is

14
21CS211 – Operating Systems – Unit 5

that every disk must take part in every disk access, and the parity bits must
be constantly calculated and checked, reducing performance. Hardware-
level parity calculations and NVRAM cache can help with both of those
issues. In practice level 3 is greatly preferred over level 2.
5. Raid Level 4 - This level is similar to level 3, employing block-level
striping instead of bit-level striping. The benefits are that multiple blocks
can be read independently, and changes to a block only require writing
two blocks ( data and parity ) rather than involving all disks. Note that
new disks can be added seamlessly to the system provided they are
initialized to all zeros, as this does not affect the parity results.
6. Raid Level 5 - This level is similar to level 4, except the parity blocks are
distributed over all disks, thereby more evenly balancing the load on the
system. For any given block on the disk(s), one of the disks will hold the
parity information for that block and the other N-1 disks will hold the data.
Note that the same disk cannot hold both data and parity for the same
block, as both would be lost in the event of a disk crash.
7. Raid Level 6 - This level extends raid level 5 by storing multiple bits of
error-recovery codes, ( such as the Reed-Solomon codes ), for each bit
position of data, rather than a single parity bit. In the example shown
below 2 bits of ECC are stored for every 4 bits of data, allowing data
recovery in the face of up to two simultaneous disk failures. Note that this
still involves only 50% increase in storage needs, as opposed to 100% for
simple mirroring which could only tolerate a single disk failure.

Figure 10.11 - RAID levels.

15
21CS211 – Operating Systems – Unit 5

 There are also two RAID levels which combine RAID levels 0 and 1 ( striping
and mirroring ) in different combinations, designed to provide both performance
and reliability at the expense of increased cost.
o RAID level 0 + 1 disks are first striped, and then the striped disks
mirrored to another set. This level generally provides better performance
than RAID level 5.
o RAID level 1 + 0 mirrors disks in pairs, and then stripes the mirrored
pairs. The storage capacity, performance, etc. are all the same, but there is
an advantage to this approach in the event of multiple disk failures, as
illustrated below:.
 In diagram (a) below, the 8 disks have been divided into two sets
of four, each of which is striped, and then one stripe set is used to
mirror the other set.
 If a single disk fails, it wipes out the entire stripe set, but
the system can keep on functioning using the remaining set.
 However if a second disk from the other stripe set now
fails, then the entire system is lost, as a result of two disk
failures.
 In diagram (b), the same 8 disks are divided into four sets of two,
each of which is mirrored, and then the file system is striped across
the four sets of mirrored disks.
 If a single disk fails, then that mirror set is reduced to a
single disk, but the system rolls on, and the other three
mirror sets continue mirroring.
 Now if a second disk fails, ( that is not the mirror of the
already failed disk ), then another one of the mirror sets is
reduced to a single disk, but the system can continue
without data loss.
 In fact the second arrangement could handle as many as
four simultaneously failed disks, as long as no two of them
were from the same mirror pair.

Figure 10.12 - RAID 0 + 1 and 1 + 0

16
21CS211 – Operating Systems – Unit 5

10.7.4 Selecting a RAID Level

 Trade-offs in selecting the optimal RAID level for a particular application include
cost, volume of data, need for reliability, need for performance, and rebuild time,
the latter of which can affect the likelihood that a second disk will fail while the
first failed disk is being rebuilt.
 Other decisions include how many disks are involved in a RAID set and how
many disks to protect with a single parity bit. More disks in the set increases
performance but increases cost. Protecting more disks per parity bit saves cost,
but increases the likelihood that a second disk will fail before the first bad disk is
repaired.

10.7.5 Extensions

 RAID concepts have been extended to tape drives ( e.g. striping tapes for faster
backups or parity checking tapes for reliability ), and for broadcasting of data.

10.7.6 Problems with RAID

 RAID protects against physical errors, but not against any number of bugs or
other errors that could write erroneous data.
 ZFS adds an extra level of protection by including data block checksums in all
inodes along with the pointers to the data blocks. If data are mirrored and one
copy has the correct checksum and the other does not, then the data with the bad
checksum will be replaced with a copy of the data with the good checksum. This
increases reliability greatly over RAID alone, at a cost of a performance hit that is
acceptable because ZFS is so fast to begin with.

Figure 10.13 - ZFS checksums all metadata and data.

17
21CS211 – Operating Systems – Unit 5

 Another problem with traditional filesystems is that the sizes are fixed, and
relatively difficult to change. Where RAID sets are involved it becomes even
harder to adjust filesystem sizes, because a filesystem cannot span across multiple
filesystems.
 ZFS solves these problems by pooling RAID sets, and by dynamically allocating
space to filesystems as needed. Filesystem sizes can be limited by quotas, and
space can also be reserved to guarantee that a filesystem will be able to grow
later, but these parameters can be changed at any time by the filesystem's owner.
Otherwise filesystems grow and shrink dynamically as needed.

Figure 10.14 - (a) Traditional volumes and file systems. (b) a ZFS pool and file systems.

10.8 Stable-Storage Implementation ( Optional )

 The concept of stable storage ( first presented in chapter 6 ) involves a storage medium in
which data is never lost, even in the face of equipment failure in the middle of a write
operation.
 To implement this requires two ( or more ) copies of the data, with separate failure
modes.
 An attempted disk write results in one of three possible outcomes:
1. The data is successfully and completely written.
2. The data is partially written, but not completely. The last block written may be
garbled.
3. No writing takes place at all.
 Whenever an equipment failure occurs during a write, the system must detect it, and
return the system back to a consistent state. To do this requires two physical blocks for
every logical block, and the following procedure:

18
21CS211 – Operating Systems – Unit 5

1. Write the data to the first physical block.


2. After step 1 had completed, then write the data to the second physical block.
3. Declare the operation complete only after both physical writes have completed
successfully.
 During recovery the pair of blocks is examined.
o If both blocks are identical and there is no sign of damage, then no further action
is necessary.
o If one block contains a detectable error but the other does not, then the damaged
block is replaced with the good copy. ( This will either undo the operation or
complete the operation, depending on which block is damaged and which is
undamaged. )
o If neither block shows damage but the data in the blocks differ, then replace the
data in the first block with the data in the second block. ( Undo the operation. )
Because the sequence of operations described above is slow, stable storage usually includes
NVRAM as a cache, and declares a write operation complete once it has been written to the
NVRAM.

Summary

Tertiary-Storage Structure

 Primary storage refers to computer memory chips; Secondary storage refers to fixed-disk
storage systems ( hard drives ); And Tertiary Storage refers to removable media, such as
tape drives, CDs, DVDs, and to a lesser extend floppies, thumb drives, and other
detachable devices.
 Tertiary storage is typically characterized by large capacity, low cost per MB, and slow
access times, although there are exceptions in any of these categories.
 Tertiary storage is typically used for backups and for long-term archival storage of
completed work. Another common use for tertiary storage is to swap large little-used files
( or groups of files ) off of the hard drive, and then swap them back in as needed in a
fashion similar to secondary storage providing swap space for primary storage.

12.9.1 Tertiary-Storage Devices

12.9.1.1 Removable Disks

 Removable magnetic disks ( e.g. floppies ) can be nearly as fast as hard


drives, but are at greater risk for damage due to scratches. Variations of
removable magnetic disks up to a GB or more in capacity have been
developed. ( Hot-swappable hard drives? )
 A magneto-optical disk uses a magnetic disk covered in a clear plastic
coating that protects the surface.
o The heads sit a considerable distance away from the magnetic
surface, and as a result do not have enough magnetic strength to
switch bits at normal room temperature.

19
21CS211 – Operating Systems – Unit 5

o For writing, a laser is used to heat up a specific spot on the disk, to


a temperature at which the weak magnetic field of the write head is
able to flip the bits.
o For reading, a laser is shined at the disk, and the Kerr effect causes
the polarization of the light to become rotated either clockwise or
counter-clockwise depending on the orientation of the magnetic
field.
 Optical disks do not use magnetism at all, but instead use special materials
that can be altered ( by lasers ) to have relatively light or dark spots.
o For example the phase-change disk has a material that can be
frozen into either a crystalline or an amorphous state, the latter of
which is less transparent and reflects less light when a laser is
bounced off a reflective surface under the material.
 Three powers of lasers are used with phase-change disks:
(1) a low power laser is used to read the disk, without
effecting the materials. (2) A medium power erases the
disk, by melting and re-freezing the medium into a
crystalline state, and (3) a high power writes to the disk by
melting the medium and re-freezing it into the amorphous
state.
 The most common examples of these disks are re-
writable CD-RWs and DVD-RWs.
 An alternative to the disks described above are Write-Once Read-Many,
WORM drives.
o The original version of WORM drives involved a thin layer of
aluminum sandwiched between two protective layers of glass or
plastic.
 Holes were burned in the aluminum to write bits.
 Because the holes could not be filled back in, there was no
way to re-write to the disk. ( Although data could be erased
by burning more holes. )
o WORM drives have important legal ramifications for data that
must be stored for a very long time and must be provable in court
as unaltered since it was originally written. ( Such as long-term
storage of medical records. )
o Modern CD-R and DVD-R disks are examples of WORM drives
that use organic polymer inks instead of an aluminum layer.
 Read-only disks are similar to WORM disks, except the bits are pressed
onto the disk at the factory, rather than being burned on one by one.

12.9.1.2 Tapes

 Tape drives typically cost more than disk drives, but the cost per MB of
the tapes themselves is lower.
 Tapes are typically used today for backups, and for enormous volumes of
data stored by certain scientific establishments. ( E.g. NASA's archive of

20
21CS211 – Operating Systems – Unit 5

space probe and satellite imagery, which is currently being downloaded


from numerous sources faster than anyone can actually look at it. )
 Robotic tape changers move tapes from drives to archival tape libraries
upon demand.
 ( Never underestimate the bandwidth of a station wagon full of tapes
rolling down the highway! )

12.9.1.3 Future Technology

 Solid State Disks, SSDs, are becoming more and more popular.
 Holographic storage uses laser light to store images in a 3-D structure,
and the entire data structure can be transferred in a single flash of laser
light.
 Micro-Electronic Mechanical Systems, MEMS, employs the technology
used for computer chip fabrication to create VERY tiny little machines.
One example packs 10,000 read-write heads within a square centimeter of
space, and as media are passed over it, all 10,000 heads can read data in
parallel.

12.9.2 Operating-System Support

 The OS must provide support for tertiary storage as removable media, including
the support to transfer data between different systems.

12.9.2.1 Application Interface

 File systems are typically not stored on tapes. ( It might be technically


possible, but it is impractical. )
 Tapes are also not low-level formatted, and do not use fixed-length blocks.
Rather data is written to tapes in variable length blocks as needed.
 Tapes are normally accessed as raw devices, requiring each application to
determine how the data is to be stored and read back. Issues such as
header contents and ASCII versus binary encoding ( and byte-ordering )
are generally application specific.
 Basic operations supported for tapes include locate( ), read( ), write( ), and
read_position( ).
 ( Because of variable length writes ), writing to a tape erases all data that
follows that point on the tape.
o Writing to a tape places the End of Tape ( EOT ) marker at the end
of the data written.
o It is not possible to locate( ) to any spot past the EOT marker.

12.9.2.2 File Naming

 File naming conventions for removable media are not entirely uniquely
specific, nor are they necessarily consistent between different systems. (

21
21CS211 – Operating Systems – Unit 5

Two removable disks may contain files with the same name, and there is
no clear way for the naming system to distinguish between them. )
 Fortunately music CDs have a common format, readable by all systems.
Data CDs and DVDs have only a few format choices, making it easy for a
system to support all known formats.

12.9.2.3 Hierarchical Storage Management

 Hierarchical storage involves extending file systems out onto tertiary


storage, swapping files from hard drives to tapes in much the same manner
as data blocks are swapped from memory to hard drives.
 A placeholder is generally left on the hard drive, storing information about
the particular tape ( or other removable media ) on which the file has been
swapped out to.
 A robotic system transfers data to and from tertiary storage as needed,
generally automatically upon demand of the file(s) involved.

12.9.3 Performance Issues

12.9.3.1 Speed

 Sustained Bandwidth is the rate of data transfer during a large file


transfer, once the proper tape is loaded and the file located.
 Effective Bandwidth is the effective overall rate of data transfer, including
any overhead necessary to load the proper tape and find the file on the
tape.
 Access Latency is all of the accumulated waiting time before a file can be
actually read from tape. This includes the time it takes to find the file on
the tape, the time to load the tape from the tape library, and the time spent
waiting in the queue for the tape drive to become available.
 Clearly tertiary storage access is much slower than secondary access,
although removable disks ( e.g. a CD jukebox ) have somewhat faster
access than a tape library.

12.9.3.1 Reliability

 Fixed hard drives are generally more reliable than removable drives,
because they are less susceptible to the environment.
 Optical disks are generally more reliable than magnetic media.
 A fixed hard drive crash can destroy all data, whereas an optical drive or
tape drive failure will often not harm the data media, ( and certainly can't
damage any media not in the drive at the time of the failure. )
 Tape drives are mechanical devices, and can wear out tapes over time, ( as
the tape head is generally in much closer physical contact with the tape
than disk heads are with platters. )

22
21CS211 – Operating Systems – Unit 5

o Some drives may only be able to read tapes a few times whereas
other drives may be able to re-use the same tapes millions of times.
o Backup tapes should be read after writing, to verify that the backup
tape is readable. ( Unfortunately that may have been the LAST
time that particular tape was readable, and the only way to be sure
is to read it again, . . . )
o Long-term tape storage can cause degradation, as magnetic fields
"drift" from one layer of tape to the adjacent layers. Periodic fast-
forwarding and rewinding of tapes can help, by changing which
section of tape lays against which other layers.

12.9.3.3 Cost

 The cost per megabyte for removable media is its strongest selling feature,
particularly as the amount of storage involved ( i.e. the number of tapes,
CDs, etc ) increases.
 However the cost per megabyte for hard drives has dropped more rapidly
over the years than the cost of removable media, such that the currently
most cost-effective backup solution for many systems is simply an
additional ( external ) hard drive.
 ( One good use for old unwanted PCs is to put them on a network as a
backup server and/or print server. The downside to this backup solution is
that the backups are stored on-site with the original data, and a fire, flood,
or burglary could wipe out both the original data and the backups. )

Old Figure 12.15 - Price per megabyte of DRAM, from 1981 to 2008

23
21CS211 – Operating Systems – Unit 5

Old Figure 12.16 - Price per megabyte of magnetic hard disk, from 1981 to 2008.

Old Figure 12.17 - Price per megabyte of a tape drive, from 1984 to 2008.

24
21CS211 – Operating Systems – Unit 5

11. File-System Interface

11.1 File Concept

11.1.1 File Attributes

 Different OSes keep track of different file attributes, including:


o Name - Some systems give special significance to names, and particularly
extensions ( .exe, .txt, etc. ), and some do not. Some extensions may be of
significance to the OS ( .exe ), and others only to certain applications (
.jpg )
o Identifier ( e.g. inode number )
o Type - Text, executable, other binary, etc.
o Location - on the hard drive.
o Size
o Protection
o Time & Date
o User ID

25
21CS211 – Operating Systems – Unit 5

11.1.2 File Operations

 The file ADT supports many common operations:


o Creating a file
o Writing a file
o Reading a file
o Repositioning within a file
o Deleting a file
o Truncating a file.
 Most OSes require that files be opened before access and closed after all access is
complete. Normally the programmer must open and close files explicitly, but
some rare systems open the file automatically at first access. Information about
currently open files is stored in an open file table, containing for example:
o File pointer - records the current position in the file, for the next read or
write access.
o File-open count - How many times has the current file been opened
( simultaneously by different processes ) and not yet closed? When this
counter reaches zero the file can be removed from the table.
o Disk location of the file.
o Access rights
 Some systems provide support for file locking.
o A shared lock is for reading only.
o A exclusive lock is for writing as well as reading.
o An advisory lock is informational only, and not enforced. ( A "Keep Out"
sign, which may be ignored. )
o A mandatory lock is enforced. ( A truly locked door. )
o UNIX used advisory locks, and Windows uses mandatory locks.

26
21CS211 – Operating Systems – Unit 5

Figure 11.2 - File-locking example in Java.

11.1.3 File Types

 Windows ( and some other systems ) use special file extensions to indicate
the type of each file:

27
21CS211 – Operating Systems – Unit 5

Figure 11.3 - Common file types.

 Macintosh stores a creator attribute for each file, according to the program that
first created it with the create( ) system call.
 UNIX stores magic numbers at the beginning of certain files. ( Experiment with
the "file" command, especially in directories such as /bin and /dev )

11.1.4 File Structure

 Some files contain an internal structure, which may or may not be known to the
OS.
 For the OS to support particular file formats increases the size and complexity of
the OS.
 UNIX treats all files as sequences of bytes, with no further consideration of the
internal structure. ( With the exception of executable binary programs, which it
must know how to load and find the first executable statement, etc. )
 Macintosh files have two forks - a resource fork, and a data fork. The resource
fork contains information relating to the UI, such as icons and button images, and
can be modified independently of the data fork, which contains the code or data as
appropriate.

28
21CS211 – Operating Systems – Unit 5

11.1.5 Internal File Structure

 Disk files are accessed in units of physical blocks, typically 512 bytes or some
power-of-two multiple thereof. ( Larger physical disks use larger block sizes, to
keep the range of block numbers within the range of a 32-bit integer. )
 Internally files are organized in units of logical units, which may be as small as a
single byte, or may be a larger size corresponding to some data record or structure
size.
 The number of logical units which fit into one physical block determines
its packing, and has an impact on the amount of internal fragmentation ( wasted
space ) that occurs.
 As a general rule, half a physical block is wasted for each file, and the larger the
block sizes the more space is lost to internal fragmentation.

11.2 Access Methods

11.2.1 Sequential Access

 A sequential access file emulates magnetic tape operation, and generally supports
a few operations:
o read next - read a record and advance the tape to the next position.
o write next - write a record and advance the tape to the next position.
o rewind
o skip n records - May or may not be supported. N may be limited to
positive numbers, or may be limited to +/- 1.

Figure 11.4 - Sequential-access file.

11.2.2 Direct Access

 Jump to any record and read that record. Operations supported include:
o read n - read record number n. ( Note an argument is now required. )
o write n - write record number n. ( Note an argument is now required. )
o jump to record n - could be 0 or the end of file.
o Query current record - used to return back to this record later.
o Sequential access can be easily emulated using direct access. The inverse
is complicated and inefficient.

29
21CS211 – Operating Systems – Unit 5

Figure 11.5 - Simulation of sequential access on a direct-access file.

11.2.3 Other Access Methods

 An indexed access scheme can be easily built on top of a direct access system.
Very large files may require a multi-tiered indexing scheme, i.e. indexes of
indexes.

Figure 11.6 - Example of index and relative files.

11.3 Directory Structure

11.3.1 Storage Structure

 A disk can be used in its entirety for a file system.


 Alternatively a physical disk can be broken up into multiple partitions, slices, or
mini-disks, each of which becomes a virtual disk and can have its own filesystem.
( or be used for raw storage, swap space, etc. )
 Or, multiple physical disks can be combined into one volume, i.e. a larger virtual
disk, with its own filesystem spanning the physical disks.

30
21CS211 – Operating Systems – Unit 5

Figure 11.7 - A typical file-system organization.

11.3.2 Directory Overview

 Directory operations to be supported include:


o Search for a file
o Create a file - add to the directory
o Delete a file - erase from the directory
o List a directory - possibly ordered in different ways.
o Rename a file - may change sorting order
o Traverse the file system.

31
21CS211 – Operating Systems – Unit 5

11.3.3. Single-Level Directory

 Simple to implement, but each file must have a unique name.

Figure 11.9 - Single-level directory.

11.3.4 Two-Level Directory

 Each user gets their own directory space.


 File names only need to be unique within a given user's directory.
 A master file directory is used to keep track of each users directory, and must be
maintained when users are added to or removed from the system.
 A separate directory is generally needed for system ( executable ) files.
 Systems may or may not allow users to access other directories besides their own
o If access to other directories is allowed, then provision must be made to
specify the directory being accessed.
o If access is denied, then special consideration must be made for users to
run programs located in system directories. A search path is the list of
directories in which to search for executable programs, and can be set
uniquely for each user.

Figure 11.10 - Two-level directory structure.

11.3.5 Tree-Structured Directories

 An obvious extension to the two-tiered directory structure, and the one with
which we are all most familiar.
 Each user / process has the concept of a current directory from which all
( relative ) searches take place.
 Files may be accessed using either absolute pathnames ( relative to the root of the
tree ) or relative pathnames ( relative to the current directory. )

32
21CS211 – Operating Systems – Unit 5

 Directories are stored the same as any other file in the system, except there is a bit
that identifies them as directories, and they have some special structure that the
OS understands.
 One question for consideration is whether or not to allow the removal of
directories that are not empty - Windows requires that directories be emptied first,
and UNIX provides an option for deleting entire sub-trees.

Figure 11.11 - Tree-structured directory structure.

11.3.6 Acyclic-Graph Directories

 When the same files need to be accessed in more than one place in the directory
structure ( e.g. because they are being shared by more than one user / process ), it
can be useful to provide an acyclic-graph structure. ( Note the directed arcs from
parent to child. )
 UNIX provides two types of links for implementing the acyclic-graph structure. (
See "man ln" for more details. )
o A hard link ( usually just called a link ) involves multiple directory entries
that both refer to the same file. Hard links are only valid for ordinary files
in the same filesystem.
o A symbolic link, that involves a special file, containing information about
where to find the linked file. Symbolic links may be used to link
directories and/or files in other filesystems, as well as ordinary files in the
current filesystem.
 Windows only supports symbolic links, termed shortcuts.

33
21CS211 – Operating Systems – Unit 5

 Hard links require a reference count, or link count for each file, keeping track of
how many directory entries are currently referring to this file. Whenever one of
the references is removed the link count is reduced, and when it reaches zero, the
disk space can be reclaimed.
 For symbolic links there is some question as to what to do with the symbolic links
when the original file is moved or deleted:
o One option is to find all the symbolic links and adjust them also.
o Another is to leave the symbolic links dangling, and discover that they are
no longer valid the next time they are used.
o What if the original file is removed, and replaced with another file having
the same name before the symbolic link is next used?

Figure 11.12 - Acyclic-graph directory structure.

11.3.7 General Graph Directory

 If cycles are allowed in the graphs, then several problems can arise:
o Search algorithms can go into infinite loops. One solution is to not follow
links in search algorithms. ( Or not to follow symbolic links, and to only
allow symbolic links to refer to directories. )
o Sub-trees can become disconnected from the rest of the tree and still not
have their reference counts reduced to zero. Periodic garbage collection is
required to detect and resolve this problem. ( chkdsk in DOS and fsck in
UNIX search for these problems, among others, even though cycles are
not supposed to be allowed in either system. Disconnected disk blocks that
are not marked as free are added back to the file systems with made-up file
names, and can usually be safely deleted. )

34
21CS211 – Operating Systems – Unit 5

Figure 11.13 - General graph directory.

11.4 File-System Mounting

 The basic idea behind mounting file systems is to combine multiple file systems into one
large tree structure.
 The mount command is given a filesystem to mount and a mount point ( directory ) on
which to attach it.
 Once a file system is mounted onto a mount point, any further references to that directory
actually refer to the root of the mounted file system.
 Any files ( or sub-directories ) that had been stored in the mount point directory prior to
mounting the new filesystem are now hidden by the mounted filesystem, and are no
longer available. For this reason some systems only allow mounting onto empty
directories.
 Filesystems can only be mounted by root, unless root has previously configured certain
filesystems to be mountable onto certain pre-determined mount points. ( E.g. root may
allow users to mount floppy filesystems to /mnt or something like it. ) Anyone can run
the mount command to see what filesystems are currently mounted.
 Filesystems may be mounted read-only, or have other restrictions imposed.

Figure 11.14 - File system. (a) Existing system. (b) Unmounted volume.

35
21CS211 – Operating Systems – Unit 5

Figure 11.15 - Mount point.

 The traditional Windows OS runs an extended two-tier directory structure, where the first
tier of the structure separates volumes by drive letters, and a tree structure is implemented
below that level.
 Macintosh runs a similar system, where each new volume that is found is automatically
mounted and added to the desktop when it is found.
 More recent Windows systems allow filesystems to be mounted to any directory in the
filesystem, much like UNIX.

11.5 File Sharing

11.5.1 Multiple Users

 On a multi-user system, more information needs to be stored for each file:


o The owner ( user ) who owns the file, and who can control its access.
o The group of other user IDs that may have some special access to the file.
o What access rights are afforded to the owner ( User ), the Group, and to
the rest of the world ( the universe, a.k.a. Others. )
o Some systems have more complicated access control, allowing or denying
specific accesses to specifically named users or groups.

11.5.2 Remote File Systems

 The advent of the Internet introduces issues for accessing files stored on remote
computers
o The original method was ftp, allowing individual files to be transported
across systems as needed. Ftp can be either account and password
controlled, or anonymous, not requiring any user name or password.
o Various forms of distributed file systems allow remote file systems to be
mounted onto a local directory structure, and accessed using normal file
access commands. ( The actual files are still transported across the

36
21CS211 – Operating Systems – Unit 5

network as needed, possibly using ftp as the underlying transport


mechanism. )
o The WWW has made it easy once again to access files on remote systems
without mounting their filesystems, generally using ( anonymous ) ftp as
the underlying file transport mechanism.

11.5.2.1 The Client-Server Model

 When one computer system remotely mounts a filesystem that is


physically located on another system, the system which physically owns
the files acts as a server, and the system which mounts them is the client.
 User IDs and group IDs must be consistent across both systems for the
system to work properly. ( I.e. this is most applicable across multiple
computers managed by the same organization, shared by a common group
of users. )
 The same computer can be both a client and a server. ( E.g. cross-linked
file systems. )
 There are a number of security concerns involved in this model:
o Servers commonly restrict mount permission to certain trusted
systems only. Spoofing ( a computer pretending to be a different
computer ) is a potential security risk.
o Servers may restrict remote access to read-only.
o Servers restrict which filesystems may be remotely mounted.
Generally the information within those subsystems is limited,
relatively public, and protected by frequent backups.
 The NFS ( Network File System ) is a classic example of such a system.

11.5.2.2 Distributed Information Systems

 The Domain Name System, DNS, provides for a unique naming system
across all of the Internet.
 Domain names are maintained by the Network Information System, NIS,
which unfortunately has several security issues. NIS+ is a more secure
version, but has not yet gained the same widespread acceptance as NIS.
 Microsoft's Common Internet File System, CIFS, establishes a network
login for each user on a networked system with shared file access. Older
Windows systems used domains, and newer systems ( XP, 2000 ),
use active directories. User names must match across the network for this
system to be valid.
 A newer approach is the Lightweight Directory-Access Protocol,
LDAP, which provides a secure single sign-on for all users to access all
resources on a network. This is a secure system which is gaining in
popularity, and which has the maintenance advantage of combining
authorization information in one central location.

37
21CS211 – Operating Systems – Unit 5

11.5.2.3 Failure Modes

 When a local disk file is unavailable, the result is generally known


immediately, and is generally non-recoverable. The only reasonable
response is for the response to fail.
 However when a remote file is unavailable, there are many possible
reasons, and whether or not it is unrecoverable is not readily apparent.
Hence most remote access systems allow for blocking or delayed
response, in the hopes that the remote system ( or the network ) will come
back up eventually.

11.5.3 Consistency Semantics

 Consistency Semantics deals with the consistency between the views of shared
files on a networked system. When one user changes the file, when do other users
see the changes?
 At first glance this appears to have all of the synchronization issues discussed in
Chapter 6. Unfortunately the long delays involved in network operations prohibit
the use of atomic operations as discussed in that chapter.

11.5.3.1 UNIX Semantics

 The UNIX file system uses the following semantics:


o Writes to an open file are immediately visible to any other user
who has the file open.
o One implementation uses a shared location pointer, which is
adjusted for all sharing users.
 The file is associated with a single exclusive physical resource, which may
delay some accesses.

11.5.3.2 Session Semantics

 The Andrew File System, AFS uses the following semantics:


o Writes to an open file are not immediately visible to other users.
o When a file is closed, any changes made become available only to
users who open the file at a later time.
 According to these semantics, a file can be associated with multiple
( possibly different ) views. Almost no constraints are imposed on
scheduling accesses. No user is delayed in reading or writing their
personal copy of the file.
 AFS file systems may be accessible by systems around the world. Access
control is maintained through ( somewhat ) complicated access control
lists, which may grant access to the entire world ( literally ) or to
specifically named users accessing the files from specifically named
remote environments.

38
21CS211 – Operating Systems – Unit 5

11.5.3.3 Immutable-Shared-Files Semantics

 Under this system, when a file is declared as shared by its creator, it


becomes immutable and the name cannot be re-used for any other
resource. Hence it becomes read-only, and shared access is simple.

11.6 Protection

 Files must be kept safe for reliability ( against accidental damage ), and protection
( against deliberate malicious access. ) The former is usually managed with backup
copies. This section discusses the latter.
 One simple protection scheme is to remove all access to a file. However this makes the
file unusable, so some sort of controlled access must be arranged.

11.6.1 Types of Access

 The following low-level operations are often controlled:


o Read - View the contents of the file
o Write - Change the contents of the file.
o Execute - Load the file onto the CPU and follow the instructions contained
therein.
o Append - Add to the end of an existing file.
o Delete - Remove a file from the system.
o List -View the name and other attributes of files on the system.
 Higher-level operations, such as copy, can generally be performed through
combinations of the above.

11.6.2 Access Control

 One approach is to have complicated Access Control Lists, ACL, which specify
exactly what access is allowed or denied for specific users or groups.
o The AFS uses this system for distributed access.
o Control is very finely adjustable, but may be complicated, particularly
when the specific users involved are unknown. ( AFS allows some wild
cards, so for example all users on a certain remote system may be trusted,
or a given username may be trusted when accessing from any remote
system. )
 UNIX uses a set of 9 access control bits, in three groups of three. These
correspond to R, W, and X permissions for each of the Owner, Group, and Others.
( See "man chmod" for full details. ) The RWX bits control the following
privileges for ordinary files and directories:

bit Files Directories


Read ( view )
R Read directory contents. Required to get a listing of the directory.
file contents.

39
21CS211 – Operating Systems – Unit 5

Write
W ( change ) file Change directory contents. Required to create or delete files.
contents.
Access detailed directory information. Required to get a long listing,
Execute file or to access any specific file in the directory. Note that if a user has
X contents as a X but not R permissions on a directory, they can still access specific
program. files, but only if they already know the name of the file they are
trying to access.

 In addition there are some special bits that can also be applied:
o The set user ID ( SUID ) bit and/or the set group ID ( SGID ) bits applied
to executable files temporarily change the identity of whoever runs the
program to match that of the owner / group of the executable program.
This allows users running specific programs to have access to files ( while
running that program ) to which they would normally be unable to
access. Setting of these two bits is usually restricted to root, and must be
done with caution, as it introduces a potential security leak.
o The sticky bit on a directory modifies write permission, allowing users to
only delete files for which they are the owner. This allows everyone to
create files in /tmp, for example, but to only delete files which they have
created, and not anyone else's.
o The SUID, SGID, and sticky bits are indicated with an S, S, and T in the
positions for execute permission for the user, group, and others,
respectively. If the letter is lower case, ( s, s, t ), then the corresponding
execute permission is not also given. If it is upper case, ( S, S, T ), then the
corresponding execute permission IS given.
o The numeric form of chmod is needed to set these advanced bits.

Sample permissions in a UNIX system.

 Windows adjusts files access through a simple GUI:

40
21CS211 – Operating Systems – Unit 5

Figure 11.16 - Windows 7 access-control list management.

11.6.3 Other Protection Approaches and Issues

 Some systems can apply passwords, either to individual files, or to specific sub-
directories, or to the entire system. There is a trade-off between the number of
passwords that must be maintained ( and remembered by the users ) and the
amount of information that is vulnerable to a lost or forgotten password.
 Older systems which did not originally have multi-user file access permissions (
DOS and older versions of Mac ) must now be retrofitted if they are to share files
on a network.
 Access to a file requires access to all the files along its path as well. In a cyclic
directory structure, users may have different access to the same file accessed
through different paths.
 Sometimes just the knowledge of the existence of a file of a certain name is a
security ( or privacy ) concern. Hence the distinction between the R and X bits on
UNIX directories.

41
21CS211 – Operating Systems – Unit 5

12. File-System Implementation

12.1 File-System Structure

 Hard disks have two important properties that make them suitable for secondary storage
of files in file systems: (1) Blocks of data can be rewritten in place, and (2) they are direct
access, allowing any block of data to be accessed with only ( relatively ) minor
movements of the disk heads and rotational latency. ( See Chapter 12 )
 Disks are usually accessed in physical blocks, rather than a byte at a time. Block sizes
may range from 512 bytes to 4K or larger.
 File systems organize storage on disk drives, and can be viewed as a layered design:
o At the lowest layer are the physical devices, consisting of the magnetic media,
motors & controls, and the electronics connected to them and controlling them.
Modern disk put more and more of the electronic controls directly on the disk
drive itself, leaving relatively little work for the disk controller card to perform.
o I/O Control consists of device drivers, special software programs ( often written
in assembly ) which communicate with the devices by reading and writing special
codes directly to and from memory addresses corresponding to the controller
card's registers. Each controller card ( device ) on a system has a different set of
addresses ( registers, a.k.a. ports ) that it listens to, and a unique set of command
codes and results codes that it understands.
o The basic file system level works directly with the device drivers in terms of
retrieving and storing raw blocks of data, without any consideration for what is in
each block. Depending on the system, blocks may be referred to with a single
block number, ( e.g. block # 234234 ), or with head-sector-cylinder combinations.
o The file organization module knows about files and their logical blocks, and how
they map to physical blocks on the disk. In addition to translating from logical to
physical blocks, the file organization module also maintains the list of free blocks,
and allocates free blocks to files as needed.
o The logical file system deals with all of the meta data associated with a file ( UID,
GID, mode, dates, etc ), i.e. everything about the file except the data itself. This
level manages the directory structure and the mapping of file names to file control
blocks, FCBs, which contain all of the meta data as well as block number
information for finding the data on the disk.
 The layered approach to file systems means that much of the code can be used uniformly
for a wide variety of different file systems, and only certain layers need to be filesystem
specific. Common file systems in use include the UNIX file system, UFS, the Berkeley
Fast File System, FFS, Windows systems FAT, FAT32, NTFS, CD-ROM systems ISO
9660, and for Linux the extended file systems ext2 and ext3 ( among 40 others supported.
)

42
21CS211 – Operating Systems – Unit 5

Figure 12.1 - Layered file system.

12.2 File-System Implementation

12.2.1 Overview

 File systems store several important data structures on the disk:


o A boot-control block, ( per volume ) a.k.a. the boot block in UNIX or
the partition boot sector in Windows contains information about how to
boot the system off of this disk. This will generally be the first sector of
the volume if there is a bootable system loaded on that volume, or the
block will be left vacant otherwise.
o A volume control block, ( per volume ) a.k.a. the master file table in
UNIX or the superblock in Windows, which contains information such as
the partition table, number of blocks on each filesystem, and pointers to
free blocks and free FCB blocks.
o A directory structure ( per file system ), containing file names and pointers
to corresponding FCBs. UNIX uses inode numbers, and NTFS uses
a master file table.
o The File Control Block, FCB, ( per file ) containing details about
ownership, size, permissions, dates, etc. UNIX stores this information in
inodes, and NTFS in the master file table as a relational database structure.

43
21CS211 – Operating Systems – Unit 5

Figure 12.2 - A typical file-control block.

 There are also several key data structures stored in memory:


o An in-memory mount table.
o An in-memory directory cache of recently accessed directory information.
o A system-wide open file table, containing a copy of the FCB for every
currently open file in the system, as well as some other related
information.
o A per-process open file table, containing a pointer to the system open file
table as well as some other information. ( For example the current file
position pointer may be either here or in the system file table, depending
on the implementation and whether the file is being shared or not. )
 Figure 12.3 illustrates some of the interactions of file system components when
files are created and/or used:
o When a new file is created, a new FCB is allocated and filled out with
important information regarding the new file. The appropriate directory is
modified with the new file name and FCB information.
o When a file is accessed during a program, the open( ) system call reads in
the FCB information from disk, and stores it in the system-wide open file
table. An entry is added to the per-process open file table referencing the
system-wide table, and an index into the per-process table is returned by
the open( ) system call. UNIX refers to this index as a file descriptor, and
Windows refers to it as a file handle.
o If another process already has a file open when a new request comes in for
the same file, and it is sharable, then a counter in the system-wide table is
incremented and the per-process table is adjusted to point to the existing
entry in the system-wide table.
o When a file is closed, the per-process table entry is freed, and the counter
in the system-wide table is decremented. If that counter reaches zero, then
the system wide table is also freed. Any data currently stored in memory
cache for this file is written out to disk if necessary.

44
21CS211 – Operating Systems – Unit 5

Figure 12.3 - In-memory file-system structures. (a) File open. (b) File read.

12.2.2 Partitions and Mounting

 Physical disks are commonly divided into smaller units called partitions. They can
also be combined into larger units, but that is most commonly done for RAID
installations and is left for later chapters.
 Partitions can either be used as raw devices ( with no structure imposed upon
them ), or they can be formatted to hold a filesystem ( i.e. populated with FCBs
and initial directory structures as appropriate. ) Raw partitions are generally used
for swap space, and may also be used for certain programs such as databases that
choose to manage their own disk storage system. Partitions containing filesystems
can generally only be accessed using the file system structure by ordinary users,
but can often be accessed as a raw device also by root.
 The boot block is accessed as part of a raw partition, by the boot program prior to
any operating system being loaded. Modern boot programs understand multiple
OSes and filesystem formats, and can give the user a choice of which of several
available systems to boot.
 The root partition contains the OS kernel and at least the key portions of the OS
needed to complete the boot process. At boot time the root partition is mounted,
and control is transferred from the boot program to the kernel found there. ( Older
systems required that the root partition lie completely within the first 1024
cylinders of the disk, because that was as far as the boot program could reach.
Once the kernel had control, then it could access partitions beyond the 1024
cylinder boundary. )
 Continuing with the boot process, additional filesystems get mounted, adding
their information into the appropriate mount table structure. As a part of the
mounting process the file systems may be checked for errors or inconsistencies,
either because they are flagged as not having been closed properly the last time
they were used, or just for general principals. Filesystems may be mounted either
automatically or manually. In UNIX a mount point is indicated by setting a flag in

45
21CS211 – Operating Systems – Unit 5

the in-memory copy of the inode, so all future references to that inode get re-
directed to the root directory of the mounted filesystem.

12.2.3 Virtual File Systems

 Virtual File Systems, VFS, provide a common interface to multiple different


filesystem types. In addition, it provides for a unique identifier ( vnode ) for files
across the entire space, including across all filesystems of different types. ( UNIX
inodes are unique only across a single filesystem, and certainly do not carry
across networked file systems. )
 The VFS in Linux is based upon four key object types:
o The inode object, representing an individual file
o The file object, representing an open file.
o The superblock object, representing a filesystem.
o The dentry object, representing a directory entry.
 Linux VFS provides a set of common functionalities for each filesystem, using
function pointers accessed through a table. The same functionality is accessed
through the same table position for all filesystem types, though the actual
functions pointed to by the pointers may be filesystem-specific. See
/usr/include/linux/fs.h for full details. Common operations provided include open(
), read( ), write( ), and mmap( ).

Figure 12.4 - Schematic view of a virtual file system.

12.3 Directory Implementation

 Directories need to be fast to search, insert, and delete, with a minimum of wasted disk
space.

46
21CS211 – Operating Systems – Unit 5

12.3.1 Linear List

 A linear list is the simplest and easiest directory structure to set up, but it does
have some drawbacks.
 Finding a file ( or verifying one does not already exist upon creation ) requires a
linear search.
 Deletions can be done by moving all entries, flagging an entry as deleted, or by
moving the last entry into the newly vacant position.
 Sorting the list makes searches faster, at the expense of more complex insertions
and deletions.
 A linked list makes insertions and deletions into a sorted list easier, with overhead
for the links.
 More complex data structures, such as B-trees, could also be considered.

12.3.2 Hash Table

 A hash table can also be used to speed up searches.


 Hash tables are generally implemented in addition to a linear or other structure

12.4 Allocation Methods

 There are three major methods of storing files on disks: contiguous, linked, and indexed.

12.4.1 Contiguous Allocation

 Contiguous Allocation requires that all blocks of a file be kept together


contiguously.
 Performance is very fast, because reading successive blocks of the same file
generally requires no movement of the disk heads, or at most one small step to the
next adjacent cylinder.
 Storage allocation involves the same issues discussed earlier for the allocation of
contiguous blocks of memory ( first fit, best fit, fragmentation problems, etc. )
The distinction is that the high time penalty required for moving the disk heads
from spot to spot may now justify the benefits of keeping files contiguously when
possible.
 ( Even file systems that do not by default store files contiguously can benefit from
certain utilities that compact the disk and make all files contiguous in the process.
)
 Problems can arise when files grow, or if the exact size of a file is unknown at
creation time:
o Over-estimation of the file's final size increases external fragmentation
and wastes disk space.
o Under-estimation may require that a file be moved or a process aborted if
the file grows beyond its originally allocated space.

47
21CS211 – Operating Systems – Unit 5

o If a file grows slowly over a long time period and the total final space
must be allocated initially, then a lot of space becomes unusable before the
file fills the space.
 A variation is to allocate file space in large contiguous chunks,
called extents. When a file outgrows its original extent, then an additional one is
allocated. ( For example an extent may be the size of a complete track or even
cylinder, aligned on an appropriate track or cylinder boundary. ) The high-
performance files system Veritas uses extents to optimize performance.

Figure 12.5 - Contiguous allocation of disk space.

12.4.2 Linked Allocation

 Disk files can be stored as linked lists, with the expense of the storage space
consumed by each link. ( E.g. a block may be 508 bytes instead of 512. )
 Linked allocation involves no external fragmentation, does not require pre-known
file sizes, and allows files to grow dynamically at any time.
 Unfortunately linked allocation is only efficient for sequential access files, as
random access requires starting at the beginning of the list for each new location
access.
 Allocating clusters of blocks reduces the space wasted by pointers, at the cost of
internal fragmentation.
 Another big problem with linked allocation is reliability if a pointer is lost or
damaged. Doubly linked lists provide some protection, at the cost of additional
overhead and wasted space.

48
21CS211 – Operating Systems – Unit 5

Figure 12.6 - Linked allocation of disk space.

 The File Allocation Table, FAT, used by DOS is a variation of linked allocation,
where all the links are stored in a separate table at the beginning of the disk. The
benefit of this approach is that the FAT table can be cached in memory, greatly
improving random access speeds.

Figure 12.7 File-allocation table.

49
21CS211 – Operating Systems – Unit 5

12.4.3 Indexed Allocation

 Indexed Allocation combines all of the indexes for accessing each file into a
common block ( for that file ), as opposed to spreading them all over the disk or
storing them in a FAT table.

Figure 12.8 - Indexed allocation of disk space.

 Some disk space is wasted ( relative to linked lists or FAT tables ) because an
entire index block must be allocated for each file, regardless of how many data
blocks the file contains. This leads to questions of how big the index block should
be, and how it should be implemented. There are several approaches:
o Linked Scheme - An index block is one disk block, which can be read
and written in a single disk operation. The first index block contains some
header information, the first N block addresses, and if necessary a pointer
to additional linked index blocks.
o Multi-Level Index - The first index block contains a set of pointers to
secondary index blocks, which in turn contain pointers to the actual data
blocks.
o Combined Scheme - This is the scheme used in UNIX inodes, in which
the first 12 or so data block pointers are stored directly in the inode, and
then singly, doubly, and triply indirect pointers provide access to more
data blocks as needed. ( See below. ) The advantage of this scheme is that
for small files ( which many are ), the data blocks are readily accessible (
up to 48K with 4K block sizes ); files up to about 4144K ( using 4K
blocks ) are accessible with only a single indirect block ( which can be
cached ), and huge files are still accessible using a relatively small number

50
21CS211 – Operating Systems – Unit 5

of disk accesses ( larger in theory than can be addressed by a 32-bit


address, which is why some systems have moved to 64-bit file pointers. )

Figure 12.9 - The UNIX inode.

12.4.4 Performance

 The optimal allocation method is different for sequential access files than for
random access files, and is also different for small files than for large files.
 Some systems support more than one allocation method, which may require
specifying how the file is to be used ( sequential or random access ) at the time it
is allocated. Such systems also provide conversion utilities.
 Some systems have been known to use contiguous access for small files, and
automatically switch to an indexed scheme when file sizes surpass a certain
threshold.
 And of course some systems adjust their allocation schemes ( e.g. block sizes ) to
best match the characteristics of the hardware for optimum performance.

12.5 Free-Space Management

 Another important aspect of disk management is keeping track of and allocating


free space.

12.5.1 Bit Vector

 One simple approach is to use a bit vector, in which each bit represents a disk
block, set to 1 if free or 0 if allocated.
 Fast algorithms exist for quickly finding contiguous blocks of a given size

51
21CS211 – Operating Systems – Unit 5

 The down side is that a 40GB disk requires over 5MB just to store the bitmap. (
For example. )

12.5.2 Linked List

 A linked list can also be used to keep track of all free blocks.
 Traversing the list and/or finding a contiguous block of a given size are not easy,
but fortunately are not frequently needed operations. Generally the system just
adds and removes single blocks from the beginning of the list.
 The FAT table keeps track of the free list as just one more linked list on the table.

Figure 12.10 - Linked free-space list on disk.

12.5.3 Grouping

 A variation on linked list free lists is to use links of blocks of indices of free
blocks. If a block holds up to N addresses, then the first block in the linked-list
contains up to N-1 addresses of free blocks and a pointer to the next block of free
addresses.

12.5.4 Counting

 When there are multiple contiguous blocks of free space then the system can keep
track of the starting address of the group and the number of contiguous free
blocks. As long as the average length of a contiguous group of free blocks is
greater than two this offers a savings in space needed for the free list. ( Similar to
compression techniques used for graphics images when a group of pixels all the
same color is encountered. )

52
21CS211 – Operating Systems – Unit 5

12.5.5 Space Maps

 Sun's ZFS file system was designed for HUGE numbers and sizes of files,
directories, and even file systems.
 The resulting data structures could be VERY inefficient if not implemented
carefully. For example, freeing up a 1 GB file on a 1 TB file system could involve
updating thousands of blocks of free list bit maps if the file was spread across the
disk.
 ZFS uses a combination of techniques, starting with dividing the disk up into (
hundreds of ) metaslabs of a manageable size, each having their own space map.
 Free blocks are managed using the counting technique, but rather than write the
information to a table, it is recorded in a log-structured transaction record.
Adjacent free blocks are also coalesced into a larger single free block.
 An in-memory space map is constructed using a balanced tree data structure,
constructed from the log data.
 The combination of the in-memory tree and the on-disk log provide for very fast
and efficient management of these very large files and free blocks.

12.6 Efficiency and Performance

12.6.1 Efficiency

 UNIX pre-allocates inodes, which occupies space even before any files are
created.
 UNIX also distributes inodes across the disk, and tries to store data files near their
inode, to reduce the distance of disk seeks between the inodes and the data.
 Some systems use variable size clusters depending on the file size.
 The more data that is stored in a directory ( e.g. last access time ), the more often
the directory blocks have to be re-written.
 As technology advances, addressing schemes have had to grow as well.
o Sun's ZFS file system uses 128-bit pointers, which should theoretically
never need to be expanded. ( The mass required to store 2^128 bytes with
atomic storage would be at least 272 trillion kilograms! )
 Kernel table sizes used to be fixed, and could only be changed by rebuilding the
kernels. Modern tables are dynamically allocated, but that requires more
complicated algorithms for accessing them.

12.6.2 Performance

 Disk controllers generally include on-board caching. When a seek is requested,


the heads are moved into place, and then an entire track is read, starting from
whatever sector is currently under the heads ( reducing latency. ) The requested
sector is returned and the unrequested portion of the track is cached in the disk's
electronics.
 Some OSes cache disk blocks they expect to need again in a buffer cache.

53
21CS211 – Operating Systems – Unit 5

 A page cache connected to the virtual memory system is actually more efficient
as memory addresses do not need to be converted to disk block addresses and
back again.
 Some systems ( Solaris, Linux, Windows 2000, NT, XP ) use page caching for
both process pages and file data in a unified virtual memory.
 Figures 11.11 and 11.12 show the advantages of the unified buffer cache found
in some versions of UNIX and Linux - Data does not need to be stored twice, and
problems of inconsistent buffer information are avoided.

Figure 12.11 - I/O without a unified buffer cache.

Figure 12.12 - I/O using a unified buffer cache.

 Page replacement strategies can be complicated with a unified cache, as one needs
to decide whether to replace process or file pages, and how many pages to
guarantee to each category of pages. Solaris, for example, has gone through many
variations, resulting in priority paging giving process pages priority over file I/O
pages, and setting limits so that neither can knock the other completely out of
memory.
 Another issue affecting performance is the question of whether to
implement synchronous writes or asynchronous writes. Synchronous writes
occur in the order in which the disk subsystem receives them, without caching;
Asynchronous writes are cached, allowing the disk subsystem to schedule writes
in a more efficient order ( See Chapter 12. ) Metadata writes are often done
synchronously. Some systems support flags to the open call requiring that writes
be synchronous, for example for the benefit of database systems that require their
writes be performed in a required order.

54
21CS211 – Operating Systems – Unit 5

 The type of file access can also have an impact on optimal page replacement
policies. For example, LRU is not necessarily a good policy for sequential access
files. For these types of files progression normally goes in a forward direction
only, and the most recently used page will not be needed again until after the file
has been rewound and re-read from the beginning, ( if it is ever needed at all. ) On
the other hand, we can expect to need the next page in the file fairly soon. For this
reason sequential access files often take advantage of two special policies:
o Free-behind frees up a page as soon as the next page in the file is
requested, with the assumption that we are now done with the old page
and won't need it again for a long time.
o Read-ahead reads the requested page and several subsequent pages at the
same time, with the assumption that those pages will be needed in the near
future. This is similar to the track caching that is already performed by the
disk controller, except it saves the future latency of transferring data from
the disk controller memory into motherboard main memory.
 The caching system and asynchronous writes speed up disk writes considerably,
because the disk subsystem can schedule physical writes to the disk to minimize
head movement and disk seek times. ( See Chapter 12. ) Reads, on the other hand,
must be done more synchronously in spite of the caching system, with the result
that disk writes can counter-intuitively be much faster on average than disk reads.

12.7 Recovery

12.7.1 Consistency Checking

 The storing of certain data structures ( e.g. directories and inodes ) in memory and
the caching of disk operations can speed up performance, but what happens in the
result of a system crash? All volatile memory structures are lost, and the
information stored on the hard drive may be left in an inconsistent state.
 A Consistency Checker ( fsck in UNIX, chkdsk or scandisk in Windows ) is often
run at boot time or mount time, particularly if a filesystem was not closed down
properly. Some of the problems that these tools look for include:
o Disk blocks allocated to files and also listed on the free list.
o Disk blocks neither allocated to files nor on the free list.
o Disk blocks allocated to more than one file.
o The number of disk blocks allocated to a file inconsistent with the file's
stated size.
o Properly allocated files / inodes which do not appear in any directory
entry.
o Link counts for an inode not matching the number of references to that
inode in the directory structure.
o Two or more identical file names in the same directory.
o Illegally linked directories, e.g. cyclical relationships where those are not
allowed, or files/directories that are not accessible from the root of the
directory tree.

55
21CS211 – Operating Systems – Unit 5

o Consistency checkers will often collect questionable disk blocks into new
files with names such as chk00001.dat. These files may contain valuable
information that would otherwise be lost, but in most cases they can be
safely deleted, ( returning those disk blocks to the free list. )
 UNIX caches directory information for reads, but any changes that affect space
allocation or metadata changes are written synchronously, before any of the
corresponding data blocks are written to.

12.7.2 Log-Structured File Systems ( was 11.8 )

 Log-based transaction-oriented ( a.k.a. journaling ) filesystems borrow


techniques developed for databases, guaranteeing that any given transaction either
completes successfully or can be rolled back to a safe state before the transaction
commenced:
o All metadata changes are written sequentially to a log.
o A set of changes for performing a specific task ( e.g. moving a file ) is
a transaction.
o As changes are written to the log they are said to be committed, allowing
the system to return to its work.
o In the meantime, the changes from the log are carried out on the actual
filesystem, and a pointer keeps track of which changes in the log have
been completed and which have not yet been completed.
o When all changes corresponding to a particular transaction have been
completed, that transaction can be safely removed from the log.
o At any given time, the log will contain information pertaining to
uncompleted transactions only, e.g. actions that were committed but for
which the entire transaction has not yet been completed.
 From the log, the remaining transactions can be completed,
 or if the transaction was aborted, then the partially completed
changes can be undone.

12.7.3 Other Solutions

 Sun's ZFS and Network Appliance's WAFL file systems take a different approach
to file system consistency.
 No blocks of data are ever over-written in place. Rather the new data is written
into fresh new blocks, and after the transaction is complete, the metadata ( data
block pointers ) is updated to point to the new blocks.
o The old blocks can then be freed up for future use.
o Alternatively, if the old blocks and old metadata are saved, then
a snapshot of the system in its original state is preserved. This approach is
taken by WAFL.

56
21CS211 – Operating Systems – Unit 5

 ZFS combines this with check-summing of all metadata and data blocks, and
RAID, to ensure that no inconsistencies are possible, and therefore ZFS does not
incorporate a consistency checker.

12.7.4 Backup and Restore

 In order to recover lost data in the event of a disk crash, it is important to conduct
backups regularly.
 Files should be copied to some removable medium, such as magnetic tapes, CDs,
DVDs, or external removable hard drives.
 A full backup copies every file on a filesystem.
 Incremental backups copy only files which have changed since some previous
time.
 A combination of full and incremental backups can offer a compromise between
full recoverability, the number and size of backup tapes needed, and the number
of tapes that need to be used to do a full restore. For example, one strategy might
be:
o At the beginning of the month do a full backup.
o At the end of the first and again at the end of the second week, backup all
files which have changed since the beginning of the month.
o At the end of the third week, backup all files that have changed since the
end of the second week.
o Every day of the month not listed above, do an incremental backup of all
files that have changed since the most recent of the weekly backups
described above.
 Backup tapes are often reused, particularly for daily backups, but there are limits
to how many times the same tape can be used.
 Every so often a full backup should be made that is kept "forever" and not
overwritten.
 Backup tapes should be tested, to ensure that they are readable!
 For optimal security, backup tapes should be kept off-premises, so that a fire or
burglary cannot destroy both the system and the backups. There are companies (
e.g. Iron Mountain ) that specialize in the secure off-site storage of critical backup
information.
 Keep your backup tapes secure - The easiest way for a thief to steal all your data
is to simply pocket your backup tapes!
 Storing important files on more than one computer can be an alternate though less
reliable form of backup.
 Note that incremental backups can also help users to get back a previous version
of a file that they have since changed in some way.
 Beware that backups can help forensic investigators recover e-mails and other
files that users had though they had deleted!

12.8 NFS

12.8.1 Overview

57
21CS211 – Operating Systems – Unit 5

Figure 12.13 - Three independent file systems.

Figure 12.14 - Mounting in NFS. (a) Mounts. (b) Cascading mounts.

12.8.2 The Mount Protocol

 The NFS mount protocol is similar to the local mount protocol, establishing a
connection between a specific local directory ( the mount point ) and a specific
device from a remote system.
 Each server maintains an export list of the local filesystems ( directory sub-trees )
which are exportable, who they are exportable to, and what restrictions apply (
e.g. read-only access. )
 The server also maintains a list of currently connected clients, so that they can be
notified in the event of the server going down and for other reasons.
 Automount and autounmount are supported.

12.8.3 The NFS Protocol

 Implemented as a set of remote procedure calls ( RPCs ):


o Searching for a file in a directory
o REading a set of directory entries
o Manipulating links and directories
o Accessing file attributes
o Reading and writing files

58
21CS211 – Operating Systems – Unit 5

Figure 12.15 - Schematic view of the NFS architecture.

12.8.4 Path-Name Translation

11.8.5 Remote Operations

 Buffering and caching improve performance, but can cause a disparity in local
versus remote views of the same file(s).

12.9 Example: The WAFL File System ( Optional )

 Write Anywhere File Layout


 Designed for a specific hardware architecture.
 Snapshots record the state of the system at regular or irregular intervals.
o The snapshot just copies the inode pointers, not the actual data.
o Used pages are not overwritten, so updates are fast.
o Blocks keep counters for how many snapshots are pointing to that block - When
the counter reaches zero, then the block is considered free.

Figure 12.16 - The WAFL file layout.

59
21CS211 – Operating Systems – Unit 5

Figure 12.17 - Snapshots in WAFL

60
21CS211 – Operating Systems – Unit 5

13. I/O Systems

13.1 Overview

 Management of I/O devices is a very important part of the operating system - so


important and so varied that entire I/O subsystems are devoted to its operation. ( Consider
the range of devices on a modern computer, from mice, keyboards, disk drives, display
adapters, USB devices, network connections, audio I/O, printers, special devices for the
handicapped, and many special-purpose peripherals. )
 I/O Subsystems must contend with two ( conflicting? ) trends: (1) The gravitation
towards standard interfaces for a wide range of devices, making it easier to add newly
developed devices to existing systems, and (2) the development of entirely new types of
devices, for which the existing standard interfaces are not always easy to apply.
 Device drivers are modules that can be plugged into an OS to handle a particular device
or category of similar devices.

13.2 I/O Hardware

 I/O devices can be roughly categorized as storage, communications, user-interface, and


other
 Devices communicate with the computer via signals sent over wires or through the air.
 Devices connect with the computer via ports, e.g. a serial or parallel port.
 A common set of wires connecting multiple devices is termed a bus.
o Buses include rigid protocols for the types of messages that can be sent across the
bus and the procedures for resolving contention issues.
o Figure 13.1 below illustrates three of the four bus types commonly found in a
modern PC:
1. The PCI bus connects high-speed high-bandwidth devices to the memory
subsystem ( and the CPU. )
2. The expansion bus connects slower low-bandwidth devices, which
typically deliver data one character at a time ( with buffering. )
3. The SCSI bus connects a number of SCSI devices to a common SCSI
controller.
4. A daisy-chain bus, ( not shown) is when a string of devices is connected
to each other like beads on a chain, and only one of the devices is directly
connected to the host.

61
21CS211 – Operating Systems – Unit 5

Figure 13.1 - A typical PC bus structure.

 One way of communicating with devices is through registers associated with each port.
Registers may be one to four bytes in size, and may typically include ( a subset of ) the
following four:
1. The data-in register is read by the host to get input from the device.
2. The data-out register is written by the host to send output.
3. The status register has bits read by the host to ascertain the status of the device,
such as idle, ready for input, busy, error, transaction complete, etc.
4. The control register has bits written by the host to issue commands or to change
settings of the device such as parity checking, word length, or full- versus half-
duplex operation.
 Figure 13.2 shows some of the most common I/O port address ranges.

Figure 13.2 - Device I/O port locations on PCs ( partial ).

62
21CS211 – Operating Systems – Unit 5

 Another technique for communicating with devices is memory-mapped I/O.


o In this case a certain portion of the processor's address space is mapped to the
device, and communications occur by reading and writing directly to/from those
memory areas.
o Memory-mapped I/O is suitable for devices which must move large quantities of
data quickly, such as graphics cards.
o Memory-mapped I/O can be used either instead of or more often in combination
with traditional registers. For example, graphics cards still use registers for
control information such as setting the video mode.
o A potential problem exists with memory-mapped I/O, if a process is allowed to
write directly to the address space used by a memory-mapped I/O device.
o ( Note: Memory-mapped I/O is not the same thing as direct memory access,
DMA. See section 13.2.3 below. )

13.2.1 Polling

 One simple means of device handshaking involves polling:


1. The host repeatedly checks the busy bit on the device until it becomes
clear.
2. The host writes a byte of data into the data-out register, and sets the write
bit in the command register ( in either order. )
3. The host sets the command ready bit in the command register to notify the
device of the pending command.
4. When the device controller sees the command-ready bit set, it first sets the
busy bit.
5. Then the device controller reads the command register, sees the write bit
set, reads the byte of data from the data-out register, and outputs the byte
of data.
6. The device controller then clears the error bit in the status register, the
command-ready bit, and finally clears the busy bit, signaling the
completion of the operation.
 Polling can be very fast and efficient, if both the device and the controller are fast
and if there is significant data to transfer. It becomes inefficient, however, if the
host must wait a long time in the busy loop waiting for the device, or if frequent
checks need to be made for data that is infrequently there.

13.2.2 Interrupts

 Interrupts allow devices to notify the CPU when they have data to transfer or
when an operation is complete, allowing the CPU to perform other duties when no
I/O transfers need its immediate attention.
 The CPU has an interrupt-request line that is sensed after every instruction.
o A device's controller raises an interrupt by asserting a signal on the
interrupt request line.

63
21CS211 – Operating Systems – Unit 5

o The CPU then performs a state save, and transfers control to the interrupt
handler routine at a fixed address in memory. ( The CPU catches the
interrupt and dispatches the interrupt handler. )
o The interrupt handler determines the cause of the interrupt, performs the
necessary processing, performs a state restore, and executes a return from
interrupt instruction to return control to the CPU. ( The interrupt
handler clears the interrupt by servicing the device. )
 ( Note that the state restored does not need to be the same state as
the one that was saved when the interrupt went off. See below for
an example involving time-slicing. )
 Figure 13.3 illustrates the interrupt-driven I/O procedure:

Figure 13.3 - Interrupt-driven I/O cycle.

 The above description is adequate for simple interrupt-driven I/O, but there are
three needs in modern computing which complicate the picture:
1. The need to defer interrupt handling during critical processing,
2. The need to determine which interrupt handler to invoke, without having
to poll all devices to see which one needs attention, and

64
21CS211 – Operating Systems – Unit 5

3. The need for multi-level interrupts, so the system can differentiate


between high- and low-priority interrupts for proper response.
 These issues are handled in modern computer architectures with interrupt-
controller hardware.
o Most CPUs now have two interrupt-request lines: One that is non-
maskable for critical error conditions and one that is maskable, that the
CPU can temporarily ignore during critical processing.
o The interrupt mechanism accepts an address, which is usually one of a
small set of numbers for an offset into a table called the interrupt
vector. This table ( usually located at physical address zero ? ) holds the
addresses of routines prepared to process specific interrupts.
o The number of possible interrupt handlers still exceeds the range of
defined interrupt numbers, so multiple handlers can be interrupt chained.
Effectively the addresses held in the interrupt vectors are the head pointers
for linked-lists of interrupt handlers.
o Figure 13.4 shows the Intel Pentium interrupt vector. Interrupts 0 to 31 are
non-maskable and reserved for serious hardware and other errors.
Maskable interrupts, including normal device I/O interrupts begin at
interrupt 32.
o Modern interrupt hardware also supports interrupt priority levels,
allowing systems to mask off only lower-priority interrupts while
servicing a high-priority interrupt, or conversely to allow a high-priority
signal to interrupt the processing of a low-priority one.

Figure 13.4 - Intel Pentium processor event-vector table.

65
21CS211 – Operating Systems – Unit 5

 At boot time the system determines which devices are present, and loads the
appropriate handler addresses into the interrupt table.
 During operation, devices signal errors or the completion of commands via
interrupts.
 Exceptions, such as dividing by zero, invalid memory accesses, or attempts to
access kernel mode instructions can be signaled via interrupts.
 Time slicing and context switches can also be implemented using the interrupt
mechanism.
o The scheduler sets a hardware timer before transferring control over to a
user process.
o When the timer raises the interrupt request line, the CPU performs a state-
save, and transfers control over to the proper interrupt handler, which in
turn runs the scheduler.
o The scheduler does a state-restore of a different process before resetting
the timer and issuing the return-from-interrupt instruction.
 A similar example involves the paging system for virtual memory - A page fault
causes an interrupt, which in turn issues an I/O request and a context switch as
described above, moving the interrupted process into the wait queue and selecting
a different process to run. When the I/O request has completed ( i.e. when the
requested page has been loaded up into physical memory ), then the device
interrupts, and the interrupt handler moves the process from the wait queue into
the ready queue, ( or depending on scheduling algorithms and policies, may go
ahead and context switch it back onto the CPU. )
 System calls are implemented via software interrupts, a.k.a. traps. When a
( library ) program needs work performed in kernel mode, it sets command
information and possibly data addresses in certain registers, and then raises a
software interrupt. ( E.g. 21 hex in DOS. ) The system does a state save and then
calls on the proper interrupt handler to process the request in kernel mode.
Software interrupts generally have low priority, as they are not as urgent as
devices with limited buffering space.
 Interrupts are also used to control kernel operations, and to schedule activities for
optimal performance. For example, the completion of a disk read operation
involves two interrupts:
o A high-priority interrupt acknowledges the device completion, and issues
the next disk request so that the hardware does not sit idle.
o A lower-priority interrupt transfers the data from the kernel memory space
to the user space, and then transfers the process from the waiting queue to
the ready queue.
 The Solaris OS uses a multi-threaded kernel and priority threads to assign
different threads to different interrupt handlers. This allows for the
"simultaneous" handling of multiple interrupts, and the assurance that high-
priority interrupts will take precedence over low-priority ones and over user
processes.

66
21CS211 – Operating Systems – Unit 5

13.2.3 Direct Memory Access

 For devices that transfer large quantities of data ( such as disk controllers ), it is
wasteful to tie up the CPU transferring data in and out of registers one byte at a
time.
 Instead this work can be off-loaded to a special processor, known as the Direct
Memory Access, DMA, Controller.
 The host issues a command to the DMA controller, indicating the location where
the data is located, the location where the data is to be transferred to, and the
number of bytes of data to transfer. The DMA controller handles the data transfer,
and then interrupts the CPU when the transfer is complete.
 A simple DMA controller is a standard component in modern PCs, and many bus-
mastering I/O cards contain their own DMA hardware.
 Handshaking between DMA controllers and their devices is accomplished
through two wires called the DMA-request and DMA-acknowledge wires.
 While the DMA transfer is going on the CPU does not have access to the PCI bus
( including main memory ), but it does have access to its internal registers and
primary and secondary caches.
 DMA can be done in terms of either physical addresses or virtual addresses that
are mapped to physical addresses. The latter approach is known as Direct Virtual
Memory Access, DVMA, and allows direct data transfer from one memory-
mapped device to another without using the main memory chips.
 Direct DMA access by user processes can speed up operations, but is generally
forbidden by modern systems for security and protection reasons. ( I.e. DMA is a
kernel-mode operation. )
 Figure 13.5 below illustrates the DMA process.

Figure 13.5 - Steps in a DMA transfer.

13.2.4 I/O Hardware Summary

67
21CS211 – Operating Systems – Unit 5

13.3 Application I/O Interface

 User application access to a wide variety of different devices is accomplished through


layering, and through encapsulating all of the device-specific code into device drivers,
while application layers are presented with a common interface for all ( or at least large
general categories of ) devices.

Figure 13.6 - A kernel I/O structure.

 Devices differ on many different dimensions, as outlined in Figure 13.7:

Figure 13.7 - Characteristics of I/O devices.

68
21CS211 – Operating Systems – Unit 5

 Most devices can be characterized as either block I/O, character I/O, memory mapped file
access, or network sockets. A few devices are special, such as time-of-day clock and the
system timer.
 Most OSes also have an escape, or back door, which allows applications to send
commands directly to device drivers if needed. In UNIX this is the ioctl( ) system call (
I/O Control ). Ioctl( ) takes three arguments - The file descriptor for the device driver
being accessed, an integer indicating the desired function to be performed, and an address
used for communicating or transferring additional information.

13.3.1 Block and Character Devices

 Block devices are accessed a block at a time, and are indicated by a "b" as the first
character in a long listing on UNIX systems. Operations supported include read( ),
write( ), and seek( ).
o Accessing blocks on a hard drive directly ( without going through the
filesystem structure ) is called raw I/O, and can speed up certain
operations by bypassing the buffering and locking normally conducted by
the OS. ( It then becomes the application's responsibility to manage those
issues. )
o A new alternative is direct I/O, which uses the normal filesystem access,
but which disables buffering and locking operations.
 Memory-mapped file I/O can be layered on top of block-device drivers.
o Rather than reading in the entire file, it is mapped to a range of memory
addresses, and then paged into memory as needed using the virtual
memory system.
o Access to the file is then accomplished through normal memory accesses,
rather than through read( ) and write( ) system calls. This approach is
commonly used for executable program code.
 Character devices are accessed one byte at a time, and are indicated by a "c" in
UNIX long listings. Supported operations include get( ) and put( ), with more
advanced functionality such as reading an entire line supported by higher-level
library routines.

13.3.2 Network Devices

 Because network access is inherently different from local disk access, most
systems provide a separate interface for network devices.
 One common and popular interface is the socket interface, which acts like a cable
or pipeline connecting two networked entities. Data can be put into the socket at
one end, and read out sequentially at the other end. Sockets are normally full-
duplex, allowing for bi-directional data transfer.
 The select( ) system call allows servers ( or other applications ) to identify sockets
which have data waiting, without having to poll all available sockets.

13.3.3 Clocks and Timers

69
21CS211 – Operating Systems – Unit 5

 Three types of time services are commonly needed in modern systems:


o Get the current time of day.
o Get the elapsed time ( system or wall clock ) since a previous event.
o Set a timer to trigger event X at time T.
 Unfortunately time operations are not standard across all systems.
 A programmable interrupt timer, PIT can be used to trigger operations and to
measure elapsed time. It can be set to trigger an interrupt at a specific future time,
or to trigger interrupts periodically on a regular basis.
o The scheduler uses a PIT to trigger interrupts for ending time slices.
o The disk system may use a PIT to schedule periodic maintenance cleanup,
such as flushing buffers to disk.
o Networks use PIT to abort or repeat operations that are taking too long to
complete. I.e. resending packets if an acknowledgement is not received
before the timer goes off.
o More timers than actually exist can be simulated by maintaining an
ordered list of timer events, and setting the physical timer to go off when
the next scheduled event should occur.
 On most systems the system clock is implemented by counting interrupts
generated by the PIT. Unfortunately this is limited in its resolution to the interrupt
frequency of the PIT, and may be subject to some drift over time. An alternate
approach is to provide direct access to a high frequency hardware counter, which
provides much higher resolution and accuracy, but which does not support
interrupts.

13.3.4 Blocking and Non-blocking I/O

 With blocking I/O a process is moved to the wait queue when an I/O request is made, and
moved back to the ready queue when the request completes, allowing other processes to
run in the meantime.
 With non-blocking I/O the I/O request returns immediately, whether the requested I/O
operation has ( completely ) occurred or not. This allows the process to check for
available data without getting hung completely if it is not there.
 One approach for programmers to implement non-blocking I/O is to have a multi-
threaded application, in which one thread makes blocking I/O calls ( say to read a
keyboard or mouse ), while other threads continue to update the screen or perform other
tasks.
 A subtle variation of the non-blocking I/O is the asynchronous I/O, in which the I/O
request returns immediately allowing the process to continue on with other tasks, and
then the process is notified ( via changing a process variable, or a software interrupt, or a
callback function ) when the I/O operation has completed and the data is available for
use. ( The regular non-blocking I/O returns immediately with whatever results are
available, but does not complete the operation and notify the process later. )

70
21CS211 – Operating Systems – Unit 5

Figure 13.8 - Two I/O methods: (a) synchronous and (b) asynchronous.

13.3.5 Vectored I/O ( NEW )

13.4 Kernel I/O Subsystem

13.4.1 I/O Scheduling

 Scheduling I/O requests can greatly improve overall efficiency. Priorities can also
play a part in request scheduling.
 The classic example is the scheduling of disk accesses, as discussed in detail in
chapter 12.
 Buffering and caching can also help, and can allow for more flexible scheduling
options.
 On systems with many devices, separate request queues are often kept for each
device:

Figure 13.9 - Device-status table.

71
21CS211 – Operating Systems – Unit 5

13.4.2 Buffering

 Buffering of I/O is performed for ( at least ) 3 major reasons:


1. Speed differences between two devices. ( See Figure 13.10 below. ) A
slow device may write data into a buffer, and when the buffer is full, the
entire buffer is sent to the fast device all at once. So that the slow device
still has somewhere to write while this is going on, a second buffer is used,
and the two buffers alternate as each becomes full. This is known
as double buffering. ( Double buffering is often used in ( animated )
graphics, so that one screen image can be generated in a buffer while the
other ( completed ) buffer is displayed on the screen. This prevents the
user from ever seeing any half-finished screen images. )
2. Data transfer size differences. Buffers are used in particular in networking
systems to break messages up into smaller packets for transfer, and then
for re-assembly at the receiving side.
3. To support copy semantics. For example, when an application makes a
request for a disk write, the data is copied from the user's memory area
into a kernel buffer. Now the application can change their copy of the
data, but the data which eventually gets written out to disk is the version
of the data at the time the write request was made.

Figure 13.10 - Sun Enterprise 6000 device-transfer rates ( logarithmic ).

13.4.3 Caching

 Caching involves keeping a copy of data in a faster-access location than where the
data is normally stored.
 Buffering and caching are very similar, except that a buffer may hold the only
copy of a given data item, whereas a cache is just a duplicate copy of some other
data stored elsewhere.
 Buffering and caching go hand-in-hand, and often the same storage space may be
used for both purposes. For example, after a buffer is written to disk, then the

72
21CS211 – Operating Systems – Unit 5

copy in memory can be used as a cached copy, (until that buffer is needed for
other purposes. )

13.4.4 Spooling and Device Reservation

 A spool ( Simultaneous Peripheral Operations On-Line ) buffers data for (


peripheral ) devices such as printers that cannot support interleaved data streams.
 If multiple processes want to print at the same time, they each send their print data
to files stored in the spool directory. When each file is closed, then the application
sees that print job as complete, and the print scheduler sends each file to the
appropriate printer one at a time.
 Support is provided for viewing the spool queues, removing jobs from the queues,
moving jobs from one queue to another queue, and in some cases changing the
priorities of jobs in the queues.
 Spool queues can be general ( any laser printer ) or specific ( printer number 42. )
 OSes can also provide support for processes to request / get exclusive access to a
particular device, and/or to wait until a device becomes available.

13.4.5 Error Handling

 I/O requests can fail for many reasons, either transient ( buffers overflow ) or
permanent ( disk crash ).
 I/O requests usually return an error bit ( or more ) indicating the problem. UNIX
systems also set the global variable errno to one of a hundred or so well-defined
values to indicate the specific error that has occurred. ( See errno.h for a complete
listing, or man errno. )
 Some devices, such as SCSI devices, are capable of providing much more detailed
information about errors, and even keep an on-board error log that can be
requested by the host.

13.4.6 I/O Protection

 The I/O system must protect against either accidental or deliberate erroneous I/O.
 User applications are not allowed to perform I/O in user mode - All I/O requests
are handled through system calls that must be performed in kernel mode.
 Memory mapped areas and I/O ports must be protected by the memory
management system, but access to these areas cannot be totally denied to user
programs. ( Video games and some other applications need to be able to write
directly to video memory for optimal performance for example. ) Instead the
memory protection system restricts access so that only one process at a time can
access particular parts of memory, such as the portion of the screen memory
corresponding to a particular window.

73
21CS211 – Operating Systems – Unit 5

Figure 13.11 - Use of a system call to perform I/O.

13.4.7 Kernel Data Structures

 The kernel maintains a number of important data structures pertaining to the I/O
system, such as the open file table.
 These structures are object-oriented, and flexible to allow access to a wide variety
of I/O devices through a common interface. ( See Figure 13.12 below. )
 Windows NT carries the object-orientation one step further, implementing I/O as
a message-passing system from the source through various intermediaries to the
device.

74
21CS211 – Operating Systems – Unit 5

Figure 13.12 - UNIX I/O kernel structure.

13.4.6 Kernel I/O Subsystem Summary

13.5 Transforming I/O Requests to Hardware Operations

 Users request data using file names, which must ultimately be mapped to specific blocks
of data from a specific device managed by a specific device driver.
 DOS uses the colon separator to specify a particular device ( e.g. C:, LPT:, etc. )
 UNIX uses a mount table to map filename prefixes ( e.g. /usr ) to specific mounted
devices. Where multiple entries in the mount table match different prefixes of the
filename the one that matches the longest prefix is chosen. ( e.g. /usr/home instead of /usr
where both exist in the mount table and both match the desired file. )
 UNIX uses special device files, usually located in /dev, to represent and access physical
devices directly.
o Each device file has a major and minor number associated with it, stored and
displayed where the file size would normally go.
o The major number is an index into a table of device drivers, and indicates which
device driver handles this device. ( E.g. the disk drive handler. )
o The minor number is a parameter passed to the device driver, and indicates which
specific device is to be accessed, out of the many which may be handled by a
particular device driver. ( e.g. a particular disk drive or partition. )
 A series of lookup tables and mappings makes the access of different devices flexible,
and somewhat transparent to users.
 Figure 13.13 illustrates the steps taken to process a ( blocking ) read request:

75
21CS211 – Operating Systems – Unit 5

Figure 13.13 - The life cycle of an I/O request.

13.6 STREAMS ( Optional )

 The streams mechanism in UNIX provides a bi-directional pipeline between a user


process and a device driver, onto which additional modules can be added.
 The user process interacts with the stream head.
 The device driver interacts with the device end.
 Zero or more stream modules can be pushed onto the stream, using ioctl( ). These
modules may filter and/or modify the data as it passes through the stream.
 Each module has a read queue and a write queue.
 Flow control can be optionally supported, in which case each module will buffer data
until the adjacent module is ready to receive it. Without flow control, data is passed along
as soon as it is ready.
 User processes communicate with the stream head using either read( ) and write( ) ( or
putmsg( ) and getmsg( ) for message passing. )
 Streams I/O is asynchronous ( non-blocking ), except for the interface between the user
process and the stream head.
 The device driver must respond to interrupts from its device - If the adjacent module is
not prepared to accept data and the device driver's buffers are all full, then data is
typically dropped.
 Streams are widely used in UNIX, and are the preferred approach for device drivers. For
example, UNIX implements sockets using streams.

76
21CS211 – Operating Systems – Unit 5

Figure 13.14 - The SREAMS structure.

13.7 Performance ( Optional )

 The I/O system is a major factor in overall system performance, and can place heavy
loads on other major components of the system ( interrupt handling, process switching,
memory access, bus contention, and CPU load for device drivers just to name a few. )
 Interrupt handling can be relatively expensive ( slow ), which causes programmed I/O to
be faster than interrupt-driven I/O when the time spent busy waiting is not excessive.
 Network traffic can also put a heavy load on the system. Consider for example the
sequence of events that occur when a single character is typed in a telnet session, as
shown in figure 13.15. ( And the fact that a similar set of events must happen in reverse
to echo back the character that was typed. ) Sun uses in-kernel threads for the telnet
daemon, increasing the supportable number of simultaneous telnet sessions from the
hundreds to the thousands.

77
21CS211 – Operating Systems – Unit 5

Figure 13.15 - Intercomputer communications.

 Other systems use front-end processors to off-load some of the work of I/O processing
from the CPU. For example a terminal concentrator can multiplex with hundreds of
terminals on a single port on a large computer.
 Several principles can be employed to increase the overall efficiency of I/O processing:
1. Reduce the number of context switches.
2. Reduce the number of times data must be copied.
3. Reduce interrupt frequency, using large transfers, buffering, and polling where
appropriate.
4. Increase concurrency using DMA.
5. Move processing primitives into hardware, allowing their operation to be
concurrent with CPU and bus operations.
6. Balance CPU, memory, bus, and I/O operations, so a bottleneck in one does not
idle all the others.
 The development of new I/O algorithms often follows a progression from application
level code to on-board hardware implementation, as shown in Figure 13.16. Lower-level
implementations are faster and more efficient, but higher-level ones are more flexible and

78
21CS211 – Operating Systems – Unit 5

easier to modify. Hardware-level functionality may also be harder for higher-level


authorities ( e.g. the kernel ) to control.

Figure 13.16 - Device functionality progression.

79

You might also like