Storage Systems: 1. Explain Various Types of Storage Devices
Storage Systems: 1. Explain Various Types of Storage Devices
Storage Systems
Long Term, nonvolatile storage for files, even when no programs are running
A level of the memory hierarchy below main memory used as a backing store for
virtual memory during program execution.
The disk industry has concentrated on improving the capacity of disks. Improvement
in capacity is customarily expressed as improvement in areal density, measured in bits
per square inch:
Areal Density = (Tracks/Inch) on a disk surface X (Bits/Inch) on a track
Through about 1988 the rate of improvement of areal density was 29% per year, thus
doubling density every three years. Between then and about 1996, the rate improved to
60% per year, quadrupling density every three years and matching the traditional rate of
DRAMs. From 1997 to 2001 the rate increased to 100%, or doubling every year. In 2001,
the highest density in commercial products is 20 billion bits per square inch, and the lab
record is 60 billion bits per square inch.
Optical Disks:
One challenger to magnetic disks is optical compact disks, or CDs, and its
successor, called Digital Video Discs and then Digital Versatile Discs or just DVDs. Both
the CD-ROM and DVD-ROM are removable and inexpensive to manufacture, but they
are read-only mediums. These 4.7-inch diameter disks hold 0.65 and 4.7 GB,
respectively, although some DVDs write on both sides to double their capacity. Their
high capacity and low cost have led to CD-ROMs and DVD-ROMs replacing floppy
disks as the favorite medium for distributing software and other types of computer data.
The popularity of CDs and music that can be downloaded from the WWW led to
a market for rewritable CDs, conveniently called CD-RW, and write once CDs, called
CD-R. In 2001, there is a small cost premium for drives that can record on CD-RW. The
media itself costs about $0.20 per CD-R disk or $0.60 per CD-RW disk. CD-RWs and
CD-Rs read at about half the speed of CD-ROMs and CD-RWs and CD-Rs write at about
a quarter the speed of CD-ROMs.
Magnetic Tape:
Magnetic tapes have been part of computer systems as long as disks because they use
the similar technology as disks, and hence historically have followed the same density
improvements. The inherent cost/performance difference between disks and tapes is
based on their geometries:
Fixed rotating platters offer random access in milliseconds, but disks have a
limited storage area and the storage medium is sealed within each reader.
Long strips wound on removable spools of “unlimited” length mean many tapes
can be used per reader, but tapes require sequential access that can take seconds.
One of the limits of tapes had been the speed at which the tapes can spin without
breaking or jamming. A technology called helical scan tapes solves this problem by
keeping the tape speed the same but recording the information on a diagonal to the tape
with a tape reader that spins much faster than the tape is moving. This technology
increases recording density by about a factor of 20 to 50. Helical scan tapes were
developed for low-cost VCRs and camcorders, which brought down the cost of the tapes
and readers.
Flash Memory
Embedded devices also need nonvolatile storage, but premiums placed on space
and power normally lead to the use of Flash memory instead of magnetic recording. Flash
memory is also used as a rewritable ROM in embedded systems, typically to allow
software to be upgraded without having to replace chips. Applications are typically
prohibited from writing to Flash memory in such circumstances.
Like electrically erasable and programmable read-only memories (EEPROM),
Flash memory is written by inducing the tunneling of charge from transistor gain to a
floating gate. The floating gate acts as a potential well which stores the charge, and the
charge cannot move from there without applying an external force. The primary
difference between EEPROM and Flash memory is that Flash restricts write to multi-
kilobyte blocks, increasing memory capacity per chip by reducing area dedicated to
control. Compared to disks, Flash memories offer low power consumption (less than 50
milliwatts), can be sold in small sizes, and offer read access times comparable to
DRAMs. In 2001, a 16 Mbit Flash memory has a 65 ns access time, and a 128 Mbit Flash
memory has a 150 ns access time.
The next item in the table concerns the number of bus masters. These devices can
initiate a read or write transaction; the CPU, for instance, is always a bus master. A bus
has multiple masters when there are multiple CPUs or when I/O devices can initiate a bus
transaction. With multiple masters, a bus can offer higher bandwidth by using packets, as
opposed to holding the bus for the full transaction. This technique is called split
transactions.
The final item in Figure 7.8, clocking, concerns whether a bus is synchronous or
asynchronous. If a bus is synchronous, it includes a clock in the control lines and a fixed
protocol for sending address and data relative to the clock. Since little or no logic is
needed to decide what to do next, these buses can be both fast and inexpensive.
Bus Standards
Standards that let the computer designer and I/O-device designer work independently
play a large role in buses. As long as both designers meet the requirements, any I/O
device can connect to any computer. The I/O bus standard is the document that defines
how to connect devices to computers.
• The Good
– Let the computer and I/O-device designers work independently
– Provides a path for second party (e.g. cheaper) competition
• The Bad
– Become major performance anchors
– Inhibit change
• How to create a standard
– Bottom-up
• Company tries to get standards committee to approve it’s latest
philosophy in hopes that they’ll get the jump on the others (e.g. S
bus, PC-AT bus, ...)
• De facto standards
– Top-down
• Design by committee (PCI, SCSI, ...)
Some sample bus designs are shown below
The I/O bus is connected to the main memory bus is shown in figure 7.15
Processor interface with i/o bus can be done with two techniques one using
interrupts and second using memory mapped I/O
• I/O Control Structures
– Polling
– Interrupts
– DMA
– I/O Controllers
– I/O Processors
The simple interface, in which the CPU periodically checks status bits to see if it is
time for the next I/O operation, is called polling.
Interrupt-driven I/O, used by most systems for at least some devices, allows the CPU
to work on some other process while waiting for the I/O device. For example, the LP11
has a mode that allows it to interrupt the CPU whenever the done bit or error bit is set. In
general-purpose applications, interrupt-driven I/O is the key to multitasking operating
systems and good response times.
The drawback to interrupts is the operating system overhead on each event. In
real-time applications with hundreds of I/O events per second, this overhead can be
intolerable. One hybrid solution for real-time systems is to use a clock to periodically
interrupt the CPU, at which time the CPU polls all I/O devices
The DMA hardware is a specialized processor that transfers data between
memory and an I/O device while the CPU goes on with other tasks. Thus, it is external to
the CPU and must act as a master on the bus. The CPU first sets up the DMA registers,
which contain a memory address and number of bytes to be transferred. More
sophisticated DMA devices support scatter/gather, whereby a DMA device can write or
read data from a list of separate addresses. Once the DMA transfer is complete, the DMA
controller interrupts the CPU. There may be multiple DMA devices in a computer
system.
No Redundancy (RAID 0)
This notation is refers to a disk array in which data is striped but there is no
redundancy to tolerate disk failure. Striping across a set of disks makes the collection
appear to software as a single large disk, which simplifies storage management. It also
improves performance for large accesses, since many disks can operate at once. Video
editing systems, for example, often stripe their data.
RAID 0 something of a misnomer as there is no redundancy, it is not in the
original RAID taxonomy, and striping predates RAID. However, RAID levels are often
left to the operator to set when creating a storage system, and RAID 0 is often listed as
one of the options. Hence, the term RAID 0 has become widely used.
Mirroring (RAID 1)
This traditional scheme for tolerating disk failure, called mirroring or shadowing,
uses twice as many disks as does RAID 0. Whenever data is written to one disk, that data
is also written to a redundant disk, so that there are always two copies of the information.
If a disk fails, the system just goes to the “mirror” to get the desired information.
Mirroring is the most expensive RAID solution, since it requires the most disks.
The RAID terminology has evolved to call the former RAID 1+0 or RAID 10
(“striped mirrors”) and the latter RAID 0+1 or RAID 01 (“mirrored stripes”).
The cost of higher availability can be reduced to 1/N, where N is the number of
disks in a protection group. Rather than have a complete copy of the original data for
each disk, we need only add enough redundant information to restore the lost information
on a failure. Reads or writes go to all disks in the group, with one extra disk to hold the
check information in case there is a failure. RAID 3 is popular in applications with large
data sets, such as multimedia and some scientific codes.
Parity is one such scheme. Readers unfamiliar with parity can think of the redundant
disk as having the sum of all the data in the other disks. When a disk fails, then you
subtract all the data in the good disks from the parity disk; the remaining information
must be the missing information. Parity is simply the sum modulo two. The assumption
behind this technique is that failures are so rare that taking longer to recover from failure
but reducing redundant storage is a good trade-off.
In RAID 3, every access went to all disks. Some applications would prefer to do
smaller accesses, allowing independent accesses to occur in parallel. That is the purpose
of the next RAID levels. Since error-detection information in each sector is checked on
reads to see if data is correct, such “small reads” to each disk can occur independently as
long as the minimum access is one sector.
Writes are another matter. It would seem that each small write would demand that all
other disks be accessed to read the rest of the information needed to recalculate the new
parity, as in Figure 7.18. A “small write” would require reading the old data and old
parity, adding the new information, and then writing the new parity to the parity disk and
the new data to the data disk.
RAID 4 efficiently supports a mixture of large reads, large writes, small reads, and
small writes. One drawback to the system is that the parity disk must be updated on every
write, so it is the bottleneck for back-to-back writes. To fix the parity-write bottleneck,
the parity information can be spread throughout all the disks so that there is no single
bottleneck for writes. The distributed parity organization is RAID 5.
D0' D0 D1 D2 D3 P
+ XOR
D0' D1 D2 D3 P'
Figure 7.19 shows how data are distributed in RAID 4 vs. RAID 5. As the organization
on the right shows, in RAID 5 the parity associated with each row of data blocks is no
longer restricted to a single disk. This organization allows multiple writes to occur
simultaneously as long as the stripe units are not located in the same disks. For example,
a write to block 8 on the right must also access its parity block P2, thereby occupying the
first and third disks. A second write to block 5 on the right, implying an update to its
parity block P1, accesses the second and fourth disks and thus could occur at the same
time as the write to block 8. Those same writes to the organization on the left would
result in changes to blocks P1 and P2, both on the fifth disk, which would be a
bottleneck.
P+Q redundancy (RAID 6)
Publications of real error rates are rare for two reasons. First academics rarely
have access to significant hardware resources to measure. Second industrial, researchers
are rarely allowed to publish failure information for fear that it would be used against
their companies in the marketplace. Below are four exceptions.
Total in Total %
Component
System Failed Failed
SCSI Controller 44 1 2.3%
SCSI Cable 39 1 2.6%
SCSI Disk 368 7 1.9%
IDE Disk 24 6 25.0%
Disk Enclosure
46 13 28.3%
-Backplane
Disk Enclosure - Power
92 3 3.3%
Supply
Ethernet Controller 20 1 5.0%
Ethernet Switch 2 1 50.0%
Ethernet Cable 42 1 2.3%
CPU/Motherboard 20 0 0%
FIGURE 7.20 Failures of components in Tertiary Disk over eighteen months of oper-
ation.
Figure 7.20 shows the failure rates of the various components of Tertiary Disk. In
advance of building the system, the designers assumed that data disks would be the least
reliable part of the system, as they are both mechanical and plentiful. As Tertiary Disk
was a large system with many redundant components, it had the potential to survive this
wide range of failures. Components were connected and mirrored images were placed no
single failure could make any image unavailable. This strategy, which initially appeared
to be overkill, proved to be vital.
This experience also demonstrated the difference between
transient faults and hard faults. Transient faults are faults that come
and go, at least temporarily fixing themselves. Hard faults stop the
device from working properly, and will continue to misbehave until
repaired.
Tandem
The next example comes from industry. Gray [1990] collected data on faults for Tandem
Computers, which was one of the pioneering companies in fault tolerant computing.
Figure 7.21 graphs the faults that caused system failures between 1985 and 1989 in
absolute faults per system and in percentage of faults encoun-tered. The data shows a
clear improvement in the reliability of hardware and maintenance. Disks in 1985 needed
yearly service by Tandem, but they were re-placed by disks that needed no scheduled
maintenance. Shrinking number of chips and connectors per system plus software’s
ability to tolerate hardware faults reduced hardware’s contribution to only 7% of failures
by 1989. And when hardware was at fault, software embedded in the hardware device
(firmware) was often the culprit. The data indicates that software in 1989 was the major
source of reported outages (62%), followed by system operations (15%).
The problem with any such statistics are that these data only refer to what is
reported; for example, environmental failures due to power outages were not reported to
Tandem because they were seen as a local problem.
VAX
The next example is also from industry. Murphy and Gent [1995] measured faults
in VAX systems. They classified faults as hardware, operating system, system
management, or application/networking. Figure 7.22 shows their data for 1985 and 1993.
They tried to improve the accuracy of data on operator faults by having the system
automatically prompt the operator on each boot for the reason for that reboot. They also
classified consecutive crashes to the same fault as operator fault. Note that the
hardware/operating system went from causing 70% of the failures in 1985 to 28% in
1993. Murphy and Gent expected system management to be the primary dependability
challenge in the future.
FCC
The final set of data comes from the government. The Federal Communications
Commission (FCC) requires that all telephone companies submit explanations when they
experience an outage that affects at least 30,000 people or lasts thirty minutes. These
detailed disruption reports do not suffer from the self-reporting problem of earlier figures,
as investigators determine the cause of the outage rather than operators of the equipment.
Kuhn [1997] studied the causes of outages between 1992 and 1994 and Enriquez [2001]
did a follow-up study for the first half of 2001. In addition to reporting number of
outages, the FCC data includes the number of customers affected and how long they were
affected. Hence, we can look at the size and scope of failures, rather than assuming that
all are equally important. Figure 7.23 plots the absolute and relative number of customer-
outage minutes for those years, broken into four categories:
These four examples and others suggest that the primary cause of failures in large
systems today is faults by human operators. Hardware faults have declined due to a
decreasing number of chips in systems, reduced power, and fewer connectors. Hardware
dependability has improved through fault tolerance techniques such as RAID. At least
some operating systems are considering reliability implications before new adding
features, so in 2001 the failures largely occur elsewhere.
FIGURE 7.31 Transaction Processing Council Benchmarks. The summary results include
both the performance metric and the price-performance of that metric. TPC-A, TPC-B,
and TPC-D were retired.
The TPC benchmarks were either the first, and in some cases still the only ones,
that have these unusual characteristics:
Price is included with the benchmark results. The cost of hardware, software, and
five-year maintenance agreements is included in a submission, which en-ables
evaluations based on price-performance as well as high performance.
The data set generally must scale in size as the throughput increases. The
benchmarks are trying to model real systems, in which the demand on the sys-tem and the
size of the data stored in it increase together. It makes no sense, for example, to have
thousands of people per minute access hundreds of bank ac-counts.
The benchmark results are audited. Before results can be submitted, they must be
approved by a certified TPC auditor, who enforces the TPC rules that try to make sure
that only fair results are submitted. Results can be challenged and disputes resolved by
going before the TPC council.
Throughput is the performance metric but response times are limited. For ex-
ample, with TPC-C, 90% of the New-Order transaction response times must be less than
5 seconds.
An independent organization maintains the benchmarks. Dues collected by TPC
pay for an administrative structure including a Chief Operating Office. This organization
settles disputes, conducts mail ballots on approval of changes to benchmarks, hold board
meetings, and so on.
The SPEC benchmarking effort is best known for its characterization of processor
performance, but has created benchmarks for other fields as well. In 1990 seven
companies agreed on a synthetic benchmark, called SFS, to evaluate systems running the
Sun Microsystems network file service NFS. This benchmark was upgraded to SFS 2.0
(also called SPEC SFS97) to include support for NSF version 3, using TCP in addition to
UDP as the transport protocol, and making the mix of operations more realistic.
Figure 7.32 shows average response time versus throughput for four systems.
Unfortunately, unlike the TPC benchmarks, SFS does not normalize for different price
configurations. The fastest system in Figure 7.32 has 7 times the number of CPUs and
disks as the slowest system, but SPEC leaves it to you to calculate price versus
performance. As performance scaled to new heights, SPEC discovered bugs in the
benchmark that impact the amount of work done during the measurement periods. Hence,
it was retired in June 2001.
SPEC WEB is a benchmark for evaluating the performance of World Wide Web
servers. The SPEC WEB99 workload simulates accesses to a web service provider, where
the server supports home pages for several organizations. Each home page is a collection
of files ranging in size from small icons to large docu-ments and images, with some files
being more popular than others. The workload defines four sizes of files and their
frequency of activity:
Figure 7.33 shows results for Dell computers. The performance result represents the
number of simultaneous connections the web server can support using the predefined
workload. As the disk system is the same, it appears that the large memory is used for a
file cache to reduce disk I/O.
System Name Result CPUs Result/ HTTP Version/OS Pentium III DRAM
CPU
PowerEdge IIS 5.0/Windows 667 MHz
732 1 732 2 GB
2400/667 2000 EB
PowerEdge TUX 1.0/Red Hat 667 MHz
1270 1 1270 2 GB
2400/667 Linux 6.2 EB
PowerEdge IIS 5.0/Windows 800 MHz
1060 2 530 4 GB
4400/800 2000 EB
PowerEdge TUX 1.0/Red Hat 800 MHz
2200 2 1100 4 GB
4400/800 Linux 6.2 EB
PowerEdge IIS 5.0/Windows 700 MHz
1598 4 400 8 GB
6400/700 2000 Xeon
PowerEdge TUX 1.0/Red Hat 700 MHz
4200 4 1050 8 GB
6400/700 Linux 6.2 Xeon
FIGURE 7.33 SPEC WEB99 results in 2000 for Dell computers. Each machine uses five
9GB, 10,000 RPM disks except the fifth system, which had seven disk. The first four
have 256 KB of L2 cache while the last two have 2 MB of L2 cache.
The art of I/O system design is to find a design that meets goals for cost, depend-
ability, and variety of devices while avoiding bottlenecks to I/O performance. Avoiding
bottlenecks means that components must be balanced between main memory and the I/O
device, because performance and hence effective cost/performance can only be as good
as the weakest link in the I/O chain. Finally, storage must be dependable, adding new
constraints on proposed designs.
In designing an I/O system, analyze performance, cost, capacity, and availability
using varying I/O connection schemes and different numbers of I/O devices of each type.
Here is one series of steps to follow in designing an I/O system. The answers for each
step may be dictated by market requirements or simply by cost, performance, and
availability goals.
1 List the different types of I/O devices to be connected to the machine, or list the
standard buses that the machine will support.
2 List the physical requirements for each I/O device. Requirements include size,
power, connectors, bus slots, expansion cabinets, and so on.
3 List the cost of each I/O device, including the portion of cost of any controller
needed for this device.
4 List the reliability of each I/O device.
5 Record the CPU resource demands of each I/O device. This list should include
CPU clock cycles to recover from an I/O activity, such as a cache flush
1 List the memory and I/O bus resource demands of each I/O device. Even when
the CPU is not using memory, the bandwidth of main memory and the I/O bus is limited.
2 The final step is assessing the performance and availability of the different ways
to organize these I/O devices. Performance can only be properly evaluated with
simulation, though it may be estimated using queuing theory. Reliability can be
calculated assuming I/O devices fail independently and are that MTTFs are exponentially
distributed. Availability can be computed from reliability by estimating MTTF for the
devices, taking into account the time from failure to repair.
You then select the best organization, given your cost, performance, and availability
goals.
Cost/performance goals affect the selection of the I/O scheme and physical design.
Performance can be measured either as megabytes per second or I/Os per second,
depending on the needs of the application. For high performance, the only limits should
be speed of I/O devices, number of I/O devices, and speed of memory and CPU. For low
cost, the only expenses should be those for the I/O devices themselves and for cabling to
the CPU. Cost/performance design, of course, tries for the best of both worlds.
Availability goals depend in part on the cost of unavailability to an organization.
To make these ideas clearer, the next dozen pages go through five examples. Each
looks at constructing a disk array with about 2 terabytes of capacity for user data with
two sizes of disks. To offer a gentle introduction to I/O design and evaluation, the
examples evolve in realism.
To try to avoid getting lost in the details, let’s start with an overview of the five
examples:
1 Naive cost-performance design and evaluation: The first example calculates cost-
performance of an I/O system for the two types of disks. It ignores dependability
concerns, and makes the simplifying assumption of allowing 100% utilization of I/O
resources. This example is also the longest.
2 Availability of the first example: The second example calculates the poor
availability of this naive I/O design.
3 Response times of the first example: The third example uses queuing theory to
calculate the impact on response time of trying to use 100% of an I/O resource.
1 More realistic cost-performance design and evaluation: Since the third example
shows the folly of 100% utilization, the fourth example changes the design to obey
common rules of thumb on utilization of I/O resources. It then evaluates cost-
performance.
2 More realistic design for availability and its evaluation: Since the second ex-
ample shows the poor availability when dependability is ignored, this final example uses
a RAID 5 design. It then calculates availability and performance.