A Flash-Memory Based File System
A Flash-Memory Based File System
net/publication/2599363
CITATIONS READS
441 1,109
3 authors, including:
All content following this page was uploaded by Atsuo Kawaguchi on 18 August 2015.
(To Appear in USENIX ’95 Winter) 1 (Page Number will be erased in the final version)
Since flash memory can be accessed directly through a bank 0 bank 1 ••••••• bank l
processor's memory bus, other approaches (such as
tightly coupling a flash memory system and a buffer
segment 0 segment 1 ••••••• segment m
cache) might perform better by reducing
memory-to-memory copy operations. Such an
summary magic no.
approach, however, would require a large number of no. of blocks
kernel modification because flash memory's erase and no. of erase op.
write properties differ greatly from those of the main physical block no.
block 0 information
memory. block 1 information flag 0
flag 1
•
• flag 2
•
2.1 Overview • flag 3
block information entry
block n information
Our driver must record which regions contain segment summary
invalid data and reclaim the invalid region by erasing
Figure 1. On-chip data structure.
those regions. Furthermore, since the number of erase
operations in each region is limited, the driver must at was performing an erase operation triggered by
least monitor the number in order to assure reliable another program. This problem can be avoided by
operation. In some cases, wear-leveling should be temporarily caching in a buffer all valid data in the
provided. flash memory chip to be erased. This caching
Our driver maintains a sequential data structure operation, however, could consume a significant
similar to that of LFS [3][4]. It handles a write request amount of processor resources.
by appending the requested data to the tail of the Recent flash memory products provide an erase-
structure. To enable later retrieval of the data it suspend capability that enables sectors that are not
maintains a translation table for translating between being erased to be read from during an erase operation.
physical block number and flash memory address. Some new products also support write operations
Because the translation is made on the level of the during the erase suspension. Our driver assumes the
physical block number, our driver can be viewed, from underlying flash memory system to be capable of
the file-system aspect, as an ordinal block device. erase- suspended read operations.
When write operations take place the flash Flash memory generally takes more time for a
memory system is fragmented into valid and invalid write operation than for a read operation. It provides a
data blocks. To reclaim invalid data blocks, we use a write bandwidth of about 100 Kbytes/s per flash
cleaner that selects an erase sector, collects valid data memory chip, whereas a conventional SCSI HDD
blocks in the sector, copies their contents to the tail of provides a peak write bandwidth 10 to 100 times
the log, invalidates the copied blocks (and higher. Some recent flash memory products
consequently makes the entire sector invalid), and incorporate page buffers for write operations. These
issues an erase command to reclaim the sector. The buffers enable a processor to send block data to a flash
functions of this cleaner are identical to those of LFS's memory chip faster. After sending the data, the
cleaner. Our prototype does not implement processor issues a “Page Buffer Write” command to
wear-leveling, although it does maintain a log of each the chip and the chip performs write operations while
erased sector. the processor does other jobs.
Our driver assumes that the underlying flash
2.2 Flash Memory Capability memory system consists of some banks of memory
that support concurrent write operations on each chip.
Early generations of flash memory products This assumption reduces the need for an on-chip page
prohibit read or write operations while they are buffer because the concurrent write operations can
performing a write or an erase operation. That is, when provide a higher transfer rate.
such a flash memory chip is performing an erase
operation on an erase sector, data can neither be read
from nor written to other erase sectors until the erase 2.3 On Flash Memory Data Structure
operation completes. A naive file system Figure 1 depicts our driver's data structure built on
implementation ignoring this prohibition would not be an underlying flash memory system. The flash
feasible in a multitasking environment because it memory system is logically handled as a collection of
would unpredictably block operations whenever the banks. A bank corresponds to a set of flash memory
program wanted data from a flash memory chip that chips and each set can perform erase or write
(To Appear in USENIX ’95 Winter) 2 (Page Number will be erased in the final version)
operations independently. The banks are in turn Bank list Cleaning bank list
divided into segments, each of which corresponds to Active
an erase sector of the flash memory system. Each bank
segment consists of a segment summary and an array
of data blocks. The segment summary contains
segment information and an array of block information
entries. Segment information includes the number of • Flag
• update
blocks the segment contains and the number of times • • queue
the segment has been erased. • Cleaning
• •
candidates
2.4 Flag Update
Each block information entry contains flags and Figure 2. Bank list and cleaning bank list.
the physical block number to which this data block
corresponds. The physical block number is provided to When a segment is selected to be cleaned, the
the driver by the file-system module when issuing a bank containing that segment is moved to the cleaning
write data request. The flags are written sequentially bank list. The bank stays in the list until an erasure
so that the driver can record the change of the block operation on the segment finishes. Because the bank is
status without erasing the segment. no longer on the bank list, it never becomes an active
The driver uses four flags to minimize the bank, and thus avoids being written during the erase
possibility of inconsistency and to make recovery operation.
easier. When a logical block is overwritten the driver The driver maintains the flag update queue to
invalidates the old block, allocates a new block, and handle the flag update procedure, described in the
writes new data to the newly allocated block. The previous section, on blocks in the bank of which
driver updates the flags on the flash memory in the segment is being erased. The driver avoids issuing a
following order: data write on a bank being cleaned by separating the
Step 1. mark the newly allocated block as bank list and the cleaning bank list. However, when a
allocated, block is logically overwritten, an invalidated block
Step 2. write the block number and then write might belong to that bank. In such a case, the driver
new data to the allocated block, postpones the flag update procedure steps 4 and 5, by
Step 3. mark the allocated block as pre-valid, entering the pair of the newly allocated and the
Step 4. mark the invalidated block as invalid, invalidated blocks into the queue. All the pairs are
and processed when the erasure finishes. Note that even if
Step 5. mark the allocated block as valid. the pairs are not processed due to a crash during the
The above steps guarantee that the flag values of erasure, the driver can recover flag consistency
the newly allocated and invalidated blocks never because of the flag update order (Step 3 for each pair
become the same under any circumstances. Therefore, has been completed before the crash occurs.).
even after a crash (e.g., a power failure) during any one The queue should be able to hold the number of
of the above steps, the driver can choose one of the pairs that are expected to be entered during an erasure.
blocks that hold the fully written data. This method is For example, the current implementation can generate
of course not sufficient to maintain complete file 500 pairs for 1 erasure [i.e., 500 blocks (250 KBytes)
system consistency, but it helps suppress unnecessary per second], and thus has 600 entries in the queue.
ambiguity at the device level. Should the queue be exhausted, the driver will stop
writing until the erasure is complete. We have not yet
experienced this condition.
2.5 Bank Management
To manage block allocation and cleaning, the
driver maintains a bank list and a cleaning bank list. 2.6 Translation Table
Figure 2 shows the relationship between these lists. The translation table data structure contains all
The driver allocates a new data block from the active information needed to manage the translation of a
bank, so data write operations take place only on the block number to an address and to manage the erase
active bank. When the free blocks in the active bank log of each segment. During the system boot time, the
are exhausted, the driver selects from the bank list the driver scans the flash memory system and constructs
bank that has the most free segments (i.e., free blocks) this translation table and other structures from the
and makes it the new active bank. on-chip segment summaries.
(To Appear in USENIX ’95 Winter) 3 (Page Number will be erased in the final version)
Figure 3 shows the relationship between the Command Description
translation table and the block information entries. FLIOCCWAIT Wait until a segment is selected to
During the system boot time the driver scans all the be cleaned.
segment summaries one by one. If it finds a valid
FLIOCCCBLK Copy 16 valid blocks of the selected
block, it records a triplet (bank no., segment
segment. Return 0 if no more valid
no., block no.) describing the block in a table blocks exist in the segment .
entry indexed by the physical block number.
FLIOCCIERS Start erasing the segment. Return 0
After the boot, the driver refers only to the
when the erasure is complete.
translation table to access data blocks on the flash
memory when a read operation is requested. The Table 2.ioctl commands for the cleaner.
address of each block can be computed from the The cleaner is divided into three parts: policy,
triplet. The driver translates a requested physical block copying and erasing, and internal data maintenance.
number to the address of a corresponding flash All jobs are executed in kernel-space, though copying
memory data block and simply copies the contents of and erasing are conducted by a daemon running in
the data block to the upper layer. user-space.
When a write operation is requested, the driver As discussed in [4], implementing a cleaner as a
checks whether it has already allocated a flash user process makes the system flexible when changing
memory data block for a requested physical block. If it or adding a cleaning policy or algorithm, or both. By
has, the allocated block is invalidated. The driver communicating with the kernel through the use of
allocates a new flash memory data block, updates the system calls and the ifile, the BSD LFS cleaner does
translation table, and copies the requested data to the almost all jobs in user-space. Our driver, in contrast,
newly allocated block, while updating the flags. does the cleaning jobs in kernel-space as Sprite LFS
does. We use a daemon to make the copying process
2.7 Cleaner concurrent with other processes. We took this
approach for its ease of implementation.
The segment cleaning operation takes place While data is being written, cleaning policy codes
during the allocation process when the number of are executed when a block is invalidated. If cleaning
available flash memory blocks for writing becomes policy conditions are satisfied for a segment, the driver
low. This operation selects a segment to be cleaned, adds it to the cleaning list and wakes up the cleaner
copies all valid data in the segment to another daemon to start copying valid blocks. Upon
segment, and issues a block erase command for the awakening, the cleaner daemon invokes the copy
selected segment. The cleaning process is the same as command repeatedly until all valid blocks are copied
that of LFS except that it explicitly invokes the erase to the clean segments. Then, it invokes the erase
operation on the segment. command and the driver starts erasing the segment by
issuing an erase command of the flash memory. The
copying is performed by codes within the driver. The
0 ©• ©•
©•
cleaning daemon controls the start of the copying. It
©• makes the copying concurrent with other processes.
©• physical
©•
block no. i Summary We added three ioctl commands for the cleaner
flag 0,1,2,3 daemon (Table 2). The daemon first invokes
bank no. l ©•
i segment no. m ©• FLIOCCWAIT and then waits (usually) until a
block no. n ©•
segment to be cleaned is selected. As an application
©• program writes or updates data in the file system, the
©•
©• ©•
device driver eventually encounters a segment that
©•
(Data
needs to be cleaned. The driver then wakes up the
©• cleaner daemon and continues its execution.
©• block Data
for i) block Eventually, the daemon starts running and invokes
©• array FLIOCCCBLK repeatedly until all the valid blocks are
Block©–address ©•
translation ©• copied to a new segment. On finishing the copy
table Segment m of operation, the daemon invokes FLIOCCIERS, which
bank n causes the driver to issue an erase command to the
Figure 3. Relationship between the block–address flash memory. The daemon invokes FLIOCCWAIT
translation table and a block information again and waits until another segment needs to be
entry. cleaned.
(To Appear in USENIX ’95 Winter) 4 (Page Number will be erased in the final version)
2.8 Cleaning Policy • As the number of free segments becomes smaller,
the threshold becomes lower so that more
For our driver, the cleaning policy concerns:
segments can be chosen with lower policy
• When the cleaner executes, and,
accounts.
• Which segments is to be cleaned.
For the greedy policy of the current implementation, N
The flash memory hardware permits multiple
is 12 and Th is 455 invalid blocks. That is, when the
segments to be erased simultaneously as long as each
number of free segments becomes 12, segments that
segment belongs to the different bank. This
contain more than 455 invalid blocks are cleaned. For
simultaneous erasure provides a higher block-reclaim
the cost-benefit policy, N is 12 and Th is set to the
rate. For simplicity, however, the current
value that is equivalent to being unmodified 30 days
implementation cleans one segment at a time. The
with one invalid block. For both the policies, segments
cleaner never tries to enhance logical block locality
having no valid blocks are always cleaned before other
during its copying activity. It simply collects and
segments.
copies live data in a segment being cleaned to a free
The threshold curve enables the driver to stop the
segment.
cleaner as long as enough free segments are available
In order to select a segment to clean, the driver is
and also to start the cleaner at a slow pace. For
equipped with two policies: “greedy” and
example, suppose the driver employs a policy such as
“cost-benefit” [3] polices. The driver provides ioctl
“When the number of free segments becomes Nb, start
commands to choose the policy. The greedy policy
the cleaner.” (This policy is represented by the
selects the segment containing the least amount of
threshold curve B in Figure 4.) When the number of
valid data, and the cost-benefit policy chooses the
free segments became Nb the cleaner would start
most valuable segment according to the formula:
cleaning even if the most invalidated segment had only
benefit age × ( 1 − u ) one invalid block. Furthermore, if the live data in the
= ,
cost 2u file system counted more than Ns-Nb segments (where
Ns is a total number of segments), the cleaner would
where u is the utilization of the segment and age is the
run every time a block was invalidated. This would
time since the most recent modification (i.e., the last
result in a file system that was impractically slow and
block invalidation). The terms 2u and 1-u respectively
had an impractically short lifetime.
represent the cost for copying (u to read valid blocks in
the segment and u to write back them) and the free
space reclaimed. Note that LFS uses 1+u for the 3. Performance Measurements and
copying cost because it reads the whole segment in Discussion
order to read valid blocks in the segment for cleaning.
The cleaning threshold defines when the cleaner Unlike a HDD-based file system, the prototype is
starts running. From the point of view of the load put free from seek latency and it is thu expected to show
upon the cleaner, the cleaning should be delayed as nearly the same performance for both sequential and
long as possible. The delay should produce more random read operations. In fact, for reading 4Kbyte
blocks to be invalidated and consequently reduce the blocks from 12.6 Mbytes of data, the sequential and
number of valid blocks that must be copied during the random throughputs of the driver are respectively 644
cleaning activity. Delaying the cleaning activity too and 707 Kbytes/s. (For the same tasks, the through
much, however, reduces the chances of the cleaning throughputs of MFS [7] are 647 and 702 Kbytes/s.)
being done in parallel with other processes. This
reduction may markedly slow the system.
Our driver uses a gradually falling threshold value
that decreases as the number of free segments
Policy accounts
(To Appear in USENIX ’95 Winter) 5 (Page Number will be erased in the final version)
/usr/motoG/tmp/ha1_10_20_94/cum-g-s-ni.eps
Flash Memory System 300
Throughput(KBytes/s)
30%
(KByte/s)
Flash memory Intel 28F008SA (1 Mbyte/chip) 60%
Banks 8 250 90%
Segments 64 (8 segments/bank)
Segment size 256 Kbytes 200
Writethroughput
Data blocks 32256 (504 blocks/segment) 150
Erase cycle 1 sec.
bandwidth 252 Kbytes/sec. 100
Write bandwidth 400 Kbytes/sec.
Read bandwidth 4 Mbytes/sec. 50
Write
CPU and Main Memory System 0
CPU IDT R3081 (R3000 compatible) 0 50 100 150 200
Cache Instruction 16 Kbytes Cumulative MBytesWritten
Cumulative MBytes written
Data 4 Kbytes Figure 5. Sequential write performance.
Memory bandwidth
Instruction read 20 Mbytes/sec.
Initial Average Number Number Total data
Data read 7 Mbytes/sec.
data throughput of erased of copied written
Data write 5 Mbytes/sec. (Kbytes/s) segments blocks (Mbytes)
Table 3. Test-platform specifications. 30% 231 749 0 192
60% 230 766 0 192
The write performance of the prototype, on the other
90% 199 889 57 266 192
hand, is affected by the cleaning, as is the case with
Table 4. Summary of sequential write
LFS. performance.
The benchmark platform consists of our
hand-made computer running 4.4BSD UNIX with a more than 53 segments for the 90% data, the cleaning
40MHz R3081 processor and 64 Mbytes of main threshold was kept near 410 invalid blocks in a
memory. The size of the buffer cache is 6 Mbytes. segment throughout the test. Consequently, the cleaner
Table 3 summarizes the platform specification† . copied an average of 64 blocks per erasure and
lowered the write throughput.
3.1 Sequential Write Performance
The goal of our sequential write performance test 3.2 Random Write Performance
was to measure the maximum throughput that can be The random write performance test evaluated the
expected under certain conditions. When a large worst case for our driver. When a randomly selected
amount of data is written sequentially, our driver portion of a large amount of data is overwritten, all the
invalidates blocks in each segment sequentially. The segments are invalidated equally. If the invalidation
driver therefore needs no copying for cleaning a takes place unevenly (e.g., sequentially), some
segment and the maximum write performance can be segments are heavily invalidated, and thus can be
obtained. cleaned with a small amount of copying. The even
Figure 5 shows the sequential write throughput as invalidation caused by the random update, however,
a function of cumulative data written, and Table 4 results in there being less chance to clean segments
summarizes the results. The results were obtained by that are particularly highly invalidated. Therefore, the
first writing a certain amount of data and then cleaning cost approaches a constant value for all
repeatedly overwriting that initial data. The curves segments.
show results based on different initial data: 4.2 Mbytes For our driver, the copying cost is expected to be
(30% of the file system capacity), 8.4 Mbytes (60%), a function of the ratio of used space to free space in the
and 12.6 Mbytes (90%). The greedy policy was used file system. As new data are written to the free
for cleaning. segments, the used segments are invalidated evenly.
The results obtained with the 90% initial data load The free segments are eventually exhausted and the
were unexpected. Although the data were overwritten cleaner starts cleaning. Consequently, the ratio of valid
sequentially, many blocks were copied for cleaning. to invalid blocks of each segment becomes that of the
This copying was a result of the effect of the cleaning ratio of used to free space of the file system.
threshold described earlier. Since the live data counts Figure 6 and Table 5 show the results of the
random write test. These results were obtained by
† The actual hardware had 128 segments (16 segments/block), but in writing a 4Kbyte data block to a randomly selected
the work reported here we used only half the segments of each
bank.
position of various amounts of initial data: again, 4.2,
(To Appear in USENIX ’95 Winter) 6 (Page Number will be erased in the final version)
/usr/motoG/tmp/ha1_10_20_94/cum-g-r-ni.eps /usr/motoG/tmp/hot-cold-merged/cum.eps
60
Throughput(KBytes/s)
Throughput (KBytes/s)
300 30%
(KByte/s) 640-116, Greedy
(KByte/s)
60% 55
250 90%
200 50
Writethroughput
Writethroughput
150 45
100 40
50 35 640-116, Cost-Benefit
Write
Write
0 30
0 50 100 150 200 0 50 100 150 200
Cumulative MBytesWritten
Cumulative MBytes written Cumulative MBytesWritten
Cumulative MBytes written
Figure 6. Random write performance. Figure 7. Hot-and-cold write performance.
Initial Average Number Number Total data Cleaning Average Nomuber Nomuber
Test policy throughput of erased
data throughput of erased of copied written of copied
(Kbytes/s) segments blocks (Mbytes) (Kbytes/s) segments blocks
Greedy 51 2617 925 193
30% 222 801 26 383 192
640-116 Cost-
60% 147 1066 155 856 192 43 3085 1 161 090
Benefit
90% 40 2634 938 294 192
Initial data: 90% (12.6 Mbytes), Total data written: 192 Mbytes
Table 5. Summary of random write
performance. Table 6. Summary of hot-and-cold write
performance.
8.4, and 12.6 Mbytes. And again the greedy policy was 3.4 Separate Segment for Cleaning
used for cleaning. The initial results obtained in the hot-and-cold
write test were far worse than we had expected. The
write throughput of the 640-116 test was nearly the
3.3 Hot-and-Cold Write Performance same as that of the random test using the greedy
policy. Furthermore, the greedy policy worked better
This test evaluated the performance cases where
than the cost-benefit policy for the 640-116 test.
write accesses exhibited certain amounts of the
Figure 8 shows the distribution of segment
locality of reference. In such cases, we can expect the
utilization after the 640-116 test. In the figure, we can
cost-benefit policy to force the segments into a
observe a weak bimodal distribution of segment
bimodal distribution where most of the segments are
utilization. Since the 60% of the data was left
nearly full of valid blocks and a few are nearly full of
unmodified, more fully valid segments should be
invalid blocks [3]. Consequently, the cost-benefit
present.
policy would result in a low copying overhead. We
We traced the blocks that the cost-benefit policy
will see that our first results did not match our
once judged as cold in the test, and Figure 9 shows the
expectations; we will then analyze why this anomaly
distribution of the cold and the not-cold blocks in the
occurred and what measures we took to address it.
segments after executing the 640-116 test. The data in
Figure 7 and Table 6 show the results of the test.
util.90-2u-640_116-i-off-f.eps
640-116 means that 60% of the write accesses go to
one-eighth of the initial data and other 40% go to 14
segments
6
reported in [5], which found that 67-78% of writes are
Number
(To Appear in USENIX ’95 Winter) 7 (Page Number will be erased in the final version)
Initial Separate Average Number Number Prototype
data segment throughput of erased of copied MFS
52-56% 92-96%
(Kbytes/s) segments blocks
No 241 742 0 Phase 1 1.3 2.0 2.9
Average
30% Phase 2 8.0 9.5 11.5
Yes 239 744 0 elapsed
No 197 888 63 627 time Phase 3 13.1 13.5 13.5
60%
Yes 198 832 33 997 for Phase 4 16.9 16.9 17.1
No 135 1195 214 894 each Phase 5 80.6 81.8 84.0
70% run
Yes 143 883 57 205 Total 119.9 123.7 129.0
80% No 81 1855 544 027 Number of written
Yes 127 1089 157 922 251 818 233 702
blocks for data
90% No 43 3085 1 161 090 Number of copied
Yes 60 2218 723 582 75 227 255 020
blocks
Total data written: 192 Mbytes
Table 7. Summary of 640-116 tests using the Number of erased
578 903
separate cleaning segment. segments
this figure was obtained by marking a block as “cold” Table 8. Andrew Benchmark results.
when the segment to which the block belongs was distribution of cold blocks after the 640-116 test using
chosen to be cleaned and its utilization was less than the modified driver. Many cold segments are
the average utilization in the file system. We can see observed.
that some segments contain both cold and not-cold
blocks. Furthermore, the number of cold blocks is
3.5 Andrew Benchmark
much smaller than expected: since three-fourths of the
12.6 Mbytes of initial data were left unmodified, we Table 8 lists the results of the Andrew
would expect, in the best case, about 19 000 cold benchmark [6] for MFS and for our prototype. The
blocks (i.e., about 38 cold segments). In the test, results were obtained by repeating the benchmark run
however, the actual number of cold blocks was 2579. 60 times. The output data files and directories of each
The reason we determined for the above results is run were stored in a directory, and to limit the file
that the driver uses one segment for both the data system usage the oldest directory was removed before
writes and the cleaning operations; the valuable, each run after 14 contiguous runs for the 52-56% test,
potentially cold blocks are mixed with data being after 24 for the 92-96% test, and after 9 for the MFS
written to the segment. The number of cold blocks test. Note that, as pointed out in [4], phases 3 and 4
therefore does not increase over time. performed no I/O because all the data access were
To address this problem, we modified the driver cached by the higher-level buffer and the inode caches.
so that the driver uses two segments: one for cleaning The benchmark consists of many read operations
cold segments and one for writing the data and and leaves a total of only about 560 Kbytes of data for
cleaning the not-cold segments. Table 7 summarizes each run. As a result, there are many chances to clean
the results of 640-116 tests on both the modified and segments without disturbing data write operations.
the original drivers. The effect of the separate cleaning Therefore, our prototype shows performance nearly
segment becomes notable as the initial utilization equivalent to that of MFS. We expect that similar
grows, and the write throughput was improved more access patterns often appear in a personal computing
than 40% for the 90% initial data. Figure 10 shows the environment. Note that the cleaner erased 903
Number of segments
25 25
Number of segments
20 1.0 20 1.0
15 15
0.75 0.75
10 10
0.5 0.5
5 5
0.25
Ratio of 0.25
Ratio of
0 not-cold 0 not-cold
0.0 0.25 0.5
0.0 blocks 0.0 0.25 0.5
0.0 block
0.75 1.0 0.75 1.0
Ratio of cold blocks Ratio of cold blocks
Figure 9. Distribution of cold and not-cold Figure 10. Distribution of cold and not-cold
blocks after the 640-116 test. blocks after the 640-116 test using
the separate cleaning segment.
(To Appear in USENIX ’95 Winter) 8 (Page Number will be erased in the final version)
segments for the 92-96% test; under the same load (15 system module to specify logically related blocks such
segments in 2 minutes), our prototype will survive as an inode and its indirect blocks. Such an interface
about 850 000 minutes (590 days). might be useful for our driver by enabling it to cluster
hot and cold data.
4. Related Work Douglis et al. [10] have examined three devices
from the viewpoint of mobile computing: a HDD, a
Logging has been widely used in certain kinds of
flash disk, and a flash memory. Their simulation
devices; in particular, in Write-Once Read-Many
results show that the flash memory can use 90% less
(WORM) optical disk drives. WORM media are
energy than a disk-based file system and respond up to
especially suitable for logging because of their
two orders of magnitude faster for read but up to an
append-only writing. OFC [8] is a WORM-based file
order of magnitude slower for write. They also found
system that supports a standard UNIX file system
that, at 90% utilization or above, a flash memory
transparently. Although its data structures differ from
erasure unit that is much larger than the file system
those of our prototype, our prototype's block-address
block size will result in unnecessary copying for
translation scheme is very similar to that of OFC. OFC
cleaning and will degrade performance.
is self-contained in that it stores all data structures on
The flash-memory-based storage system
a WORM medium and needs no read-write medium
eNVy [11] tries to provide high performance in a
such as a HDD. To get around the large memory
transaction-type application area. It consists of a large
requirement, it manages its own cache to provide
amount of flash memory, a small amount of battery-
efficient access to the structure. Our prototype,
backed SRAM for write buffering, a large-bandwidth
however, needs improvement with regard to its
parallel data path between them, and a controller for
memory requirement (about 260 Kbytes for a
page mapping and cleaning. In addition to the
16-Mbyte flash memory system).
hardware support, it uses a combination of two
LFS [3] uses the logging concept for HDDs. Our
cleaning policies, FIFO and locality gathering, in
prototype is similar to LFS in many aspects, such as
order to minimize the cleaning costs for both uniform
segment, segment summary, and segment cleaner, but
and hot-and-cold access distribution. Simulation
LFS does not use block-address translation. LFS
results show that at a utilization of 80% it can handle
incorporates the FFS index structure into the log so
30 000 transactions per second while spending 30%
that data retrieval can be made in the same fashion as
processing time for cleaning.
in the FFS. That is, each inode contains pointers that
Microsoft Flash File System (MFFS) [2] provides
directly point to data blocks. Our prototype, on the
MS-DOS-compatible file system functionality with a
other hand, keeps a log of physical block modification.
flash memory card. It uses data regions of variable size
LFS gathers as many data blocks as possible to be
rather than data blocks of fixed length. Files in MFFS
written in order to maximize the throughput of write
are chained together by using address pointers located
operations by minimizing seek operations. Since flash
within the directory and file entries. Douglis et al. [10]
memory is free from seek penalty, maximizing the
observed that MFFS write throughput decreased
write size does not necessarily improve performance.
significantly with more cumulative data and with more
The paper on BSD-LFS [4] reports that the effect
storage consumed.
of the cleaner is significant when data blocks are
SunDisk manufactures a flash disk card that has a
updated randomly. Under these conditions, each
small erasure unit, 576 bytes [12]. Each unit takes less
segment tends to contain fewer invalid data blocks and
time to be erased than does Intel’s 16Mbit flash
the cleaner's copying overhead accounts for more than
memory. The size enables the card to replace a HDD
60% of the total writes. With our prototype, this
directly. The driver of the card erases data blocks
overhead accounts for about 70% on the 90%-utilized
before writing new data into them. Although this erase
file system.
operation reduces the effective write performance, the
Since flash memory offers a limited number of
flash disk card shows stable performance under high
write/erase cycles on its memory cell, our driver
utilization because there is no need to copy live
requires the block translation mechanism. Logical
data [10]. In a UNIX environment with FFS, simply
Disk (LD) [9] uses the same technique to make a disk-
replacing the HDD with the flash disk would result in
based file system log-structured transparently.
unexpected short life because FFS meta data such as
Although the underlying storage media and goals are
inodes are located at fixed blocks and are updated
different, both the driver and LD function similarly.
more often than user data blocks. The flash disk card
LD does, though, provides one unique abstract
might perform well in the UNIX environment if a
interface called block lists. The block lists enable a file
proper wear-leveling mechanism were provided.
(To Appear in USENIX ’95 Winter) 9 (Page Number will be erased in the final version)
5. Conclusion [8] T. Laskodi, B. Eifrig, and J. Gait, “A UNIX File
System for a Write-Once Optical Disk”, Proc. ’88
Our prototype shows that it is possible to Summer USENIX, 1988.
implement a flash-memory-based file system for [9] W. de Jonge, M. F. Kaashoek, and W. C. Hsieh,
UNIX. The benchmark results shows that the proposed “Logical Disk: A Simple New Approach to
system avoids many of the problems expected to result Improving File System Performance”, Technical
from flash memory's overwrite incapability. Report MIT/LCS/TR-566, Massachusetts
The device driver approach makes it easy to Institute of Technology, 1993.
implement this prototype system by using the existing [10] F. Douglis, R. Cáceres, F. Kaashoek, K. Li, B.
FFS module. But because the FFS is designed for use Marsh, and J. A. Tauber, “Storage Alternatives
with HDD storage, this prototype needs to use a for Mobile Computers”, Proc. 1st Symposium on
portion of the underlying flash memory to hold data Operating Systems Design and Implementation,
structures tuned for a HDD. Furthermore, the 1994.
separation of the device driver from the file system [11] M. Wu and W. Zwaenepoel, “eNVy: A
module makes the prototype system management Non-Volatile, Main Memory Storage System”,
difficult and inefficient. For example, there is no way Proc. 6th International Conference on
for the driver to know whether or not a block is Architectural Support for Programming
actually invalid until the FFS module requests a write Languages and Operating Systems, 1994.
on the block—even if the file for which the block was [12] “Operating system now has flash EEPROM
allocated had been removed 15 minutes before. A file management software for external storage
system module should therefore be dedicated to flash devices” (in Japanese), Nikkei Electronics, No.
memory. 605, 1994.
(To Appear in USENIX ’95 Winter) 10 (Page Number will be erased in the final version)