10 File Systems
10 File Systems
Computer Systems
Partition 1
MBR
(NTFS) 5
Extended Partitions
• In some cases, you may want >4 partitions
• Modern OSes support extended partitions
Logical Logical
Disk 1
9
VFS Flowchart
Processes (usually) don’t Relatively simple to
need to know about low- add additional file
level file system details system drivers
10
Mount isn’t Just for Bootup
• When you plug storage
devices into your running
system, mount is executed
in the background
• Example: plugging in a USB
stick
• What does it mean to
“safely eject” a device?
– Flush cached writes to that
device
– Cleanly unmount the file
system on that device 11
• Partitions and Mounting
• Basics (FAT)
• inodes and Blocks (ext)
• Block Groups (ext2)
• Journaling (ext3)
• Extents and B-Trees (ext4)
• Log-based File Systems
12
Status Check
• At this point, the OS can locate and mount
partitions
• Next step: what is the on-disk layout of the file
system?
– We expect certain features from a file system
• Named files
• Nested hierarchy of directories
• Meta-data like creation time, file permissions, etc.
– How do we design on-disk structures that support
these features?
13
The Directory Tree
cs5600
home cbw
bin
/ (root) python
tmp amislove
Disk 18
Mapping Files to Blocks
• Every file is composed of >=1 blocks
• Key question: how do we map a file to its blocks?
List of blocks As (start, length) pairs
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
• Problem? • Problem?
– Really large files – Fragmentation
– E.g. try to add a new
file with 3 blocks 19
Directories
• Traditionally, file systems have used a hierarchical,
tree-structured namespace
– Directories are objects that contain other objects
• i.e. a directory may (or may not) have children
– Files are leaves in the tree
• By default, directories contain at least two entries
“..” points the the “.” self pointer
parents directory
.
..
/ (root) bin
20
python
More on Directories
• Directories have associated meta-data
– Name, number of entries
– Created time, modified time, access time
– Permissions (read/write), owner, and group
• The file system must encode directories and store
them on the disk
– Typically, directories are stored as a special type of file
– File contains a list of entries inside the directory, plus
some meta-data for each entry
21
Example Directory File
Windows
0 1 2 3 4 5 6 7 8 9
Disk C:\
22
Directory File Implementation
• Each directory file stores many entries
• Key Question: how do you encode the entries?
Unordered List of Entries Sorted List of Entries
• Other alternatives: hash tables, B-trees
Name Index Dir? Perms Name Index Dir? Perms
• More on B-trees later…
. 2 Y rwx .
• In practice, implementing directory files is2complicated
Y rwx
Windows 3 Y rwx
• Example: do filenames havepagefile.sys 5
a fixed, maximum N r
length
Users
or 4variable
Y rwx
length? Users 4 Y rwx
pagefile.sys 5 N r Windows 3 Y rwx
Super
Disk Block
25
• Directories are special files
– File contains a list of entries inside
Windows the directory
• Possible values for FAT entries:
C:\ – 0 – entry is empty
Users – 1 – reserved by the OS
– 1 < N < 0xFFFF – next block in a chain
– 0xFFFF – end of a chain
2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9
Super
Disk Block C:\
27
Fragmentation
• Blocks for a file need not be contiguous
56 57 58 59 60 61 62 63 64 65 67 68
FAT 0 0 65 0 0 0xFF 0 58 0 67 61 0
FF
56 57 58 59 60 61 62 63 64 65 67 68
Blocks
56 57 58 59 60 61 62 63 64 65 67 68
56 57 58 59 60 61 62 63 64 65 67 68
Blocks
30
• Partitions and Mounting
• Basics (FAT)
• inodes and Blocks (ext)
• Block Groups (ext2)
• Journaling (ext3)
• Extents and B-Trees (ext4)
• Log-based File Systems
31
Status Check
• At this point, we have on-disk structures for:
– Building a directory tree
– Storing variable length files
• But, the efficiency of FAT is very low
– Lots of seeking over file chains in FAT
– Only way to identify free space is to scan over the
entire FAT
• Linux file system uses more efficient structures
– Extended File System (ext) uses index nodes (inodes)
to track files and directories
32
Size Distribution of Files
• FAT uses a linked list for all files
– Simple and uniform mechanism
– … but, it is not optimized for short or long files
• Question: are short or long files more common?
– Studies over the last 30 years show that short files
are much more common
– 2KB is the most common file size
– Average file size is 200KB (biased upward by a few
very large files)
• Key idea: optimize the file system for many small
files 33
• Super block, storing:
• Size and location of bitmaps
• Number and location of inodes
• Number and location of data blocks
• Index of root inodes
• Table of inodes
Bitmap of free & • Each inode is a file/directory
used data blocks • Includes meta-data and lists
of associated data blocks
Bitmap of free &
used inodes
Data blocks (4KB each)
34
• Directories are files
• Contains the list of
bin
entries in the directory
/ Name inode
• . Each inode 0
can directly point to 12
home cbw
blocks
bin 1
• Can also indirectly point to blocks
home 2
at 1, 2, and 3 levels of depth
initrd.img 3
Inode Data
Inodes Data Blocks
Bitmap Bitmap
SB
Root inode = 0 35
ext2 inodes
Size (bytes) Name What is this field for?
2 mode Read/write/execute?
2 uid User ID of the file owner
4 size Size of the file in bytes
4 time Last access time
4 ctime Creation time
4 mtime Last modification time
4 dtime Deletion time
2 gid Group ID of the file
2 links_count How many hard links point to this file?
4 blocks How many data blocks are allocated to this file?
4 flags File or directory? Plus, other simple flags
60 block 15 direct and indirect pointers to data blocks
36
inode Block Pointers
• Each inode is the root of an unbalanced tree of
data blocks 15 total pointers
inode
Triple
Single Double
Indirect
Indirect Indirect
12 blocks *
4KB = 48KB
38
File Reading Example
Bitmaps inodes Data Blocks
data inode root tmp file root tmp file[0] file[1] file[3]
read
open(“/tmp/file”)
read
read
read
read
Update the last read
Time
open(“/tmp/file”)
read
Write read
Example write
Update the write
modified time write
of the directory write
Time
read
read
write()
write
write
write
ext2 inodes, Again
Size (bytes) Name What is this field for?
2 mode Read/write/execute?
2 uid User ID of the file owner
4 size Size of the file in bytes
4 time Last access time
4 ctime Creation time
4 mtime Last modification time
4 dtime Deletion time
2 gid Group ID of the file
2 links_count How many hard links point to this file?
4 blocks How many data blocks are allocated to this file?
4 flags File or directory? Plus, other simple flags
60 block 15 direct and indirect pointers to data blocks
41
Hard Link Example
• Multiple directory entries may point to the same
inode
[amislove@ativ9 ~] ln –T ../cbw/my_file cbw_file
Inode Data
Inodes Data Blocks
Bitmap Bitmap
SB
42
Hard Link Details
• Hard links give you the ability to create many
aliases of the same underlying file
– Can be in different directories
• Target file will not be marked invalid (deleted)
until link_count == 0
– This is why POSIX “delete” is called unlink()
• Disadvantage of hard links
– Inodes are only unique within a single file system
– Thus, can only point to files in the same partition
43
Soft Links
• Soft links are special files that include the path
to another file
– Also known as symbolic links
– On Windows, known as shortcuts
– File may be on another partition or device
44
Soft Link Example
[amislove@ativ9 ~] ln –s ../cbw/my_file cbw_file
cbw my_file
1. Create a soft link file
home 2. Add it to the current
amislove cbw_file directory
Inode Data
Inodes Data Blocks
Bitmap Bitmap
SB
45
ext: The Good and the Bad
• The Good – ext file system (inodes) support:
– All the typical file/directory features
– Hard and soft links
– More performant (less seeking) than FAT
• The Bad: poor locality
– ext is optimized for a particular file size distribution
– However, it is not optimized for spinning disks
– inodes and associated data are far apart on the disk!
Inode Data
Bitmap Bitmap Inodes Data Blocks
SB
46
• Partitions and Mounting
• Basics (FAT)
• inodes and Blocks (ext)
• Block Groups (ext2)
• Journaling (ext3)
• Extents and B-Trees (ext4)
• Log-based File Systems
47
Status Check
• At this point, we’ve moved from FAT to ext
– inodes are imbalanced trees of data blocks
– Optimized for the common case: small files
• Problem: ext has poor locality
– inodes are far from their corresponding data
– This is going to result in long seeks across the disk
• Problem: ext is prone to fragmentation
– ext chooses the first available blocks for new data
– No attempt is made to keep the blocks of a file
contiguous
48
Fast File System (FFS)
• FFS developed at Berkeley in 1984
– First attempt at a disk aware file system
– i.e. optimized for performance on spinning disks
• Observation: processes tend to access files that
are in the same (or close) directories
– Spatial locality
• Key idea: place groups of directories and their
files into cylinder groups
– Introduced into ext2, called block groups
49
Block Groups
• In ext, there is a single set of key data structures
– One data bitmap, one inode bitmap
– One inode table, one array of data blocks
• In ext2, each block group contains its own key
data structures
Inode Data
Bitmap Bitmap Inodes Data Blocks
52
• Partitions and Mounting
• Basics (FAT)
• inodes and Blocks (ext)
• Block Groups (ext2)
• Journaling (ext3)
• Extents and B-Trees (ext4)
• Log-based File Systems
53
Status Check
55
File Append Example
owner: christo • These three operations can
permissions: rw
size: 1
2 potentially be done in any order
pointer: 4
pointer: null
5
• … but the system can crash at
pointer: null any time
pointer: null
Inode Data
Inodes Data Blocks
Bitmap Bitmap
v1
v2
D1 D2
Update
Update Write the
the data
the inode data
bitmap 56
Inode Data
Inodes Data Blocks
Bitmap Bitmap
v1
D1 D2
Result: file system is consistent, but the data is lost Write the data
v2
v1
D1
Update the inode Result: inode points to garbage data, and file
system is inconsistent (data bitmap vs. inode)
v1
D1
v1
D1 D2
v1
v2
D1
60
fsck Tasks
• Superblock: validate the superblock, replace it
with a backup if it is corrupted
• Free blocks and inodes: rebuild the bitmaps by
scanning all inodes
• Reachability: make sure all inodes are reachable
from the root of the file system
• inodes: delete all corrupted inodes, and rebuild
their link counts by walking the directory tree
• directories: verify the integrity of all directories
• … and many other minor consistency checks 61
fsck: the Good and the Bad
• Advantages of fsck
– Doesn’t require the file system to do any work to
ensure consistency
– Makes the file system implementation simpler
• Disadvantages of fsck
– Very complicated to implement the fsck program
• Many possible inconsistencies that must be identified
• Many difficult corner cases to consider and handle
– fsck is super slow
• Scans the entire file system multiple times
• Imagine how long it would take to fsck a 40 TB RAID array
62
63
Approach 2: Journaling
• Problem: fsck is slow because it checks the entire
file system after a crash
– What if we knew where the last writes were before
the crash, and just checked those?
• Key idea: make writes transactional by using a
write-ahead log
– Commonly referred to as a journal
• Ext3 and NTFS use journaling
Block Block Block
Superblock Journal …
Group 0 Group 1 Group N
64
Write-Ahead Log
• Key idea: writes to disk are first written into a log
– After the log is written, the writes execute normally
– In essence, the log records transactions
• What happens after a crash…
– If the writes to the log are interrupted?
• The transaction is incomplete
• The user’s data is lost, but the file system is consistent
– If the writes to the log succeed, but the normal
writes are interrupted?
• The file system may be inconsistent, but…
• The log has exactly the right information to fix the problem
65
Data Journaling Example
• Assume we are appending to a file
– Three writes: inode v2, data bitmap v2, data D2
• Before executing these writes, first log them
Journal
TxB TxE
I v2 B v2 D2
ID=1 ID=1
68
Data Journaling Timeline
69
Crash Recovery (1)
• What if the system crashes during logging?
– If the transaction is not committed, data is lost
– But, the file system remains consistent
Journal TxB I v2 B v2 D2
Inode Data
Inodes Data Blocks
Bitmap Bitmap
v1
D1
70
Crash Recovery (2)
• What if the system crashes during the checkpoint?
– File system may be inconsistent
– During reboot, transactions that are committed but not
free are replayed in order
– Thus, no data is lost and consistency is restored
71
Corrupted Transactions
• Problem: the disk scheduler may not execute
writes in-order
– Transactions in the log may appear committed, when
in fact they are invalid
Journal TxB I v2 B v2 D2 TxE
73
Making Journaling Faster
• Journaling adds a lot of write overhead
• OSes typically batch updates to the journal
– Buffer sequential writes in memory, then issue one
large write to the log
– Example: ext3 batches updates for 5 seconds
• Tradeoff between performance and persistence
– Long batch interval = fewer, larger writes to the log
• Improved performance due to large sequential writes
– But, if there is a crash, everything in the buffer will be
lost
74
Meta-Data Journaling
• The most expensive part of data journaling is
writing the file data twice
– Meta-data is small (~1 sector), file data is large
• ext3 implements meta-data journaling
Journal TxB I v2 B v2 TxE
Inode Data
Inodes Data Blocks
Bitmap Bitmap
v1
v2
D1 D2
75
Meta-Journaling Timeline
Issue
Complete
data and data
Issue
are written
Complete
76
Crash Recovery Redux (1)
• What if the system crashes during logging?
– If the transaction is not committed, data is lost
– D2 will eventually be overwritten
– The file system remains consistent
Journal TxB I v2 B v2
Inode Data
Inodes Data Blocks
Bitmap Bitmap
v1
D1 D2
77
Crash Recovery Redux (2)
• What if the system crashes during the checkpoint?
– File system may be inconsistent
– During reboot, transactions that are committed but not
free are replayed in order
– Thus, no data is lost and consistency is restored
78
Delete and Block Reuse
Journal TxB dir dir TxE TxB dir dir TxE TxB f1 f1 TxE
Inode Data
Inodes Data Blocks
Bitmap Bitmap
dir
f1
dir
f1
Journal TxB dir dir TxE TxB dir dir TxE TxB f1 f1 TxE
Data Blocks
f1
dir
80
Handling Delete
• Strategy 1: don’t reuse blocks until the delete is
checkpointed and freed
• Strategy 2: add a revoke record to the log
– ext3 used revoke records
81
Journaling Wrap-Up
• Today, most OSes use journaling file systems
– ext3/ext4 on Linux
– NTFS on Windows
• Provides excellent crash recovery with relatively
low space and performance overhead
• Next-gen OSes will likely move to file systems
with copy-on-write semantics
– btrfs and zfs on Linux
82
• Partitions and Mounting
• Basics (FAT)
• inodes and Blocks (ext)
• Block Groups (ext2)
• Journaling (ext3)
• Extents and B-Trees (ext4)
• Log-based File Systems
83
Status Check
• At this point:
– We not only have a fast file system
– But it is also resilient against corruption
• What’s next?
– More efficiency improvements!
84
Revisiting inodes
• Recall: inodes use indirection to acquire
additional blocks of pointers
• Problem: inodes are not efficient for large files
– Example: for a 100MB file, you need 25600 block
pointers (assuming 4KB blocks)
• This is unavoidable if the file is 100% fragmented
– However, what if large groups of blocks are
contiguous?
85
From Pointers to Extents
• Modern file systems try hard to minimize
fragmentation
– Since it results in many seeks, thus low performance
• Extents are better suited for contiguous files
Each extent
inode inode
includes a block
block 1 block 1
pointer and a
block 2 length 1
length
block 3 block 2
block 4 length 2
block 5 block 3
block 6 length 3
86
Implementing Extents
• ext4 and NTFS use extents
• ext4 inodes include 4 extents instead of block
pointers
– Each extent can address at most 128MB of
contiguous space (assuming 4KB blocks)
– If more extents are needed, a data block is allocated
– Similar to a block of indirect pointers
87
Revisiting Directories
• In ext, ext2, and ext3, each directory is a file with
a list of entries
– Entries are not stored in sorted order
– Some entries may be blank, if they have been deleted
• Problem: searching for files in large directories
takes O(n) time
– Practically, you can’t store >10K files in a directory
– It takes way too long to locate and open files
88
From Lists to B-Trees
• ext4 and NTFS encode directories as B-Trees to
improve lookup time to O(log N)
• A B-Tree is a type of balanced tree that is
optimized for storage on disk
– Items are stored in sorted order in blocks
– Each block stores between m and 2m items
• Suppose items i and j are in the root of the tree
– The root must have 3 children, since it has 2 items
– The three child groups contain items a < i, i < a < j,
and a > j
89
Example B-Tree
• ext4 uses a B-Tree variant known as a H-Tree
– The H stands for hash (sometime called B+Tree)
• Suppose you try to open(“my_file”, “r”)
hash(“my_file”) = 0x0000C194
H-Tree Root
0x00AD1102 0xCFF1A412
H-Tree Leaf
H-Tree Leaf H-Tree Leaf
0x0000A0D1 0x0000C194
my_file inode 90
ext4: The Good and the Bad
• The good – ext4 (and NTFS) supports:
– All of the basic file system functionality we require
– Improved performance from ext3’s block groups
– Additional performance gains from extents and B-
Tree directory files
• The bad:
– ext4 is an incremental improvement over ext3
– Next-gen file systems have even nicer features
• Copy-on-write semantics (btrfs and ZFS)
91
• Partitions and Mounting
• Basics (FAT)
• inodes and Blocks (ext)
• Block Groups (ext2)
• Journaling (ext3)
• Extents and B-Trees (ext4)
• Log-based File Systems
92
Status Check
• At this point:
– We have arrived at a modern file system like ext4
• What’s next?
– Go back to the drawing board and reevaluate from
first-principals
93
Reevaluating Disk Performance
• How has computer hardware been evolving?
– RAM has become cheaper and grown larger :)
– Random access seek times have remained very slow :(
• This changing dynamic alters how disks are used
– More data can be cached in RAM = less disk reads
– Thus, writes will dominate disk I/O
• Can we create a file system that is optimized for
sequential writes?
94
Log-structured File System
• Key idea: buffer all writes (including meta-data)
in memory
– Write these long segments to disk sequentially
– Treat the disk as a circular buffer, i.e. don’t overwrite
• Advantages:
– All writes are large and sequential
• Big question:
– How do you manage meta-data and maintain
structure in this kind of design?
95
Treating the Disk as a Log
• Same concept as data journaling
– Data and meta-data get appended to a log
– Stale data isn’t overwritten, its replaced
96
Buffering Writes
• LFS buffers writes in-memory into chunks
Memory
Giant Log
98
Memory inode Maps
inode
Data Data Data Data inode Data inode
map
Block 1 Block 2 Block 3 Block 4 1 Block 5 2
N
Disk
Giant Log
inode inode
CR Block Block Block Block
1
Block
2
map
1 2 3 4 5 N
100
How to Read a File in LFS
• Suppose you want to read inode 1
1. Look up inode 1 in the checkpoint region
• inode map containing inode 1 is in sector X
2. Read the inode map at sector X
• inode 1 is in sector Y
3. Read inode 1
• File data is in sectors A, B, C, etc.
inode inode
CR Block Block Block Block
1
Block
2
map
1 2 3 4 5 N
101
Directories in LFS
• Directories are stored just like in typical file systems
– Directory data stored in a file
– inode points to the directory file
– Directory file contains name inode mappings
inode inode
CR Block Block Block Block
1
Data
2
map
1 2 3 4 1 N
102
Garbage
• Over time, the log is going to fill up with stale
data
– Highly fragmented: live data mixed with stale data
• Periodically, the log must be garbage collected
103
Garbage Collection in LFS
• Each cluster has a summary block
– Contains the block inode mapping for each block in the cluster
• check
• To Which liveness,
blocks arethestale?
GC reads each file with blocks in the
• Pointers from other
cluster
– Ifclusters are invisible
the current info doesn’t match the summary, blocks are stale
Memory
S D1 D1 i1 D2 i2
Summary block
Disk
S D1 D1 i1 D2 i2 S D1 i1 D3 D3 D3 S D1 D2 i2 i1
105
File Systems for SSDs
• SSD hardware constraints
– To implement wear leveling, writes must be spread
across the blocks of flash
– Periodically, old blocks need to be garbage collected
to prevent write-amplification
• Does this sounds familiar?
• LFS is the ideal file system for SSDs!
• Internally, SSDs manage all files in a LFS
– This is transparent to the OS and end-users
– Ideal for wear-leveling and avoiding write-
amplification
106
Copy-on-write
• Modern file systems incorporate ideas from LFS
• Copy-on-write sematics
– Updated data is written to empty space on disk,
rather than overwriting the original data
– Helps prevent data corruption, improves sequential
write performance
• Pioneered by LFS, now used in ZFS and btrfs
– btrfs will probably be the next default file system in
Linux
107
Versioning File Systems
• LFS keeps old copies of data by default
• Old versions of files may be useful!
– Example: accidental file deletion
– Example: accidentally doing open(file, ‘w’) on a file
full of data
• Turn LFS flaw into a virtue
• Many modern file systems are versioned
– Old copies of data are exposed to the user
– The user may roll-back a file to recover old versions
108