Linux VFS and Block
Linux VFS and Block
Layers
1
Rights to Copy
This kit contains work by the
following authors:
© Copyright 2004-2009
Attribution – ShareAlike 2.0
Michael Opdenacker /Free
You are free Electrons
[email protected]
to copy, distribute, display, and perform the work
http://www.free-electrons.com
to make derivative works
© Copyright 2003-2006
to make commercial use of the work Oron Peled
Under the following conditions [email protected]
http://www.actcom.co.il/~oron
Attribution. You must give the original author credit.
Share Alike. If you alter, transform, or build upon this work, you may
© Copyright 2004–2008
distribute the resulting work only under a license identical to this Codefidence ltd.
one. [email protected]
For any reuse or distribution, you must make clear to others the
http://www.codefidence.com
license terms of this work.
© Copyright 2009–2017
Any of these conditions can be waived if you get permission from
the copyright holder.
Bina ltd.
[email protected]
Your fair use and other rights are in no way affected by the above. http://www.bna.co.il
License text: http://creativecommons.org/licenses/by-sa/2.0/legalcode
2
Block Device Drivers
Linux Drivers types:
Character Device Drivers
Block Device Drivers
Network Device Drivers
3
Architecture
User application User application
User
Kernel
Virtual File System
Individual filesystems
Buffer/page cache
Block layer
Hardware
4
VFS
Linux provides a unified Virtual File System interface:
The VFS layer supports abstract operations.
Specific file systems implements them.
5
Example - read
User space:
x = read(fd, buffer, size);
Kernel:
sys_read(fd , buffer, size);
6
vfs_read
Performs some checks and calls __vfs_read
7
VFS
The major VFS abstract objects:
File - An open file (file descriptor).
Dentry - A directory entry. Maps names to inodes
Inode - A file inode. Contains persistent information
Superblock - descriptor of a mounted filesystem
8
File Object
Information about an open file
Mode
Position
…
Per process
you can set the table size using ulimit -n
9
Dentry Object
Information about a directory entry
Name
Pointer to the inode
10
dcache
The VFS implements the open(2), stat(2), chmod(2),
and similar system calls.
The pathname argument that is passed to them is
used by the VFS to search through the directory entry
cache (dcache)
11
Inode Object
unique descriptor of a file or directory
contains permissions, timestamps, block map (data)
Usually stored in the a special block(s) on the disk
inode#: integer (unique per mounted filesystem)
Filesystem: fn(inode#) => data
12
13
Superblock
The file system metadata
Defines the file system type, size, status, and other
information about other metadata structures
14
struct vfsmount
Represents a mounted instance of a particular file
system
15
Architecture
User application User application
User
Kernel
Virtual File System
Individual filesystems
Buffer/page cache
Block layer
Hardware
16
VFS Structures
17
Registering a new FS
Call register_file_system and pass a pointer to:
struct file_system_type
Fields:
Name (/proc/filesystems)
Flags
Mount callback
18
Mount
mount –t myfs /dev/myblk /myfs
The mount callback is called
Typical implementation:
mount_bdev/mount_nodev/mount_mtd
19
Filling the super block
Extents the super block to add private data
20
Filling the super block(2)
Create a root inode
Set inode_operations
Set file_operations
Set address_space_operations
21
super_operations
alloc/read/write/clear/delete inode
put_super (release)
freeze/unfreeze/remount/sync the file system
show_options (/proc/[pid]/mounts)
statfs – statfs(2)
22
inode_operations
create – new inode for regular file
link/unlink/rename – add/remove/modify dir entry
symlink, readlink, get_link – sot link ops
mkdir/rmdir – new inode for directory file
mknod – new inode for device file
permission – check access permissions
lookup – called when the VFS needs to look up an
inode in a parent directory
23
file_operations
open/release
read/write
read_iter/write_iter – async ops
iterate – directory content(ls)
mmap/lock/sync/poll
*_ioctl
…
24
dentry_operations
The filesystem can overload the standard dentry
operations
Special cases
Msdos 8.3 limit
fat case insensitive
25
ls /myfs
stat(2) the path
26
iterate
Checks the requested position (the user buffer can
be smaller than the directory content)
27
Simple example
28
inode_operations lookup
For each directory entry name, we need to find and
fill dentry object and inode object
29
Open a file
sys_open
do_sys_open
do_flip_open
path_openat
get_empty_flip
do_last
vfs_open
do_dentry_open
Set the file_operations (fops_get)
Call the open callback if exists
30
Read/Write
sys_read -> vfs_read -> __vfs_read
sys_write -> vfs_write -> __vfs_write
31
Generic Functions
32
libfs
/fs/libfs.c – library for filesystem writer
simple_* functions
simple_lookup, simple_mkdir, ….
Simple file_operations
Simple inode_operations
33
Architecture
User application User application
User
Kernel
Virtual File System
Individual filesystems
Buffer/page cache
Block layer
Hardware
34
Integration with Memory Subsystem
The address_space object
Used to group and manage pages in the page cache
One per file
The “physical analogue” to the virtual vm_area_struct
Radix tree enables quick searching for the desired page, given only the file offset
35
address_space_operations
readpage – read a page from backing store
writepage - write a dirty page to backing store
readpages/writepages
set_page_dirty
write_begin - Called by the generic buffered write code to
ask the filesystem to prepare to write len bytes at the
given offset in the file
36
The Page Cache
Page cache can read individual disk blocks whose size is
determined by the filesystem
37
The Page Cache
Use mark_buffer_dirty to flag the buffer as dirty
Need their data to be synced to disk
38
Read/Write Examples
39
Simple readpage()
40
bio – IO Request
Historically, a buffer_head was used to map a single
block within a page, and of course as the unit of I/O
through the filesystem and block layers.
Nowadays the basic I/O unit is the bio
See EXT4 readpages for example (submit_bio)
buffer_heads are used for:
extracting block mappings (via a get_block_t call),
tracking state within a page (via a page_mapping)
wrapping bio submission for backward compatibility
reasons (e.g. submit_bh).
41
submit_bh
Calls submit_bh_wbc
From here on down, it's all bio
bio_alloc
bio_add_page
…
submit_bio
To the request layer
42
Architecture
User application User application
User
Kernel
Virtual File System
Individual filesystems
Buffer/page cache
Block layer
Hardware
43
Inside the block layer
Buffer/page cache
Block layer
Block Block
driver driver
I/O scheduler
Hardware
44
Inside the block layer (2)
The block layer allows block device drivers to receive
I/O requests, and is in charge of I/O scheduling
46
Available I/O schedulers
I/O schedulers in current kernels
Noop, for non-disk based block devices
Deadline, tries to guarantee that an I/O will be served
within a deadline
CFQ, the Complete Fairness Queuing, the default
scheduler, tries to guarantee fairness between users of a
block device
47
Looking at the code
The block device layer is implemented in the block/
directory of the kernel source tree
This directory also contains the I/O scheduler code, in
the
*-iosched.c files.
PATA/SATA
driver
50
Registering the major
The first step in the initialization of a block device driver
is the registration of the major number
int register_blkdev(unsigned int major,
const char *name);
Major can be 0, in which case it is dynamically allocated
Once registered, the driver appears in /proc/devices with the
other block device drivers
52
Initializing a disk
Allocate a gendisk structure
struct gendisk *alloc_disk(int minors)
53
Initializing a disk (2)
Initialize the gendisk structure
Fields major, first_minor, fops, disk_name and queue
should at the minimum be initialized
private_data can be used to store a pointer to some
private information for the disk
Set the capacity
void set_capacity(struct gendisk *disk, sector_t size)
55
Unregistering a disk
Unregister the disk
void del_gendisk(struct gendisk *gp);
56
block_device_operations
A set of function pointers
open() and release(), called when a device handled
by the driver is opened and closed
ioctl() for driver specific operations. unlocked_ioctl()
is the non-BKL variant, and compat_ioctl() for 32
bits processes running on a 64 bits kernel
direct_access() required for XIP support, see
http://lwn.net/Articles/135472/
media_changed() and revalidate() required for
removable media support
getgeo(), to provide geometry informations to
userspace
57
A simple request() function
static void foo_request(struct request_queue *q)
if (! blk_fs_request(req)) {
continue;
58
A simple request() function (2)
Information about the transfer are available in the
struct request
sector, the position in the device at which the transfer
should be made
current_nr_sectors, the number of sectors to transfer
buffer, the location in memory where the data should be
read or written to
rq_data_dir(), the type of transfer, either READ or WRITE
60
Request queue configuration (1)
blk_queue_bounce_limit(queue, u64)
Tells the kernel the highest physical address that the
device can handle. Above that address, bouncing will
be made. BLK_BOUNCE_HIGH, BLK_BOUNCE_ISA and
BLK_BOUNCE_ANY are special values
HIGH: will bounce if the pages are in high-memory
ISA: will bounce if the pages are not in the ISA 16 Mb
zone
ANY: will not bounce
61
Request queue configuration (2)
blk_queue_max_sectors(queue, unsigned int)
Tell the kernel the maximum number of 512 bytes
sectors for each request.
blk_queue_max_phys_segments(queue, unsigned
short)
blk_queue_max_hw_segments(queue, unsigned short)
Tell the kernel the maximum number of non-memory-
adjacent segments that the driver can handle in a
single request (default 128).
63
Inside a request
A request is composed of several segments, that are
contiguous on the block device, but not necessarily
contiguous in physical memory
65
Request example
66
Request Hooks
struct block_device *blkdev;
blkdev = lookup_bdev(“/dev/sda”,0);
blkdev_queue = bdev_get_queue(blkdev);
original_request_fn = blkdev_queue->request_fn;
blkdev_queue->request_fn = my_request_fn;
67
Asynchronous operations
If you handle several requests at the same time, which is
often the case when handling them in asynchronous manner,
you must dequeue the requests from the queue :
void blkdev_dequeue_request(struct request *req);
68
Asynchronous operations (2)
Once the request is outside the queue, it's the
responsibility of the driver to process all segments of
the request
71
DMA (3)
After the DMA transfer completion, the segments must
be unmapped, using
int dma_unmap_sg(struct device *dev,
struct scatterlist *sglist,
int hwcount,
enum dma_data_direction dir)
72
MMC / SD
Block layer
MMC Core
CONFIG_MMC
drivers/mmc/core/
73
MMC host driver
For each host
struct mmc_host *mmc_alloc_host(int extra,
struct device *dev)
Initialize struct mmc_host fields: caps, ops,
max_phys_segs, max_hw_segs, max_blk_size,
max_blk_count, max_req_size
int mmc_add_host(struct mmc_host *host)
At unregistration
void mmc_remove_host(struct mmc_host *host)
void mmc_free_host(struct mmc_host *host)
74
MMC host driver (2)
The mmc_host->ops field points to a mmc_host_ops
structure
Handle an I/O request
void (*request)(struct mmc_host *host,
struct mmc_request *req);
Set configuration settings
void (*set_ios)(struct mmc_host *host,
struct mmc_ios *ios);
Get read-only status
int (*get_ro)(struct mmc_host *host);
Get the card presence status
int (*get_cd)(struct mmc_host *host);
75
שבוע אורקל 2017כולל סמינרים המגוונים המאפשרים למשתתפים לגלות את המגמות והטרנדים האחרונים ,להתחבר לטכנולוגיות חדשות
ולהיחשף למתודולוגיות עבודה המתאימות לפיתוח האינטנסיבי המאפיין חברות תוכנה כיום.
הסמינרים מאוגדים תחת 8מסלולי לימוד ,בכל מסלול תוכלו למצוא סמינרים רלוונטיים העוסקים בחזית הטכנולוגיה ומתרכזים בפתרונות מבוססי
קוד פתוח ומבית היוצר של חברת – Oracle
Cloud platforms | DevOps | Development | Database | Analytics & Big Data
The digital transformation - IoT & Mobile trends | Technology Managers & Leaders | After Event Workshops
גם השנה ,שבוע אורקל יכלול יומיים של After Event Workshopsשיתקיימו בבניין ג'ון ברייס בתל אביב ויאפשרו למשתתפים לגעת בטכנולוגיה
ולהתנסות ביכולות מתקדמות וחדשניות במגוון נושאים הכרוכים בפיתוח וב.DevOps -
זוהי ההזדמנות שלכם להתעדכן ולמקסם את היכולות שלכם משימוש בטכנולוגיות ,Oracle
לשמוע את טובי המומחים בקהילת הטכנולוגיה והעסקים בישראל,
לצאת עם תובנות ,טכנולוגיות ופתרונות מעשיים אשר יסייעו לכם למקסם את יכולתכם המקצועית
וליהנות מחוויית Networkingייחודית ובעלת ערך חברתי ועסקי תוך מימוש החזון והמסר העיקרי של הכנס:
Maximize Your Oracle Experience
77