osbook-v0.72
osbook-v0.72
Kernel-Oriented Approach
(Partially written. Expect grammatical mistakes
and minor technical errors.
Updates are released every week on Fridays.)
Send bug reports/suggestions to
[email protected]
Version 0.72
Smruti R. Sarangi
January 5, 2025
This work is licensed under a Creative Commons Attribution-NoDerivs 4.0
International License. URL: https://creativecommons.org/licenses/
by-nd/4.0/deed.en
1 © Smruti R. Sarangi
List of Trademarks
• Linux is a registered trademark owned by Linus Torvalds.
• Microsoft and Windows are registered trademarks of Microsoft Corpora-
tion.
• Android, Chrome and Chrome OS are registered trademarks of Google
LLC.
• The Unix trademark is owned by the Open Group.
1 Introduction 9
1.1 Types of Operating Systems . . . . . . . . . . . . . . . . . . . . . 11
1.2 The Linux OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Versions, Statistics and Conventions . . . . . . . . . . . . 14
1.3 Organization of the Book . . . . . . . . . . . . . . . . . . . . . . 17
3 Processes 57
3.1 The Notion of a Process . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 The Process Descriptor . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.1 struct task struct . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.2 struct thread info . . . . . . . . . . . . . . . . . . . . . 59
3.2.3 Task States . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.4 Kernel Stack . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.5 Task Priorities . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.6 Computing Actual Task Priorities . . . . . . . . . . . . . 66
3
© Smruti R. Sarangi 4
9
© Smruti R. Sarangi 10
CPU
Memory Hard
Disk
I/0 devices
Figure 1.1: Diagram of the overall system
Programs
P1 P2 P3
Opera�ng System
CPUs
Hardware
Memory
I/O & Storage
Figure 1.2: Place of the OS in the overall system
Summary 1.0.1
• Programs share hardware such as the CPU, the memory and stor-
age devices. These devices have to be fairly allocated to different
programs based on user-specified priorities. The job of the OS is
to do a fair resource allocation.
many running programs that are trying to access the same set of
memory locations. This may can be a possible security violation
or this may be a genuine shared memory-based communication
pattern. There is a need to differentiate between them by providing
neat and well-defined mechanisms.
• Power, temperature and security concerns have become very im-
portant over the last decade. Any operating system that is being
designed today needs to run on very small devices such as mobile
phones, tablets and even smartwatches. In the foreseeable future,
they may run on even smaller devices such as smart glasses or de-
vices that are embedded within the body. Hence, it is important
for an OS to be extremely power-aware.
experience. Moreover, in this case the code size and the memory footprint of
the OS needs to be very small.
proprietary walls; this allowed the community to make rapid progress because
all incremental improvements had to be shared. However, at that point of time,
this was not the case with other pieces of software. Users or developers were not
duty-bound to contribute back to the mother repository. This ensured that a lot
of the innovations that were made by large research groups and multinational
companies were not given back to the community.
Over the years, Linux has grown by leaps and bounds in terms of function-
ality and popularity. By 2000, it had established itself as a worthy desktop and
server operating system. People started taking it seriously and many academic
groups started moving away from UNIX to adopt Linux. Given that Linux was
reasonably similar to UNIX in terms of the interface and some other high-level
design decisions, it was easy to migrate from Unix to Linux. The year 2003
was a pivotal year for the Linux community. This year Linux kernel version 2.6
was released. It had a lot of advanced features and was very different from the
previous kernel versions. After this, Linux started being taken very seriously in
both academic and industry circles. In a certain sense, it had come of age and
had entered the big league. Many companies sprang up that started offering
Linux-based offerings, which included the kernel bundled with a set of packages
(software programs) and also custom support.
Over the years, Linux distributions such as Red Hat® , Suse® and Ubuntu®
(Canonical® ) have come to dominate the scene. As of writing this book, circa
2024, they continue to be major Linux vendors. Since 2003, a lot of other
changes have also happened. Linux has found many new applications – it has
made major inroads into the mobile and handheld market. The Android op-
erating system, which as of 2023 dominates the entire mobile operating space
is based on Linux. Many of the operating systems for smart devices and other
wearable gadgets are based on Android. In addition, Google® ’s Chrome OS
is also a Linux-derived variant. So are other operating systems for Smart TVs
such as LG® ’s webOS and Samsung® ’s Tizen.
Point 1.2.1
It is important to understand the economic model. Especially in the
early stages, the GPL licensing model made a lot of difference and
was very successful in single-handedly propelling the Linux movement.
We need to understand that Linux carved a niche of its own in terms
of performance roughly a decade after the project began. The reason
it was grown and sustained by a large team of developers in the first
formative decade is that they saw a future in it. This also included large
for-profit companies. The financial logic behind such extraordinary
foresight is quite straightforward.
As of today, Linux is not the only free open-source operating system. There
are many others, which are derived from classical UNIX, notably BSD Unix
(Berkeley Standard Distribution) family of operating systems. Some other im-
portant variants are FreeBSD® , OpenBSD and NetBSD® . Akin to Linux, their
code is also free to use and distribute. Of course, they follow a different licensing
mechanism, which is not as restrictive as GPL. However, they are also very good
operating systems in their own right. They have their niche markets, and they
have a large developer community that actively adds features and ports them
to new hardware. A paper by your author and his student S. S. Singh [Singh
and Sarangi, 2020] nicely compares three operating systems – Linux, FreeBSD
and OpenBSD – in terms of the performance across different workloads.
Example: 6.2.12
x y z -rc<num>
rc � release candidate
(test release)
Major version
number Patch number
Minor version
number
Linux Versions
Consider Linux version 6.2.16. Here 6 is the major version number, 2 is the mi-
nor version number and 16 is the patch number. Every ⟨major, minor⟩ version
pair has multiple patch numbers associated with it. A major version represents
important architectural changes. The minor version adds important bug fixes
and feature additions. A patch mostly focuses on minor issues and security-
related bug fixes. Every time there is an important feature-related commit, a
patch is created. Prior to 2004, even minor versions were associated with sta-
ble versions and odd minor versions were associated with development versions.
Ever since Linux kernel version 3.0, this practice has not been adhered to. Ev-
ery version is stable now. Development versions are now release candidates that
predate stable versions.
Every new patch is associated with multiple release candidates. A release
candidate does not have major bugs; it incorporates multiple smaller fixes and
feature additions that are not fully verified. These release candidates are con-
sidered experimental and are not fully ready to be used in a production setting.
They are numbered as follows -rc1, -rc2, . . .. They are mainly aimed at other
Linux developers, who can download these release candidates, test their features,
suggest improvements and initiate a process of (mostly) online discussion. Once,
the discussions have converged, the release candidates are succeeded by a stable
version (read patch or major/minor version).
Let us now provide an overview of the Linux code base (see Figure 1.4). The
© Smruti R. Sarangi 16
architecture subsystem of the kernel contains all the code that is architecture
specific. The Linux kernel has a directory called arch that contains various
subdirectories. Each subdirectory corresponds to a distinct architecture such as
x86, ARM, Sparc, etc. An OS needs to rely on processor-specific code for various
critical actions like booting, device drivers and access to privileged hardware
operations. All of this code is nicely bundled up in the arch directory. The
rest of the code of the operating system is independent of the architecture. It
is not dependent on the ISA or the machine. It relies on primitives, macros
and functions defined in the corresponding arch subdirectory. All the operating
system code relies on these abstractions such that developers do not have to
concern themselves with details of the architecture such as whether it is 16-bit
or 32-bit, little endian or big endian, CISC or RISC. This subsystem contains
more than 1.7 million lines of code.
The other large subsystems that contain large volumes of code are the code
bases for the filesystem and network, respectively. Note that a popular OS such
as Linux needs to support many file systems and network protocols. As a result,
the code base for these directories is quite large. The other subsystems for the
memory and security modules are comparatively much smaller.
Figure 1.5 shows the list of prominent directories in the Linux kernel. The
kernel directory contains all the core features of the Linux kernel. Some of the
most important subsystems are the scheduler, time manager, synchronization
manager and debugging subsystem. It is by far the most important subsystem
or the core kernel. We will focus a lot on this subsystem.
We have already seen the arch directory. A related directory is the init
directory that contains all the booting code. Both these directories are hardware
dependent.
The mm, fs, block and io uring directories contain important code for the
memory subsystem, file system and I/O modules, respectively. The code for
virtualizing an operating system is resident in the virt directory. Virtualizing
the OS means that we can run an OS as a regular program on top of the Linux
OS. This subsystem is tightly coupled with the memory, file and I/O subsystems.
Finally, note that the largest directory is drivers that contains drivers (spe-
17 © Smruti R. Sarangi
cialized programs for talking to devices) for a large number of I/O devices. This
directory is so large because an operating system such as Linux needs to support
a large amount of hardware. For every hardware device, we should not expect
the user to browse the web, locate its driver and install it. Hence, there is a
need to include its code in the code base of the kernel itself. At the same time,
we do not want to include the code of every single device driver on the planet
in the code base of the kernel. Its code will become prohibitively large. Rarely
used and obsolescent devices can be left out. Hence, the developers of the ker-
nel need to judiciously choose the set of drivers that need to be included in
the kernel’s code base, which is released and distributed. These devices should
be reasonably popular, and the drivers should be deemed to be safe (devoid of
security issues).
Figure 1.6 shows the list of chapters and appendixes in the book. All the
chapters use concepts that may require the user to refer to the appendixes.
There are three appendixes in the book. Appendix A introduces the x86 as-
sembly language (the 64-bit variant). We shall refer to snippets of assembly
code throughout the text. To understand them thoroughly, it is necessary to be
familiar with x86 assembly. Most of the critical routines in operating systems
are still written in assembly language for speed and efficiency. Appendix B
describes the compiling, linking and loading process. This appendix should be
read thoroughly because it is important to understand how large C-based soft-
ware projects are structured. Readers show know the specific roles of C files,
header files, .o files, static and dynamically linked libraries. These concepts
are described in detail in this chapter. Finally, Appendix C introduces the most
commonly used data structures in the Linux kernel. A lot of the data structures
that we typically study in a basic undergraduate data structures course have
© Smruti R. Sarangi 18
the urgent work is completed immediately and the rest of the work is completed
later. There are different types of kernel tasks that can do such deferred work.
They run with different priorities and have different features. Specifically, we
shall introduce softirqs, threaded IRQs and work queues. Finally, we shall
introduce signals, which are the reverse of system calls. The OS uses signals to
send messages to running processes. For example, if we press a mouse button,
then a message goes to the running process regarding the mouse click and its
coordinates. This happens via signals. Here also there is a need to change the
context of the user application because it now needs to start processing the
signal.
Chapter 5 is a long chapter on synchronization and scheduling. In any
modern OS, we have hundreds of running tasks that often try to access shared
resources concurrently. Many such shared resources can only be accessed by
one thread at a time. Hence, there is a need for synchronizing the accesses.
This is known as locking in the context of operating systems. Locking is a large
and complex field that has a fairly strong overlap with advanced multiprocessor
computer architecture. We specifically need to understand it in the context of
memory models and data races. Memory models determine the valid outcomes
of concurrent programs on a given architecture. We shall observe that it is
often necessary to restrict the space of outcomes using special instructions to
correctly implement locks. If locks are correctly implemented and used, then
uncoordinated accesses known as data races will not happen. Data races are
the source of a lot of synchronization-related bugs. Once the basic primitive
has been designed, we shall move on to discussing different types of locks and
advanced synchronization mechanisms such as semaphores, condition variables,
reader-writer locks and barriers. The kernel needs many concurrent data struc-
tures such as producer-consumer queues, mutexes, spinlocks and semaphores to
do its job. We shall look at their design and implementation in detail.
Next, we shall move on to explaining a very interesting synchronization
primitive that is extremely lightweight and derives its correctness by stopping
task preemption at specific times. It is known as the read-copy-update (RCU)
mechanism, which is widely used in the kernel code. It is arguably one of the
most important innovations made by the designers of the kernel, which has had
far-reaching implications. It has obviated the need for a garbage collector. We
shall then move on to discussing scheduling algorithms. After a cursory intro-
duction to trivial algorithms like shortest-job first and list scheduling, we shall
move on to algorithms that are actually used in the kernel such as completely
fair scheduling (CFS). This discussion will segue into a deeper discussion on
real-time scheduling algorithms where concrete guarantees can be made about
schedulability and tasks getting a specific pre-specified amount of CPU time.
In the context of real-time systems, another important family of algorithms
deal with locking and acquiring resources exclusively. It is possible that a low-
priority process may hold a resource for a long time while a high-priority process
is waiting for it. This is known as priority inversion, which needs to be avoided.
We shall study a plethora of mechanisms to avoid this and other problems in
the domain of real-time scheduling and synchronization.
Chapter 6 discusses the design of the memory system in the kernel. We shall
start with extending the concepts that we studied in Chapter 2 (architecture
fundamentals). The role of the page table, TLB, address spaces, pages and folios
will be made clear. For a course on operating systems, understanding these
© Smruti R. Sarangi 20
Basically, the CPU and the devices are being virtualized here. As of today,
virtualization and its lightweight version namely containers are the most popular
technologies in the cloud computing ecosystem. Some popular virtualization
software are VMWare vSphere® , Oracle VirtualBox® and XenServer® . They
are also known as hypervisors. Linux has a built-in hypervisor known as Linux
KVM (kernel virtual machine). We will study more about them in this chapter.
We will also look at lightweight virtualization techniques using containers that
virtualize processes, users, the network, file systems, configurations and devices.
Docker® and Podman® are important technologies in this space. In the last
part of this chapter we shall look at specific mechanisms for virtualizing the I/O
system and file systems, and finally conclude.
© Smruti R. Sarangi 22
Exercises
Ex. 1 — What are the roles and functions of a modern operating system?
Ex. 2 — Is a system call like a regular function call? Why or why not?
Ex. 3 — Why is the drivers directory the largest directory in the kernel’s code
base?
Ex. 4 — What are the advantages of having a single arch directory that stores
all the architecture-specific code? Does it make writing the rest of the kernel
easier?
Ex. 5 — Write a report about all the open-source operating systems in use
today. Trace their evolution.
23
© Smruti R. Sarangi 24
Core
Caches
Main memory
mind here is that it is only the main memory – DRAM memory located outside
the chip – that is visible to software, notably the OS. The rest of the smaller
memory elements within the chip such as the L1, L2 and L3 caches are normally
not visible to the OS. Some ISAs have specialized instructions that can flush
certain levels of the cache hierarchy either fully or partially. Sometimes even
user applications can use these instructions. However, this is the only notable
exception. Otherwise, we can safely assume that almost all software including
privileged software like the operating system are unaware of the caches. Let us
live with the assumption that the highest level of memory that an OS can see
or access is the main memory.
Let us define the term memory space as the set of all addressable memory
locations. A software program including the OS perceive this memory space to
be one large array of bytes. Any location in this space can be accessed at will
and also can be modified at will. Later on when we discuss virtual memory, we
will refine this abstraction.
address is as follows. The base address A is stored in the %esp register and the
offset is 4. The memory address is equal to (A + 4). Given the speed and ease of
access, registers are ubiquitous. They are additionally used to access privileged
locations and I/O addresses, as we shall see later.
Next, let us differentiate between CISC and RISC processors. RISC stands
for “Reduced Instruction Set Computer”. A lot of the modern ISAs such as
ARM and RISC-V are RISC instruction sets, which are regular and simple.
RISC ISAs and processors tend to use registers much more than their CISC
(complex instruction set) counterparts. CISC instructions can have long im-
mediates (constants) and may also use more than one memory operands. The
instruction set used by Intel and AMD processors, x86, is a CISC ISA. Regard-
less of the type of the ISA, registers are central to the operation of any program
(be it RISC or CISC). The compiler needs to manage them efficiently.
2.1.3 Registers
General Purpose Registers
Let us look at the space of registers in some more detail. All the registers that
regular programs use are known as general purpose registers. They are visible
to all software including the compiler. Note that almost all the programs that
are compiled today use registers and the author is not aware of any compilation
model or any architectural model that does not rely on registers.
Privileged Registers
A core also has a set of registers known as privileged registers, which only the OS
or software with similar privileges can access. In Chapter 8, we shall look at hy-
pervisors or virtual machine managers (VMMs) that run with OS privileges. All
such software are known as system software or privileged mode software. They
are given special treatment by the CPU – they can access privileged registers.
For instance, an ALU has a flags register that stores its state, especially the
state of instructions that have executed in the past such as comparison instruc-
tions. Often these flags registers are not fully visible to regular application-level
software. However, they are visible to the OS and anything else that runs with
OS privileges such as VMMs. It is necessary to have full access to these registers
to enable multitasking: run multiple programs on a core one after the other.
We also have control registers that can enable or disable specific hardware
features such as the fan, LED lights on the chassis and can even turn off the
© Smruti R. Sarangi 26
system itself. We do not want all the instructions that change the values stored
in these registers to be visible to regular programs because then a single appli-
cation can create havoc. Hence, we entrust only a specific set of programs (OS
and VMM) with access rights to these registers.
Then, there are debug registers that are meant to debug hardware and sys-
tem software. Given the fact that they are privy to additional information and
can be used to extract information out of running programs, we do not allow
regular programs to access these registers. Otherwise, there will be serious se-
curity violations. However, from a system designer’s point of view or from the
OS’s point of view these registers are very important. This is because they
give us an insight into how the system is operating before and after an error
is detected – this information can potentially allow us to find the root cause of
bugs.
Finally, we have I/O registers that are used to communicate with externally
placed I/O devices such as the monitor, printer and network card. Here again,
we need privileged access. Otherwise, we can have serious security violations,
and different applications may try to monopolize an I/O resource. They may
not allow other applications to access them. Hence, the OS needs to act as a
broker. Its job is to manage, restrict and regulate accesses.
Given the fact that we have discussed so much about privileged registers, let
us see how the notion of privileges is implemented. Note that we need to ensure
that only the OS and related system software such as the VMM can have access
to privileged resources such as the privileged registers.
Ring 3
Ring 0
rings – Ring 0 (OS) to Ring 3 (application) (refer to Figure 2.2). The primary
role of rings 1 and 2 is to run guest operating systems and other software that
do not require as much of privileged access as the software running in ring zero.
Nevertheless, they enjoy more privileges than regular application code. They
are typically used while running guest OSes on virtual machines.
draw the attention of the CPU Such that it can process the interrupt. This
would entail stopping the execution of the currently executing program
and jumping to a memory location that contains the code of the interrupt
handler (specialized routine in the OS to handle the interrupt).
System Call If an application needs some service from the OS such as creating
a file or sending a network packet, then it cannot use the conventional
mechanism, which is to make a function call. OS functions cannot directly
be invoked by the application. Hence, there is a need to generate a dummy
interrupt such that The same set of actions can take place, which happens
when an external interrupt is received. In this case, a specialized system
call handler takes over and satisfies the request made by the application.
Signal A system call is a message that is sent from the application to the OS.
A signal is the reverse. It is a message that is sent from the OS to the
application. An example of this would be a key press. In this case, an
interrupt is generated, which is processed by the OS. The OS reads the
key that was pressed, and then figures out the process that is running
in the foreground. The value of this key needs to be communicated to
this process. The signal mechanism is the method that is used. In this
case, a function registered by the process with the OS to handle a “key
press” event is invoked. The running application process then gets to know
that a certain key was pressed and depending upon its logic, appropriate
action is taken. A signal is basically a callback function that an application
registers with the OS. When an event of interest happens (pertaining to
that signal), the OS calls the callback function in the application context.
This callback function is known as the signal handler.
As we can see, communicating with the OS does require some novel and un-
conventional mechanisms. Traditional methods of communication that include
writing to shared memory or invoking functions are not used because the OS
runs in a separate address space and also switching to the OS is an onerous
activity. It also involves a change in the privilege level and a fair amount of
bookkeeping is required at both the hardware and software levels, as we shall
see in subsequent chapters.
system calls. Library calls almost never change their signature because they are
designed to be very flexible. Flexibility is not a feature of system calls because
parameter passing is complicated. Consequently, library calls remain portable
across versions of the same operating system and also across different variants
of an operating system such as the different distributions of Linux.
In the header file /usr/include/asm/unistd 64.h, 286 System calls are
defined. The standard way to call a system call is as follows.
mov $ < sys call number >$ , % rax
syscall
As we can see, all that we need to do is that we need to load the number
of the system call in the rax register. The syscall Instruction subsequently
does the rest. We generate a dummy interrupt, store some data corresponding
to the state of the executing program (for more details, refer to [Sarangi, 2021])
and load the appropriate system call handler. An older approach is to directly
generate an interrupt itself using the instruction int 0x80. Here, the code 0x80
stands for a system call. However, as of today, this method is not used for x86
processors.
The state of the running program is known as its context. Whenever, we have
an interrupt, exception or a system call, there is a need to store the context,
jump to the respective handler, finish some additional work in the kernel (if
there is any), restore the context and start the original program at exactly the
same point. The caveat is that all of these actions need to happen without the
explicit knowledge of the program that was interrupted. Its execution should
be identical to a situation where it was not interrupted by an external event.
Of course, if the execution has led to an exception or system call, then the
corresponding event/request will be handled. In any case, we need to return to
exactly the same point at which the context was switched.
Flags and
Program
Registers special Memory PC
state
registers
Register state
Figure 2.3 show an overview of the process to store the context of a run-
ning program. The state of the running program comprises the contents of the
general purpose registers, contents of the flags and special purpose registers,
the memory and the PC (program counter). Towards the end of this chapter,
we shall see that the virtual memory mechanism stores the memory state very
effectively. Hence, we need not bother about storing and restoring the mem-
ory state because there is already a mechanism namely virtual memory that
© Smruti R. Sarangi 30
takes care of it completely. Insofar as the rest of the three elements are con-
cerned, we can think of all of them as the volatile state of the program that
is erased when there is a context switch. As a result, a hardware mechanism is
needed to read all of them and store them in memory locations that are known
a priori. We shall see that there are many ways of doing this and there are
specialized/privileged instructions that are used.
For more details about what exactly the hardware needs to do, readers can
refer to the computer architecture text by your author [Sarangi, 2021]. In the
example pipeline in the reference, the reader will appreciate the need for having
specialized hardware instructions for automatically storing the PC, the flags and
special registers, and possibly the stack pointer in either privileged registers or
a dedicated memory region. Regardless of the mechanism, we have a known
location where the volatile state of the program is stored, and it can later on
be retrieved by the interrupt handler. Note that for the sake of readability, we
will use the term interrupt handler to refer to a traditional interrupt handler as
well as exception handlers and system call handlers wherever this is clear from
the context.
Subsequently, the first task of the interrupt handler is to retrieve the program
state of the executing program – either from specialized registers or a dedicated
memory area. Note that these temporary locations may not store the entire
state of the program, for instance they may not store the values of all the
general purpose registers. The interrupt handler will thus have to be more
work and retrieve the full program state. In any case, the role of the interrupt
handler is to collect the full state of the executing program and ultimately store
it somewhere in memory, from where it can easily be retrieved later.
Restoring the context of a program is quite straightforward. We need to
follow the reverse sequence of steps.
The life cycle of a process can thus be visualized as shown in Figure 2.4. The
application program executes, it is interrupted for a certain duration after the
OS takes over, then the application program is resumed at the point at which
it was interrupted. Here, the word “interrupted” needs to be understood in a
very general sense. It could be a hardware interrupt, a software interrupt like a
system call or an exception.
Context Context
switch switch
OS + other
OS + other processes
processes
execute
execute
Execu�on Execu�on Execu�on
Figure 2.4: The life cycle of a process (active and interrupted phases)
Timer Interrupts
There is an important question to think about here. What if there is a program
that does not see any interrupts and there are no system calls or exceptions?
This means that the OS will never get executed if all the cores are occupied
by different instances of such programs. Kindly note that the operating system
is never executing in the background (as one would want to naively believe)
– it is a separate program that needs to be invoked by a very special method
namely either a system call, exception or interrupt. Let us refer to system calls,
exceptions and interrupts as events of interest. It cannot come into the picture
(execute on a core) any other way. Now, we are looking at a very peculiar
situation where all the cores are occupied with programs that do none of the
above. There are no events of interest. The key question that we need to answer
is whether the system becomes unresponsive and if these programs decide to run
for a long time, is rebooting the system the only option?
Question 2.1.1
CPU
Core Core
Timer chip
Core Core
Periodically send �mer
interrupts to the CPU
insight is that this is needed for ensuring that the system is responsive, and it
periodically executes the OS code. The operating system kernel has full control
over the processes that run on cores, the memory, storage devices and I/O
systems. Hence, it needs to run periodically such that it can effectively manage
the system and provide a good quality of experience to users.
We divide time into jiffies, where there is a timer interrupt at the end of
every jiffy. The number of jiffies (jiffy count) is incremented by one when a
timer interrupt is received. The duration of a jiffy has been reducing over the
course of time. It used to be 10 ms in the Linux kernel around a decade ago
and as of 2023, it is 1 ms. It can be controlled by the compile-time parameter
HZ. If HZ=1000, it means that the duration of a jiffy is 1 ms. We do not want
a jiffy to be too long, otherwise the system will take a fair amount of time to
respond. Simultaneously, we also do not want it to be too short, otherwise a lot
of time will be spent in servicing timer interrupts.
Inter-processor interrupts
As we have seen, the OS gets invoked on one core and now its job is to take
control of the system and basically manage everything including running pro-
cesses, waiting processes, cores, devices and memory. Often there is a need to
ascertain if a process has been running for a long time or not and whether it
needs to be swapped out or not. If there is a need to swap it out, then the OS
always chooses the most eligible process (using its scheduler) and runs it on a
core.
If the new process runs on the core on which the OS is executing, then it is
simple. All that needs to be done is that the OS needs to load the context of
the process that it wants to run. However, if a process on some other core needs
to be swapped out, and it needs to be replaced with the chosen process, then
the process is more elaborate. It is necessary to send an interrupt to that core
33 © Smruti R. Sarangi
Note that a program is compiled only once on the developers’ machines and
then distributed to the world. If a million copies are running, then we can be
rest assured that they are running on a very large number of heterogeneous
devices. These devices can be very different from each other. Of course, they
will have to share the same ISA, but they can have radically different main
memory sizes and even cache sizes. Unless we assume that all the 2n addresses
are accessible to a program, no other assumption can be made. This may sound
impractical on 64-bit machines, but this is the most elegant assumption that
can be made. Of course, how we design a memory system whose size is much
smaller than 264 bytes remains a problem to be solved.
What if we made another assumption? If we assumed that the program can
access 2 GB at will, it will not run on a system with 1 GB of memory, unless
we find a mechanism to do so. If we can find a mechanism to do this, then
we can always scale the system to assume that an address is 232 or 264 bytes
and still manage to run on physical systems with far lower memory. We thus
have a compatibility problem here, where we want our program to assume that
addresses are n bits wide (typically 32 or 64 bits), yet run on machines with all
memory sizes (typically much lower than the theoretical maximum).
Processes assume that they can access any byte in large memory regions
of size 232 or 264 bytes at will (for 32-bit and 64-bit systems, respectively).
Even if processes are actually accessing very little data, there is a need
to create a mechanism to run them on physical machines with far lower
memory (let’s say a few GBs). The fact is that the addresses they assume
are not compatible with physical addresses (on real machines). This is
the compatibility problem.
that programs and compilers remain simple and assume that the entire memory
space is theirs. This is a very convenient abstraction. However, on a real system,
we also want different processes to access a different set of addresses such that
there is no overlap between the sets. This is known as the overlap problem.
Unless adequate steps are taken, it is possible for two processes to access
overlapping regions of memory, and also it is possible to get unauthorized
access to other processes’ data by simply reading values that they write
to memory. This is known as the overlap problem.
What we can see is that the memory map is partitioned into distinct zones.
The memory map starts from address zero. Then after a fixed offset, the text
section starts, which contains all the program’s instructions. The processor
starts executing the first instruction at the beginning of the text section and
then starts fetching subsequent instructions as per the logic of the program.
Once the text section ends, the data section begins. It stores initialized data
that comprises global and static variables that are typically defined outside the
scope of functions. After this, we have the bss (block starting symbol) section
that stores the same kind of variables, however they are uninitialized. Note
that each of these sections in the memory map is basically a range of memory
addresses and this range varies from process to process. It is possible that one
process has a very small data section and another process has a very large data
section – it all depends upon how the program is written.
Then we have the heap and the stack. The heap is a memory region that
stores dynamically allocated variables and data structures, which are typically
allocated using the malloc call in C and the new call in C++ and Java. Tra-
© Smruti R. Sarangi 36
ditionally, the heap section has grown upwards (towards increasing addresses).
As and when we allocate new data, the heap size increases. It is also possible
for the heap size to decrease as we free or dynamically delete allocated data
structures. Then there is a massive hole, which basically means that there is
a very large memory region that doesn’t store anything. Particularly, in 64-bit
machines, this region is indeed extremely large.
Next, at a very high memory location (0xC0000000 in 32-bit Linux), the
stack starts. The stack typically grows downwards (grows towards decreasing
addresses). Given the fact that there is a huge gap between the end of the heap
and the top of stack, both of them can grow to be very large. If we consider the
value 0xC0000000, it is actually 3 GB. This basically means that on a 32-bit
system, an application is given 3 GB of memory at most. This is why the stack
section starts at this point. Of course, one can argue that if the size of the stack,
heap and other sections combined exceeds 3 GB, we shall run out of space. This
indeed can happen and that is why we typically use a 64-bit machine where the
likelihood of this happening is very low because our programs are not that large
at the moment.
The last unanswered question is what happens to the one GB that is re-
maining (recall 232 bytes = 4 GB)? This is a region that is typically assigned
to the operating system kernel for storing all of its runtime state. As we shall
see in later chapters, there is a need to split the address space between user
applications and the kernel.
Now, the interesting thing is that all processes share the same structure of
the memory map. This means that the chances of them destructively interfering
with each other is even higher because most variables will have similar addresses:
they will be stored in roughly the same region of the memory map. Even if two
processes are absolutely innocuous (harmless), they may still end up corrupting
each other’s state, which is definitely not allowed. As a result, ensuring a degree
of separation is essential. Another point that needs to be mentioned with regard
to the kernel memory is that it is an invariant across process memory maps. It
is something that pretty much remains constant and in the case of a 32-bit
system, occupies the top one GB of the memory map of every process. In a
certain sense, processors assume that their range of operation is the first 3 GB
and the top one GB is beyond their jurisdiction.
The advantage of having a fixed memory map structure is that it is very easy
to generate code, binaries can also have a fixed format that is correlated with
the memory map and operating systems know how to layout code and data in
memory. Regardless of the elegance, simplicity and standardization, we need to
solve the overlap problem. Having a standard memory map structure makes this
problem worse because now regardless of the process, the variables are stored
in roughly the same set of addresses. Therefore, the chances of destructive
interference become very high. Additionally, this problem creates a security
nightmare.
or even access each other’s memory regions. Clearly, the simplest solution is to
somehow restrict the memory regions that a process can access.
Let us look at a simple implementation of this idea. Assume that we have two
registers associated with each process: base and limit. The base register stores
the first address that is assigned to a process and the limit register stores the last
address. Between base and limit, the process can access every memory address.
In this case, we are constraining the addresses that a process can access and via
this we are ensuring that no overlap is possible. We observe that the value of the
base register need not be known to the programmer or the compiler. All that
needs to be specified is the difference between the limit and base (maximum
number of bytes a process can access).
The first step is to find a free memory region when a process is loaded. Its
size needs to be more than the maximum size specified by the process. The
starting address of this region is set as the contents of the base register. An
address computed by the CPU is basically an offset added to the contents of the
base register. The moment the CPU sends an address to the memory system,
depending upon the process that is running, we generate addresses accordingly.
Note that the contents of the base registers vary depending upon the process. In
this system, if the process accesses an address that is beyond the limit register,
then a fault is generated. A graphical description of the system is shown in
Figure 2.8.
hole
We can clearly see that there are many processes, and they have their mem-
ory regions clearly demarcated. Therefore, there is no chance of an overlap.
This idea does seem encouraging, but this is not going to work in practice for
a combination of several reasons. The biggest problem is that neither the pro-
grammer nor the compiler know for sure how much memory a program requires
in run time. This is because for large programs, the user inputs are not known,
and thus the total memory footprint is not predictable. Even if it is predictable,
we will have to budget for a very large footprint (conservative maximum). In
most cases, this conservative estimate is going to be much larger than the mem-
ory footprints we may see in practice. We may thus end up wasting a lot of
memory. Hence, in the memory region that is allocated to a process between
the base and limit registers, there is a possibility of a lot of memory getting
wasted. This is known as internal fragmentation.
© Smruti R. Sarangi 38
Let us again take a deeper look at Figure 2.8. We see that there are holes or
unallocated memory regions between allocated memory regions. Whenever we
want to allocate memory for a new process, we need to find a hole that is larger
than what we need and then split it into an allocated region and a smaller hole.
Very soon we will have many of these holes in the memory space, which cannot
be used for allocating memory to any other process. It may be the case that
we have enough memory available, but it is just that it is partitioned among so
many processes that we do not have a contiguous region that is large enough.
This situation where a lot of memory is wasted in such holes is known as external
fragmentation. Of course, there are many ways of solving this problem. Some
may argue that periodically we can compact the memory space by reading data
and transferring them to a new region by updating the base and limit registers
for each process. In this case, we can essentially merge holes and create enough
space by creating one large hole. Of course, the problem is that a lot of reads
and writes will be involved in this process and during that time the process
needs to remain mostly stalled.
Another problem is that the prediction of the maximum memory usage may
be wrong. A process may try to access memory that is beyond the limit register.
As we have argued, in this case a fault is generated. However, this can be avoided
if we allocate another memory region and link the second memory region to the
first (using a linked list like structure). The algorithm now is that we first
access the memory region that is allocated to the process and if the offset is
beyond the limit register, then we access a second read memory region. The
second remain memory region will also have base and limit registers. We can
extend this idea and create a linked list of such memory regions. We can also
save time by having a lookup table. It will not be necessary to traverse linked
lists. Given an address, we can quickly figure out in which memory region it
lies. Many of the early approaches focused on such kind of techniques, and they
grew to become very complex, but soon the community realized that this is not
a scalable solution, and it is definitely not elegant.
address to a physical address such that we can access memory and solve the
overlap problem, as well as the compatibility problem.
A few ideas emerge from this discussion. Given a virtual address, there
should be some sort of table that we can look up, and find the physical address
that it maps to. Clearly, one virtual address will always be mapped to one
physical address. This is a common sense requirement. However, if we can
also ensure that every physical address maps to one virtual address, or in other
words there is a strict one-to-one mapping, then we observe that no overlaps
between processes are possible. Regardless of how hard a process tries, it will
not be able to access or overwrite the data that belongs to any other process
in memory. In this case we are using the term data in the general sense – it
encompasses both code and data. Recall that in the memory system, code is
actually stored as data.
The crux of the entire definition of virtual memory (see Definition 2.2.5) is
that we have a mapping table that maps each virtual address (that is used by
the program) to a physical address. If the mapping satisfy some conditions,
then we can solve all the three problems. So the main technical challenge in
front of us is to properly and efficiently create the mapping table to implement
an address translation system.
Process 1
Map process’s virtual
pages to physical frames
Process 2
Figure 2.9: Conceptual overview of the virtual memory based page mapping
system
32-bit memory system, we thus need 20 bits to specify a page address. We will
thus have 220 or roughly a million pages in the system. For each page, we need
to store a 20-bit physical frame address. The total storage overhead is thus (20
bits = 2.5 bytes) multiplied with one million, which turns out to be 2.5 MB.
This is the storage overhead per process, because every process needs its own
page table. Now assume that we have 100 processes in the system, we therefore
need 250 MB to just store page tables !!!
This is a lot, and it represents a tremendous wastage of physical memory
space. If we think about it, we shall observe that most of the virtual address
space is actually not used. In fact, it is quite sparse particularly between the
stack and the heap, which can actually be quite large. This problem is still
manageable for 32-bit memory systems, especially if we don’t have a lot of con-
currently running processes. However, if we consider a 64-bit memory system,
then the page table storage overhead is prohibitively large and clearly this idea
will not work. Hence, we need far more efficient way of storing our mappings.
We need to a look at the memory map of a process seriously and understand the
structure of the sparsity to design a better page table (refer to Section 2.2.1).
Begin by noting that a very small part of the virtual address space is actually
populated. The beginning of the virtual address is populated with the text,
data, bss and heap sections. Then there is a massive gap. Finally, the stack
is situated at the highest end of the allowed virtual memory addresses. There
is nothing in between. Later on we will see that other memory regions such as
memory mapped files can occupy a part of this region. But still we will have
large gaps and thus there will be a significant amount of sparsity. This insight
can be used to design a multilevel page table, which can capture leverage this
pattern.
The design of a multilevel page table is shown in Figure 2.10. It shows an
example address translation system for a 64-bit machine. We typically observe
that we don’t need that large a virtual address. 264 bytes is more than a billion
gigabytes, and no practical system (as of today) will ever have so much of
memory. Hence, most practical systems as of 2023, use a 48-bit virtual address.
That is sufficient. The top 16 (MSB) bits are assumed to be zero. We can
43 © Smruti R. Sarangi
48-bit
Virtual 12 bits
Bits 48-40 Bits 39-31 Bits 30-22 Bits 21-13 (intra-page)
address
The top 16 bits of the VA 52-bit frame
are assumed to be zero address
Level 1 Level 2 Level 3 Level 4
CR3 register
(c) Smru� R. Sarangi, 2023 55
always break this assumption and have more levels in a multilevel page table.
This is seldom required. Let us thus proceed assuming 48-bit physical address.
We however assume a full 64-bit physical address in our examples. Note that
the physical address can be as wide as possible because we are just storing a
few additional bits per entry – we are not adding new levels in the page table.
Given that 12 bits are needed to address a byte in a 4 KB page, we are left with
52 bits. Hence, a physical frame number is specified using 52 bits. Figure 2.11
shows the memory map of a process assuming that the lower 48 bits of a memory
address are used to specify the virtual memory address.
248 - 1
Stack
In our 48-bit virtual address, we use the bottom 12 bits to specify the address
of the byte within the 4 KB page. Recall that 212 bytes = 4 KB. We are left
with 36 bits. We partition them into four blocks of 9 bits each. If we count from
1, then these are bit positions 40-48, 31-39, 22-30 and 13-21. Let us consider the
topmost level, i.e., the top 9 bits (bits 40-48). We expect the least amount of
randomness in these bits. The reason is obvious. In any system with temporal
and spatial locality, we expect most addresses to be close by. They may vary
in their lower bits, however, in all likelihood their more significant bits will be
© Smruti R. Sarangi 44
level than at the cache line level. As a result, we do not need to cache a lot of
entries. In fact, it has been shown that caching 64 to 128 entries is good enough.
Hence, almost all processors have a small hardware cache called a TLB
(Translation Lookaside Buffer). It caches 64 to 128 translation entries. It can
also have two levels (L1 TLB and L2 TLB) and cache more entries (roughly
a 1000). We can also have two different TLBs per core: one for instructions
and one for data. Given that the L1 TLB is quite small, it can be accessed
very quickly – typically in less than a cycle. In a large out-of-order pipeline
in a modern core, this small latency is hardly perceptible and as a result, the
translation basically comes for free.
In the rare case, when there is a miss in the TLB, then it is necessary to
access the page table, which is a slow process. It can take hundreds to thousands
of cycles to access the page table. Note that the translated address in this case
is a frame address, which can then be directly read from the memory system
(caches and main memory).
network. Hence, a page table entry can additionally store the location of the
frame and the device that contains it. The device itself can have a complex
description that could be a combination of an IP address and a device id. All
of this information is stored in the page table entry. However, it is not there in
the TLB entry because in this case we assume that the frame is there in main
memory. Let us look at some of the other fields that are there in a page table
entry notably the permission bits.
program ran, the base register for the code section was the code segment register
(cs register). Similarly, for data and stack variables, the data and stack segment
registers were the base registers, respectively. In this case, different physical
addresses were computed based on the contents of the segment registers. The
physical addresses sometimes mapped to different physical regions of memory
devices or sometimes even different devices.
Given that base-limit addressing has now become obsolete, segment registers
have lost their original utility. However, there are new uses. The first and
foremost is security. We can for instance prohibit any data access from accessing
code page. This is easy to do. We have seen that in the memory map the lower
addresses are code addresses and once the end, the data region begins. These
sections store read-only constants and values stored on the heap. Any data
address is a positive offset from the location stored in the data segment register.
This means that a data page address will always be greater than any code page
address, and thus it is not possible for any data access to modify the regions of
memory that store instructions. Most malwares try to access the code section
and change instructions such that they can hijack the program and make it do
what they want. Segmentation is an easy way of preventing such attacks. In
many other attacks, it is assumed that the addresses of variables in the virtual
address space are known. For example, the set of attacks that try to modify
return addresses stored on the stack need to know the memory address at which
the return address is stored. Using the stack segment register it is possible to
obfuscate these addresses and confuse the attacker. In every run, the operating
system can randomly set the contents of the stack segment register. This is
known as stack obfuscation. The attacker will thus not be able to guess what is
stored in a given address on the stack – it will change in every run. Note that
program correctness will not be hampered because a program is compiled in a
manner where it is assumed that all addresses will be computed as offsets from
the contents of their respective segment registers.
There are some other ingenious uses as well. It is possible to define small
memory regions that are private to each core (per-core regions). The idea here
is that it is often necessary to store some information in a memory region that
is private to each core and should not be accessed by processes running on other
cores, notably kernel processes. These regions are accessed by kernel threads
mainly for the purposes of memory management and scheduling. An efficient
way of implementing this is by associating an unused segment register with such
a region. All accesses to this region can then use this segment register as the
base address. The addresses (read offsets) can be simple and intuitive such as
0, 4, 8, 12, etc. In practice, these offsets will be added to the contents of the
segment register and the result will be a full 64-bit virtual address.
cs es
ss fs
ds gs
Figure 2.13 shows the way that segmented memory is addressed. The phi-
losophy is broadly similar to the paging system for virtual memory where the
insight is to leverage temporal and spatial locality as much as possible. There
is a segment descriptor cache (SDC) that caches the ids of the segment registers
for the current process. Most of the time, the values of the segment registers
will be found in the SDC. Segment register values get updated far more infre-
quently as compared to TLB values. Hence, we expect hits here almost all the
time other than some rare instances when a process is loaded either for the first
time or after a context switch.
Similar to a TLB miss, if there is a miss in the SDC, then there is a need to
search in a larger structure for a given segment register belonging to a process.
In older days there used to be an LDT (local descriptor table) and a global
descriptor table (GDT). We could think of the LDT as the L1 level and the
GDT as the L2 level. However, nowadays the LDT is mostly not used. If there
is a miss in the SDC, then a dedicated piece of hardware searches for the value in
the GDT, which is a hardware structure. Of course, it also has a finite capacity,
and if there is a miss there then an interrupt is raised. The operating system
needs to populate the GDT with the correct value. It maintains all the segment
register related information in a dedicated data structure.
GDT Address
SDC
Virtual
memory
Segment Register
As seen in Figure 2.13, the base address stored in the relevant segment reg-
ister is added to the virtual address. This address further undergoes translation
to a physical address before it can be sent to the physical memory system.
49 © Smruti R. Sarangi
2.3.1 Overview
CPU
GPU
Northbridge
PCI
slots
Keyboard, mouse, USB
Southbridge
ports, I/O chips, ...
Any processor chip has hundreds of pins. Complex designs have roughly
a 1000+ pins. Most of them are there to supply current to the chip: power
and ground pins. The reason that we need so many pins is because modern
processors draw a lot of current. A pin has a limited current delivery capacity.
However, a few hundred bits are typically left for communication with external
entities such as the memory chips, off-chip GPUs and I/O devices.
Memory chips have their dedicated memory controllers on-chip. These mem-
ory controllers are aware of the number of memory chips that are connected and
how to interact with them. This happens at the hardware level and the OS is
blissfully unaware of what goes on here. Depending on the motherboard, there
could be a dedicated connection to an off-chip GPU. An ultra-fast and high
bandwidth connection is required to a GPU that is housed separately on the
motherboard. Such buses (sets of copper wires) have their own controllers that
are typically on-chip.
© Smruti R. Sarangi 50
Figure 2.14 shows a traditional design where the dedicated circuitry for com-
municating with the main memory modules and the GPU are combined, and
added to the Northbridge chip. The Northbridge chip used to traditionally be
resident on the motherboard (outside the chip). However, in most modern pro-
cessors today, the logic used in the chip has moved into the chip. It is much
faster for the cores and caches to communicate with an on-chip component.
Given that both the main memory and GPU have very high bandwidth require-
ments, this design decision makes sense. Alternative designs are also possible
where the Northbridge logic is split into two and is placed at different ends of the
chip. One part communicates with the GPU and the other part communicates
with the memory modules.
To communicate with other slower I/O devices such as the keyboard, mouse
and hard disk a dedicated controller chip called the Southbridge chip is used.
In most modern designs, this chip is outside the chip – it is placed on the moth-
erboard. The simplest way is to have a Northbridge-Southbridge connection.
However, this is not mandatory. There could be a separate connection to the
Southbridge chip and in high-performance implementation, we can have the
Southbridge logic inside the CPU chip. Let us however stick to the simplistic
design shown in Figure 2.14.
The Southbridge chip is further connected to dedicated chips in the chipset
whose job is to route messages to the large number of I/O devices that are
present in a typical system. In fact, we can have a tree of such chips, where
messages are progressively routed to the I/O devices through the different levels
of the tree. For example, the Soutbridge chip may send the messages to the
PCI-X chip, which subsequently send the messages down the PCI-X buses to
the target I/O device. The Southbridge chip may also choose to send a message
to the USB ports, and a dedicated controller may then route the message to the
specific USB port that the message is meant to be sent to.
The question that we need to answer is how do we programmatically interact
with these I/O ports? It should be possible for assembly programs to read and
write from I/O ports easily. There are several methods in modern processors.
There is a trade-off between the ease of programming, latency and bandwidth.
I/O ports, and a 4-byte access reads/writes four consecutive I/O ports. There
are I/O controller chips in the chipset such as the Northbridge and Southbridge
chips that know the locations of the I/O ports on the motherboard and can
route the traffic to/from the CPUs.
The device drivers incorporate assembly code that uses variants of the in
and out instructions to access I/O ports corresponding to the devices. User-
level programs request the operating system for I/O services where the request
the OS to effect a read or write. The OS in turn passes on the request to the
device drivers, who use a series of I/O instructions to interact with the devices.
Once, the read/write operation is done the data read from the device and the
status of the operation is passed on to the program that requested for the I/O
operation.
If we dive in further, we observe that an in instruction is a message that is
sent to the chip on the motherboard that is directly connected to the I/O device.
Its job is to further interpret this instruction and send device-level commands
to the device. It is expected that the chip on the motherboard, which message
needs to be sent. The OS need not concern itself with such low-level details.
For example, a small chip on the motherboard knows how to interact with USB
devices. It handles all the I/O. It just exposes a set of I/O ports to the CPU
that are accessible via the in/out ports. Similar is the case for out instructions,
where the device drivers simply write data to I/O ports. The corresponding
chip on the motherboard knows how to translate this to device-level commands.
Using I/O ports is the oldest method to realize I/O operations and has
been around for the last fifty years. It is however a very slow method and the
amount of data that can be transferred is very little. Also, for transferring a
small amount of data (1-4 bytes), there is a need to issue a new I/O instruction.
This method is alright for control messages but not for data messages in high
bandwidth devices like the network cards. There is a need for a faster method.
This is known as port-mapped I/O (PMIO).
Virtual
address
space
Mapped to I/O
addresses I/O device ports
The faster method is to map directly map regions of the virtual address
space to an I/O device. Insofar as the OS is concerned it makes regular reads
© Smruti R. Sarangi 52
and writes. The TLB however stores an additional bit indicating that the page
is an I/O page. The hardware automatically translates memory requests to I/O
requests. There are several advantages of this scheme (refer to Figure 2.15).
The first is that we can send a large amount of data in one go. The x86
architecture has instructions that allow the programmer to move hundreds of
bytes between addresses in one go. These instructions can be used to trans-
fer hundreds of bytes or a few kilobytes to/from I/O space. The hardware
can then use fast mechanisms to ensure that this happens as soon as possible.
This would mean reading or writing a large amount of data from memory and
communicating with I/O devices.
At the side of the processor, we can clearly see the advantage. All that
we need is a few instructions to transfer a large amount of data. This reduces
the instruction processing overhead at the end of the CPU and keeps the pro-
gram simple. I/O devices and chips in the chipset have also evolved to support
memory-mapped I/O. Along with their traditional port-based interface, they
are also incorporating small memories that are accessible to other chips in the
chipset. The data that is the process of being transferred to/from I/O devices
can be temporarily buffered in these small memories.
A combination of these technologies makes memory-mapped I/O very effi-
cient. Hence, it is very popular as of 2023. In many reference manuals, it is
conveniently referred to it via its acronym MMIO.
CPU DMA
2. Interrupt the CPU (transfer done)
Even though memory-mapped I/O is much more efficient than the older
method that relied on primitive instructions and basic I/O ports, it turns out
that we can do far better. Even in the case of memory-mapped I/O, the proces-
sor needs to wait for the load-store instruction that is doing the I/O to finish.
Given that I/O operations take a lot of time, the entire pipeline will fill up and
the processor will remain stalled until the outstanding I/O operations complete.
Of course, one simple solution is that we do the memory mapped I/O operations
in smaller chunks; however, some part of this problem will still remain. We can
also remove write operations from the critical path and assume that they are
done asynchronously. Still the problem of slow reads will be there.
Our main objective here is that we would like to do other work while I/O
operations are in progress. We can extend the idea of asynchronous writes to
also have asynchronous reads. In this model, the processor does not wait for
the read or write operation to complete. The key idea is shown in Figure 2.16,
where there is a separate DMA (direct memory access) chip that effects the
transfers between the I/O device and memory. The CPU basically outsources
the I/O operation to the DMA chip. The chip is provided the addresses in
memory as well as the addresses on the I/O device along with the direction of
53 © Smruti R. Sarangi
data transfer. Subsequently, the DMA chip initiates the process of data transfer.
In the meanwhile, the CPU can continue executing programs without stalling.
Once the DMA operation completes, it is necessary to let the OS know about
it.
Hence, the DMA chip issues an interrupt, the OS comes into play, and then
it realizes that the DMA operation has completed. Since user programs cannot
directly issue DMA requests, they instead just make system calls and let the
OS know about their intent to access an I/O device. This interface can be kept
simple primarily because it is only the OS’s device drivers that interact with
the DMA chip. When the interrupt arrives, the OS knows what to do with it
and how to signal the device drivers that the I/O operation is done, and they
can either read the data that has been fetched from an I/O device or assume
that the write has completed. In many cases, it is important to let the user
program also know that the I/O operation has completed. For example, when
the printer successfully finishes printing a page the icon changes from “printing
in progress” to “printing complete”.
Exercises
Ex. 1 — Why are multiple rings there in an x86 processor? Isn’t having just
two rings enough?
Ex. 2 — How does a process know that it is time for another process to run
in a multitasking system? Explain the mechanism in detail.
Ex. 3 — Assume a 16-core system. There are 25 active threads that are purely
computational. They do not make system calls. The I/O activity in the system
is negligible. Answer the following questions:
a)How will the scheduler get invoked?
b)Assume that the scheduler has a special feature. Whenever it is invoked,
it will schedule a new thread on the core on which it was invoked and
replace the thread running on a different core with another active (ready
to run) thread. How do we achieve this? What kind of hardware support
is required?
Ex. 4 — What is the need for having privileged registers in a system? How
does Intel avoid them to a large extent?
Ex. 5 — How can we design a virtual memory system for a machine that does
not have any kind of storage device such as a hard disk attached to it? How do
we boot such a system?
Ex. 7 — Do the processor and compiler work with physical addresses or vir-
tual addresses?
Ex. 8 — How does the memory map of a process influence the design of a
page table for 64-bit systems?
storage area)?
Ex. 12 — When is it preferred to use an inverted page table over the tradi-
tional (tree-based) page table?
Ex. 13 — Why are the memory contents not a part of a process’s context?
Ex. 14 — Assume two processes access a file in read-only mode. They use
memory-mapped I/O. Is there a possibility of saving physical memory space
here?
© Smruti R. Sarangi 56
Chapter 3
Processes
57
© Smruti R. Sarangi 58
one place. The key components of the task struct data structure are shown
in Table 3.1. Linux internally refers to every process as a task.
Field Description
struct thread info thread info Low-level information
uint state Process state
void * stack Kernel stack
Priorities prio, static prio, normal prio
struct sched info sched info Scheduling information
struct mm struct *mm, *active mm Pointer to memory information
pid t pid Process id
struct task struct *parent Parent process
struct list head children, sibling Child and sibling processes
Other fields File system, I/O, synchroniza-
tion, and debugging fields
For instance, we cannot assume that an integer is four bytes on every platform or
a long integer is eight bytes on every platform. These things are quite important
for implementing an operating system because many a time we are interested
in bytes-level information. Hence, to be 100% sure, it is a good idea To define
all the primitive data types in the arch directories.
For example, if we are interested in defining an unsigned 32-bit integer,
we should not use the classic unsigned int primitive because we never know
whether an int is 32 bits or not on the architecture on which we are compiling
the kernel. Hence, it is a much better idea to define custom data types that for
instance guarantee that regardless of the architecture, a data type will always be
an unsigned integer (32 bits long). Courtesy the C preprocessor, this can easily
be done. We can define types such as u32 and 6̆4 that correspond to unsigned
32-bit and 64-bit integers, respectively, on all target architectures. It is the job
of the architecture-specific programmers to include the right kind of code in the
arch folder to implement these virtual data types (u32 and u64). Once this is
done, the rest of the kernel code can use these data types seamlessly.
Similar abstractions and virtualization are required to implement other parts
of the booting subsystem, and other low-level services such as memory manage-
ment and power management. Basically, anything that is architecture-specific
needs to be defined in the corresponding subfolder in the arch directory and
then a generic interface needs to be exposed to the rest of the kernel code.
/* current CPU */
u32 cpu ;
}
This structure basically stores the current state of the thread, the state of
the executing system call and synchronization-related information. Along with
that, it stores another vital piece of information, which is the number of the
CPU on which the thread is running or is scheduled to run at a later point
in time. We shall see in later sections that finding the id of the current CPU
(and the state associated with it) is a very frequent operation and thus there is a
pressing need to realize it as efficiently as possible. In this context, thread info
61 © Smruti R. Sarangi
Execu�on TASK_ZOMBIE
finished (task finishes or
is terminated)
New task created
Scheduler asks it to execute
TASK_RUNNING TASK_RUNNING
TASK_STOPPED
(ready but not (currently
(stopped)
running) running) SIGSTOP
Here is the fun part in Linux. A task that is currently running and the
task that is running have the same state: TASK RUNNING. There are historical
reasons for this as well as there are simple common sense reasons in terms of
efficiency. We are basically saying that a task that is ready to run and one that
is running have the same state and thus in a certain sense are indistinguishable.
This little trick allows us to use the same queue for maintaining all such tasks
that are ready to run or are currently running. We shall see later that this
simplifies many design decisions. Furthermore, if there is a context switch,
then there is no need to change the status of the task that was swapped out.
Of course, someone may argue that using the same state (TASK RUNNING)
introduces ambiguity. To a certain extent it is true, but it does simplify a lot of
things and does not appear to be a big hindrance in practice.
Now it is possible that a running task may keep on running for a long time
© Smruti R. Sarangi 62
and the scheduler may decide that it is time to swap it out so that other tasks
get a chance. In this case, the task is said to be “preempted”. This means that
it is forcibly displaced from a core (swapped out). However, it is still ready to
run, hence its state remains TASK RUNNING. Its place is taken by another task
– this process thus continues.
Let us look at a few other interactions. A task may be paused using the
SIGSTOP signal. Specifically, the kill system call can be used to send the stop
signal to a task. We can also issue the following command on the command line:
kill -STOP pid. Another approach is to send the SIGTSTP signal by pressing
Ctrl-z on the terminal. The only difference here is that this signal can be
ignored. Sometimes there is a need for doing this, especially if we want to run
the task at a later point of time when sufficient CPU and memory resources
are available. In this case, we can just pause the task. Note that SIGSTOP is a
special type of signal that cannot simply be discarded or caught by the process
that corresponds to this task. In this case, this is more of a message to the
kernel to actually pause the task. It has a very high priority. At a later point
of time, the task can be resumed using the SIGCONT signal. Needless to say,
the task resumes at the same point at which it was paused. The correctness of
the process is not affected unless it relies on some aspect of the environment
that possibly got changed while it was in a paused state. The fg command line
utility can also be used to resume such a suspended task.
Let us now come to the two interrupted states: INTERRUPTIBLE and UNINTERRUPTIBLE.
A task enters these states when it requests for some service like reading an I/O
device, which is expected to take a lot of time. In the first state, INTERRUPTIBLE,
the task can still be resumed to act on a message sent by the OS, which we refer
to as a signal. For instance, it is possible for other tasks to send the interrupted
process a message (via the OS) and in response it can invoke a signal handler.
Recall that a signal handler is a specific function defined in the program that
is conceptually similar to an interrupt handler, however, the only difference is
that it is implemented in user space. In comparison, in the UNINTERRUPTIBLE
state, the task does not respond to signals.
Zombie Tasks
The process of wrapping up a task is quite elaborate in Linux. Recall that the
processor has no way of knowing when a task has completed. It is thus necessary
to explicitly inform it by making the exit system call. However, a task’s state
is not cleaned up at this stage. Instead, the task’s parent is informed using the
SIGCHLD signal. The parent then needs to call the system call wait to read the
exit status of the child. It is important to understand that every time the exit
system call is called, the exit status is passed as an argument. Typically, the
value zero indicates that the task completed successfully. On the other hand,
a non-zero status indicates that there was an error. The status in this case
represents the error code.
Here again, there is a convention. The exit status ‘1’ indicates that there
was an error, however it does not provide any additional details. We can refer to
this situation as a non-specific error. Given that we have a structured hierarchy
of tasks with parent-child relationships, Linux explicitly wants every parent to
read the exit status of all its children. Until a parent task has read the exit
status of the child, the child remains a zombie task – neither dead nor alive.
63 © Smruti R. Sarangi
Pointer to
task_struct
thread_info
thread_info
current
For a long time, the kernel stack was limited to two pages, i.e., 8 KB. It
contained useful data about the running thread. These are basically per-thread
stacks. In addition, the kernel maintains a few other stacks, which are CPU
specific. The CPU-specific stacks are used to run interrupt handlers, for in-
stance. Sometimes, we have very high priority interrupts and some interrupts
cannot be ignored (not maskable). The latter kind of interrupts are known as
NMIs (non-maskable interrupts). This basically means that if we are execut-
ing an interrupt handler, if a higher priority interrupt arrives, we need to do a
context switch and run the interrupt handler for the higher-priority interrupt.
There is thus a need to switch to a new interrupt stack. Hence, each CPU has
an interrupt stack table with seven entries. This means that Linux can han-
dle deeply nested interrupts (till 7 levels). This is conceptually similar to the
regular context switch process for user-level tasks.
© Smruti R. Sarangi 64
Figure 3.2 shows the structure of the kernel stack in order kernels. The
thread info structure was kept at the lowest address and there was a dedicated
current pointer that pointed to the thread info structure. This is a very
quick method of retrieving the thread info associated with the current task.
In fact from any stack address, we can quickly compute the address stored in the
current pointer using the fact that the starting address of thread info needs
to be a multiple of 8 KB. Simple bitwise operations on the address can be used
to find this value (left as an exercise for the reader). Once, we get the address
of the thread info structure, we can get the pointer to the task struct.
The kernel stack as of today looks more or less the same. It is still limited to 8
KB in size. However, the trick of placing thread info at the lowest address and
using that to reference the corresponding task struct Is not needed anymore.
We can use a better method that relies on segmentation. This is one of the
rare instances in which x86 segmentation proves to be extremely beneficial. It
provides a handy reference point in memory for storing specific data that is
highly useful and has the potential for being frequently used. Furthermore, to
use segmentation, we do not need any extra instructions (see Appendix A). The
segment information can be embedded in the memory address itself. Hence, this
part comes for free and the Linux kernel designers leverage this to the hilt.
Refer to the code in Listing 3.2. It defines a macro current that returns a
pointer to the current task struct via a chain of macros and in-line functions
(works like a macro from a performance point of view). The most important
thing is that this pointer to the current task struct, which is accessible using the
current variable, is actually stored in the gs segment [Lameter and Kumar,
2014]. This serves as a dedicated region, which is specific to a CPU. We can
treat it as a per-CPU storage area that stores a lot of information that is relevant
to that particular CPU only.
Note that here we are using the term “CPU” as a synonym for a “core”.
This is Linux’s terminology. We can store a lot of important information in a
dedicated per-CPU/per-core area notably the current (task) variable, which
is needed very often. It is clearly a global variable insofar as the kernel code
running on the CPU is concerned. We thus want to access it with as few memory
accesses as possible. In our current solution with segmentation, we are reading
the variable with just a single instruction. This was made possible because of
the segmentation mechanism in x86. An astute reader can clearly make out that
this mechanism is more efficient than the earlier method that used a redirection
65 © Smruti R. Sarangi
via the thread info structure. The slower redirection based mechanism is still
used in architectures that do not have support for segmentation.
There are many things to be learned here. The first is that for something as
important as the current task, which is accessed very frequently, and is often on
the critical path, there is a need to devise a very efficient mechanism. Further-
more, we also need to note the diligence of the kernel developers in this regard
and appreciate how much they have worked to make each and every mechanism
as efficient as possible – save memory accesses wherever and whenever possible.
In this case, several conventional solutions are clearly not feasible such as storing
the current task pointer in CPU registers, a privileged/model-specific regis-
ter (may not be portable), or even a known memory address. The issue with
storing this pointer at a known memory address is that it significantly limits
our flexibility in using the virtual address space and there are portability issues
across architectures. As a result, the developers chose the segmentation-based
method for x86 hardware.
There is a small technicality here. We need to note that different CPUs
(cores on a machine) will have different per-CPU regions. This, in practice,
can be realized very easily with this scheme because different CPUs have dif-
ferent segment registers. We also need to ensure that these per-CPU regions
are aligned to cache line boundaries. This means that a cache line is uniquely
allocated to a per-CPU region. No cache line stores data corresponding to a
per-CPU region and other data. If this is the case, we will have a lot of false
sharing misses across the CPU cores, which will prove to be very detrimental
to the overall performance. Recall that false sharing misses are an artifact of
cache coherence. A cache line may end up continually bouncing between cores
if they are interested in accessing different non-overlapping chunks of that same
cache line.
Linux uses 140 task priorities. The priority range as shown in Table 3.2 is
from 0 to 139. The priorities 0-99 are for real-time tasks. These tasks are for
mission-critical operations, where deadline misses are often not allowed. The
scheduler needs to execute them as soon as possible.
The reason we have 100 different priorities for such real-time processes is
because we can have real-time tasks that have different degrees of importance.
We can have some that have relatively “soft” requirements, in the sense that
it is fine if they are occasionally delayed. Whereas, we may have some tasks
where no delay is tolerable. The way we interpret the priority range 0-99 is as
© Smruti R. Sarangi 66
follows. In this space 0 corresponds to the least priority real-time task and the
task with priority 99 has the highest priority in the overall system.
Some kernel threads run with real-time priorities, especially if they are in-
volved in important bookkeeping activities or interact with sensitive hardware
devices. Their priorities are typically in the range of 40 to 60. In general, it
is not advisable to have a lot of real-time tasks with very high priorities (more
than 60) because the system tends to become quite unstable. This is because the
CPU time is completely monopolized by these real-time tasks, resulting in the
rest of the tasks, including many OS tasks, not getting enough time to execute.
Hence, a lot of important kernel activities get delayed.
Now for regular user-level tasks, we interpret their priority slightly differ-
ently. In this case, higher the priority number, lower is the actual priority. This
basically means that in the entire system, the task with priority 139 has the
least priority. On the other hand, the task with priority 100 has the highest
priority among all regular user-level tasks. It still does not have a real-time
priority but among non-real-time tasks it has the highest priority. The impor-
tant point to understand is that the way that we understand these numbers
is quite different for real-time and non-real-time tasks. We interpret them in
diametrically opposite manners in both the cases.
There are two concepts here. The first is the number that we assign in the
range 0-139, and the second is the way that we interpret the number as a task
priority. It is clear from the preceding discussion that the number is interpreted
differently for regular and real-time tasks. However, if we consider the kernel, it
needs to resolve the ambiguity and use a single number to represents the priority
of a task. We would ideally like to have some degree of monotonicity. Ideally, we
want that either a lower value should always correspond to a higher priority or
the reverse, but we never want a combination of the two in the actual kernel code.
This is exactly what is done in the code snippet that is shown in Listing 3.3. We
need to note that there are historical reasons for interpreting user and real-time
priority numbers at the application level differently, but in the kernel code this
ambiguity needs to be removed. We need to ensure monotonicity.
In line with this philosophy, let us consider the first else if condition that
corresponds to real-time tasks. In this case, the value of MAX RT PRIO is 100.
Hence, the range [0-99] gets translated to [99-0]. This basically means that lower
the value of prio, greater the priority. We would want user-level priorities
to be interpreted similarly. Hence, let us proceed to the body of the else
statement. Here, the macro NICE TO PRIO is used. Before expanding the micro,
it is important to understand the notion of being nice in Linux.
The default user-level priority associated with a regular task is 120. Given
a choice, every user would like to raise the priority of her task to be as high
67 © Smruti R. Sarangi
as possible. After all everybody wants their task to finish quickly. Hence,
the designers of Linux decided (rightfully so) to not give users the ability to
arbitrarily raise the priorities of their tasks. Instead, they allowed users to do
the reverse, which was to reduce the priority of their tasks. It is a way to be
nice to others. There are many instances where it is advisable to do so. For
instance, there are many tasks that do routine bookkeeping activities. They
are not very critical to the operation of the entire system. In this case, it is
a good idea for users to be courteous and let the operating system know that
their task is not very important. The scheduler can thus give more priority to
other tasks. There is a formal method of doing this, which is known as the
nice mechanism. As the name suggests, the user can increase the priority value
from 120 to any number in the range 121-139 by specifying a nice value. The
nice value in this case is a positive number, which is added to the number 120.
The final value represents the priority of the process. The macro NICE TO PRIO
effects the addition – it adds the nice value to 120.
There is a mechanism to also have a negative nice value. This mechanism
is limited to the superuser, who is also known as the root user in Linux-based
systems. The superuser is a special kind of user who has more privileges and
is supposed to play the role of a system administrator. She still does not have
kernel privileges, but she can specify a negative nice value that is between -1 and
-20. However, she still cannot raise the priority of a regular user-level process
to that of a real-time process, but she can arbitrarily alter the priority of all
user-level processes. We are underscoring the fact that regular users who are
not superusers cannot access this facility. Their nice values are strictly positive
and are in the range [1-19].
Now we can fully make sense of the code shown in Listing 3.3. We have
converted the user or real-time priority to a single number prio. Lower it is,
greater the actual priority. This number is henceforth used throughout the
kernel code to represent actual task priorities. We will see that when we discuss
schedulers, the process priorities will be very important and shall play a vital
role in making scheduling decisions.
/* Timestamps : */
The class sched info shown in Listing 3.4 contains some meta information
about the overall scheduling process. The variable pcount denotes the number
of times this task has run on the CPU. run delay is the time spent waiting in
the run queue. Note that the run queue is a structure that stores all the tasks
whose status is TASK RUNNING. As we have discussed earlier, this includes
tasks that are currently running on CPUs as well as tasks that are ready to run.
Then we have a bunch of timestamps. The most important timestamps are
last arrival and last queued that store when a task last ran on the CPU,
and it was last queued to run. In general, the unit of time within a CPU is
either in milliseconds or in jiffies (refer to Section 2.1.4).
Key components
Listing 3.5 shows the code of vm area struct that represents a contiguous
virtual memory region. As we can see from the code, it maintains the details
of each virtual memory (VM) region including its start and end addresses. It
also contains a pointer to the original mm struct. Let us now introduce the two
kinds of memory regions in Linux: anonymous and file-backed.
Anonymous Memory Region These are many memory regions that are not
mirrored or copied from a file such as the stack and heap for instance.
© Smruti R. Sarangi 70
These memory regions are created during the execution of the process
and store dynamically allocated data. Hence, these are referred to as
anonymous memory regions. They have a dynamic existence and do not
have a file-backed copy and are not linked to specific sections in a binary
or object file.
Let us answer the first question here. We shall defer answering the rest of
the questions until we have explained some additional concepts. For mapping a
pid to a struct pid, we use a Radix tree (see Appendix C).
A natural question that will arise here is why not use a hash table? The
kernel developers conducted a lot of experiments and tested a lot of data struc-
tures. They found that most of the time, processes share prefixes in terms of
their more significant digits. This is because, most of the time, the processes
that are active roughly have a similar pid range. As a result, if let’s say we
have looked up one process’s entry, the relevant part of the data structure is
still present in the processor’s caches. It can be used to quickly realize a lookup,
given that the subsequent pids are expected to share prefix digits. Hence, in
practice, such Radix trees were found to be faster than hash tables.
mind that a container at any point of time can be stopped, suspended and
migrated to another machine. On the other machine, the container can be
restarted. We want processes to continue to have the same process ids even on
the new machine. This will ensure that processes execute unbeknownst to the
fact that they have actually been migrated. Furthermore, if they would like
to communicate with each other, it is much more convenient if they retain the
same pids. This is where the notion of a namespace comes in handy.
In this case, a pid is defined only within the context of a namespace. Hence,
when we migrate a namespace along with its container, and then the container
is restarted on a remote machine, it is tantamount to reinstating the namespace
and all the processes within it. Given that we wish to operate in a closed
sandbox, they can continue to use the same pids and the operating system on
the target machine will respect them because they are running within their
separate namespace.
A namespace itself can be embedded in a hierarchy of namespaces. This is
done for the ease of management. It is possible for the system administrator to
provide only a certain set of resources to the parent namespace. Then the parent
namespace needs to appropriately partition them among its child namespaces.
This allows for fine-grained resource management and tracking.
struct pid_namespace {
/* A Radix tree to store allocated pid structures */
struct idr idr ;
The code of struct pid namespace is shown in Listing 3.6. The most
important structure that we need to consider is idr (IDR tree). This is an
annotated Radix tree (of type struct idr) and is indexed by the pid. The
reason that there is such a sophisticated data structure here is because, in
principle, a namespace could contain a very large number of processes. Hence,
there is a need for a very fast data structure for storing and indexing them.
We need to understand that often there is a need to store additional data
associated with a process. It is stored in a dedicated structure called (struct
pid). The idr tree returns the pid structure for a given pid number. We need
to note that a little bit of confusion is possible here given that both are referred
to using the same term “pid”.
Next, we have a kernel object cache (kmem cache) or pool called pid cachep.
It is important to understand what a pool is. Typically, free and malloc calls for
73 © Smruti R. Sarangi
allocating and deallocating memory in C take a lot of time. There is also need
for maintaining a complex heap memory manager, which needs to find a hole of
a suitable size for allocating a new data structure. It is a much better idea to
have a set of pre-allocated objects of the same type in an 1D array called a pool.
It is a generic concept and is used in a lot of software systems including the
kernel. Here, allocating a new object is as simple as fetching it from the pool
and deallocating it is also simple – we need to return it back to the pool. These
are very fast calls and do not involve the action of the heap memory manager,
which is far slower. Furthermore, it is very easy to track memory leaks. If we
forget to return objects back to the pool, then in due course of time the pool
will become empty. We can then throw an exception, and let the programmer
know that this is an unforeseen condition and is most likely caused by a memory
leak. The programmer must have forgotten to return objects back to the pool.
To initialize the pool, the programmer should have some idea about the
maximum number of instances of objects that may be active at any given point
of time. After adding a safety margin, the programmer needs to initialize the
pool and then use it accordingly. In general, it is not expected that the pool
will become empty because as discussed earlier it will lead to memory leaks.
However, there could be legitimate reasons for this to happen such as a wrong
initial estimate. In such cases, one of the options is to automatically enlarge
the pool size up till a certain limit. Note that a pool can store only one kind of
an object. In almost all cases, it cannot contain two different types of objects.
Sometimes exceptions to this rule are made if the objects are of the same size.
Next, we store the level field that indicates the level of the namespace.
Recall that namespaces are stored in a hierarchical fashion. This is why, every
namespace has a parent field.
Listing 3.7: The struct pid
source : include/linux/pid.h
struct upid {
int nr ; /* pid number */
struct pid_namespace * ns ; /* namespace pointer */
};
struct pid
{
refcount_t count ;
unsigned int level ;
Let us now look at the code of struct pid in Listing 3.7. As discussed
earlier, often there is a need to store additional information regarding a process,
which may possibly be used after the pid has been reused, and the process has
terminated. The count field refers to the number of resources that are using the
© Smruti R. Sarangi 74
process. Ideally, it should be 0 when the process is freed. Also, every process
has a default level, which is captured by the level field.
A struct pid can also be used to identify a group of tasks. This is why
a linked list of the type tasks is maintained. The last field numbers is very
interesting. It is an array of struct upid data structures (also defined in
Listing 3.7). Each struct upid is a tuple of the pid number and a pointer
to the namespace. Recall that we had said that a pid number makes sense in
only a given namespace. In other namespaces, the same process (identified with
struct pid) can have a different process id number (pid value). Hence, there
is a need to store such ⟨pid,namespace⟩ tuples.
3, 4 and 7. Their binary representations are 001, 011, 100 and 111, respectively.
Given that we start from the most significant bit, we can create a Radix tree as
shown in Figure 3.4.
The internal nodes are shown as ovals and the leaf nodes are rectangles.
Each leaf node is uniquely identified by the path from the root leading to it (the
pid number). Each leaf node additionally points to the struct pid associated
with the process. Along with that each leaf node implicitly points to a location
in the bit map. If a leaf node corresponds to pid 3 (011), then we can say that
it maps to the 3rd bit in the bit vector.
Bit
vector
0 1 2 3 4 5 6 7
Figure 3.4: Example of an IDR tree
Let us now interpret the IDR tree differently. Each leaf node corresponds to
a range of pid numbers. For instance, we can associate pid 1 with the indices
0 and 1 in the bit vector. Similarly, for pid 3, we can associate all the indices
after the previous pid 1. These will be indices 2 and 3. Similarly, for pid 4, we
can associate index 4 and for pid 7, we can associate 5, 6 and 7.
This is just an example, we can associate larger indices until the next pid
(in the ascending/preorder sequence of leaves). The convention that is used
does not matter. We are basically creating non-overlapping partitions of the bit
vector that are in sorted order.
We interpret the bit vector as follows. If a bit is equal to FREE (logical 1 in
the case of x86), then it means that the corresponding pid number is free, and
vice versa. Each internal node stores similar information as a van Emde Boas
tree (vEB tree). The aim is to find the index of the leftmost (lowest) entry in
the bit vector that stores FREE. Note that in this case a node can have more
than 2 children. For each child we store a bit indicating if it has a free entry in
the subtree rooted at it or not. Let us say that there are k children. The parent
stores a k-bit vector, where we start searching it from index 0 to k − 1. We
find the earliest entry whose status is FREE. The algorithm recursively proceeds
using the subtree rooted at the chosen child. At the end, the algorithm reaches
the leaf node.
Each leaf node corresponds to a contiguous region of bits in the bit vector.
This region can be very large especially if a lot of contiguous pids are deallocated.
© Smruti R. Sarangi 76
Now, we need to note that there is no additional vEB tree that additionally
indexes this region. Other than the Radix tree nodes, we don’t have additional
nodes. Hence, we need a fast method to find the first location that contains a
FREE entry in a large contiguous chunk of bits in the bit vector.
Clearly, using a naive method that scans every bit sequentially will take a
lot of time. However, some smart solutions are possible here. We can start with
dividing the set of contiguous bits into chunks of 32 or 64 bits (depending on
the architecture). Let us assume that each chunk is 64 bits. We can typecast
this chunk to an unsigned long integer and compare it with 0. If the comparison
succeeds, then it means that all the bits are 0 and there is no 1 (FREE). If it is
non-zero, then it means that the 64-bit chunk has at least one 1. Fortunately,
x86 machines have an instruction called bsf (bit scan forward) that returns
the position of the first (least significant) 1. This is a very fast hardware in-
struction that executes in 1-2 cycles. The kernel uses this instruction to almost
instantaneously find the location of the first 1 bit (FREE bit).
As soon as a FREE bit is found in the bit vector, it is set to 0, and the
corresponding pid number is deemed to be allocated. This is equivalent to
converting a 1 to a 0 in a vEB tree (see Appendix C). There is a need to traverse
the path from the leaf to the root and change the status of nodes accordingly.
Similarly, when a pid is deallocated, we convert the entry from 0 to 1 (FREE)
and appropriate changes are made to the nodes in the path from the leaf to the
root.
/* I / O device context */
struct io_context * io_context ;
There are historical reasons and over time programmers have learned how to
leverage it to design efficient systems. Other operating systems like Windows
(notably) has other mechanisms.
This model is simple and to many programmers it is much more intuitive.
The kernel defines a few special processes notably the idle process that does
nothing and the init process that is the mother process for all the processes
in the system. The idle process is basically more of a concept that an actual
process. It represents the fact that nothing active is being done or in other
words no program is running. Its pid is 0. The init process on the other hand
is the first process to run (started at boot time). It spawns the rest of the
processes, and its pid is 1.
4
5 int main ( void ) {
6 int pid = fork () ;
7
8 if ( pid == 0) {
9 printf ( " I am the child \ n " ) ;
10 } else {
11 printf ( " I am the parent : child = % d \ n " , pid ) ;
12 }
13 }
An example of using the fork System call is shown in Listing 3.9. Here, the
fork library recall is used that encapsulates the fork system call. The fork library
call returns a process id (pid in the code) after creating the child process.
Note that inside the code of the forking procedure, a new process is created,
which is a child of the parent process that made the fork call. It is a perfect
copy of the parent process. It inherits the parent’s code as well as its memory
state. In this case, inheriting means that all the memory regions and the state
is fully copied to the child. For example, if a variable x is defined to be 7 in the
code prior to executing the fork call, then after the call is over and the child
is created, both of them can read x. They will see its value to be 7. However,
there is a point to note here. The variable x is different for both the processes
– it has different physical addresses even though its starting value is the same
(i.e., 7). This means that if the parent changes x to 19, the child will still read
it to be 7 (because it possess a copy of x and not x itself). Basically, the child
gets a copy of the value of x, not a reference to it. Even though the name x is
the same across the two processes, the variables themselves are different.
Now that we have clarified the meaning of copying the entire memory space,
let us look at the return value. Both the child and the parent will return from
the fork call. A natural question that can be asked here is that prior to executing
the fork call, there was no child, how can it return from the fork call? This is
the fun and tricky part. When the child is created deep in the kernel’s process
cloning logic, a full task along with all of its accompanying data structures is
created. The memory space is fully copied including the register state and the
value of the return address. The state of the task is also fully copied. Since all
the addresses are virtual, creating a copy does not hamper correctness. Insofar
as the child process is concerned, all the addresses that it needs are a part of
its address space. It is, at this point of time, indistinguishable from the parent.
The same way that the parent will eventually return from the fork call, the child
also will. The child will get the return address from either the register or the
stack (depending upon the architecture). This return address, which is virtual,
will be in its own address space. Given that the code is fully copied the child
will place the return value in the variable pid and start executing Line 8 in
Listing 3.9. The notion of forking a new process is shown in Figure 3.5.
Herein lies the brilliance of this mechanism – the parent and child are re-
turned different values.
The child is returned 0 and the parent is returned the pid of the child.
This part is crucial because it helps the rest of the code differentiate between
© Smruti R. Sarangi 80
Original
process
Create a
copy
fork()
Child process
child's 0
pid
the parent and the child. A process knows Whether it is the parent process or
the child process from the return value: 0 for the child and the child’s pid for the
parent. After this happens, the child and parent go their separate ways. Based
on the return value of the fork call, the if statement is used to differentiate
between the child and parent. Both can execute arbitrary code beyond this
point in their behavior can completely diverge. In fact, we shall see that the
child can completely replace its memory map and execute some other binary.
However, before we go that far let us look at how the address space of one
process is completely copied. This is known as the copy-on-write mechanism.
Copy on Write
Figure 3.6(a) shows the copy-on-write mechanism. In this case, we simply copy
the page tables. The child inherits a verbatim copy of the parent’s page table
even though it has a different virtual address space. This mechanism ensures
that the same virtual address in both the child and parent’s virtual address space
points to the same physical address. This basically means that no memory is
wasted in the copying process and the size of the memory footprint remains
exactly the same. Note that copying the page table implies copying the entire
memory space including the text, data, bss, stack and heap. Other than the
return value of the fork call, nothing else differentiates the child and parent.
They can seamlessly read any address in their virtual address space and they
will get the same value. However, note that this is an implementation hack.
Conceptually, we do not have any shared variables. As we have discussed earlier,
if a variable x is defined before the fork call, after the call it actually becomes
two variables: x in the parent’s address space and x in the child’s address space.
It is true that to save space as well as for performance reasons, the same piece
of physical memory real estate is being used, however conceptually, these are
different variables. This becomes very clear when we consider write operations.
This part is shown in Figure 3.6(b) where we see that upon a write, a new
physical copy of the frame that contains that variable that was written to is
81 © Smruti R. Sarangi
Page
Parent Child
(a) Parent and child sharing a page
Page Page
Parent Child
created. The child and parent now have different mappings in their page tables.
Their page tables are updated to reflect this change. For the same virtual
address, they now point to different physical addresses. Assume that the child
initiated the write, then after a new physical copy of the page is created and the
respective mapping is made, the write is realized. In this case, the child writes
to its copy of the page. This write is not visible to the parent.
As the name suggests, this is a copy-on-write mechanism where the child
and parent continue to use the same physical page (frame) until there is no
write. This saves space and also allows us to fork a new process very quickly.
However, the moment there is a write, there is a need to create a new physical
copy of the page and realize the write on the respective copy (copy on write).
This does increase the performance overheads when it comes to the first write
operation after a fork call, however, a lot of this overhead gets amortized and is
often not visible.
There are several reasons for this. The first is that the parent or child may
not subsequently write to a large part of the memory space. In that case, the
copy-on-write mechanism will never kick in. The child may overwrite its memory
image with that of another binary and continue to execute it. In this case also,
there will be no need for a copy-on-write. Furthermore, lazily creating copies
of pages as and when there is a demand, distributes the overhead over a a long
period of time. Most applications can absorb this very easily. Hence, the fork
© Smruti R. Sarangi 82
Details
Let us now delve into some of the details of the forking mechanism. We would
like to draw your attention to the file in the kernel that lists all the supported
system calls: include/linux/syscalls.h. It has a long list of system calls. How-
ever, the system calls of our interest are clone and vfork. The clone system call
is the preferred mechanism to create a new process or thread in a thread group.
It is extremely flexible and takes a wide variety of arguments. However, the
vfork call is optimized for the case when the child process immediately makes
an exec call. In this case, there is no need to fully initialize the child and copy
the page tables of the parent. Finally, note that in a multithreaded process
(thread group), only the calling thread is forked.
Inside the kernel, all of them ultimately end up calling the copy process
function in kernel/fork.c. The signature of the function is as follows:
Here, the ellipses . . . indicate that there are more arguments, which we are
not specifying for the sake of readability. The main tasks that are involved in
copying a process are as follows:
83 © Smruti R. Sarangi
2. Copy of all the information about open files, network connections, I/O,
and other resources from the original task
(a) Copy all the connections to open files. This means that from now
on the parent and child can access the same open file (unless it is
exclusively locked by the parent)
(b) Copy a reference to the current file system
(c) Copy all information regarding signal handlers to the child
(d) Copy the virtual address map and other memory related information
(struct mm struct). This also copies the page table.
(e) Recreate all namespace memberships and copy all the I/O permis-
sions. By default, the child has the same level of permissions as the
parent.
(a) Add the new task to the list of children in the parent task
(b) Fix the parent and sibling list of the newly added child task
(c) Add the thread group information in a task that belongs to a multi-
threaded process.
The exec family of system calls are used to achieve this. In Listing 3.10
an example is shown where the child process runs the execv library call. Its
arguments are a null-terminated string representing the path of the executable
and an array of arguments. The first argument by default is the file name
– “pwd” in this case. The next few arguments should be the command-line
arguments to the executable and the last argument needs to be NULL. There
are a bunch of library calls in the exec family. All of them wrap the exec system
call.
There are many steps involved in this process. The first action is to clean
up the memory space (process map) of a process and reinitialize all the data
structures. We need to then load the starting state of the new binary in the
process’s memory map. This includes the contents of the text, data and bss
sections. Then there is a need to initialize the stack and heap sections as well
as initializing the starting value of the stack pointer. In general, connections to
external resources such as files and network addresses are preserved in an exec
call. Hence, there is no need to modify, cleanup or reinitialize them. We can
now start executing the process with an updated memory map from the start
of the text section. This is like regular program execution. The fact that we are
starting from a forked process is conveniently forgotten.
1. All the general-purpose registers including the stack pointer and return
address register (if the architecture has one)
2. Program counter (instruction pointer in x86)
3. Segment registers
4. Privileged registers such as CR3 (starting address of the page table)
5. ALU and floating-point unit flags
There are many other minor component of the hardware context in a large
and complex processor like an x86-64 machine. However, we have listed the
main components for the sake of readability. This context needs to be stored
and later restored.
It is important to discuss the two virtual memory related structures here
namely the TLB and page table. The TLB stores the most frequently (or
recently) used virtual-to-physical mappings. There is a need to flush the TLB
when the process changes, because the new process will have a new virtual
memory map. We do not want it to use the mappings of the previous process.
They will be incorrect and this will also be a serious security hazard because now
the new process can access the memory space of the older process. Hence, once
a process is swapped out, at least no other user-level process should have access
to its TLB contents. An easy solution is to flush the TLB upon a context switch.
However, as we shall see later, there is a more optimized solution, which allows
us to append the pid number to each TLB entry. This does not require the
system to flush the TLB upon a context switch. Instead, every process is made
to use only its own mappings. Because of the pid information that is present,
a process cannot access the mappings of any other process. This mechanism
reduces the number of TLB misses; as a result, there is a net performance
improvement.
© Smruti R. Sarangi 86
The page table, open file and network connections and the details of other
resources that a process uses are a part of the software context. They remain in
the process’s task struct. There is no need to manage them upon a context
switch because they do not need to be specifically stored and restored later.
Also, there is no need to flush the TLB or change the page table. If we
agree to use a different set of virtual addresses in kernel mode, then there is no
problem. This is because the kernel code that will be executed will only use
virtual addresses that are in the kernel space and have no overlap with user-level
virtual addresses. This kernel code is loaded separately when the soft switch
happens – it is not a part of the original binary. Hence, there is no need to
flush any TLB or freshly load any page table. For example, in 32-bit Linux, the
virtual address space was limited to 4 GB. User processes could only use the
lower 3 GB, and the kernel threads could only use the upper 1 GB. Of course,
there are special mechanisms for a kernel thread to read data from the user
space. However, the reverse is not possible because that would be a serious
security lapse. There is a similar split in 64-bit kernels as well – separate user
and kernel virtual address spaces.
There may not be any relationship between the user thread that was interrupted
and the interrupt handler. Hence, in all likelihood, the interrupt handler will
require a dedicated kernel thread to run. If you may recall, we had discussed the
idea of an interrupt stack in Section 3.2.4, where we had mentioned that each
CPU maintains a stack of interrupt threads. Whenever an interrupt arrives, we
pop the stack and assign the task of running the interrupt handler to the thread
that was just popped.
An advantage of doing this is that such threads would already have been
initialized to have a high priority. Furthermore, interrupt handling threads
have other restrictions such as they cannot hold locks or use any kind of blocking
synchronization. All of these rules and restrictions can be built into them at
the time of creation. Interrupt handling in Linux follows the classical top half
and bottom half paradigm. Here, the interrupt handler, which is known as the
top half, does basic interrupt processing. However, it may be the case that a
lot more work is required. This work is deferred to a later time and is assigned
to a lower-priority thread, which is classically referred to as the bottom half.
Of course, this mechanism has become more sophisticated now; however, the
basic idea is still the same: do the high-priority work immediately and defer
the low-priority work to a later point in time. The bottom half thread does not
have the same restrictions that the top half thread has. It thus has access to a
wider array of features. Here also, the interrupt handler’s (top half’s) code and
variables are in a different part of the virtual address space (not in a region that
is accessible to any user process). Hence, there is no need to flush the TLB or
reload any page table. This speeds up the context switch process.
interrupt processing a little bit, however we can be sure that the process was
saved correctly.
1. The hardware stores the program counter (rip register) to the register
rcx and stores the flags register rflags in r11. Clearly, prior to making
a system call it is assumed that the two general purpose registers rcx and
r11 do not contain any useful data.
4. Almost all x86 and x86-64 processors define a special segment per CPU
known as the Task State Segment or TSS. The size of the TSS segment is
small, but it is used to store important information regarding the context
switch process. It was previously used to store the entire context of the
task. However, these days it is used to store a part of the overall hardware
context of a running task. On x86-64, a system call handler stores the stack
pointer (rsp) on it. There is sadly no other choice. We cannot use the
kernel stack because for that we need to update the stack pointer – the old
value will get lost. We also cannot use a general purpose register. Hence,
a separate memory region is necessary to act as a temporary region.
5. Finally, the stack of the current process can be set to the kernel stack.
6. We can now push the rest of the state to the kernel stack. This will include
the following:
Additional Context
Along with the conventional hardware context, there are additional parts of
the hardware context that need to be stored and restored. Because the size of
the kernel stack is limited, it is not possible to store a lot of information there.
Hence, a dedicated structure called a thread struct is defined to store all extra
and miscellaneous information. It is defined in the following link:
arch/x86/include/asm/processor.h.
Every thread defines a TLS region (thread local storage). It stores variables
specific to a thread. The thread struct stores a list of such TLS regions
(starting address and size of each), the stack pointer (optionally), the segment
registers (ds,es,fs and gs), I/O permissions and the state of the floating-point
unit.
• arch start context switch: initialize the architectural state (if required).
At the moment, x86 architectures do basic sanity checks in this stage.
91 © Smruti R. Sarangi
• Manage the mm struct structures (memory maps) for the previous and
next tasks
Listing 3.11: Code for switching the memory structures (partial code shown
with adaptations)
source : kernel/sched/core.c
if (! next - > mm )
next - > active_mm = prev - > active_mm ;
if ( prev - > mm )
mmgrab ( prev - > active_mm ) ;
}
if (! prev - > mm ) {
prev - > active_mm = NULL ;
}
middle, but finally the same user thread is coming back. Since its page table
and TLB state have been maintained, there is no need to flush the TLB. This
improves performance significantly.
The switch to function accomplishes this task by executing the steps to save
the context in the reverse order. The first is to extract all the information
in the thread struct structures and restore them. They are not very critical
to the execution and thus can be restored first. Then the thread local state
and segment registers other than the code segment register can be restored.
Lastly, the current task pointer, a few of the registers and the stack pointer
are restored.
The function finish task switch completes the process. It updates the process
states of the prev and next tasks and takes care of the timing information
associated with tasks. Note that tasks in the kernel maintain detailed timing
information about the time that they have executed. This information is used
by the scheduler. A few more memory-related adjustments are done such as
adjusting the pages mapped into kernel space (known as kmap in Linux).
Finally, we are ready to start the new task !!! We set the values of the rest of
the flags, registers, the code segment register and finally the instruction pointer.
Trivia 3.4.1
One will often find statements of the form in the kernel code:
if ( likely ( < some condition >) ) {...}
if ( unlikely ( < some condition >) ) {...}
These are hints to the branch predictor of the CPU. The term likely
means that the branch is most likely to be taken, and the term unlikely
means that the branch is most likely to be not taken. These hints increase
the branch predictor accuracy, which is vital to good performance.
Trivia 3.4.2
One often find statements of the form:
static __latent_entropy struct task_struct *
copy_process (...) {...}
We are using the value of the task struct* pointer as a source of ran-
domness. Many such random sources are combined in the kernel to create
a good random number generating source that can be used to generate
cryptographic keys.
93 © Smruti R. Sarangi
Exercises
Ex. 2 — Why do we use the term “kernel thread” as opposed to “kernel pro-
cess”?
Ex. 3 — How does placing a limit on the kernel thread stack size make kernel
memory management easy?
Ex. 4 — If the kernel wants to access physical memory directly, how does it
do so using the conventional virtual memory mechanism?
Ex. 5 — Explain the design and operation of the kernel linked list structure
in detail.
Ex. 6 — Why cannot the kernel code use the same user-level stack and delete
its contents before a context switch?
Ex. 7 — What are the advantages of creating a child process with f ork and
exec, as compared to a hypothetical mechanism that can directly create a pro-
cess given the path to the binary?
Ex. 8 — Assume that there are some pages in a process such as code pages
that need to be read-only all the time. How do we ensure that this holds during
the forking process as well? How do we ensure that the copy-on-write mechanism
does not convert these pages to “non-read-only”?
Ex. 9 — What is the role of the TSS segment in the overall context switch
process?
Ex. 13 — Why do we use registers to store the values of rip and rflags in
the case of system calls, whereas we use the stack for interrupts?
* Ex. 14 — To save the context of a program, we need to read all of its reg-
isters, and store them in the kernel’s memory space. The role of the interrupt
handler is to do this by sequentially transferring the values of registers to ker-
nel memory. Sadly, the interrupt handler needs to access registers for its own
execution. We thus run the risk of inadvertently overwriting the context of the
original program, specifically the values that it saved in the registers. How do
we stop this from happening?
Ex. 16 — Consider a situation where a process exits, yet a few threads of that
process are still running. Will those threads continue to run? Explain briefly.
Ex. 17 — Why are idr trees used to store pid structures? Why can’t we use
BSTs, B-Trees, and hash tables? Why is it effective?
Ex. 18 — Which two trees does the idr tree combine? How and why?
Ex. 25 — How are the Radix and Van Emde Boas trees combined? What
is the need for combining them? Answer the latter question in the context of
process management.
95 © Smruti R. Sarangi
Open-Ended Questions
In this chapter, we will study the details of system calls, interrupts, exceptions
and signals. The first three are the only methods to invoke the OS. Normally,
the OS code lies dormant. It comes into action only after three events: system
calls, interrupts and exceptions. In common parlance all three of these events
are often referred to as interrupts even though sometimes distinctions are made
such as using the terms “hardware interrupts” and “software interrupts”. It is
important to note that in the general interrupt-processing mechanism, an inter-
rupt can be generated by either external hardware or internally by the program
itself. Internal interrupts comprise software interrupts to effect system calls and
exceptions 1 For all these mechanisms including the older way of making system
calls, i.e., invoking the instruction int 0x80 that simply generates a dummy
interrupt with interrupt code 0x80, the generic interrupt processing mechanism
is used. The processor even treats exceptions as a special kind of interrupts.
All interrupts have their own interrupt codes; they are also known as interrupt
vectors. An interrupt vector is used to index interrupt handler tables. Let us
elaborate.
Figure 4.1 shows the structure of the Interrupt Descriptor Table (IDT) that
is pointed to by the idtr register. As we can see, regardless of the source of the
interrupt, ultimately an integer code called an interrupt vector gets associated
with it. It is the job of the hardware to assign the correct interrupt vector to
an interrupting event. Once this is done, a hardware circuit is ready to access
the IDT using the interrupt vector as the index.
Accessing the IDT is a simple process. A small module in hardware simply
finds the starting address of the IDT by reading the contents of the idtr register
and then accesses the relevant entry using the interrupt vector. The output is
the address of the interrupt handler, whose code is subsequently loaded. The
handler finishes the rest of the context switch process and begins to execute the
code to process the interrupt. Let us now understand the details of the different
1 Note that the new syscall instruction for 64-bit processors, does not use this mechanism.
Instead, it directly transitions the execution to the interrupt entry point after a ring level
change and some basic context saving.
97
© Smruti R. Sarangi 98
IDT
Excep�ons
System calls
(via int 0x80)
Interrupt Address of
vector the handler
Interrupts
idtr
Hardware device Register
iden�fied by its IRQ
types of handlers.
int main () {
printf ( " Hello World \ n " ) ;
}
Let us now understand this process in some detail. The signature of the
printf function is as follows: int printf(const char* format, ...). The
format string is of the form ‘‘The result is %d, %s’’. It is succeeded by
a sequence of arguments, which replace the format specifiers (like ‘‘%d’’ and
‘‘%s’’) in the format string. The ellipses . . . indicates a variable number of
arguments.
A sequence of functions are called in the glibc code. The sequence is as
follows: printf → printf → vfprintf → printf positional → outstring
→ PUT. Gradually the signature changes – it becomes more and more generic.
This ensures that other calls like fprintf that write to a file are all covered by
the same function as special cases. Note that Linux treats every device as a file
including the terminal. The terminal is a special kind of file, which is referred to
as stdout. The function vfprintf accepts a generic file as an argument, which
it can write to. This generic file can be a regular file in the file system or the
terminal (stdout). The signature of vprintf is as follows:
99 © Smruti R. Sarangi
Note the generic file argument FILE *s, the format string, the list of ar-
guments and the flags that specify the nature of the I/O operation. Every
subsequent call generalizes the function further. Ultimately, the control reaches
the new do write in the glibc code (fileops.c). It makes the write system
call, which finally transfers control to the OS. At this point, it is important to
digress and make a quick point about the generic principles underlying library
design.
instruction. Regardless of the method used, we arrive at the entry point of a sys-
tem call, which is the do syscall 64 function defined in arch/x86/entry/entry 64.S.
At this point there is a ring level switch and interrupts are switched off. The
reason to turn off interrupts is to ensure that the context is saved correctly. If
there is an interrupt in the middle of saving the context, there is a possibility
that an error will be induced. Hence, the context saving process cannot termi-
nate prematurely. Saving the context is a short process and masking interrupts
during this process will not create a lot of performance issues in handling even
very critical tasks. Interrupts can be enabled as soon as the context is saved.
Linux has a standard system call format. It is shown in Table 4.1 that
shows which register stores which type of argument. For instance, rax stores the
system call number. Six more arguments can be supplied via registers as shown
in Table 4.1. If there are more arguments, then they need to be transferred
via the user stack. The kernel can read user memory, and thus it can easily
retrieve these arguments. However, passing arguments using the stack is not
the preferred method. It is much slower than passing values via registers.
Note that a system call is a planned activity, compared to an interrupt.
Hence, we can free up some registers such as rcx and r11 by spilling their
contents. Recall that the PC and the flags are automatically stored in these
registers once a system call is made. The system call handler subsequently
stores the contents of these registers on the kernel stack.
Attribute Register
System call number rax
Arg. 1 rdi
Arg. 2 rsi
Arg. 3 rdx
Arg. 4 r10
Arg. 5 r8
Arg. 6 r9
Let us now discuss the do syscall 64 function more. After basic context
saving, interrupts are enabled, and then the function accesses a system call table
as shown in Table 4.2. Given a system call number, the table lists the pointer to
the function that handles the specific type of system call. This function is then
subsequently called. For instance, the write system call ultimately gets handled
by the ksys write function, where all the arguments are processed, and the real
work is done.
We start out by checking the TIF NEED RESCHED bit in the flags stored in
the thread info structure (accessible via task struct). This flag is set by the
scheduler when it feels that the current task has executed for a long time, and it
needs to give way to other processes or there are other higher priority processes
that are waiting. Sometimes threads explicitly request for getting preempted
such that other threads get a chance to execute. Other threads may create
some value that is useful to the current thread. In this case also, the thread
that wishes to yield the CPU gets this flag set.
If this flag is set, then the scheduler needs to run and find the most worthy
process to run next. The scheduler uses very complex algorithms to decide this.
The scheduler treats the TIF NEED RESCHED flag as a coarse-grained estimate.
Nevertheless, it makes its independent decision. It may decide to continue with
the same task, or it may decide to start a new task on the same core. This is
purely its prerogative.
The context restore mechanism follows the reverse sequence vis-á-vis the
context switch process. Note that there are some entities such as segment
registers that normally need not be stored but have to be restored. The reason
is that because they are not transient (ephemeral) in character. We don’t expect
them to get changed often, especially by the user task. Once, they are set, they
typically continue to retain their values till the end of the execution. Their
values can be stored in the task struct. At the time of restoring a context, if
the task changes, we can read the respective values from the task struct and
set the values of the segment registers.
Finally, the kernel calls the sysret instruction that sets the value of the PC
and completes the control transfer back to the user process. It also changes the
ring level or in other words effects a mode switch (from kernel mode to user
mode).
LAPIC
CPU CPU
I/O
APIC
CPU CPU cntrlr
4.2.1 APICs
Figure 4.4 represents the flow of actions. We need to distinguish between two
terms: interrupt request and interrupt number/vector. The interrupt number of
interrupt vector is a unique identifier of the interrupt and is used to identify the
interrupt service routine that needs to run whenever the interrupt is generated.
The IDT is indexed by this number.
The interrupt request(IRQ) on the other hand is a hardware signal that
is sent to the CPU indicating that a certain hardware needs to be serviced.
There are different IRQ lines (see Figure 4.4). For example, one line may be
for the keyboard, another one for the mouse, so on and so forth. In older
systems, the number/index of the IRQ line was the same as the interrupt vector.
However, with the advent of programmable interrupt controllers (read APICs),
this has been made more flexible. The mapping can be changed dynamically.
For example, we can program a new device such as a USB device to actually
act as a mouse. It will generate exactly the same interrupt vector. In this way,
it is possible to obfuscate a device and make it present itself as a different or
somewhat altered device to software.
Here again there is a small distinction between the LAPIC and I/O APIC.
The LAPIC directly generates interrupt vectors and sends them to the CPU.
The flow of actions (for the local APIC) are shown in Figure 4.4. ❶ The first
step is to check if interrupts are enabled or disabled. Recall that we discussed
that often there are sensitive sections in the execution of the kernel where it
is a wise idea to disable interrupts such that no correctness problems are in-
troduced. Interrupts are typically not lost. They are queued in the hardware
queue in the respective APIC and processed in priority order when interrupts
103 © Smruti R. Sarangi
IRQ
lines APIC CPU
IRQ (interrupt request): An interrupt vector (INT) identifies any kind The LAPIC sends an
Kernel iden�fier for a HW of an interrupting event: interrupt, system interrupt vector to the
interrupt source call, exception, fault, etc. CPU (not an IRQ)
I/O APIC
In the full system, there is only one I/O APIC chip. It is typically not a part
of the CPU, instead it is a chip on the motherboard. It mainly contains a
redirection table. Its role is to receive interrupt requests from different devices,
process them and dispatch the interrupts to different LAPICs. It is essentially
an interrupt router. Most I/O APICs typically have 24 interrupt request lines.
Typically, each device is assigned its IRQ number – the lower the number higher
is the priority. A noteworthy mention is the timer interrupt, whose IRQ number
is typically 0.
a kernel thread is running on CPU 5 and the kernel decides to preempt the task
running on CPU 1. Currently, we are not aware of any method of doing so.
The kernel thread only has control over the current CPU, which is CPU 5. It
does not seem to have any control over what is happening on CPU 1. The IPI
mechanism, which is a hardware mechanism is precisely designed to facilitate
this. CPU 5 on the behest of the kernel thread running on it, can instruct
its LAPIC to send an IPI to the LAPIC of CPU 1. This will be delivered to
CPU 1, which will run a kernel thread on CPU 1. After doing the necessary
bookkeeping steps, this kernel thread will realize that it was brought in because
the kernel thread on CPU 5 wanted to replace the task running on CPU 1 with
some other task. In this manner, one kernel thread can exercise its control over
all CPUs. It does however need the IPI mechanism to achieve this. Often, the
timer chip is a part of the LAPIC. Depending upon the needs of the kernel, its
interrupt frequency can be configured or even changed dynamically. We have
already described the flow of actions in Figure 4.4.
Distribution of Interrupts
The next question that we need to address is how are the interrupts distributed
between the LAPICs. There are regular I/O interrupts, timer interrupts and
IPIs. We can either have a static distribution or a dynamic distribution. In
the static distribution, one specific core or a set of cores are assigned the role
of processing a given interrupt. Of course, there is no flexibility when it comes
to IPIs. Even in the case of timer interrupts, it is typically the case that each
LAPIC generates periodic timer interrupts to interrupt its local core. However,
this is not absolutely necessary, and some flexibility is provided. For instance,
instead of generating periodic interrupts, it can be programmed to generate an
interrupt at a specific point of time. In this case, this is a one-shot interrupt and
periodic interrupts are not generated. This behavior can change dynamically
because LAPICs are programmable.
In the dynamic scheme, it is possible to send the interrupt to the core that is
running the task with the least priority. This again requires hardware support.
Every core on an Intel machine has a task priority register, where the
kernel writes the priority of the current task that is executing on it. This
information is used by the I/O APIC to deliver the interrupt to the core that
is running the least priority process. This is a very efficient scheme, because it
allows higher priority processes to run unhindered. If there are idle cores, then
the situation is even better. They can be used to process all the I/O interrupts
and sometimes even timer interrupts (if they can be rerouted to a different core).
4.2.2 IRQs
The file /proc/interrupts contains the details of all the IRQs and how they
are getting processed (refer to Figure 4.3). Note that this file is relevant to only
the author’s machine and that too as of 2023.
The first column is the IRQ number. As we see, the timer interrupt is IRQ#
0. The next four columns show the count of timer interrupts received at each
CPU. Note that it has a small value. This is because any modern machine has
a variety of timers. It has the low-resolution LAPIC timer. In this case, a more
high-resolution timer was used. Modern kernels prefer high-resolution timers
105 © Smruti R. Sarangi
because they can dynamically configure the interrupt interval based on the
processes that are executing in the kernel. This interrupt is originally processed
by the I/O APIC. The term “2-edge” means that this is an edge-triggered
interrupt on IRQ line 2. Edge-triggered interrupts are activated when there is
a level transition (0 → 1 and 1 → 0 transitions). The handler is the generic
function associated with the timer interrupt.
The “fasteoi” interrupts are level-triggered. Instead of being based on an
edge (a signal transition), they depend upon the level of the signal in the in-
terrupt request line. “eoi” stands for “End of Interrupt”. The line remains
asserted until the interrupt is acknowledged.
For every request that comes from an IRQ, an interrupt vector is generated.
Table 4.4 shows the range of interrupt vectors. NMIs (non-maskable interrupts
and exceptions) fall in the range 0-19. The interrupt numbers 20-31 are reserved
by Intel for later use. The range 32-127 corresponds to interrupts generated by
external sources (typically I/O devices). We are all familiar with interrupt num-
ber 128 (0x80 in hex) that is a software-generated interrupt corresponding to a
system call. Most modern machines have stopped using this mechanism because
they now have a faster method based on the syscall instruction. 239 is the local
APIC (LAPIC) timer interrupt. As we have argued many IRQs can generate
this interrupt vector because there are many timers in modern systems with
different resolutions. Lastly, the range 251-253 corresponds to inter-processor
interrupts (IPIs). A disclaimer is due here. This is the interrupt vector range in
the author’s Intel i7-based system as of 2023. This in all likelihood may change
in the future. Hence, a request to the reader is to treat this data as an example.
Table 4.5 summarizes our discussion quite nicely. It shows the IRQ number,
interrupt vector and the hardware device. We see that IRQ 0 for the default
© Smruti R. Sarangi 106
timer corresponds to interrupt vector 32. The keyboard, system clock, network
interface and USB ports have their IRQ numbers and corresponding interrupt
vector numbers. One advantage of separating the two concepts – IRQ and in-
terrupt vector – are clear from the case of timers. We can have a wide variety
of timers with different resolutions. However, they can be mapped to the same
interrupt vector. This will ensure that whenever an interrupt arrives from any
one of them the timer interrupt handler can be invoked. The current can dy-
namically decide which timer to use depending on the requirements and load
on the system.
Given that HW IRQs are limited in number, it is possible that we may have
more devices than the number of IRQs. In this case, several devices have to
share the same IRQ number. We can do our best to dynamically manage the
IRQs such as deallocating the IRQ when a device is not in use or dynamically
allocating an IRQ when a device is accessed for the first time. In spite of that we
still may not have enough IRQs. Hence, there is a need to share an IRQ between
multiple devices. Whenever an interrupt is received from an IRQ, we need to
check which device generated it by running all the handlers corresponding to
each connected device (that share the same IRQ). These handlers will query the
individual devices or inspect the data and find out. Ultimately, we will find a
device that is responsible for the interrupt. This is a slow but compulsory task.
Listing 4.2 shows the important fields in struct irqdesc. It is the nodal
data structure for all IRQ-related data. It stores all the information regarding
the hardware device, the interrupt vector, CPU affinities (which CPUs process
it), pointer to the handler, special flags and so on.
Akin to process namespaces, IRQs are subdivided into domains. This is
especially necessary given that modern processors have a lot of devices and
interrupt controllers. We can have a lot of IRQs, but at the end of the day, the
processor will use the interrupt vector (a simple number between 0-255). It still
needs to retain its meaning and be unique.
A solution similar to hierarchical namespaces is as follows: assign each in-
terrupt controller a domain. Within a domain, the IRQ numbers are unique.
Recall that we followed a similar logic in process namespaces – within a names-
pace pid numbers are unique. The IRQ number (like a pid) is in a certain sense
getting virtualized. Similar to a namespace’s IDR tree whose job was to map pid
numbers to struct pid data structures, we need a similar mapping structure
here per domain. It needs to map IRQ numbers to irq desc data structures.
This is known as reverse mapping (in this specific context). Such a mapping
mechanism allows us to quickly retrieve an irq desc data structure given an
IRQ number. Before that we need to add the interrupt controller to an IRQ
domain. Typically, the irq domain add function is used to realize this. This is
similar to adding a process to a namespace first before starting any operation
on the process.
In the case of an IRQ domain, we have a more nuanced solution. If there are
less than 256 IRQs, then the kernel uses a simple linear list, otherwise it uses a
radix tree. This gives us the best of both worlds.
The domains are organized hierarchically. We have an I/O APIC domain
whose parent is a larger domain known as the interrupt remapping domain – its
job is to virtualize the multiple I/O APICs. This domain forwards the interrupt
to the controllers in the LAPIC domain that further virtualize the IRQs, map
them to interrupt vectors and present them to the cores.
© Smruti R. Sarangi 108
An astute reader will quickly notice the difference between hierarchical names-
paces and hierarchical IRQ domains. In the former, the aim is to make a child
process a member of the parent namespace such that it can access resources
that the parent owns. However, in the case of IRQ domains, interrupts flow
from the child to parent. There is some degree of virtualization and remapping
at every stage. For example, one of the domains in the middle could send all
keyboard interrupts to only one VM (virtual machine) running on the system.
This is because the rest of the VMs may not be allowed to read entries from the
keyboard. Such policies can be enforced with IRQ domains.
The IDT maps the interrupt vector to the address of the handler.
The initial IDT is set up by the BIOS. During the process of the kernel
booting up, it is sometimes necessary to process user inputs or other important
system events like a voltage or thermal emergency. Also in many cases prior
to the OS booting up, the bootloader shows up on the screen; it asks the user
about the kernel that she would like to boot. For all of this, we need a bare
bones IDT that is already set up. However, once the kernel boots, it needs
to reinitialize or overwrite it. For every single device and exception-generating
situation, entries need to be made. These will be custom entries and only the
kernel can make them because the BIOS would simply not be aware of them –
they are very kernel specific. Furthermore, the interrupt handlers will be in the
kernel’s address space and thus only the kernel will be aware of their locations.
In general, interrupt handlers are not kept in a memory region that can be
relocated or swapped out. The pages are locked and pinned in physical memory
(see Section 3.2.8).
The kernel maps the IDT to the idt table data structure. Each entry of this
table is indexed by the interrupt vector. Each entry points to the corresponding
interrupt handler. It basically contains two pieces of information: the value
of the code segment register and an offset within the code segment. This is
sufficient to load the interrupt handler. Even though this data structure is
set up by the kernel, it is actually looked up in hardware. There is a simple
mechanism to enable this. There is a special register called the IDTR register.
Similar to the CR3 register for the page table, it stores the base address of the
IDT. Thus, the processor knows where to find the IDT in physical memory.
The rest of the lookup can be done in hardware and interrupt handlers can be
automatically loaded by a hardware circuit. The OS need not be involved in
this process. Its job is to basically set up the table and let the hardware do the
rest.
such as a Graphics card or a network card. These devices are essential to the
operation of the full system.
Next, it makes a call to init IRQ to set up the per-CPU interrupt stacks
(Section 3.2.4) and the basic IDT. Once that is done the LAPICs and the I/O
APIC can be setup along with all the connected devices. The apic bsp setup
function realizes this task. All the platform specific initialization functions for
x86 machines are defined in a structure .irqs that contains a list of function
pointers as shown in Listing 4.3. The function apic intr mode init specifically
initializes the APICs on x86 machines.
Listing 4.3: The function pointers associated with IRQ handling
source : arch/x86/kernel/x86 init.c
. irqs = {
. pre_vector_init = init_ISA_irqs ,
. intr_init = native_init_IRQ ,
. intr_mode_select = apic_intr_mode_select ,
. intr_mode_init = apic_intr_mode_init ,
. crea te_p ci_ms i_do main = native_create_pci_msi_domain ,
}
TPR Task priority register. This stores the priority of the task. When we
are dynamically assigning interrupts to cores, the priority stored in this
register comes handy.
The entry point to the IDT is shown in Listing 4.4. The vector irq array
is a table that uses the interrupt vector (vector) as an index to fetch the
corresponding irq desc data structure. This array is stored in the per-CPU
region, hence the this cpu read macro is used to access it. Once we obtain the
irq desc data structure, we can process the interrupt by calling the handle irq
function. Recall that the interrupt descriptor stores a pointer to the function
that is meant to handle the interrupt. The array regs contains the values of
all the CPU registers. This was populated in the process of saving the context
of the running process that was interrupted. Let us now look at an interrupt
handler, referred to as an IRQ handler in the parlance of the Linux kernel. The
specific interrupt handlers are called from the handle irq function.
Recall that an IRQ can be shared across devices. When an IRQ is set to
high, we don’t know which device has raised an interrupt. It is important to call
all the handlers associated with the IRQ one after the other, until one of them
associates the interrupt with its corresponding device. This interrupt handler
111 © Smruti R. Sarangi
can then proceed to handle the interrupt. The rest of the interrupt handlers
associated with the IRQ will return NONE, whereas the handler that handles it
returns either HANDLED or WAKE THREAD.
The structure irq desc has a linked list comprising struct irqaction*
elements. It is necessary to walk this list and find the handler that is associated
with the device that raised the interrupt.
Note that all of these handlers (of type irq handler t) are function point-
ers. They can either be generic interrupt handlers defined in the kernel or
device-specific handlers defined in the device driver code (the drivers direc-
tory). Whenever a device is connected or at boot time, the kernel locates the
device drivers for every such device. The device driver registers a list of func-
tions with the kernel. Whenever an interrupt arrives on a given IRQ line, it is
necessary to invoke the interrupt handlers of all the devices that share that IRQ
line. Basically, we traverse a list of function pointers and invoke one after the
other until the interrupt is successfully handled.
The interrupt handler that does the basic interrupt processing is conven-
tionally known as the top half. Its primary job is to acknowledge the receipt of
the interrupt to the APIC and communicate urgent data with the device. Note
that such interrupt handlers are not allowed to make blocking calls or use locks.
Given that they are very high priority threads, we do not want to wait for a long
time to acquire a lock or potentially get into a deadlock situation. Hence, they
are not allowed to use any form of blocking synchronization. Let us elaborate.
Top-half interrupt handlers run in a specialized interrupt context. In the
interrupt context, blocking calls such as lock acquisition are not allowed, pre-
emption is disabled, there are limitations on the stack size (similar to other
kernel threads) and access to user-space memory is not allowed. These are
clearly attributes of very high-priority threads that you want to run and finish
quickly.
If the interrupt processing work is very limited, then the basic top half
interrupt handler is good enough. Otherwise, it needs to schedule a bottom half
thread for deferred interrupt processing. We schedule the work for a later point
in time. The bottom half thread does not have the same restrictions of the top
half thread. It can acquire locks, perform complex synchronization and can take
a long time to complete. Moreover, interrupts are enabled when a bottom half
thread is running. Given that such threads have a low priority, they can execute
for a long time. They will not make the system unstable.
4.2.6 Exceptions
The Intel processor on your author’s machine defines 24 types of exceptions.
These are treated exactly the same way as interrupts and similar to an interrupt
vector, an exception number is generated.
Even though interrupts and exceptions are conceptually different, they are
still handled by the same mechanism, i.e., the IDT. Hence, from the stand-
point of interrupt handling, they are the same (they index the IDT in the same
manner), however, later on within the kernel their processing is very different.
Table 4.6 shows a list of some of the most common exceptions supported by
Intel x86 processors and the latest version of the Linux kernel.
Many of the exceptions are self-explanatory. However, some need some ad-
ditional explanation as well as justification. Let us consider the “Breakpoint”
© Smruti R. Sarangi 112
Exception Handling
Let us now look at exception handling (also known as trap handling). For every
exception, we define a macro of the form (refer to Listing 4.5) –
113 © Smruti R. Sarangi
We are declaring a macro for division errors. It is named exc divide error.
It is defined in Listing 4.6. The generic function do error trap handles all the
traps (to begin with). Along with details of the trap, it takes all the CPU
registers as an input.
Listing 4.6: Definition of a trap handler
source : arch/x86/kernel/traps.c
DEFINE_IDTENTRY ( exc_divide_error )
{
do_error_trap ( regs , 0 , " divide error " , X86_TRAP_DE ,
SIGFPE , FPE_INTDIV , error_get_trap_addr ( regs ) ) ;
}
There are several things that an exception handler can do. The various
options are shown in Figure 4.5.
The first option is clearly the most innocuous, which is to simply send a
signal to the process and not take any other kernel-level action. This can for
instance happen in the case of debugging, where the processor will generate an
exception upon the detection of a debug event. The OS will then be informed,
and the OS needs to send a signal to the debugging process. This is exactly
how a breakpoint or watchpoint work.
The second option is not an exclusive option – it can be clubbed with the
other options. The exception handler can additionally print messages to the
kernel logs using the built-in printk function. This is a kernel-specific print
function that writes to the logs. These logs are visible using either the dmesg
command or are typically found in the /var/log/messages file. Many times
understanding the reasons behind an exception is very important, particularly
when kernel code is being debugged.
The third option is meant to genuinely be an exceptional case. It is a dou-
ble fault – an exception within an exception handler. This is never supposed
to happen unless there is a serious bug in the kernel code. In this case, the
recommended course of action is to halt the system and restart the kernel This
event is also known as a kernel panic (srckernel/panic.c).
The fourth option is very useful. For example, assume that a program has
been compiled for a later version of a processor that provides a certain instruc-
tion that an earlier version does not. For instance, processor version 10 in the
processor family provides the cosine instruction, which version 9 does not. In
this case, it is possible to create a very easy patch in software such that code
that uses this instruction can still seamlessly run on a version 9 processor.
The idea is as follows. We allow the original code to run. When the CPU
will encounter an unknown instruction (in this case the cosine instruction), it
will generate an exception – illegal instruction. The kernel’s exception handler
can then analyze the nature of the exception and figure out that it was actually
the cosine operation that the instruction was trying to compute. However, that
instruction is not a part of the ISA of the current processor. In this case, it
is possible to use other existing instructions and perform the computation to
compute the cosine of the argument and populate the destination register with
© Smruti R. Sarangi 114
the result. The running program can be restarted at the exact point at which it
trapped. The destination register will have the correct result. It will not even
perceive the fact that it was running on a CPU that did not support the cosine
instruction. Hence, from the point of view of correctness, there is no issue.
Of course, there is a performance penalty – this is a much slower solution
as compared to having a dedicated instruction. However, the code now be-
comes completely portable. Had we not implemented this patching mechanism
via exceptions, the entire program would have been rendered useless. A small
performance penalty is a very small price to pay in this case.
The last option is known as the notify die mechanism, which implements
the classic observer pattern in software engineering.
of code to fix any errors that may have occurred. Another interested listener
can log the event. These two processes are clearly doing different things, which
was the original intention. We can clearly add more processors to the chain of
listeners, and do many other things.
The return values of the different handlers are quite relevant and important
here. This process is similar in character to the irqaction mechanism, where
we invoke all the interrupt handlers that share an IRQ line in sequence. The
return value indicates whether the interrupt was successfully handled or not.
In that case, we would like to handle the interrupt only once. However, in the
case of an exception, multiple handlers can be invoked, and they can perform
different kinds of processing. They may not enjoy a sense of exclusivity (as
in the case of interrupts). Let us elaborate on this point by looking at the
return values of exception handlers that use the notify die mechanism (shown
in Table 4.7). We can either continue traversing the chain of listeners/observers
after processing an event or stop calling any more functions. All the options
have been provided.
Value Meaning
NOTIFY DONE Do not care about this event. However, other
functions in the chain can be invoked.
NOTIFY OK Event successfully handled. Other functions in
the chain can be invoked.
NOTIFY STOP Do not call any more functions.
NOTIFY BAD Something went wrong. Stop calling any more
functions.
Table 4.7: Status values returned by exception handlers that have subscribed
to the notify die mechanism. source : include/linux/notifier.h
4.3.1 Softirqs
A regular interrupt’s top-half handler is known as a hard IRQ. It is bound by a
large number of rules and constraints regarding what it can and cannot do. A
softirq on the other hand is a bottom half handler. There are two ways that it
can be invoked (refer to Figure 4.6).
System management
Hard IRQ
(kernel threads)
do_so�irq
The first method (on the left) starts with a regular I/O interrupt (hard IRQ).
After basic interrupt processing, a softirq request is raised. This means that a
work parcel is created that needs to be executed later using a softirq thread.
It is important to call the function local bh enable after this such that the
processing of bottom-half threads like softirq threads is enabled.
Then at a later point of time the function do softirq is invoked whose job
is to check all the deferred work items and execute them one after the other
using specialized high-priority threads.
There is another mechanism of doing this type of work (the right path in the
figure). It is not necessary that top-half interrupt handlers raise softirq requests.
They can be raised by regular kernel threads that want to defer some work for
later processing. It is important to note that there may be more urgent needs
in the system and thus some kernel work needs to be immediately. Hence, a
deferred work item can be created as stored as a softirq request.
A dedicated kernel thread called ksoftirqd runs periodically and checks for
pending softirq requests. These threads are called daemons. Daemons are ded-
icated kernel threads that typically run periodically and check/process pending
requests. ksoftirqd periodically follows the same execution path and calls the
function do softirq where it picks an item from a softirq queue and executes
a function to process it.
The net summary is that softirqs are generic mechanisms that can be used
by both top-half interrupt handlers and specialized kernel threads; both can
insert softirq requests in dedicated queues. They are processed later on when
CPU time is available.
Raising a softirq
Many kinds of interrupt handlers can raise softirq requests. They all invoke the
raise softirq function whenever they need to add a softirq request. Instead
of using a software queue, there is a faster method to record this information.
117 © Smruti R. Sarangi
A fast method is to store a word in memory in the per-CPU region. Each bit
of this memory word has a bit corresponding to a specific type of softirq. If a
bit is set, then it means that a softirq request of the specific type is pending at
the corresponding CPU.
Here are examples of some types of softirqs (defined in include/linux/interrupt.h):
HI SOFTIRQ, TIMER SOFTIRQ, NET TX SOFTIRQ, BLOCK SOFTIRQ, SCHED SOFTIRQ
and HRTIMER SOFTIRQ. As the names suggest, for different kinds of inter-
rupts, we have different kinds of softirqs defined. Of course, the list is limited
and so is flexibility. However, the softirq mechanism was never meant to be very
generic in the first place. It was always meant to offload deferred work for a few
well-defined classes of interrupts and kernel tasks. It is not meant to be used
by device drivers.
...
}
Broad Overview
Work queue
Wrapper
Wrapper
Linked list of work
Wrapper
Worker pool items (work_structs)
Worker poolpool
Worker
Inac�ve
items
Threads
Let us provide a brief overview of how a work queue works (refer to Fig-
ure 4.7).
A work queue is typically associated with a certain class of tasks such as high-
priority tasks, batch jobs, bottom halves, etc. This is not a strict requirement,
however, in terms of software engineering, this is a sensible decision.
Each work queue contains a bunch of worker pool wrappers that each wrap
a worker pool. Let us first understand what is a worker pool, and then we will
discuss the need to wrap it (create additional code to manage it). A worker
119 © Smruti R. Sarangi
pool has three components: set of inactive work items, a group of threads that
process the work in the pool and a linked list of work items that need to be
processed (executed).
The main role of the worker pool is to basically store a list of work items
that need to be completed at some point of time in the future. Consequently,
it has a set of ready threads to perform this work and to also guarantee some
degree of timely completion. This is why, it maintains a set of threads that can
immediately be given a work item to process. A work item contains a function
pointer and the arguments of the function. A thread executes the function with
the arguments that are stored in the work item (referred to as a work struct
in the kernel code).
It may appear that all that we need for creating such a worker pool is a
bunch of threads and a linked list of work items. However, there is a little bit
of additional complexity here. It is possible that a given worker pool may be
overwhelmed with work. For instance, we typically associate a worker pool with
a CPU or a group of CPUs. It is possible that a lot of work is being added to
it and thus the linked list of work items ends up becoming very long. Hence,
there is a need to limit the size of the work that is assigned to a worker pool.
We do not want to traverse long linked lists.
An ingenious solution to limit the size of the linked list is as follows. We
tag some work items as active and put them in the linked list of work items
and tag the rest of the work items as inactive. The latter are stored in another
data structure, which is specialized for storing inactive work items (meant to
be processed much later). The advantage that we derive here is that for the
regular operation of the worker pool, we deal with smaller data structures.
Given that now there is an explicit size limitation, whenever there is an
overflow in terms of adding additional work items, we can safely store them in
the set of inactive items. When we have processed a sizeable number of active
items, we can bring in work items from the inactive list into the active list. It is
the role of the wrapper of a worker pool to perform this activity. Hence, there
is a need to wrap it.
The worker pool along with its wrapper can be thought of as one cohesive
unit. Now, we may need many such wrapped worker pools because in a large
system we shall have a lot of CPUs, and we may want to associate a worker
pool with each CPU or a group of CPUs. This is an elegant way of partitioning
the work and also doing some load-balancing.
Let us now look at the kernel code that is involved in implementing a work
queue.
/kernel/workqueue
/kernel/workqueue.c workqueue_struct
internal.h
kernel task
pool_workqueue worker_pool
a basic work item – the work struct data structure. The fields are shown in
Listing 4.8.
There is not much to it. The member struct list head entry indicates
that this is a part of a linked list of work structs. This is per se not a field
that indicates the details of the operation that needs to be performed. The only
two operational fields of importance are data (data to be processed) and the
function pointer (func). The data field can be a pointer as well to an object that
contains all the information about the arguments. A work struct represents
the basic unit of work.
The advantage of work queues is that it is usable by third-party code and
device drivers as well. This is quite unlike threaded IRQs and softirqs that are
not usable by device drivers. Any entity can create a struct work struct and
insert it in a work queue. This is executed later on when the kernel has the
bandwidth to execute such work.
Let us now take a deeper look at a worker pool (represented by struct
worker pool). Its job is to maintain a pool of worker threads that process work
queue items. Whenever a thread is required it can be retrieved from the pool.
There is no need to continually allocate new worker threads. We maintain a
pool of threads that are pre-initialized. Recall that we discussed the notion of
a pool in Section 3.2.11 and talked about its advantages in terms of eliminating
the time to allocate and deallocate objects.
Every worker pool has an associated CPU that it has an affinity with, a list of
work struct s, a list of worker threads, and a mapping between a work struct
and the worker thread that it is assigned to. Whenever a new work item comes,
and a corresponding work struct is allocated, and a worker thread is assigned
121 © Smruti R. Sarangi
to it.
The relationship between the pool of workers, the work queue and the worker
pool is shown in Figure 4.9. The apex data structure that represents the entire
work queue is the struct workqueue struct. It has a member, which is a
wrapper class called struct pool workqueue. It wraps the worker pool.
Let us explain the notion of a wrapper class. This wrapper class wraps the
worker pool (struct worker pool). This means that it intercepts every call to
the worker pool, checks it and appropriately modifies it. Its job is to restrict
the size of the pool and limit the amount of work it does at any point of time.
It is not desirable to overload a worker pool with a lot of work – it will perform
inefficiently. This also means that either the kernel itself is doing a lot of work
or it is not distributing the work efficiently amongst the CPUs (different worker
pools).
The standard approach used by this wrapper class is to maintain two lists:
active and inactive. If there is more work than what the queue can handle, we
put the additional work items in the inactive list. When the size of the pool
reduces, items can be moved from the inactive list to the list of active work
items.
In addition, each CPU has two work queues: one for low-priority tasks and
one for high-priority tasks.
Listing 4.10 shows the code of a signal handler. Here, the handler function is
handler that takes as input a single argument: the number of the signal. Then
the function executes like any other function. It can make library calls and also
call other functions. In this specific version of the handler, we are making an
exit call. This kills the thread that is executing the signal handler. However,
this is not strictly necessary.
Let us assume that we did not make the call to the exit library function,
then one of the following could have happened: if the signal blocked other signals
or interrupts, then their respective handlers would be executed, if the signal was
associated with process or thread termination, then the respective thread (or
thread group) would be terminated (example: SIGSEGV and SIGABRT) upon
returning from the handler. If the thread was not meant to be terminated,
then it resumes executing from the point at which it was paused and the signal
123 © Smruti R. Sarangi
handler’s execution began. From the thread’s point of view this is like a regular
context switch.
Now let us look at the rest of the code. Refer to the main function. We need
to register the signal handler. This is done in Line 10. After that, we fork the
process. It is important to bear in mind that signal handling information is also
copied. In this case, for the child process its signal handler will be the copy of
the handler function in its address space. The child process prints that it is
the child and then goes into an infinite while loop.
The parent process on the other hand has more work to do. First, it waits
for the child to get fully initialized. There is no point in sending a signal to a
process that has not been fully initialized. Otherwise, it will ignore the signal.
It thus sleeps for 2 seconds, which is deemed to be enough. It then sends a signal
to the child using the kill library call that in turns makes the kill system call,
which is used to send signals to processes. In this case, it sends the SIGUSR1
signal. SIGUSR1 has no particular significance otherwise – it is meant to be
defined by user programs for their internal use.
When the parent process sends the signal, the child at that point of time is
stuck in an infinite loop. It subsequently wakes up and runs the signal handler.
The logic of the signal handler is quite clear – it prints the fact that it is the
child along with its process id and then makes the exit call. The parent in turn
waits for the child to exit, and then it collects the pid of the child process along
with its exit status. The WEXITSTATUS macro can be used to parse the exit value
(extract its lower 8 bits).
The output of the program shall clearly indicate that the child was stuck in
an infinite loop. Then the parent called the signal handler and waited for the
child to exit. Finally, the child thread exited.
Please refer to Table 4.8 that shows some of the most common signals used
in the Linux operating system. Many of them can be handled and blocked.
However, there are many like SIGSTOP and SIGKILL that are not sent to the
processes. The kernel directly stops or kill the process, respectively.
In this sense, the SIGKILL signal is meant for all the threads of a multi-
threaded process. But, as we can see, in general, a signal is meant to be handled
only by a single thread in a thread group. There are different ways of sending a
signal to a thread group. One of the simplest approaches is the kill system call
that can send any given signal to a thread group. One of the threads handles
the signal. There are many versions of this system call. For example, the tkill
call can send a signal to specific thread within a process, whereas the tgkill
call takes care of a corner case. It is possible that thread id specified in the
tkill call is recycled. This means that the thread completes and then a new
thread is spawned with the same id. This can lead to a signal being sent to
the wrong thread. To guard against this the tgkill call takes an additional
argument, the thread group id. It is unlikely that both will be recycled and still
remain the same.
Regardless of the method that is used, it is very clear that signals are sent
© Smruti R. Sarangi 126
to a thread group; they are not meant to be sent to a particular thread unless
the tgkill call is used. Sometimes, there is an arithmetic exception in a thread
and thus there is a need to call the specific handler for that thread only. In this
case, it is not possible nor advisable to call the handler associated with another
thread in the same thread group.
Furthermore, signals can be blocked as well as ignored. When a signal is
blocked and such a signal arrives, it is queued. All such queued/pending signals
are handled once they are unblocked. Here also there is a caveat: no two
pending signals of the same type can be pending for a process at the same time.
Moreover, when a signal handler executes, it blocks the corresponding signal.
There are several ways in which a signal can be handled.
The first option is to ignore the signal – it means that the signal is not
important, and no handler is registered for it. In this case, the signal can be
happily ignored. On the other hand, if the signal is important and can lead to
process termination, then the action that needs to be taken is to kill the process.
Examples of such signals are SIGKILL and SIGINT (refer to Table 4.8). There
can also be a case where there will be process termination, but an additional file
will be created called the core dump. It can be used by a debugger to inspect
the state of the process at which it was paused or stopped because of the receipt
of the signal. For instance, we can find the values of all the local variables, the
stack’s contents and the memory contents.
We have already seen the process stop and resume signals earlier. Here, the
stop action is associated with suspending a process indefinitely until the resum-
ing action is initiated. The former corresponds to the SIGSTOP and SIGTSTP
signals, whereas the latter corresponds to the SIGCONT signal. It is impor-
tant to understand that like SIGKILL, these signals are not sent to the process.
They are instead intercepted by the kernel and the process is either terminated
or stopped/resumed. SIGKILL and SIGSTOP in particular cannot be ignored,
handled or blocked.
Finally, the last method is to handle the signal by registering a handler. Note
that in many cases this may not be possible, especially if the signal arose because
of an exception. The same exception-causing instruction will execute after the
handler returns and again cause an exception. In such cases, process termination
or stopping that thread and spawning a new thread are good options. In some
cases, if the circumstances behind an exception can be changed, then the signal
handler can do that like remapping a memory page or changing the value of
a variable. Making such changes in a signal handler is quite risky and is only
meant for black belt programmers ,.
size_t sass_ss_size ;
Let us now look at the relevant kernel code. The apex data structure in signal
handling is signal struct (refer to Listing 4.11). The information about the
signal handler is kept in struct sighand struct. The two important fields
that store the set of blocked/masked signals are blocked and real blocked.
They are of the type sigset t, which is nothing but a bit vector: one bit for
each signal. It is possible that a lot of signals have been blocked by the process
because it is simply not interested in them. All of these signals are stored in
the variable real blocked. Now, during the execution of any signal handler,
typically more signals are blocked including the signal that is being handled.
There is a need to add all of these additional signals to the set real blocked.
With these additional signals, the expanded set of signals is called blocked.
Note the following:
In this case, we set the blocked signal set as a super set of the set real blocked.
These are all the signals that we do not want to handle when a signal handler
is executing. After finishing executing the handler, the kernel sets blocked =
real blocked.
struct sigpending stores the list of pending/queued signals that have not
been handled by the process yet. We will discuss its intricacies later.
Finally, consider the last field, which is quite interesting. For a signal han-
dler, we may want to use the same stack of the thread that was interrupted or
a different one. If we are using the same stack, then there is no problem; we
can otherwise use a different stack in the thread’s address space. In this case
its starting address and the size of the stack need to be specified in this case.
If we are using the alternative stack, which is different from the real stack that
the thread was using, no correctness problem is created. The original thread in
any case is stopped and thus the stack that is used does not matter.
Listing 4.12 shows the important fields in the main signal-related struc-
ture signal struct. It mainly contains process-related information such as the
© Smruti R. Sarangi 128
number of active threads in the thread group, linked list of all the threads (in
the thread group), list of all the constituent threads that are waiting on the
wait system call, the last thread that processed a signal and the list of pending
signals (shared across all the threads in a thread group).
Let us now look at the next data structure – the signal handler.
Listing 4.13 shows the wrapper of signal handlers of the entire multithreaded
process. It actually contains a lot of information in these few fields. Note that
this structure is shared by all the threads in the thread group.
The first field count maintains the number of task struct s that use this
handler. The next field signalfd wqh is a queue of waiting processes. At this
stage, it is fundamental to understand that there are two ways of sending a
signal to a process. We have already seen the first approach, which involves
calling the signal handler directly. This is a straightforward approach and uses
the traditional paradigm of using callback functions, where a callback function
is a function pointer that is registered with the caller. In this case, the caller
(invoker) is the signal handling subsystem of the OS.
It turns out that there is a second mechanism, which is not used that widely.
As compared to the default mechanism, which is asynchronous (signal handlers
can be run any time), this is a synchronous mechanism. In this case, signal
handling is a planned process. It is not the case that signals can arrive at any
point of time, and then they need to be handled immediately. This notion is
captured in the field signalfd wqh. The idea is that the process registers a file
descriptor with the OS – we refer to this as the signalfd file. Whenever a signal
needs to be sent to the process, the OS writes the details of the signal to the
signalfd file. Processes in this case, typically wait for signals to come. Hence,
processes need to be woken up. At their leisure, they can check the contents of
the file and process the signals accordingly.
Now, it is possible that multiple processes are waiting for something to be
written to the signalfd file. Hence, there is a need to create a queue of waiting
processes. This wait queue is the signalfd wqh field.
However, the more common method of handling signals is using the regular
asynchronous mechanism. All that we need to store here is an array of 64
( NSIG) signal handlers. 64 is the maximum number of signal handlers that
Linux on x86 supports. Each signal handler is wrapped using the k sigaction
structure. On most architectures, this simply wraps the sigaction structure,
which we shall describe next.
struct sigaction
129 © Smruti R. Sarangi
The important fields of struct sigaction are shown in Listing 4.14. The
fields are reasonably self-explanatory. sa handler is the function pointer in the
thread’s user space memory. flags represents the parameters that the kernel
uses to handle the signal such as whether a separate stack needs to be used or
not. Finally, we have the set of masked signals.
struct sigpending
The final data structure that we need to define is the list of pending signals
(struct sigpending). This data structure is reasonably complicated. It uses
some of the tricky features of linked lists, which we have very nicely steered
away from up till now.
struct sigpending {
struct list_head list; sigqueue sigqueue sigqueue sigqueue
sigset_t signal;
}
struct sigqueue {
struct list_head list; /* Poin�ng to its current posi�on in the
queue of sigqueues */
kernel_siginfo_t info; /* signal number, signal source, etc. */
}
Refer to Figure 4.10. The structure sigpending wraps a linked list that
contains all the pending signals. The name of the list is as simple as it can be,
list. The other field of interest is signal that is simply a bit vector whose ith
bit is set if the ith signal is pending for the process. Note that this is why there
is a requirement that two signals of the same type can never be pending for the
same process.
Each entry of the linked list is of type struct sigqueue. Note that we
discussed in Appendix C that in Linux different kinds of nodes can be part
of a linked list. Hence, in this case we have the head of the linked list as a
structure of type sigpending, whereas all the entries are of type sigqueue. As
non-intuitive as this may seem, this is indeed possible in Linux’s linked lists.
Each sigqueue structure is a part of a linked list, hence, it is mandated to
have an element of type struct list head. This points to the linked lists on
© Smruti R. Sarangi 130
the left and right (previous and next), respectively. Each entry encapsulates the
signal in the kernel siginfo t structure.
This structure contains the following fields: signal number, number of the
error or exceptional condition that led to the signal being raised, source of the
signal and the sending process’s pid (if relevant). This is all the information
that is needed to store the details of a signal that has been raised.
struct ucontext {
unsigned long uc_flags ;
stack_t uc_stack ; /* user ’s stack pointer */
struct sigcontext uc_mcontext ; /* Snapshot of all the
registers and process ’s state */
};
struct rt sigframe keeps all the information required to store the context
of the thread that was signaled. The context per se is stored in the structure
struct ucontext. Along with some signal handling flags, it stores two vital
pieces of information: the pointer to the user thread’s stack and the snapshot
of all the user thread’s registers and its state. The stack pointer can be in
the same region of memory as the user thread’s stack or in a separate memory
region. Recall that it is possible to specify a separate address for storing the
signal handler’s stack.
The next argument is the signal information that contains the details of the
signal: its number, the relevant error code and the details of the source of the
signal.
The last argument is the most interesting. The question is where should the
signal handler return to? It cannot return to the point at which the original
thread stopped executing. This is because its context has not been restored
yet. Hence, we need to return to a special function that needs to do a host of
things such as restoring the user thread’s context. Hence, here is the “million
131 © Smruti R. Sarangi
dollar” idea. Before launching the signal handler, we deliberately tweak the
return address to return to a custom function that can restore the user thread’s
context. Note that on x86 machines, the return address is stored on the stack.
All that we need to do is to change it to point it to a specific function, which is
the restore rt function in the glibc standard library.
When the signal handler returns, it will now return and start executing
the restore rt function. This function does a lot of important things. It
does some bookkeeping and makes the all important sigreturn system call.
This transfers control back to the kernel. It is only the kernel that can restore
the context of a process. This cannot be done in user space without hardware
support. Hence, it is necessary to bring the kernel into the picture. The kernel’s
system call handler copies the context stored in the user process’s stack using
the copy from user function to the kernel’s address space. The same way that
we restore the context while loading a process on a core, we do exactly the
same here. The context collected from user space is transferred to the same
subsystem in the kernel; it restores the user thread’s context (exactly at where
it stopped). The kernel populates all the registers of the user thread including
the PC and the stack pointer. It starts from exactly the same point at which it
was paused to handle the signal.
To summarize, a signal handler is a small process within a process. It has a
short-lived life. It ceases to exist after the signal handling function finishes its
execution. The original thread then resumes.
Exercises
Ex. 4 — The way that we save the context in the case of interrupts and system
calls is slightly different. Explain the nature of the difference. Why is this the
case?
© Smruti R. Sarangi 132
Ex. 5 — If we want to use the same assembly language routine to store the
context after both an interrupt and a system call, what kind of support is
required (in SW and/or HW)?
Ex. 10 — What is the philosophy behind having sets like blocked and real-blocked
in signal-handling structures? Explain with examples.
Ex. 11 — How does the interrupt controller ensure real-time task execution
on an x86 system? It somehow needs to respect real-time process priorities.
Ex. 15 — What are the beneficial features of softirqs and threaded IRQs?
In this chapter we will discuss one of the most important concepts in operating
systems namely synchronization and scheduling. The first deals with managing
resources that are common to a bunch of processes or threads (shared between
them). It is possible that there will be competition amongst the threads or
processes to acquire the resource: this is also known as a race condition. Such
data races can lead to errors. As a result, only one of the processes needs to
access the shared resource at any point of time.
Once all such synchronizing conditions have been worked out, it is the role
of the operating system to ensure that all the computing resources namely the
cores and accelerators are optimally used. There should be no idleness or exces-
sive context switching. Therefore, it is important to design a proper scheduling
algorithm such that tasks can be efficiently mapped to the available compu-
tational resources. We shall see that there are a wide variety of scheduling
algorithms, constraints and possible scheduling goals. Given that there are such
a wide variety of practical use cases, situations and circumstances, there is no
one single universal scheduling algorithm that outperforms all the others. In
fact, we shall see that for different situations, different scheduling algorithms
perform very differently.
5.1 Synchronization
5.1.1 Introduction to Data Races
Consider the case of a multicore CPU. We want to do a very simple operation,
which is just to increment the value of the count variable that is stored in
memory. It is a regular variable and incrementing it should be easy. Listing 5.1
shows that it translates to three assembly-level instructions. We are showing C-
like code without the semicolon for the sake of enhancing readability. Note that
each line corresponds to one line of assembly code (or one machine instruction)
in this code snippet. count is a global variable that can be shared across
threads. t1 corresponds to a register (private to each thread and core). The
first instruction loads the variable count to a register, the second line increments
the value in the register and the third line stores the incremented value in the
135
© Smruti R. Sarangi 136
This code is very simple, but when we consider multiple threads, it turns
out to be quite erroneous because we can have several correctness problems.
Consider the scenario shown in Figure 5.1. Note again that we first load
the value into a register, then we increment the contents of the register and
finally save the contents of the register in the memory address corresponding to
the variable count. This makes a total of 3 instructions that are not executed
atomically – execute at three different instants of time. Here there is a possibility
of multiple threads trying to execute the same code snippet at the same point
of time and also update count concurrently. This situation is called a data race
(a more precise and detailed definition follows later).
count = 0
Thread 1 Thread 2
t1 = count t2 = count
t1 = t1 + 1 t2 = t2 + 1
count = t1 count = t2
Figure 5.1: Incrementing the count variable in parallel (two threads). The
run on two different cores. t1 and t2 are thread-specific variables mapped to
registers
Before we proceed towards that and elaborate on how and why a data race
can be a problem, we need to list a couple of assumptions.
❶ The first assumption is that each basic statement in Listing 5.1 corre-
sponds to one line of assembly code, which is assumed to execute atomically.
This means that it appears to execute at a single instant of time.
❷ The second assumption here is that the delay between two instructions
can be indefinitely long (arbitrarily large). This could be because of hardware-
level delays or could be because there is a context switch and then the context
is restored after a long time. We cannot thus assume anything about the timing
of the instructions, especially the timing between consecutive instructions given
that there could be indefinite delays for the aforementioned reasons.
Now given these assumptions, let us look at the example shown in Figure 5.1
and one possible execution in Figure 5.2. Note that a parallel program can have
many possible executions. We are showing one of them, which is particularly
problematic. We see that the two threads read the value of the variable count
137 © Smruti R. Sarangi
count = 0
t2 = t2 + 1
count = t1 count = 1
count = t2 count = 1
Should be 2 5
Figure 5.2: An execution that leads to the wrong value of the count variable
Unlock.
Figure 5.4 shows the execution of the code snippet count++ by two threads.
Note the critical sections, the use of the lock and unlock calls. Given that the
critical section is protected with locks, there are no data races here. The final
value is correct: count = 2.
Thread 1 Thread 2
t1 = count
t1 = t1 + 1
count = t1
t2 = count
t2 = t2 + 1
count = t2
(c) Smru� R. Sarangi, 2023
Figure 5.4: Two threads incrementing count by wrapping the critical section
within a lock-unlock call pair
thread finds that the value has changed back to 0 (free), it tries to set it to 1
(test-and-set phase). In this case, it is inevitable that there will be a competition
or a race among the threads to acquire the lock (set the value in A to 1). Regular
reads or writes cannot be used to implement such locks.
It is important to use an atomic synchronizing instruction that almost all
the processors provide as of today. For instance, we can use the test-and-set
instruction that is available on most hardware. This instruction checks the value
of the variable stored in memory and if it is 0, it atomically sets it to 1 (appears
to happen instantaneously). If it is able to do so successfully (0 → 1), it returns
a 1, else it returns 0. This basically means that if two threads are trying to set
the value of a lock variable from 0 to 1, only one of them will be successful. The
hardware guarantees this
The test-and-set instruction returns 1 if it is successful, and it returns 0
if it fails (cannot set 0 → 1). Clearly we can extend the argument and observe
that if there are n threads that all want to convert the value of the lock variable
from 0 to 1, then only one of them will succeed. The thread that was successful
is deemed to have acquired the lock. For the rest of the threads that were
unsuccessful, they need to keep trying (iterating). This process is also known
as busy waiting. Such a lock that involves busy waiting – it is also called a spin
lock.
It is important to note that we are relying on a hardware instruction that
atomically sets the value in a memory location to another value and indicates
whether it was successful in doing so or not. There is a lot of theory around this
and there are also a lot of hardware primitives that play the role of atomic oper-
ations. Many of them fall in the class of read-modify-write (RMW) operations.
They read the value stored at a memory location, sometimes test if it satisfies
a certain property or not, and then they modify the contents of the memory
location accordingly. These RMW operations are typically used in implement-
ing locks. The standard method is to keep checking whether the lock variable
is free or not. The moment the lock is found to be free, threads compete to
acquire the lock using atomic instructions. Atomic instructions guarantee that
only one instruction is successful at a time. Once a thread acquires the lock, it
can proceed to safely access the critical section.
After executing the critical section, unlocking is quite simple. All that needs
to be done is that the value of the location at the lock needs to be set back to 0
(free). However, bear in mind that if one takes a computer architecture course,
© Smruti R. Sarangi 140
one will realize that this is not that simple. This is because all the memory
operations that have been performed in the critical section should be visible to
all the threads running on other cores once the lock has been unlocked. This
normally does not happen as architectures and compilers tend to reorder in-
structions. Also, it is possible that the instructions in the critical section are
visible to other threads before a lock is fully acquired unless additional precau-
tions are taken. This is again an unintended consequence or reordering that
is done by compilers and machines for performance reasons. Such reordering
needs to be checked.
Point 5.1.1
Fence instructions are expensive in terms of performance. Hence, we
need to minimize them. They are however required to ensure correct-
ness in multithreaded programs and to implement lock-unlock operations
correctly.
This is why most atomic instructions either additionally act as fence instruc-
tions or a separate fence instruction is added by the library code to lock/unlock
functions.
between this unlock operation and the subsequent lock operation. This leads to
the second update of the count variable. Therefore, we can say that there is a
happens-before relationship between the first and second updates to the count
variable. Note that this relationship is a property of a given execution. In a
different execution, a different happens-before relationship may be visible. A
happens-before relationship by definition is a transitive relationship.
The moment we do not have such happens-before relationships between ac-
cesses, they are deemed to be concurrent. Note that in our example, such
happens-before relationships are being enforced by the lock/unlock operations
and their inherent fences. Happens-before order: updates in the critical section
→ unlock operation → lock operation → reads/writes in the second critical
section (so on and so forth). Encapsulating critical sections within lock-unlock
pairs creates such happens-before relationships. Otherwise, we have data races.
Such data races are clearly undesirable as we saw in the case of count++.
Hence, concurrent and conflicting accesses to the same shared variable should
not be there. With data races, it is possible that we may have hard-to-detect
bugs in the program. Also, data races have a much deeper significance in terms
of the correctness of the execution of parallel programs. At this point we are not
in the position to appreciate all of this. All that can be said is that data-race-
free programs have a lot of nice and useful properties, which are very important
in ensuring the correctness of parallel programs. Hence, data races should be
avoided for a wide variety of reasons. Refer to the book by your author on
Advanced Computer Architecture [Sarangi, 2023] for a detailed explanation of
data races, and their implications and advantages.
Point 5.1.2
An astute reader may argue that there have to be data races in the code
to acquire the lock itself. However, those happen in a very controlled
manner, and they don’t pose a correctness problem. This part of the
code is heavily verified and is provably correct. The same cannot be said
about data races in regular programs.
Properly-Labeled Programs
Now, to avoid data races, it is important to create properly labeled programs.
In a properly labeled program, the same shared variable should be locked by
the same lock or the same set of locks. This will avoid concurrent accesses to
the same shared variable. For example, the situation shown in Figure 5.6 has
a data race on the variable C because it is not protected by the same lock in
both the cases. Hence, we may observe a data race because this program is
not properly labeled. This is why it is important that we ensure that the same
variable is protected by the same lock (could also be the same set of multiple
locks).
5.1.4 Deadlocks
Using locks sadly does not come for free; they can lead to a situation known as
deadlocks. A deadlock is defined as a situation where one thread is waiting on
another thread, that thread is waiting on another thread, so on and so forth –
© Smruti R. Sarangi 142
Figure 5.6: A figure showing a situation with two critical sections. The first
is protected by lock X and the second is protected by lock Y . Address C is
common to both the critical sections. There may be a data race on address C.
we have a circular or cyclic wait. This basically means that in a deadlocked sit-
uation, no thread can make any progress. In Figure 5.7, we see such a situation
with locks.
Thread 1 Thread 2
Lock X Lock Y
Lock Y Lock X
It shows that one thread holds lock X, and it tries to acquire lock Y . On the
other hand, the second thread holds lock Y and tries to acquire lock X. There
is a clear deadlock situation here. It is not possible for any thread to make
progress because they are waiting on each other. This is happening because we
are using locks and a thread cannot make any progress unless it acquires the
lock that it is waiting for. A code with locks may thus lead to such kind of
deadlocks that are characterized by circular waits. Let us elaborate.
There are four conditions for a deadlock to happen. This is why if a deadlock
is supposed to be avoided or prevented, one of these conditions needs to be
prevented/avoided. The conditions are as follows:
4. Circular wait: As we can see in Figure 5.7, all the threads are waiting
on each other and there is a circular or cyclic wait. A cyclic wait ensures
that no thread can make any progress.
forks to be put on the table. These have sadly been picked up from the table
by their respective neighbors. Clearly a circular wait has been created. Let
us look at the rest of the deadlock conditions, which are non-preeemption and
hold-and-wait, respectively. Clearly mutual exclusion will always have to hold
because a fork cannot be shared between neighbors at the same moment of time.
Preemption – forcibly taking away a fork from the neighbor – seems to be
difficult because the neighbor can also do the same. Designing a protocol around
this idea seems to be difficult. Let us try to relax hold-and-wait. A philosopher
may give up after a certain point of time and put the fork that he has acquired
back on the table. Again creating a protocol around this appears to be difficult
because it is very easy to get into a livelock.
Hence, the simplest way of dealing with this situation is to try to avoid the
circular wait condition. In this case, we would like to introduce the notion of
asymmetry, where we can change the rules for just one of the philosophers. Let
us say that the default algorithm is that a philosopher picks the left fork first
and then the right one. We change the rule for one of the philosophers: he
acquires his right fork first and then the left one.
It is possible to show that a circular wait cannot form. Let us number the
philosophers from 1 to n. Assume that the nth philosopher is the one that has
the special privilege of picking up the forks in the reverse order (first right and
then left). In this case, we need to show that a cyclic wait can never form.
Assume that a cyclic wait has formed. It means that a philosopher (other
than the last one) has picked up the left fork and is waiting for the right fork to be
put on the table. This is the case for philosophers 1 to and n−1. Consider what
is happening between philosophers and n − 1 and n. The (n − 1)th philosopher
picks its left fork and waits for the right one. The fact that it is waiting basically
means that the nth philosopher has picked it up. This is his left fork. It means
that he has also picked up his right fork because he picks up the forks in the
reverse order. He first picks up his right fork and then his left one. This basically
means that the nth philosopher has acquired both the forks and is thus eating
his food. He is not waiting. We therefore do not have a deadlock situation over
here.
allow processes to deadlock and wait forever. However, the converse is not true.
Deadlock freedom does not imply starvation freedom because starvation is a
much stronger condition.
The other condition is a livelock, where processes continuously take steps
and execute statements but do not make any tangible progress. This means
that even if processes continually change their state, they do not reach the final
end state – they continually cycle between interim states. Note that they are not
in a deadlock in the sense that they can still take some steps and keep changing
their state. However, the states do not converge to the final state, which would
indicate a desirable outcome.
For example, consider two people trying to cross each other in a narrow
corridor. A person can either be on the left side or on the right side of the
corridor. So it is possible that both are on the left side, and they see each other
face to face. Hence, they cannot cross each other. Then they decide to either
stay there or move to the right. It is possible that both of them move to the
right side at the same point of time, and they are again face to face. Again
they cannot cross each other. This process can continue indefinitely. In this
case, the two people can keep moving from left to right and back. However,
they are not making any progress because they are not able to cross each other.
This situation is a livelock, where threads move in terms of changing states, but
nothing useful gets ultimately done.
Listing 5.2: Code to create two pthreads and collect their return values
# include < stdio .h >
# include < pthread .h >
# include < stdlib .h >
# include < unistd .h >
In the main function, two pthreads are created. The arguments to the
pthread create function are a pointer to the pthread structure, a pointer to
a pthread attribute structure that shall control its behavior (NULL in this ex-
ample), the function pointer that needs to be executed and a pointer to its sole
argument. If the function takes multiple arguments, then we need to put all of
them in a structure and pass a pointer to that structure.
The return value of the func function is quite interesting. It is a void *,
which is a generic pointer. In our example, it is a pointer to an integer that is
equal to 2 times the thread id. When a pthread function (like func) returns, akin
to a signal handler, it returns to the address of a special routine. Specifically,
it does the job of cleaning up the state and tearing down the thread. Once the
thread finishes, the parent thread that spawned it can wait for it to finish using
the pthread join call.
This is similar to the wait call invoked by a parent process, when it waits
for a child to terminate in the regular fork-exec model. In the case of a regular
process, we collect the exit code of the child process. However, in the case
of pthreads, the pthread join call takes two arguments: the pthread, and the
address of a pointer variable (&result). The value filled in the address is exactly
the pointer that the pthread function returns. We can proceed to dereference
the pointer and extract the value that the function wanted to return.
Given that we have now created a mechanism to create pthread functions
that can be made to run in parallel, let us implement a few concurrent algo-
rithms. Let us try to increment a count.
Let us now use the CAS method to increment count (code shown in List-
ing 5.6).
Hypothe�cal
sequen�al {1} {3,1} {3} {} {2} {} {4} {5,4} {5}
order
Figure 5.9: A parallel execution and its equivalent sequential execution. Every
event has a distinct start time and end time. In this figure, we assume that we
know the completion time. We arrange all the events in ascending order of their
completion times in a hypothetical sequential order at the bottom. Each point
in the sequential order shows the contents of the queue after the respective
operation has completed. Note that the terminology enq: 3 means that we
enqueue 3, and similarly deq: 4 means that we dequeue 4.
The key question that needs to be answered is where is this point of com-
pletion vis-á-vis the start and end points. If it always lies between them, then
we can always claim that before a method call ends, it is deemed to have fully
completed – its changes to the global state are visible to all the threads. This
is a very strong correctness criterion of a parallel execution. We are, of course,
assuming that the equivalent sequential execution is legal. This correctness
criteria is known as linearizability.
Linearizability
Linearizability is the de facto criterion used to prove the correctness of con-
current data structures that are of a non-blocking nature. If all the executions
corresponding to a concurrent algorithm are linearizable, then the algorithm
itself is said to satisfy linearizability. In fact, the execution shown in Figure 5.9
is linearizable.
This notion of linearizability is summarized in Definition 5.1.1. Note that
the term “physical time” in the definition refers to real time that we read off a
© Smruti R. Sarangi 152
wall clock. Later on, while discussing progress guarantees, we will see that the
notion of physical time has limited utility. We would alternatively prefer to use
the notion of a logical time instead. Nevertheless, let us stick to physical time
for the time being.
Now, let us address the last conundrum. Even if the completion times are not
known, which is often the case, as long as we can show that distinct completion
points appear to exist for each method (between its start and end), the execution
is deemed to be linearizable. Mere existence of completion points is what needs
to be shown. Whether the method actually completes at that point or not is
unimportant. This is why we keep using the word “appears” throughout the
definitions.
Write
CPU Delay
Write
buffer
L1 Cache
Now consider the other case when the point of completion may be after the
end of a method. For obvious reasons, it cannot be before the start point of
a method. An example of such an execution, which is clearly atomic but not
linearizable, is a simple write operation in multicore processors (see Figure 5.10).
The write method returns when the processor has completed the write operation
and has written it to its write buffer. This is also when the write operation is
removed the pipeline. However, that does not mean that the write operation has
completed. It completes when it is visible to all the threads, which can happen
much later – when the write operation leaves the write buffer and is written to
a shared cache. This is thus a case when the completion time is beyond the end
time of the method. The word “beyond” is being used in the sense that it is
“after” the end time in terms of the real physical time.
We now enter a world of possibilities. Let us once again consider the simple
read and write operations that are issued by cores in a multicore system. The
153 © Smruti R. Sarangi
Sequential Consistency
Along with atomicity, SC mandates that in the equivalent sequential order of
events, methods invoked by the same thread appear in program order. The
program order is the order of instructions in the program that will be perceived
by a single-cycle processor, which will pick an instruction, execute it completely,
proceed to the next instruction, so on and so forth. SC is basically atomicity +
intra-thread program order.
Consider the following execution. Assume that x and y are initialized to 0.
They are global variables. t1 and t2 are local variables. They are stored in
registers (not shared across threads).
Thread 1 Thread 2
x=1 y=1
t1 = y t2 = x
Note that if we run this code many times on a multicore machine, we shall
see different outcomes. It is possible that Thread 1 executes first and completes
both of its instructions and then Thread 2 is scheduled on another core, or vice
versa, or their execution is interleaved. Regardless of the scheduling policy, we
will never observe the outcome t1 = t2 = 0 if the memory model is SC or lin-
earizability. This reason is straightforward. All SC and linearizable executions
respect the per-thread order of instructions. In this case, the first instruction
© Smruti R. Sarangi 154
Point 5.1.3
All linearizable executions are also sequentially consistent. All sequen-
tially consistent executions also satisfy the requirements of weak memory
models. Note that the converse is not true.
lock. Note that an acquire is weaker than a full fence, which also specifies the
ordering of operations before the fence (in program order). Similarly, the re-
lease operation corresponds to lock release. As per RC, the release operation
can complete only if all the operations before it have fully completed. Again,
this also makes sense, because when we release the lock, we want the rest of the
threads to see all the changes that have been made in the critical section.
Obstruction Freedom
It is called obstruction freedom, which basically says that in an n-thread system,
if we set any set of (n − 1) threads to sleep, then the only thread that is active
will be able to complete its execution in a bounded number of internal steps.
© Smruti R. Sarangi 156
This means that we cannot use locks because if the thread that has acquired
the lock gets swapped out or goes to sleep, no other thread can complete the
operation.
Wait Freedom
Now, let us look at another progress guarantee, which is at the other end of
the spectrum. It is known as wait freedom. In this case, we avoid all forms
of starvation. Every thread completes the operation within a bounded number
of internal steps. So in this case, starvation is not possible. The code shown
in Listing 5.4 is an example of a wait-free algorithm because regardless of the
number of threads and the amount of contention, it completes within a bounded
number of internal steps. However, the code shown in Listing 5.6 is not a wait-
free algorithm. This is because there is no guarantee that the compare and swap
will be successful in a bounded number of attempts. Thus, we cannot guarantee
wait freedom. However, this code is obstruction free because if any set of (n − 1)
threads go to sleep, then the only thread that is active will succeed in the CAS
operation and ultimately complete the overall operation in a bounded number
of steps.
Lock Freedom
Given that we have now defined what an obstruction-free and a wait-free al-
gorithm is, we can now tackle the definition of lock freedom, which is slightly
more complicated. In this case, let us count the cumulative number of steps
that all the n threads in the system execute. We have already mentioned that
there is no correlation between the time it takes to complete an internal step
across the n threads. That remaining true, we can still take a system and count
the cumulative number of internal steps taken by all the threads together. Lock
freedom basically says that if this cumulative number is above a certain thresh-
old or a bound, then we can say for sure that at least one of the operations has
completed successfully. Note that in this case, we are saying that at least one
thread will make progress and there can be no deadlocks.
All the threads also cannot get stuck in a livelock. However, there can be
starvation because we are taking a system-wide view and not a thread-specific
view here. As long as one thread makes progress by completing operations, we
do not care about the rest of the threads. This was not the case in wait-free
algorithms. The code shown in Listing 5.6 is lock free, but it is not wait free.
The reason is that the compare and exchange has to be successful for at least one
of the threads and that thread will successfully move on to complete the entire
count increment operation. The rest of the threads will fail in that iteration.
However, that is not of a great concern here because at least one thread achieves
success.
It is important to note that every program that is wait free is also lock free.
This follows from the definition of lock freedom and wait freedom, respectively.
If we are saying that in less than k internal steps, every thread is guaranteed
to complete its operation, then in nk system-wide steps, at least one thread is
guaranteed to complete its operation. By the pigeonhole principle, at least one
thread must have taken k steps and completed its operation. Thus wait freedom
implies lock freedom.
157 © Smruti R. Sarangi
Similarly, every program that is lock free is also obstruction free, which again
follows very easily from the definitions. This is the case because we are saying
that if the system as a whole takes a certain number of steps (let’s say k ′ ), then
at least one thread successfully completes its operation. Now, if n − 1 threads
in the system are quiescent, then only one thread is taking steps and within k ′
steps it has to complete its operation. Hence, the algorithm is obstruction free.
Obstruc�on free
Lock free
Wait free
Figure 5.11: Venn diagram showing the relationship between different progress
guarantees
However, the converse is not true in the sense that it is possible to find a
lock-free algorithm that is not wait free and an obstruction free algorithm that is
not lock free. This can be visualized in a Venn diagram as shown in Figure 5.11.
All of these algorithms cannot use locks. They are thus broadly known as non-
blocking algorithms even though they provide very different kinds of progress
guarantees.
An astute reader may ask why not use wait-free algorithms every time be-
cause after all there are theoretical results that say that any algorithm can be
converted to a parallel wait-free variant, which is also provably correct. This
part is correct, however, wait-free algorithms tend to be very slow and also are
very difficult to write and verify. Hence, in most practical cases, a lock-free
implementation is much faster and is far easier to code and verify. In general,
obstruction freedom is too week as a progress guarantee. Thus, it is hard to find
a practical system that uses an obstruction- free algorithm. In most practical
systems, lock-free algorithms are used, which optimally trade off performance,
correctness and complexity.
There is a fine point here. Many authors have replaced the bounded property
in the definitions with finite. The latter property is more theoretical and often
does not gel well with practical implementations. Hence, we have not decided
to use it in this book. We will continue with bounded steps, where the bound
can be known in advance.
5.1.8 Semaphores
Let us now consider another synchronization primitive called a semaphore. We
can think of it as a generalization of a lock. It is a more flexible variant of a
© Smruti R. Sarangi 158
lock, which admits more than two states. Recall that a lock has just two states:
locked and unlocked.
pthread_cond_t cond ;
pthr ead_cond_init (& count , NULL ) ;
Point 5.1.4
A condition variable is not a semaphore. A semaphore has a notion of
memory – it stores a count. The count can be incremented even if there
is no waiting thread. However, in the case of a condition variable, there
is a much stronger coupling. Whenever a pthread signal or broadcast
call is made, the threads that are waiting on the condition variable at
that exact point of time are woken up. Condition variables do not per se
have a notion of memory. They don’t maintain any counts. They simply
act as a rendezvous mechanism (meeting point) between signaling and
waiting threads. Hence, in this case, it is possible that a signal may be
made but at that point of time there is no waiting thread, and thus the
signal will be lost. This is known as the lost wakeup problem.
Clearly, a reader and writer cannot operate concurrently at the same point
of time without synchronization because of the possibility of data races.
We thus envision two smaller locks as a part of the locking mechanism: a
read lock and a write lock. The read lock allows multiple readers to operate in
parallel on a concurrent object, which means that we can invoke a read method
concurrently. We need a write lock that does not allow any other readers or
writers to work on the queue concurrently. It just allows one writer to change
the state of the queue.
void get_write_lock () {
LOCK (\ _ \ _rwlock ) ;
}
void release_write_lock () {
UNLOCK (\ _ \ _rwlock ) ;
}
void get_read_lock () {
LOCK (\ _ \ _rdlock ) ;
if ( readers == 0) LOCK (\ _ \ _rwlock ) ;
readers ++;
UNLOCK (\ _ \ _rdlock ) ;
}
void release_read_lock () {
LOCK (\ _ \ _rdlock ) ;
readers - -;
if ( readers == 0)
UNLOCK (\ _ \ _rwlock ) ;
UNLOCK (\ _ \ _rdlock ) ;
}
The code for the locks is shown in Listing 5.10. We are assuming two macros
LOCK and UNLOCK. They take a lock (mutex) as their argument, and invoke the
methods lock and unlock, respectively. We use two locks: rwlock (for both
readers and writers) and rdlock (only for readers). The prefix signifies
that these are internal locks within the reader-writer lock. These locks are
meant for implementing the logic of the reader-writer lock, which provides two
key functionalities: get or release a read lock (allow a process to only read),
and get or release a write lock (allow a process to read/write). Even though the
names appear similar, the internal locks are very different from the functionality
that the composite reader-writer lock provides, which is providing a read lock
(multiple readers) and a write lock (single writer only).
Let’s first look at the code of a writer. There are two methods that it
can invoke: get write lock and release write lock. In this case, we need a
global lock that needs to stop both reads and writes from proceeding. This is
why in the function get write lock, we wait on the lock rwlock.
The read lock, on the other hand, is slightly more complicated. Refer to
the function get read lock in Listing 5.10. We use another mutex lock called
rdlock. A reader waits to acquire it. The idea is to maintain a count of the
161 © Smruti R. Sarangi
number of readers. Since there are concurrent updates to the readers variable
involved, it needs to be protected by the rdlock mutex. After acquiring
rdlock, it is possible that the lock acquiring process may find that a writer is
active. We need to explicitly check for this by checking if the number of readers,
readers, is equal to 0 or not. If it is equal to 0, then it means that other readers
are not active – a writer could be active. Otherwise, it means that other readers
are active, and a writer cannot be active.
If readers = 0 we need to acquire rwlock to stop writers. The rest of the
method is reasonably straightforward. We increment the number of readers and
finally release rdlock such that other readers can proceed.
Releasing the read lock is also simple. We subtract 1 from the number of
readers after acquiring rdlock. Now, if the number of readers becomes equal
to 0, then there is no reason to hold the global rwlock. It needs to be released
such that writers can potentially get a chance to complete their operation.
A discerning reader at this point of time will clearly see that if readers are
active, then new readers can keep coming in and the waiting write operation
will never get a chance. This means that there is a possibility of starvation.
Because readers may never reach 0, rwlock will never be released by the
reader holding it. The locks themselves could be fair, but overall we cannot
guarantee fairness for writes. Hence, this version of the reader-writer lock’s
design needs improvement. Starvation-freedom is needed, especially for write
operations. Various solutions to this problem are proposed in the reference
[Herlihy and Shavit, 2012].
the threads to finish so that it can collect all the partial sums and add them to
produce the final result (reduce phase). This is a rendezvous point insofar as all
the threads are concerned because all of them need to reach this point before
they can proceed to do other work. Such a point arises very commonly in a lot
of scientific kernels that involve linear algebra.
Hence, it is very important to optimize such operations, which are known
as barriers. Note that this barrier is different from a memory barrier (discussed
earlier), which is a fence operation. They just happen to share the same name
(unfortunately so). We can psychologically think of a barrier as a point that
stops threads from progressing, unless all the threads that are a part of the
thread group associated with the barrier reach it (see Figure 5.12). Almost
all programming languages, especially parallel programming languages provide
support for barriers. In fact, supercomputers have special dedicated hardware
for barrier operations. They can be realized very quickly, often in less than a
few milliseconds.
There is a more flexible version of a barrier known as a phaser (see Fig-
ure 5.13). It is somewhat uncommon, but many languages like Java define them
and in many cases they prove to be very useful. In this case, we define two
points in the code: Point 1 and Point 2. The rule is that no thread can cross
Point 2 unless all the threads have arrived at Point 1. Point 1 is a point in
the program, which in a certain sense precedes Point 2 or is before Point 2 in
program order. Often when we are pipelining computations, there is a need for
using phasers. We want some amount of work to be completed before some new
work can be assigned to all the threads. Essentially, we want all the threads to
complete the phase prior to Point 1, and enter the phase between Points 1 and
2, before a thread is allowed to enter the phase that succeeds Point 2.
5.2 Queues
Let us now see how to use all the synchronization primitives introduced in
Section 5.1.
One of the most important data structures in a complex software system like
an OS is a queue. All practical queues have a bounded size. Hence, we shall not
differentiate between a queue and a queue with a maximum or bounded size.
Typically, to communicate messages between different subsystems, queues are
163 © Smruti R. Sarangi
Producers Consumers
Figure 5.14: A bounded queue
Using this restriction, it turns out that we can easily create a wait-free queue.
There is no need to use any locks – operations complete within bounded time.
Listing 5.11: A simple wait-free queue with one enqueuer and one dequeuer
# define BUFSIZE 10
# define INC ( x ) (( x +1) % BUFSIZE )
# define NUM 25
void nap () {
struct timespec rem ;
int ms = rand () % 100;
struct timespec req = {0 , ms * 1000 * 1000};
nanosleep (& req , & rem ) ;
}
return 0; /* success */
}
int deq () {
int cur_head = atomic_load (& head ) ;
int cur_tail = atomic_load (& tail ) ;
int new_head = INC ( cur_head ) ;
The main function creates two threads. The odd-numbered thread enqueues
by calling enqfunc, and the even-numbered thread dequeues by calling deqfunc.
These functions invoke the enq and deq functions, respectively, NUM times. Be-
tween iterations, the threads take a nap for a random duration.
The exact proof of wait freedom can be found in textbooks on this topic
such as the book by Herlihy and Shavit [Herlihy and Shavit, 2012]. Given that
there are no loops, we don’t have a possibility of looping endlessly. Hence, the
enqueue and dequeue operations will complete in bounded time. The proof of
linearizability and correctness needs more understanding and thus is beyond the
scope of this book.
Note the use of atomics. They are a staple of modern programming languages
such as C++20 and other recent languages. Along with atomic load and store
operations, the library provides many more functions such as atomic fetch add,
atomic flag test and set and atomic compare exchange strong. Depend-
ing upon the architecture and the function arguments, their implementations
come with different memory ordering guarantees (embed different kinds of fences).
© Smruti R. Sarangi 166
int deq () {
int val ;
do {
LOCK ( qlock ) ;
if ( tail == head ) val = -1;
else {
val = queue [ head ];
head = INC ( head ) ;
}
UNLOCK ( qlock ) ;
} while ( val == -1) ;
return val ;
}
int main () {
...
pthread_mutex_init (& qlock , NULL ) ;
...
pth read _mute x_de stroy (& qlock ) ;
}
int main () {
sem_init (& qlock , 0 , 1) ;
...
sem_destroy (& qlock ) ;
}
Listing 5.14: A queue with semaphores but does not have busy waiting
# define WAIT ( x ) ( sem_wait (& x ) )
# define POST ( x ) ( sem_post (& x ) )
sem_t qlock , empty , full ;
POST ( qlock ) ;
POST ( full ) ;
return 0; /* success */
}
int deq () {
WAIT ( full ) ;
WAIT ( qlock ) ;
POST ( qlock ) ;
POST ( empty ) ;
return val ;
}
int main () {
sem_init (& qlock , 0 , 1) ;
sem_init (& empty , 0 , BUFSIZE ) ;
sem_init (& full , 0 , 0) ;
...
sem_destroy (& qlock ) ;
sem_destroy (& empty ) ;
sem_destroy (& full ) ;
}
We use three semaphores here. We still use qlock, which is needed to pro-
tect the shared variables. Additionally, we use the semaphore empty that is
initialized to BUFSIZE (maximum size of the queue) and the full semaphore
that is initialized to 0. These will be used for waking up threads that are wait-
ing. We define the WAIT and POST macros that wrap sem wait and sem post,
respectively.
Consider the enq function. We first wait on the empty semaphore. There
need to be free entries available. Initially, we have BUFSIZE free entries. Every
time a thread waits on the semaphore, it decrements the number of free entries
by 1 until the count reaches 0. After that the thread waits. Then we enter
the critical section that is protected by the binary semaphore qlock. There is
no need to perform any check on whether the queue is full or not. We know
that it is not full because the thread successfully acquired the empty semaphore.
This means that at least one free entry is available in the array. After releasing
qlock, we signal the full semaphore. This indicates that an entry has been
added to the queue.
Let us now look at the deq function. It follows the reverse logic. We start
out by waiting on the full semaphore. There needs to at least be one entry
in the queue. Once this semaphore has been acquired, we are sure that there
is at least one entry in the queue, and it will remain there until it is dequeued
(property of the semaphore). The critical section again need not have any
checks regarding whether the queue is empty or not. It is protected by the
qlock binary semaphore. Finally, we complete the function by signaling the
169 © Smruti R. Sarangi
empty semaphore. The reason for this is that we are removing an entry from
the queue, or creating one additional free entry. Waiting enqueuers will get
signaled.
Note that there is no busy waiting. Threads either immediately acquire the
semaphore if the count is non-zero or are swapped out. They are put in a wait
queue inside the kernel. They thus do not monopolize CPU resources and more
useful work is done. We are also utilizing the natural strength of semaphores.
int peak () {
/* This is a read function */
get_read_lock () ;
int val = ( head == tail ) ? -1 : queue [ head ];
release_read_lock () ;
return val ;
}
int enq ( int val ) {
WAIT ( empty ) ;
POST ( full ) ;
return 0; /* success */
}
int deq () {
int val ;
WAIT ( full ) ;
POST ( empty ) ;
return val ;
}
The code of the enq and deq functions remains more or less the same. We
wait and signal the same set of semaphores: empty and full. The only difference
is that we do not acquire a generic lock, but we acquire the write lock using the
get write lock function.
It is just that we are using a different set of locks for the peak function and
the enq/deq functions. We allow multiple readers to work in parallel.
large number of data structures, writing correct and efficient lock-free code is
very difficult, and writing wait-free code is even more difficult. Hence, a large
part of the kernel still uses regular spinlocks; however, they come with a twist.
Along with being regular spinlocks that rely on busy waiting, there are a
few additional restrictions. Unlike regular mutexes that are used in user space,
the thread holding the lock is not allowed to go to sleep or get swapped out.
This means that interrupts need to be disabled in the critical section (protected
by kernel spinlocks). This further implies that these locks can also be used
in the interrupt context. A thread holding such a lock will complete in a finite
amount of time unless it is a part of a deadlock (discussed later). On a multicore
machine, it is possible that a thread may wait for the lock to be released by
a thread running on another core. Given that the lock holder cannot block or
sleep, this mechanism is definitely lock free. We are assuming that the lock
holder will complete the critical section in a finite amount of time. This will
indeed be the case given our restrictions on blocking interrupts and disallowing
preemption.
If we were to allow context switching after a spinlock has been acquired, then
we may have a deadlock situation The new thread may have a higher priority.
To make matters worse, it may try to acquire the lock. Given that we shall have
busy waiting, it will continue to loop and wait for the lock to get freed. But the
lock may never get freed because the thread that is holding the lock may never
get a chance to run. The reason it may not get a chance to run is because it has
a lower priority than the thread that is waiting on the lock. Hence, kernel-level
spinlocks need these restrictions. It effectively locks the CPU. The lock-holding
thread does not migrate, nor does it allow any other thread to run until it has
finished executing the critical section and released the spinlock.
# define preempt_enable () \
do { \
barrier () ; \
if ( unlikely ( p r e e m p t _ c o u n t _ d e c _ a n d _ t e s t () ) ) \
__preempt_schedule () ; \
© Smruti R. Sarangi 172
} while (0)
The core idea is a preemption count variable. If the count is non-zero, then
it means that preemption is not allowed. Whereas if the count is 0, it means
that preemption is allowed. If we want to disable preemption, all that we have
to do is increment the count and also insert a fence operation, which is also
known as a memory barrier. The reason for a barrier is to ensure that the code
in the critical section is not reordered and brought before the lock acquire. Note
that this is not the same barrier that we discussed in the section on barriers
and phasers (Section 5.1.11). They just happen to share the same name. These
are synchronization operations, whereas the memory barrier is akin to a fence,
which basically disables memory reordering. The preemption count is stored in
a per-CPU region of memory (accessible via a segment register). Accessing it
is a very fast operation and requires very few instructions.
The code for enabling preemption is shown in Listing 5.16. In this case, we
do more or less the reverse. We have a fence operation to ensure that all the
pending memory operations (executed in the critical section) completely finish
and are visible to all the threads. After that, we decrement the count using an
atomic operation. If the count reaches zero, it means that now preemption is
allowed, so we call the schedule function. It finds a process to run on the core.
An astute reader will make out that this is like a semaphore, where if preemption
is disabled n times, it needs to be enabled n times for the task running on the
core to become preemptible.
The code for a spinlock is shown in Listing 5.17. We see that the spinlock
structure encapsulates an arch spinlock t lock and a dependency map (struct
lockdep map). The raw lock member is the actual spinlock. The dependency
map is used to check for deadlocks (we will discuss that later).
Listing 5.18: Inner workings of a spinlock
source : include/asm − generic/spin − lock.h
void arch_spin_lock ( arch_spinlock_t * lock ) {
u32 val = atomic_fetch_add (1 < <16 , lock ) ;
u16 ticket = val >> 16; /* upper 16 bits of lock */
if ( ticket == ( u16 ) val ) /* Ticket id == ticket next in
line */
return ;
a t om i c _ co n d _ re a d _ ac q u i re ( lock , ticket == ( u16 ) VAL ) ;
smp_mb () ; /* barrier instruction */
}
173 © Smruti R. Sarangi
Let us understand the design of the spinlock. Its code is shown in List-
ing 5.18. It is a classic ticket lock that has two components: a ticket, which
acts like a coupon, and the id of the next ticket (next). Every time that a thread
tries to acquire a lock, it gets a new ticket. It is deemed to have acquired the
lock when ticket == next.
Consider a typical bank where we go to meet a teller. We first get a coupon,
which in this case is the ticket. Then we wait for our coupon number to be
displayed. Once that happens, we can go to the counter at which a teller is
waiting for us. The idea here is quite similar. If you think about it, you will
conclude that this lock guarantees fairness. Starvation is not possible. The
way that this lock is designed in practice is quite interesting. Instead of using
multiple fields, a single 32-bit unsigned integer is used to store both the ticket
and the next field. We divide the 32-bit unsigned integer into two smaller
unsigned integers that are 16 bits wide. The upper 16 bits store the ticket id.
The lower 16 bits store the value of the next field.
When a thread arrives, it tries to get a ticket. This is achieved by adding
216 (1 ¡¡ 16) to the lock variable. This basically increments the ticket stored in
the upper 16 bits by 1. The atomic fetch and add instruction is used to achieve
this. This instruction has a built-in memory barrier as well (more about this
later). Now, the original ticket can be extracted quite easily by right shifting
the value returned by the fetch and add instruction by 16 positions.
The next task is to extract the lower 16 bits (next field). This is the number
of the ticket that is the holder of the lock, which basically means that if the
current ticket is equal to the lower 16 bits, then we can go ahead and execute
the critical section. This is easy to do using a simple typecast operation. Hear
the type u16 refers to a 16-bit unsigned integer. Simply typecasting val to the
type u16 type retrieves the lower 16 bits as an unsigned integer. This is all that
we need to do. Then, we need to compare this value with the thread’s ticket,
which is also a 16-bit unsigned integer. If both are equal, then the spinlock has
effectively been acquired and the method can return.
Now, assume that they are not equal. Then there is a need to wait – there is a
need to busy wait. This is where we call the macro atomic cond read acquire,
which requires two arguments: the lock value and the condition that needs to
be true. This condition checks whether the obtained ticket is equal to the next
field in the lock variable. It ends up calling the macro smp cond load relaxed,
whose code is shown next.
Listing 5.19: The code for the busy-wait loop
source : include/asm − generic/barrier.h
# define smp_ cond _load _rel axed ( ptr , cond_expr ) ({ \
typeof ( ptr ) __PTR = ( ptr ) ; \
__ un qu al_ sc al ar_ ty pe of (* ptr ) VAL ; \
for (;;) { \
VAL = READ_ONCE (* __PTR ) ; \
if ( cond_expr ) \
break ; \
cpu_relax () ; /* insert a delay */ \
} \
( typeof (* ptr ) ) VAL ; \
})
© Smruti R. Sarangi 174
The kernel code for the macro is shown in Listing 5.19. In this case, the
inputs are a pointer to the lock variable and an expression that needs to evaluate
to true. Then we have an infinite loop where we dereference the pointer and
fetch the current value of the lock. Next, we evaluate the conditional expression
(ticket == (u16)VAL). If the conditional expression evaluates to true, then it
means that the lock has been implicitly acquired. We can then break from the
infinite loop and resume the rest of the execution. Note that we cannot return
from a macro because a macro is just a piece of code that is copy-pasted by the
preprocessor with appropriate argument substitutions.
In case the conditional expression evaluates to false, then of course, there
is a need to keep iterating. But along with that, we would not like to contend
for the lock all the time. This would lead to a lot of cache line bouncing across
cores, which is detrimental to performance. We are unnecessarily increasing the
memory and on-chip network traffic. It is a better idea to wait for some time
and try again. This is where the function cpu relax is used. It makes the
thread back off for some time.
Given that fairness is guaranteed, we will ultimately exit the infinite loop,
and we will come back to the main body of the arc spinlock function. In this
case, there is a need to introduce a memory barrier. Note that this is a generic
pattern? Whenever we get a lock or acquire a lock, there is a need to insert a
memory barrier after it. This ensures that prior to entering the critical section all
the reads and writes are fully completed and are visible to all the threads in the
SMP system. Moreover, no instruction in the critical section can complete before
the memory barrier has completed its operation. This ensures that changes
made in the critical section get reflected only after the lock has been acquired.
Listing 5.20: The code for unlocking a spinlock
source : include/asm − generic/spinlock.h
void arch_spin_unlock ( arch_spinlock_t * lock )
{
u16 * ptr = ( u16 *) lock + IS_ENABLED (
CON FIG_ CPU_B IG_E NDIAN ) ;
u32 val = atomic_read ( lock ) ;
smp_store_release ( ptr , ( u16 ) val + 1) ; /* store
following release consistency semantics */
}
Let us now come to the unlock function. This is shown in Listing 5.20. It
is quite straightforward. The first task is to find the address of the next field.
This needs to be incremented to let the new owner of the lock know that it
can now proceed. There is a complication here. We need to see if the machine
is big endian or little endian. If it is a big endian machine, which basically
means that the lower 16 bits are actually stored in the higher addresses, then
a small correction to the address needs to be made. This logic is embedded in
the isenabled (Big endian) macro. In any case at the end of this statement,
the address of the next field is stored in the ptr variable. Next, we get the
value of the ticket from the lock variable, increment it by 1, and store it in the
address pointed to by ptr, which is nothing but the address of the next field.
Now if there is a thread whose ticket number is equal to the contents of the
next field, then it knows that it is the new owner of the lock. It can proceed
with completing the process of lock acquisition and start executing the critical
175 © Smruti R. Sarangi
section. At the very end of the unlock function, we need to execute a memory
barrier known as an SMP store release, which basically ensures that all the
writes made in the critical section are visible to the rest of the threads after the
lock has been released. This completes the unlock process.
struct mutex {
atomic_long_t owner ;
raw_spinlock_t wait_lock ;
struct list_head wait_list ;
# ifdef C ON F IG _D E BU G _L OC K _A L LO C
struct lockdep_map dep_map ;
# endif
};
The code of the kernel mutex is shown in Listing 5.22. Along with a spinlock
(wait lock), it contains a pointer to the owner of the mutex and a waiting list
of threads. Additionally, to prevent deadlocks it also has a pointer to a lock
dependency map. However, this field is optional – it depends on the compilation
parameters. Let us elaborate.
The owner field is a pointer to the task struct of the owner. An as-
tute reader may wonder why it is an atomic long t and not a task struct
*. Herein, lies a small and neat trick. We wish to provide a fast-path mech-
anism to acquire the lock. We would like the owner field to contain the value
of the task struct pointer of the lock-holding thread, if the lock is currently
acquired and held by a thread. Otherwise, its value should be 0. This neat
trick will allow us to do a compare and exchange on the owner field in the hope
of acquiring the lock quickly. We try the fast path only once. To acquire the
lock, we will compare the value stored in owner with 0. If there is an equality
then we will store a pointer to the currently running thread’s task struct in
its place.
Otherwise, we enter the slow path. In this case, the threads waiting to
acquire the lock are stored in wait list, which is protected by the spinlock
wait lock. This means that before enqueueing the current thread in wait list,
we need to acquire the spinlock wait lock first.
Listing 5.23 shows the code of the lock function (mutex lock) in some more
detail. Its only argument is a pointer to the mutex. First, there is a need to
check if this call is being made in the right context or not. For example, the
kernel defines an atomic context in which the code cannot be preempted. In this
context, sleeping is not allowed. Hence, if the mutex lock call has been made
in this context, it is important to flag this event as an error and also print the
stack trace (the function call path leading to the current function).
Assume that the check passes, and we are not in the atomic context, then
we first make an attempt to acquire the mutex via the fast path. If we are not
177 © Smruti R. Sarangi
successful, then we try to acquire the mutex via the slow path using the function
mutex lock slowpath.
In the slow path, we first try to acquire the spinlock, and if that is not
possible then the process goes to sleep. In general, the task is locked in the
UNINTERRUPTIBLE state. This is because we don’t want to wake it up to
process signals. When the lock is released, it wakes up all such sleeping processes
such they can contend for the lock. The process that is successful in acquiring
the spinlock wait lock adds itself to wait list and goes to sleep. This is done
by setting its state (in general) to UNINTERRUPTIBLE.
Note that this is a kernel thread. Going to sleep does not mean going
to sleep immediately. It just means setting the status of the task to either
INTERRUPTIBLE or UNINTERRUPTIBLE. The task still runs. It needs to
subsequently invoke the scheduler such that it can find the most eligible task to
run on the core. Given the status of the current task is set to a sleep state, the
scheduler will not choose it for execution.
The unlock process pretty much does the reverse. We first check if there are
waiting tasks in the wait list. If there are no waiting tasks, then the owner
field can directly be set to 0, and we can return. However, if there are waiting
tasks, then there is a need to do much more processing. We first have to acquire
the spinlock associated with the wait list (list of waiting processes). Then, we
remove the first entry and extract the task, next, from it. The task next needs
to be woken up in the near future such that it can access the critical section.
However, we are not done yet. We need to set the owner field to next such that
incoming threads know that the lock is acquired by some thread, it is not free.
Finally, we release the spinlock and hand over the id of the woken up task next
to the scheduler.
Note that the kernel code can use many other kinds of locks. Their code is
available in the directory kernel/locking.
A notable example is a queue-based spin lock (MCS lock, qspinlock in
kernel code). It is in general known to be a very scalable lock that it is quite fast.
It also minimizes cache line bouncing (movement of the cache line containing
the lock variable across cores). The idea is that we create a linked list of nodes
where the tail pointer points to the end of the linked list. We then add the
current node (wrapper of the current task) to the very end of this list. This
process requires two operations: make the current tail node point to the new
node (containing our task), and modify the tail pointer to point to the new
node. Both of these operations need to execute atomically – it needs to appear
that both of them executed at a single point of time, instantaneously. The MCS
lock is a very classical lock and almost all texts on concurrent systems discuss its
design a great detail. Hence, we shall not delve further (reference [Herlihy and
Shavit, 2012]). It suffices to say that it uses complex lock-free programming,
and we do not perform busy waiting on a single location, instead a thread only
busy waits on a Boolean variable declared within its node structure. When its
predecessor in the list releases the lock, it sets this variable to false, and the
current thread can then acquire the lock. This eliminates cache line bouncing
to a very large extent.
© Smruti R. Sarangi 178
There are a few more variants like the osq lock (variant of the MCS lock)
and the qrwlock (reader-writer lock that gives priority to readers).
The kernel code has its version of semaphores (see Listing 5.24). It has a spin
lock (lock), which protects the semaphore variable count. Akin to user-level
semaphores, the kernel semaphore supports two methods that correspond to
wait and post, respectively. They are known as down (wait) and up (post/signal).
The kernel semaphore functions in exactly the same manner as the user-level
semaphore. After acquiring the lock, the count variable is either incremented
or decremented. However, if the count variable is already zero, then it is not
possible to decrement it and the current task needs to wait. This is the point
at which it is added to the list of waiting processes (wait list) and the task
state is set to UNINTERRUPTIBLE. Similar to the case of unlocking a spin
lock, here also, if the count becomes non-zero from zero, we pick a process
from the wait list and set its task state to RUNNING. Given that all of this
is happening within the kernel, setting the task state is very easy. All of this
is very hard at the user level for obvious reasons. We need a system call for
everything. However, in the kernel, we do not have those restrictions and thus
these mechanisms are much faster.
In any unsafe state, it is possible that the thread gets preempted and the
interrupt handler runs. This interrupt handler may try to acquire the lock. Note
that any softirq − unsafe state is hardirq − unsafe as well. This is because hard
irq interrupt handlers have a higher priority as compared to softirq handlers.
We define hardirq − safe and hardirq − unsafe analogously. These states will be
used to flag potential deadlock-causing situations.
We next validate the chain of lock acquire calls that have been made. Check
for trivial deadlocks first (fairly common in practice): A → B → A. Such
trivial deadlocks are also known as lock inversions. Let us now use the states.
No path can contain a hardirq − unsafe lock and then a hardirq − safe lock. This
allows the latter call to possibly interrupt the critical section associated with
the former lock. This may lead to the lock inversion deadlock.
Let us now look at the general case in which we search for cyclic (circular)
waits. We need to create a global graph where each lock instance is a node, and
if the process holding lock A waits to acquire lock B, then there is an arrow
from A to B. If we have V nodes and E edges, then the time complexity is
O(V + E). This is quite slow. Note that we need to check for cycles before
acquiring every lock.
Let us use a simple caching-based technique. Consider a chain of lock acqui-
sitions, where the lock acquire calls can possibly be made by different threads.
Given that the same kind of code sequences tend to repeat in the kernel code,
we can cache a full sequence of lock acquisition calls. If the entire sequence is
devoid of cycles, then we can deem the corresponding execution to be deadlock
free. Hence, the brilliant idea here is as follows.
Figure 5.15: A hash table that stores an entry for every chain of lock acquisitions
Instead of checking for a deadlock on every lock acquire, we check for dead-
locks far more infrequently. We consider a long sequence (chain) of locks and
hash all of them. A hash table stores the “deadlock status” associated with
such chains (see Figure 5.15). It is indexed with the hash of the chain. If the
chain has been associated with a cyclic wait (deadlock) in the past, then the
hash table stores a 1, otherwise it stores a 0. This is a much faster mechanism
for checking for deadlocks and the overheads are quite limited. Note that if no
entry is found in the hash table, then either we keep building the chain and
try later, or we run a cycle detection algorithm immediately. This is a generic
mechanism that is used to validate spinlocks, mutexes and reader-writer locks.
The process of allocating and freeing objects is the most interesting. Allo-
cation is per se quite straightforward – we can use the regular malloc call. The
object can then be used by multiple threads. However, freeing the object is rel-
atively more difficult. This is because threads may have references to it. They
may try to access the fields of the object after it has been freed. We thus need
to free the allocated object only when no thread is holding a valid reference
to it or is holding a reference but promises never to use it in the future. In
C, it is always possible to arrive at the old address of an object using pointer
arithmetic. However, let us not consider such tricky situations because RCU
requires a certain amount of disciplined programming.
One may be tempted to use conventional reference counting, which is rather
slow and complicated in a concurrent, multiprocessor setting. A thread needs
to register itself with an object, and then it needs to deregister itself once
it is done using it. Registration and deregistration increment and decrement
the reference count, respectively. Any deallocation can happen only when the
reference count reaches zero. This is a complicated mechanism. The RCU
mechanism [McKenney, 2007] far simpler.
It needs to have the following features:
Note that we have only focus on the free part because it is the most difficult.
Consider the example of a linked list (see Figure 5.16).
Synchronize
and reclaim
Readers may be reading this the space
Final state
In this case even though we delete a node from the linked list, other threads
may still have references to it. The threads holding a reference to the node will
not be aware that the node has been removed from the linked list. Hence, after
deletion from the linked list we still cannot free the associated object.
count requires busy waiting. We already know the problems in busy waiting
such as cache line bouncing and doing useless work. This is precisely what we
would like to avoid via the synchronize rcu call.
Writing is slightly different here – we create a copy of an object, modify it,
and assign the new pointer to a field in the encapsulating data structure. Note
that a pointer is referred to as RCU protected, when it can be assigned and
dereferenced with special RCU-based checks (we shall see later).
Listing 5.25: Example code that traverses a list within an RCU read context
source : include/linux/rcupdate.h
rcu_read_lock () ;
l is t _f or _ ea ch _ en t ry _r c u (p , head , list ) {
t1 = p - > a ;
t2 = p - > b ;
}
rcu_read_unlock () ;
Listing 5.26: Replace an item in a list and then wait till all the readers finish
list_replace_rcu (& p - > list , &q - > list ) ;
synchronize_rcu () ;
kfree ( p ) ;
Listing 5.26 shows a piece of code that waits till all the readers complete. In
this case, one of the threads calls the list replace rcu function that replaces
an element in the list. It is possible that there are multiple readers who currently
have a reference to the old element (p->list) and are currently reading it. We
need to wait for them to finish the read operation. The only assumption that
can be made here is that all of them are accessing the list in an RCU context –
the code is wrapped between the RCU read lock and read unlock calls.
183 © Smruti R. Sarangi
The function synchronize rcu makes the thread wait for all the readers to
complete. Once, all the readers have completed, we can be sure that the old
pointer will not be read again. This is because the readers will check if the node
pointed to by the pointer is still a part of the linked list or not. This is not
enforced by RCU per se. Coders nevertheless have to observe such rules if they
want to use RCU correctly.
After this we can free the pointer p using the kfree call.
Let us now consider an example that uses the function rcu assign pointer
function in the context of the list replace rcu function (see Listing 5.28). In
a doubly-linked list, we need to replace the old entry by new. We first start
with setting the next and prev pointers of new (make them the same as old).
Note that at this point, the node is not added to the list.
It is added when we set the next pointer of new− > prev is set to new. This
is the key step that adds the new node to the list. This pointer assignment is
done in an RCU context because it delinks the earlier node from the list. There
may be references to the earlier node that are still held by readers. This is
why this pointer has to be done in an RCU context. We need to wait for those
readers to complete before old is deallocated.
Dereferencing a Pointer
At this point the object can be freed, and its space can be reclaimed. This
method is simple and slow.
Reader
Reader
Reader Reader
Reader
Removal Reclama�on
Figure 5.17: Removal and reclamation of an object (within the RCU context)
Let us now understand when the grace period (from the point of view of a
thread) ends and the period of quiescence starts. One of the following conditions
needs to be satisfied.
1. When a thread blocks: If there is a restriction that blocking calls are not
allowed in an RCU read block, then if the thread blocks we can be sure
that it is in the quiescent state.
3. If the kernel enters an idle loop, then also we can be sure that the read
block is over.
Whenever any of these conditions is true, we set a bit that indicates that the
CPU is out of the RCU read block – it is in the quiescent state. The reason that
this enables better performance is as follows. There is no need to send costly
inter-processor interrupts to each CPU and wait for a task to execute. Instead,
we adopt a more proactive approach. The moment a thread leaves the read
block, the CPU enters the quiescent state and this fact is immediately recorded
by setting a corresponding per-CPU bit. Note the following: this action is off
the critical path and there is no shared counter.
Once all the CPUs enter a quiescent state, the grace period ends and the
object can be reclaimed. Hence, it is important to answer only one question
when a given CPU enters the quiescent state: Is this the last CPU to have
entered the quiescent state? If, the answer is “Yes”, then we can go forward
and declare that the grace period has ended. The object can then be reclaimed.
This is because we can be sure that no thread holds a valid reference to the
object (see the answer to Question 5.3.5).
187 © Smruti R. Sarangi
Question 5.3.1
What if there are more threads than CPUs? It is possible that all of
them hold references to an object. Why are we maintaining RCU state
at a CPU level?
Answer: We assume that whenever a thread accesses the object that
is RCU-protected, it is accessed only within an RCU context (within a
read block). Furthermore, a check is also made within the read block
that it is very much a part of the containing data structure. It cannot
access the object outside the RCU context. Now, once a thread enters
an RCU read block, it cannot be preempted until has finished executing
the read block.
It is not possible for the thread to continue to hold a reference and use it.
This is because it can be used once again only within the RCU context,
and there it will be checked if the object is a part of its containing data
structure. If it has been removed, then the object’s reference has no
value.
For a similar reason, no other thread running on the CPU can access
the object once the object has been removed and the quiescent state has
been reached on the CPU. Even if another thread runs on the CPU, it
will not be able to access the same object because it will not find it in
the containing data structure.
Tree RCU
Let us now suggest an efficient method of managing the quiescent state of all
CPUs. The best way to do so is to maintain a tree. Trees have natural paral-
lelism; they avoid centralized state.
struct rcu state is used to maintain quiescence information across the
cores. Whenever the grace period ends (all the CPUs are quiescent at least
once), a callback function may be called. This will let the writer know that the
object can be safely reclaimed.
struct rcu_state
struct Register a callback
rcu_node func�on.
Called when the
grace period ends.
struct struct struct
rcu_node rcu_node rcu_node
Preemptible RCU
Sadly, RCU stops preemption and migration when the control is in an RCU block
(read-side critical section). This can be detrimental to real-time programs as
they come with strict timing requirements and deadlines. For real-time versions
of Linux, there is a need to have a preemptible version of RCU where preemption
is allowed within an RCU read-side block. Even though doing this is a good
idea for real-time systems, it can lead to many complications.
In classical RCU, read-side critical sections had almost zero overhead. Even
on the write-side all that we had to do is read the current data structure, make a
copy, make the changes (update) and add it to the encapsulating data structure
(such as a linked list or a tree). The only challenge was to wait for all the
outstanding readers to complete, which has been solved very effectively.
Here, there are many new complications. If there is a context switch in the
middle of a read block, then the read-side critical section gets “artificially length-
ened”. We can no more use the earlier mechanisms for detecting quiescence. In
this case, whenever a process enters a read block, it needs to register itself, and
then it needs to deregister itself when it exits the read block. Registration and
deregistration can be implemented using counter increments and decrements,
respectively. The rcu read lock function needs to increment a counter and
the rcu read unlock function needs to decrement a counter. These counters
are now a part of a process’s context, not the CPU’s context (unlike classical
RCU). This is because we may have preemption and subsequent migration. It
is also possible for two concurrent threads to run on a CPU that access RCU-
189 © Smruti R. Sarangi
protected data structures. Note that this was prohibited earlier. We waited
for a read block to completely finish before running any other thread. In this
case, two read blocks can run concurrently (owing to preemption). Once pre-
empted, threads can also migrate to other CPUs. Hence, counters can no more
be per-CPU counters. State management thus becomes more complex.
We have thus enabled real-time execution and preemption at the cost of
making RCU slower and more complex.
5.4 Scheduling
Scheduling is one of the most important activities performed by an OS. It is a
major determinant of the overall system’s responsiveness and performance.
the same for all the jobs. Here again, there are two types of problems. In one
case, the jobs that will arrive in the future are known. In the other case, we
have no idea – jobs may arrive at any point of time.
J1 J2 J3 J4
3 2 4 1
Figure 5.19: Example of a set of jobs that are awaiting to be scheduled
Figure 5.19 shows an example where we have a bunch of jobs that need to
be scheduled. We assume that the time that a job needs to execute (processing
time) is known (shown in the figure).
In Figure 5.20, we introduce an objective function, which is the mean job
completion time. The completion time is the duration between the arrival time
and the time at which the job fully completes. This determines the responsive-
ness of the system. It is possible for a system to minimize the makespan yet
unnecessarily delay a lot of jobs, which shall lead to an adverse mean completion
time value.
J1 J2 J3 J4
t1
t2
t3
t4
P
Figure 5.20: Mean completion time µ = i ti /n
We can thus observe that the problem of scheduling is a very fertile ground
for proposing and solving optimization problems. We can have a lot of con-
straints, settings and objective functions.
To summarize, we have said that in any scheduling problem, we have a list
of jobs. Each job has an arrival time, which may either be equal to 0 or some
other time instant. Next, we typically assume that we know how long a job shall
take to execute. Then in terms of constraints, we can either have preemptible
jobs or we can have non-preemptible jobs. The latter means that the entire
job needs to execute in one go without any other intervening jobs. Given these
constraints, there are a couple of objective functions that we can minimize. One
would be to minimize the makespan, which is basically the time from the start
of scheduling till the time it takes for the last job to finish execution. Another
objective function is the average completion time, where the completion time is
again defined as the time at which a job completes minus the time at which it
arrived (measure of the responsiveness).
191 © Smruti R. Sarangi
For scheduling such a set of jobs, we have a lot of choice. We can use many
simple algorithms, which in some cases, can also be proven to be optimal. Let us
start with the random algorithm. It randomly picks a job and schedules it on a
free core. There is a lot of work that analyzes the performance of such algorithms
and many times such random choice-based algorithms perform quite well. In
the space of deterministic algorithms, the shortest job first (SJF) algorithm is
preferred. It schedules all the jobs in ascending order of their execution times.
It is a non-preemptible algorithm. We can prove that it minimizes the average
completion time.
KSW Model
Let us now introduce a more formal way of thinking and introduce the Karger-
Stein-Wein (KSW) model [Karger et al., 1999]. It provides an abstract or generic
framework for all scheduling problems. It essentially divides the space of prob-
lems into large classes and finds commonalities in between problems that belong
to the same class. Specifically, it requires three parameters: α, β and γ.
The first parameter α determines the machine environment. It specifies the
number of jobs and the processing time of each job. It specifies the number of
cores, number of jobs, and the execution time of each job. The second parameter
β specifies the constraints. For example, it specifies whether preemption is al-
lowed or not, whether the arrival times are all the same or are different, whether
the jobs have dependencies between them or whether there are job deadlines.
A dependency between a pair of jobs can exist in the sense that we can specify
that job J1 needs to complete before J2 . Note that in real-time systems, jobs
come with deadlines, which basically means that jobs have to finish before a
certain time. A deadline is thus one more type of constraint.
Finally, the last parameter is γ, which is the optimality criterion. We have
already discussed the average mean completion time and makespan criteria.
We can also define a weighted completion time – a weighted mean of completion
times. Here a weight in a certain sense represents a job’s priority. Note that
the mean completion time metric is a special case of the weighted completion
time metric – all the weights are equal to 1. Let the completion time of job i
be Ci . The cumulative completion time is equivalent to the mean completion
time in this case because the number of jobs behaves like a constant. We can
represent this criterion as ΣCi . The makespan is represented as Cmax (maximum
completion time of all jobs).
We can consequently have a lot of scheduling algorithms for every scheduling
problem, which can be represented using the 3-tuple α | β | γ as per the KSW
formulation.
We will describe two kinds of algorithms in this book. The most popular
algorithms are quite simple, and are also provably optimal in some scenarios.
We will also introduce a bunch of settings where finding the optimal schedule is
an NP-complete problem [Cormen et al., 2009]. There are good approximation
algorithms for solving such problems.
© Smruti R. Sarangi 192
Let us define the problem 1 || ΣCj in the KSW model. We are assuming
that there is a single core. The objective function is to minimize the sum of
completion times (Cj ). Note that minimizing the sum of completion times is
equivalent to minimizing the mean completion time because the number of tasks
is known a priori and is a constant.
J4 J2 J1 J3
1 2 3 4
Figure 5.21: Shortest job first scheduling
The claim is that the SJF (shortest job first) algorithm is optimal in this
case (example shown in Figure 5.21). Let us outline a standard approach for
proving that a scheduling algorithm is optimal with respect to the criterion that
is defined in the KSW problem. Here we are minimizing the mean completion
time.
Let the SJF algorithm be algorithm A. Assume that another algorithm A′ is
optimal. There must be a pair of jobs j and k such that j immediately precedes
k and the processing time (execution time) of j > k. This means pj > pk . Note
that such a pair of jobs will not be found in algorithm A. Assume pj started
at time t. Let us exchange jobs j and k with the rest of the schedule remaining
the same. Let this new schedule be produced by another algorithm A′′ .
Next, let us evaluate the contribution to the cumulative completion time by
jobs j and k in algorithm A′ . It is (t + pj ) + (t + pj + pk ). Let us evaluate
the contribution of these two jobs in the schedule produced by A′′ . It is (t +
pk ) + (t + pj + pk ). Given that pj > pk , we can conclude that the schedule
produced by algorithm A′ is longer (higher cumulative completion time). This
can never be the case because we have assumed A′ to be optimal. We have
a contradiction here because A′′ appears to be more optimal than A′ , which
violates our assumption.
Hence, A′ or any algorithm that violates the SJF order cannot be optimal.
Thus, algorithm A (SJF) is optimal.
Weighted Jobs
Let us now define the problem where weights are associated with jobs. It will
be 1 || wj Cj in the KSW formulation. If ∀j, wj = 1, we have the classical
unweighted formulation for which SJF is optimal.
For the weighted version, let us schedule jobs in descending order of (wj /pj ).
Clearly, if all wj = 1, this algorithm is the same as SJF. We can use the same
exchange-based argument to prove that using (wj /pj ) as the job priority yields
an optimal schedule.
193 © Smruti R. Sarangi
EDF Algorithm
Let us next look at the EDF (Earliest Deadline First) algorithm. It is one of
the most popular algorithms in real-time systems. Here, each job is associated
with a distinct non-zero arrival time and deadline. Let us define the lateness as
⟨completion time⟩ - ⟨deadline⟩. Let us define the problem as follows:
We are still considering a single core machine. The constraints are on the
arrival time and deadline. The constraint ri represents the fact that job i is
associated with arrival time ri – it can start only after it has arrived (ri ). Jobs
can arrive at any point of time (dynamically). The dli constraint indicates
that job i has deadline dli associated with it – it needs to complete before it.
Preemption is allowed (pmtn). We wish to minimize the maximum lateness
(Lmax ). This means that we would like to ensure that jobs complete as soon as
possible, with respect to their deadline. Note that in this case, we care about
the maximum value of the lateness, not the mean value. This means that we
don’t want any single job to be delayed significantly.
The algorithm schedules the job whose deadline is the earliest. Assume that
a job is executing, and a new job arrives that has an earlier deadline. Then the
currently running job is swapped out, and the new job that now has the earliest
deadline executes.
If the set of jobs are schedulable, which means that it is possible to find
a schedule such that no job misses its deadline, then the EDF algorithm will
produce such a schedule. If they are not schedulable, then the EDF algorithm
will broadly minimize the time by which jobs miss their deadline (minimize
Lmax ).
The proof is on similar lines and uses exchange-based arguments (refer to
[Mall, 2009]).
SRTF Algorithm
Let us continue our journey and consider another problem: 1 | ri , pmtn | ΣCi .
Consider a single core machine where the jobs arrive at different times and
preemption is allowed. We aim to minimize the mean/cumulative completion
time.
In this case, the most optimal algorithm is SRTF (shortest remaining time
first). For each job, we keep a tab on the time that is left for it to finish
execution. We sort this list in ascending order and choose the job that has the
shortest amount of time left. If a new job arrives, we compute its remaining time
and if that number happens to be the lowest, then we preempt the currently
running job and execute the newly arrived job.
We can prove that this algorithm minimizes the mean (cumulative) comple-
tion time using a same exchange-based argument.
• 1 | ri | ΣCi : In this case, preemption is not allowed and jobs can arrive at
any point of time. There is much less flexibility in this problem setting.
This problem is provably NP-complete.
• 1 | ri | Lmax : This problem is similar to the former. Instead of the average
(cumulative) completion time, we have lateness as the objective function.
• 1 | ri , pmtn | Σwi Ci : This is a preemptible problem that is a variant
of 1 | ri , pmtn | ΣCi , which has an optimal solution – SRTF. The only
addendum is the notion of the weighted completion time. It turns out
that for generic weights, this problem becomes NP-complete.
We thus observe that making a small change to the problem renders it NP-
complete. This is how sensitive these scheduling problems are.
Practical Considerations
All the scheduling problems that we have seen assume that the job execution
(processing) time is known. This may be the case in really well-characterized
and constrained environments. However, in most practical settings, this is not
known.
Figure 5.22 shows a typical scenario. Any task typically cycles between two
bursts of activity: a CPU-bound burst and an I/O burst. The task typically
does a fair amount of CPU-based computation, and then makes a system call.
This initiates a burst where the task waits for some I/O operation to complete.
We enter an I/O bound phase in which the task typically does not actively
execute. We can, in principle, treat each CPU-bound burst as a separate job.
Each task thus yields a sequence of jobs that have their distinct arrival times.
The problem reduces to predicting the length of the next CPU burst.
We can use classical time-series methods to predict the length of the CPU
burst. We predict the length of the nth burst tn as a function of tn−1 , tn−2 . . . tn−k .
For example, tn could be described by the following equation:
these predictions, the algorithms listed in the previous sections like EDF, SJF
and SRTF can be used. At least some degree of near-optimality can be achieved.
Let us consider the case when we have a poor prediction accuracy. We need
to then rely on simple, classical and intuitive methods.
Conventional Algorithms
We can always make a random choice, however, that is definitely not desirable
here. Something that is much more fair is a simple FIFO algorithm. To im-
plement it, we just need a queue of jobs. It guarantees the highest priority to
the job that arrived the earliest. A problem with this approach is the “convoy
effect”. A long-running job can delay a lot of smaller jobs. They will get unnec-
essarily delayed. If we had scheduled them first, the average completion time
would have been much lower.
We can alternatively opt for round-robin scheduling. We schedule a job for
one time quantum. After that we preempt the job and run another job for one
time quantum, so on and so forth. This is at least fairer to the smaller jobs.
They complete sooner.
There is thus clearly a trade-off between the priority of a task and system-
level fairness. If we boost the priority of a task, it may be unfair to other tasks
(refer to Figure 5.23).
Priority
Fairness
Queue-based Scheduling
A standard method of scheduling tasks that have different priorities is to use
a multilevel feedback queue as shown in Figure 5.24. Different queues in this
composite queue are associated with different priorities. We start with the
highest-priority queue and start scheduling tasks using any of the algorithms
that we have studied. If empty cores are still left, then we move down the
© Smruti R. Sarangi 196
Queue: Level 1
Queue: Level 2
Queue: Level 3
priority order of queues: schedule tasks from the second-highest priority queue,
third-highest priority queue and so on. Again note that we can use a different
scheduling algorithm for each queue. They are independent in that sense.
Depending upon the nature of the task and for how long it has been waiting,
tasks can migrate between queues. To provide fairness, tasks in low-priority
queues can be moved to high-priority queues. If a background task suddenly
comes into the foreground and becomes interactive, its priority needs to be
boosted, and the task needs to be moved to a higher priority queue. On the
other hand, if a task stays in the high-priority queues for a long time, we can
demote it to ensure fairness. Such movements ensure both high performance
and fairness.
Dispatcher
Let us now come to the issue of multicore scheduling. The big picture is
197 © Smruti R. Sarangi
shown in Figure 5.25. We have a global queue of tasks that typically contains
newly created tasks or tasks that needs to be migrated. A dispatcher module
sends the tasks to different per-CPU task queues. Theoretically, it is possible
to have different scheduling algorithms for different CPUs. However, this is not
a common pattern. Let us again look at the space of problems in the multicore
domain.
Bin Packing Problem: We have a finite number of bins, where each bin
has a fixed capacity S. There are n items. The size of the ith item is si .
We need to pack the items in bins without exceeding any bin’s capacity.
The objective is to minimize the number of bins and find an optimal
mapping between items to bins.
List Scheduling
Let us consider one of the most popular non-preemptive scheduling algorithms
in this space known as list scheduling. We maintain a list of ready jobs. They
are sorted in descending order according to some priority scheme. The priority
here could be the user’s job priority or could be some combination of the arrival
time, deadline, and the time that the job has waited for execution. When a
CPU becomes free it fetches the highest priority task from the list. In case, it
is not possible to execute that job, then the CPU walks down the list and finds
a job to execute. The only condition here is that we cannot return without a
job if the list is non-empty. Moreover, all the machines are considered to be
identical in terms of computational power.
Let us take a deeper look at the different kinds of priorities that we can use.
We can order the jobs in descending order of arrival time or job processing time.
We can also consider dependencies between jobs. In this case, it is important to
find the longest path in the graph (jobs are nodes and dependency relationships
are edges). The longest path is known as the critical path. The critical path often
determines the overall makespan of the schedule assuming we have adequate
compute resources. This is why in almost all scheduling problems, a lot of
emphasis is placed on the critical path. We always prefer scheduling jobs on
the critical path as opposed to jobs off the critical path. We can also consider
attributes associated with nodes in this graph. For example, we can set the
priority to be the out-degree (number of outgoing edges). If a job as a high
out-degree, then it means that a lot of other jobs are dependent on it. Hence,
if this job is scheduled, many other jobs will get benefited – they will have one
less dependency.
It is possible to prove that list scheduling is often near-optimal in some cases
using theoretical arguments [Graham, 1969]. Consider the problem P || Cmax .
Let the makespan (Cmax ) produced by an optimal scheduling algorithm OPT
have a length C ∗ . Let us compute the ratio of the makespan produced by list
scheduling and C ∗ . Our claim is that regardless of the priority that is used, we
are never worse off by a factor of 2.
Proof: Let there be n jobs and m CPUs. Let the execution times of the jobs
be p1 . . . pn , and job k (execution time pk ) complete the last. Assume it started
at time t. Then Cmax = t + pk .
Given that there is no idleness in list scheduling, we can conclude that till t
all the CPUs were 100% busy. This means that if we add all the work done by
all the CPUs till point t, it will be mt. This comprises the execution times of
199 © Smruti R. Sarangi
a subset of jobs that does not include job k (one that completes the last). We
thus arrive at the following inequality.
X
pi ≥ mt
i̸=k
X
⇒ pi − pk ≥ mt
i
1 X pk
⇒t ≤ pi − (5.2)
m i m
P
pi pk
⇒t + pk = Cmax ≤ i − + pk
P m
m
pi 1
⇒Cmax ≤ i + pk 1 −
m m
Now, C ∗ ≥ pk and C ∗ ≥ mean(pi ). These follow from the fact that jobs
cannot be split across CPUs (no preemption) and we wait for all the jobs to
complete. We thus have,
P
ipi 1
Cmax ≤ + pk 1 −
m m
1
≤ C∗ + C∗ 1 − (5.3)
m
Cmax 1
⇒ ≤2−
C∗ m
we discussed the lockdep map in Section 5.3.4. The Banker’s algorithm, which
we will introduce in this section, uses a more generalized form of the lockdep
map algorithm where we can have multiple copies of resources. It is a very
classical algorithm in this space, and can be used as a basis for generating many
practically relevant algorithms.
The key insight is as follows. Finding circular waits in a graph is sufficient for
cases where we have a single copy of a resource, however, when we have multiple
copies of a resource, a circular wait is not well-defined. Refer to Figure 5.26. We
show a circular dependency across processes and resources. However, because
of multiple copies, a deadlock does not happen. Hence, the logic for detecting
deadlocks when we have multiple copies of resources available is not as simple
as finding cycles in a graph. We need a different algorithm.
P1 P2 P3
A B C
Let us look at the data structures used in the Banker’s algorithm (see Ta-
ble 5.2). There are n processes and m types of resources. The array avlbl
stores the number of copies that we have for resource i.
In Algorithm 1, we first initialize the cur cnt array and set it equal to avlbl
(count of free resources). At the beginning, the request of no process is assumed
to be satisfied (allotted). Hence, we set the value of all the entries in the array
done to false.
Next, we need to find a process with id i such that it is not done yet (done[i]
== false) and its requirements stored in the need[i] array are element-wise less
than cur cnt. Let us define some terminology here before proceeding forward.
need[][] is a 2-D array. need[i] is a 1-D array that captures the resource
requirements for process i – it is the ith row in need[n][m] (row-column format).
For two 1-D arrays A and B of the same size, the expression A ≺ B means that
∀i, A[i] ≤ B[i] and ∃j, A[j] < B[j]. This means that each element of A is less
than or equal to the corresponding element of B. Furthermore, there is at least
one entry in A that is strictly less than the corresponding entry in B. If both
the arrays are element-wise identical, we write A = B. Now, if either of the
cases is true – A ≺ B or A = B – we write A ⪯ B.
Let us now come back to need[i] ⪯ cur cnt. It means that the maximum
requirement of a process is less than the currently available count of resources
© Smruti R. Sarangi 202
Let us now look at the resource request algorithm (Algorithm 2). We start
out with introducing a new array called req, which holds process i’s require-
ments. For example, if req[j] is equal to k, it means that process i needs k
copies of resource j.
Let us now move to the check phase. Consider the case where need[i] ≺ req,
which basically means that every entry of req is greater than or equal to the cor-
responding entry of need[i], and at least one entry is strictly greater than the
corresponding entry in need[i]. In this case, there are clearly more require-
ments than what was declared a priori (stored in the need[i] array). Such
requests cannot be satisfied. We need to return false. On the other hand, if
avlbl ≺ req, then it means that we need to wait for resource availability, which
may happen in the future. In this case, we are clearly not exceeding pre-declared
thresholds, as we were doing in the former case.
Next, let us make a dummy allocation once enough resources become avail-
able (allocate). The first step is to subtract req from avlbl. This basically
203 © Smruti R. Sarangi
means that we satisfy the request for process i. The resources that it requires
are not free anymore. Then we add req to acq[i], which basically means that
the said resources have been acquired. We then proceed to subtract req from
need[i]. This is because at all points of time, max=acq + need.
After this dummy allocation, we check if the state is safe or not by invoking
Algorithm 1. If the state is not safe, then it means that the current resource
allocation request should not be allowed – it may lead to a deadlock.
resents the current request. We have the following relationship between them:
reqs ⪯ need.
Let us now understand the expression reqs[i] ⪯ cur cnt. This basically
means that for some process i, we can satisfy its request at that point of time.
We subsequently move to update, where we assume that i’s request has been
satisfied. Therefore, similar to the safety checking algorithm, we return the
resources that i had held. We thus add acq[i] to cur cnt. This process is done
now (done[i] ← true). We go back to the find procedure and keep iterating till
we can satisfy the requests of as many processes as possible. When this is not
possible anymore, we jump to deadlock check.
Now, if done[i] == true for all processes, then it means that we were able
to satisfy the requests of all processes. There cannot be a deadlock. However,
if this is not the case, then it means that there is a dependency between pro-
cesses because of the resources that they are holding. This indicates a potential
deadlock situation.
There are several ways of avoiding a deadlock. The first is that before every
resource/lock acquisition we check the request using Algorithm 2. We do not
acquire the resource if we are entering an unsafe state. If the algorithm is
more optimistic, and we have entered an unsafe state already, then we perform
a deadlock check, especially when the system does not appear to make any
progress. We kill one of the processes involved in a deadlock and release its
resources. We can choose one of the processes that has been waiting for a long
time or has a very low priority.
do {
preempt_disable () ;
__schedule ( SM_NONE ) ;
205 © Smruti R. Sarangi
s c h e d _ p r e e m p t _ e n a b l e _ n o _ r e s c h e d () ;
} while ( need_resched () ) ;
There are several ways in which the schedule function can be called. If
a task makes a blocking call to a mutex or semaphore, then there is a pos-
sibility that it may not acquire the mutex/semaphore. In this case, the task
needs to be put to sleep. The state will be set to either INTERRUPTIBLE or
UNINTERRUPTIBLE. Since the current task is going to sleep, there is a need
to invoke the schedule function such that another task can execute.
The second case is when a process returns after processing an interrupt or
system call. The kernel checks the TIF NEED RESCHED flag. If it is set to true,
then it means that there are waiting tasks and there is a need to schedule them.
On similar lines, if there is a timer interrupt, there may be a need to swap the
current task out and bring a new task in (preemption). Again we need to call
the schedule to pick a new task to execute on the current core.
Every CPU has a runqueue where tasks are added. This is the main data
structure that manages all the tasks that are supposed to run on a CPU. The
apex data structure here is the runqueue (struct rq) (see kernel/sched/sched.h).
Linux defines different kinds of schedulers (refer to Table 5.3). Each sched-
uler uses a different algorithm to pick the next task that needs to run on a
core. The internal schedule function is a wrapper function on the individual
scheduler-specific function. There are many types of runqueues – one for each
type of scheduler.
Scheduling Classes
Let us introduce the notion of scheduling classes. A scheduling class represents
a class of jobs that need to be scheduled by a specific type of scheduler. Linux
defines a hierarchy of scheduling classes. This means that if there is a pending
job in a higher scheduling class, then we schedule it first before scheduling a job
in a lower scheduling class.
The classes are as follows in descending order of priority.
Stop Task This is the highest priority task. It stops everything and executes.
DL This is the deadline scheduling class that is used for real-time tasks. Every
task is associated with a deadline. Typically, audio and video encoders
create tasks in this class. This is because they need to finish their work
in a bounded amount of time. For 60-Hz video, the deadline is 16.66 ms.
RT These are regular real-time threads that are typically used for processing
interrupts (top or bottom halves), for example softIRQs.
Fair This is the default scheduler that the current version of the kernel uses
(v6.2). It ensures a degree of fairness among tasks where even the lowest
priority task gets some CPU time.
Idle This scheduler runs the idle process, which means it basically accounts for
the time in which the CPU is not executing anything – it is idle.
© Smruti R. Sarangi 206
In Listing 5.33, we observe that most of the functions have the same broad
pattern. The key argument is the runqueue struct rq that is associated with
each CPU. It contains all the task structs scheduled to run on a given CPU. In
any scheduling operation, it is mandatory to provide a pointer to the runqueue
such that the scheduler can find a task among all the tasks in the runqueue to
execute on the core. We can then perform several operations on it such as en-
queueing or dequeueing a task: enqueue task and dequeue task, respectively.
The most important functions in any scheduler are the functions pick task
and pick next task – they select the next task to execute. These functions are
207 © Smruti R. Sarangi
scheduler specific. Each type of scheduler maintains its own data structures and
has its own internal notion of priorities and fairness. Based on the scheduler’s
task selection algorithm an appropriate choice is made. The pick task function
is the fast path that finds the highest priority task (all tasks are assumed to
be separate), whereas the pick next task function is on the slow path. The
slow path incorporates some additional functionality, which can be explained
as follows. Linux has the notion of control groups (cgroups). These are groups
of processes that share scheduling resources. Linux ensures fairness across pro-
cesses and cgroups. In addition, it ensures fairness between processes in a
cgroup. cgroups further can be grouped into hierarchies. The pick next task
function ensures fairness while also considering cgroup information.
Let us consider a few more important functions. migrate task rq mi-
grates the task to another CPU – it performs the crucial job of load balancing.
update curr performs some bookkeeping for the current task – it updates its
runtime statistics. There are many other functions in this class such as func-
tions to yield the CPU, check for preemptibility, set CPU affinities and change
priorities.
These scheduling classes are defined in the kernel/sched directory. Each
scheduling class has an associated scheduler, which is defined in a separate C
file (see Table 5.3).
Scheduler File
Stop task scheduler stop task.c
Deadline scheduler deadline.c
Real-time scheduler rt.c
Completely fair scheduler (CFS) cfs.c
Idle idle.c
The runqueue
Let us now take a deeper look at a runqueue (struct rq) in Listing 5.34. The
entire runqueue is protected by a single spinlock lock. It is used to lock all key
operations on the runqueue. Such a global lock that protects all the operations
on a data structure is known as a monitor lock.
The next few fields are basic CPU statistics. The field nr running is the
number of runnable processes in the runqueue. nr switches is the number of
process switches on the CPU and the field cpu is the CPU number.
The runqueue is actually a container of individual scheduler-specific run-
queues. It contains three fields that point to runqueues of different schedulers:
cfs, rt and dl. They correspond to the runqueues for the CFS, real-time and
deadline schedulers, respectively. We assume that in any system, at the min-
imum we will have three kinds of tasks: regular (handled by CFS), real-time
tasks and tasks that have a deadline associated with them. These scheduler
types are hardwired into the logic of the runqueue.
It holds pointers to the current task (curr), the idle task (idle) and the
mm struct (prev mm). The task that is chosen to execute is stored in struct
© Smruti R. Sarangi 208
*core pick.
Scheduling-related Statistics
/* Preferred CPU */
struct rb_node core_node ;
/* statistics */
u64 exec_start ;
u64 sum_exec_runtime ;
u64 vruntime ;
u64 p rev_ sum_e xec_ runti me ;
© Smruti R. Sarangi 210
u64 nr_migrations ;
struct sched_avg avg ;
/* runqueue */
struct cfs_rq * cfs_rq ;
};
Notion of vruntimes
Equation shows the relation between the actual runtime and the vruntime.
The vruntime is δvruntime times the actual runtime. The formula for δvruntime is
211 © Smruti R. Sarangi
shown in Equation 5.4. Let δ be equal to the time interval between the current
time and the time at which the current started executing. If vruntime is equal
to δ then it means that we are not using a scaling factor. The scaling factor
is equal to the weight associated with a nice value of 0 divided by the weight
associated with the real nice value. We clearly expect the ratio to be less than
1 for high-priority tasks and be less than 1 for low-priority tasks.
weight(nice = 0)
δvruntime = δ × (5.4)
weight(nice)
Listing 5.37 shows the mapping between nice values as weights. The nice
value is 1024 for the nice value 0, which is the default. For every increase in the
nice value by 1, the weight reduces 1.25×. For example, if the nice value is 5,
the weight is 335. δvruntime = 3.05δ. Clearly, we have an exponential decrease
in the weight as we modify the nice value. For a nice value if n, the weight is
roughly 1024 × (1.25)n . The highest priority user task has a weight equal to
88761 (86.7×). This means that it gets significantly more runtime as compared
to a task that has the default priority.
Let us use the three mnemonics SP , N and G for the sake of readability.
Please refer to the code snippet shown in Listing 5.38. If the number of runnable
tasks are more than N (limit of the number of runnable tasks that can be con-
sidered in a scheduling period (SP )), then it means that the system is swamped
with tasks. We clearly have more tasks than what we can run. This is a crisis
situation, and we are looking at a rather unlikely situation. The only option in
this case is to increase the scheduling period by multiplying nr running with
G (minimum task execution time).
Let us consider the else part, which is the more likely case. In this case, we
set the scheduling period as SP .
Listing 5.38: Implementation of scheduling quanta in CFS
source : kernel/sched/fair.c
u64 __sched_period ( unsigned long nr_running )
{
if ( unlikely ( nr_running > sched_nr_latency ) )
return nr_running * s y s c t l _ s c h e d _ m i n _ g r a n u l a r i t y ;
© Smruti R. Sarangi 212
else
return sysctl_sched_latency ;
}
Once the scheduling period has been set, we set the scheduling slice for
each task as shown in Equation 5.5 (assuming we have the normal case where
nr running ≤ N).
weight(taski )
slicei = SP × P (5.5)
j weight(taskj )
We basically partition the scheduling period based on the weights of the con-
stituent tasks. Clearly, high-priority tasks get larger scheduling slices. However,
if we have the unlikely case where nr running > N, then each slice is equal to
G.
The scheduling algorithm works as follows. We find the task with the least
vruntime in the red-black tree. We allow it to run until it exhausts its scheduling
slice. This logic is shown in Listing 5.39. Here, if the CFS queue is non-
empty, we compute the time for which a task has already executed (ran). If
slice > ran, then we execute the task for slice − ran time units by setting
the timer accordingly, otherwise we reschedule the current task.
Listing 5.39: hrtick start fair
source : kernel/sched/fair.c
if ( rq - > cfs . h_nr_running > 1) {
u64 slice = sched_slice ( cfs_rq , se ) ;
u64 ran = se - > sum_exec_runtime - se - >
pre v_su m_exe c_ru ntime ;
s64 delta = slice - ran ;
if ( delta < 0) {
if ( task_current ( rq , p ) )
resched_curr ( rq ) ;
return ;
}
hrtick_start ( rq , delta ) ;
}
Clearly, once a task has exhausted its slice, its vruntime has increased and
its position needs to be adjusted in the RB tree. In any case, every time we
need to schedule a task, we find the task with the least vruntime in the RB tree
and check if it has exhausted its time slice or not. If it has, then we mark it as
a candidate for rescheduling (if there is spare time left in the current scheduling
period) and move to the next task in the RB tree with the second-smallest
vruntime. If that also has exhausted its scheduling slice or is not ready for some
reason, then we move to the third-smallest, and so on.
Once all tasks are done, we try to execute tasks that are rescheduled, and
then start the next scheduling period.
up after a long sleep and tasks getting migrated. They will start with a zero
vruntime and shall continue to have the minimum vruntime for a long time.
This has to be prevented – it is unfair for existing tasks. Also, when tasks move
from a heavily-loaded CPU to a lightly-loaded CPU, they should not have an
unfair advantage there. The following safeguards are in place.
3. If an old task is being restored or a new task is being added, then set
se− > vruntime+ = cfs rq− > minv runtime. This ensures that some
degree of a level playing field is being maintained.
4. This ensures that other existing tasks have a fair chance of getting sched-
uled
5. Always ensure that all vruntimes monotonically increase (in the cfs rq
and sched entity structures).
loadavg = u0 + u1 × y + u2 × y 2 + . . . (5.6)
This is a time-series sun with a decay term y. The decaying rate is quite
1
slow. y 32 = 0.5, or in other words y = 2− 32 . This is known as per-entity load
tracking (PELT, kernel/sched/pelt.c), where the number of intervals for which
we compute the load average is a configurable parameter.
© Smruti R. Sarangi 214
Real-time Scheduler
The real-time scheduler has one queue for every real-time priority. In addition,
we have a bit vector– one bit for each real-time priority. The scheduler finds
the highest-priority non-empty queue. It starts picking tasks from that queue.
If there is a single task then that task executes. The scheduling is clearly not
fair. There is no notion of fairness across real-time priorities.
However, for tasks having the same real-time priority, there are two op-
tions: FIFO and round-robin (RR). In the real-time FIFO option, we break ties
between two equal-priority tasks based on when they arrived (first-in first-out
order). In the round-robin (RR) algorithm, we check if a task has exceeded its
allocated time slice. If it has, we put it at the end of the queue (associated
with the real-time priority). We find the next task in this queue and mark it
for execution.
i
X t
Wi (t) = dj
j=1
Pj
Wi (t)
Qi (t) =
t (5.7)
Qi = min Qi (t)
{0<t≤Pi }
Q = max Qi
{1≤i≤n}
We consider a time interval t, and find the number of periods of job j that
are contained within it. Then we multiply the number of periods with the
execution time of a task (in job j). This is the total CPU load for the j th job. If
we aggregate the loads for the first i jobs in the system (arranged in descending
order of RMS priority), we get the cumulative CPU load Wi (t). Let us next
compute Qi (t) = Wi (t)/t. It is the mean load of the first i tasks over the time
interval t.
Next, let us minimize this quantity over the time period t and a given i. Let
this quantity be Qi . If Qi ≤ 1, then the ith task is schedulable and vice versa.
Let us define Q = max(Qi ). If Q ≤ 1, then it means that all the tasks are
schedulable. It turns out that this is both a necessary and sufficient condition.
For obvious reasons, it is not as elegant and easy to compute as the Liu-Layland
bound. Nevertheless, this is a more exact expression and is often used to assess
schedules.
© Smruti R. Sarangi 218
only when Ii (t) + di > t. We subsequently set the new value of t to be equal to
the sum the interference (Ii (t)) and the execution time of the ith task (di ). We
basically set t ← Ii (t) + di .
Before proceeding to the next iteration, it is necessary to perform a sanity
check. We need to check if t exceeds the deadline Di or not. If it exceeds the
deadline, clearly the ith task is not schedulable. We return false. If the deadline
has not been exceeded, then we can proceed to the next iteration.
We perform the same of steps in the next iteration. We compute the new
value of the interference using the new value of t. Next, we add the execution
time di of the ith task to it. Now, if the sum is less than or equal to the value
of t, we are done. We can declare the ith task to be schedulable, subject to the
fact that t ≤ Di . Otherwise, we increment t and proceed to the next iteration.
Given the fact that in every iteration we increase t, we will either find task i to
be schedulable or t will ultimately exceed Di .
Let us first consider a simple setting where there are two tasks in the system.
The low-priority task happens to lock a resource first. When the high-priority
tries to access the resource, it blocks. However, in this case the blocking time is
predictable – it is the time that the low-priority task will take to finish using the
resource. After that the high-priority task is guaranteed to run. This represents
the simple case and is an example of bounded priority inversion.
Let us next consider the more complicated case. If a high-priority task is
blocked by a low-priority task, a medium priority task can run in its place.
© Smruti R. Sarangi 220
This ends up blocking the low-priority task, which is holding the resource. If
such medium-priority tasks continue to run, the low-priority task may remain
blocked for a very long time. Here, the biggest loser is the high-priority task
because the time for which it will remain inactive is not known and dependent
on the behavior of many other tasks. Hence, this scenario is known as unbounded
priority inversion.
Next, assume that a task needs access to k resources, which it needs to
acquire sequentially (one after the other). It may undergo priority inversion
(bounded or unbounded) while trying to acquire each of these k resources. The
total amount of time that the high-priority task spends in the blocked state
may be prohibitive. This is an example of chain blocking, which needs to be
prevented. Assume that these are nested locks – the task acquires resources
without releasing previously held resources.
To summarize, the main issues that arise out of priority inversion related
phenomena are unbounded priority inversion and chain blocking. Coupled with
known issues like deadlocks, we need to design protocols such that all three
scenarios are prevented by design.
Point 5.5.1
• Let the pri function denote the priority of a task. For example,
pri(T ) is the instantaneous priority of task T . Note that this can
change over time. Furthermore, let us assume that the initial pri-
orities assigned to tasks are fully comparable – no two priorities
are equal. This can easily be accomplished by breaking ties (same
priority numbers) using task numbers.
• Assume a uniprocessor system.
here. Let the priority of the resource-holding task Thld be phld and the priority
of the resource-requesting task Treq be preq . If phld < preq , we temporarily raise
the priority of Thld to preq . However, if phld > preq , nothing needs to be done.
Note that this is a temporary action. Once the contended resource is released,
the priority of Thld reverts to phld . Now, it is possible that phld may not be
the original priority of Thld because this itself may be a boosted priority that
Thld may have inherited because it held some other resource. We will not be
concerned about that and just revert the priority to the value that existed just
before the resource was acquired, which is phld in this case.
Note that a task can inherit priorities from different tasks in the interval of
time in which it holds a resource. Every time a task is blocked because it cannot
access a resource, it tries to make the resource-holding task inherit its priority
if its priority is greater than the priority of the resource-holding task.
Let us explain with an example. Assume that the real-time priority of the
low-priority task Tlow is 5. The priority of a medium-priority task Tmed is 10,
and the priority of the high-priority task Thigh is 15. These are all real-time
priorities: higher the number, greater the priority. Now assume that Tlow is the
first to acquire the resource. Next, Tmed tries to acquire the resource. Due to
priority inheritance, the priority of Tlow now becomes 10. Next, Thigh tries to
acquire the resource. The priority of Tlow ends up getting boosted again. It is
now set to 15. After releasing the resource, the priority of Tlow reverts back to
5.
This is a very effective idea, and it is simple to implement. This is why many
versions of Linux support the PIP protocol. Sadly, the PIP protocol suffers from
deadlocks, unbounded priority inversion and chain blocking. Let us explain.
Point 5.5.2
Deadlocks and chain blocking are the major issues in the PIP protocol.
the requests that a task will make. Moreover, let us define a ceil(resource)
function, which is defined as the priority of the highest priority task that can
possibly acquire a resource (some time in the future).
Priority Inversion
Deadlocks
Next, let us consider deadlocks. Let us consider the same example that we
considered in the case of the priority inheritance protocol. In this case Tlow
acquires R1 first. It is not possible for Thigh to run. This is because the priority
of Tlow will become at least phigh + 1. It is thus not possible for the scheduler to
choose Thigh . In fact, it is possible to easily prove that the moment a resource
is acquired, no other task that can possibly acquire the resource can run – all
their priorities are less than ceil(resource) + 1. Hence, resource contention is
not possible because the contending task can never execute after the resource
acquisition. If there is no contention, a deadlock is not possible.
For every resource, we can define a set of tasks that may possibly acquire
it. Let us refer to it as the resource’s request set. The moment one of them
acquires the resource, the priority of that task gets set to ceil(resource) + 1.
This means that no other task in the request set can start or resume execution
after the resource R has been acquired. This ensures that henceforth there will
be no contention because of R. No contention ⇒ No deadlock.
Chain Blocking
The key question that we need to answer is after task T has acquired a resource
R, can it get blocked when it tries to acquire more resources? Assume that T
has acquired R, and then it tries to acquire R′ . If R′ is free, then there is no
chain blocking. Now, assume that R′ is already acquired by task T ′ – this leads
to T getting blocked. Let this be the first instance of such a situation, where
a task gets blocked while trying to acquire a resource after already acquiring
another resource. The following relationships hold:
223 © Smruti R. Sarangi
Let pri denote the instantaneous priority function. It is clear that T ′ is not
blocked (given our assumption). Now, given that T is running, its priority must
be more than that of T ′ . Next, from the definition of the HLP protocol, we
have pri(T ′ ) > ceil(R′ ). Note that this relationship holds because the resource
R′ has already been acquired by T ′ . The priority of T ′ can be further boosted
because T ′ may have acquired other resources with higher ceilings. In any case,
after resource acquisition pri(T ′ ) > ceil(R′ ) holds.
If we combine these equations, we have pri(T ) > ceil(R′ ). Note that at this
point of time, T has not acquired R′ yet. Given that the ceiling is defined as
the maximum priority of any interested task, we shall have pri(T ) ≤ ceil(R′ )
(before R′ has been acquired by T ). We thus have a contradiction here.
Hence, we can conclude that it never shall be the case that a task that has
already acquired one resource is waiting to acquire another. There will thus be
no chain blocking. Let us now quickly prove a lemma about chain blocking and
deadlocks.
Lemma 1
If there is no chain blocking, there can be no deadlocks.
Proof: Deadlocks happen because a task holds on to one resource, and tries
to acquire another (hold and wait condition). Now, if this process is guaranteed
to happen without blocking (no chain blocking), then a hold-and-wait situation
will never happen. Given that hold-and-wait is one of the necessary conditions
for a deadlock, there will be no deadlocks.
Inheritance Blocking
This protocol however does create an additional issue namely inheritance block-
ing. Assume that the priority of task T is 5 and the resource ceiling is 25. In
this case, once T acquires the resource, its priority becomes 26. This is very
high because 25 is a hypothetical maximum that may get realized very rarely.
Because of this action all the high-priority tasks with priorities between 6 and
25 get preempted. They basically get blocked because T inherited the priority
26. The sad part is that there may be no other process that is interested in
acquiring the resource regardless of its priority. We still end up blocking a lot
of other processes.
Point 5.5.3
Inheritance blocking is the major issue in the HLP protocol. It does not
suffer from chain blocking, deadlocks and unbounded priority inversion.
© Smruti R. Sarangi 224
Let us outline the two basic rules that determine the behavior of the PCP
protocol.
Inheritance Clause The task holding a resource inherits the priority of the
blocked task, if its priority is lower.
Let us understand the resource grant clause in some further detail. Let
us call a resource that has set the CSC a critical resource. If a task T owns a
critical resource, then the resource grant clause allows it to acquire an additional
resource. There are two cases that arise after the resource has been acquired.
Either the existing critical resource continues to remain critical or the new
resource that is going to be acquired becomes critical. In both cases, T continues
to own the critical resource.
In the other subclause, a task can acquire a resource if it has a priority
greater than the CSC. It will clearly have the highest priority in the system. It
is also obvious that it hasn’t acquired any resource yet. Otherwise, its priority
would not have been greater than the CSC. Let us state a few lemmas without
proof. It is easy to prove them using the definition of CSC.
Lemma 2
The CSC is greater than or equal to the priority of any task that currently
holds a resource in the system.
Lemma 3
The moment a task whose priority is greater than the current CSC ac-
quires a resource, it sets the CSC and that resource becomes critical.
Lemma 4
After a task acquires a resource, no other task with a priority less than
or equal to the CSC at that point of time can acquire any resource until
the CSC is set to a lower value.
Proof: A resource is acquired when the task priority is either more than the
CSC or the CSC has already been set by a resource that the task has acquired
in the past. In either case, we can be sure that the task has acquired a critical
resource (see Lemma 3). Subsequently, no new task with a priority less than
CSC can acquire a resource. It will not pass any of the subclauses of the re-
source grant clause – it will not have any resource that has set the CSC and its
priority is also less than the CSC.
acquiring R′ , it got blocked while trying to acquire some other resource R′′ . This
is not possible, because that instance of chain blocking will precede the instance
of chain blocking of task T , which we assume to be the first such instance in the
system. It violates our assumption. Hence, there is a contradiction here and
this case is also not possible.
We thus observe that both Cases I and II are not possible. Given that they
are exhaustive, we can conclude that chain blocking is not possible in the PCP
protocol.
Next, note that in the PCP protocol we do not elevate the priority to very
high levels, as we did in the HLP protocol. The priority inheritance protocol is
the same as the PIP protocol. Hence, we can conclude that inheritance blocking
is far more controlled.
Point 5.5.4
The PCP protocol does not suffer from deadlocks, chain blocking and
unbounded priority inversion. The problem of inheritance blocking is
also significantly controlled.
Exercises
Ex. 1 — What are the four necessary conditions for a deadlock? Briefly ex-
plain each condition.
Ex. 2 — Assume a system with many short jobs with deterministic execution
times. Which scheduler should be used?
Ex. 3 — Design a concurrent stack using the compare-and-set(CAS) primi-
tive. Use a linked list as a baseline data structure to store the stack (do not use
array).
int CAS ( int * location , int old_value , int value ) {
if (* location == old_value ) {
* location = value ;
return 1;
} else return 0;
}
227 © Smruti R. Sarangi
Do not use any locks (in any form). In your algorithm, there can be starvation;
however, no deadlocks. Provide the code for the push and pop methods. They
need to execute atomically. Note that in any real system there can be arbitrary
delays between consecutive instructions.
Ex. 4 — Explain why spinlocks are not appropriate for single-processor sys-
tems yet are often used in multiprocessor systems.
Ex. 8 — Consider the kernel mutex. It has an owner field and a waiting queue.
A process is added to the waiting queue only if the owner field is populated
(mutex is busy). Otherwise, it can become the owner and grab the mutex.
However, it is possible that the process saw that the owner field is populated,
added itself to the waiting queue but by that time the owner field became empty
– the previous mutex owner left without informing the current process. There
is thus no process to wake it up now, and it may wait forever. Assume that
there is no dedicated thread to wake processes up. The current owner wakes up
one waiting process when it releases the mutex (if there is one).
Sadly, because of such race conditions, processes may wait forever. Design a
kernel-based mutex that does not have this problem. Consider all race condi-
tions. Assume that there can be indefinite delays between instructions. Try
to use atomic instructions and avoid large global locks. Assume that task ids
require 40 bits.
Ex. 9 — The Linux kernel has a policy that a process cannot hold a spinlock
while attempting to acquire a semaphore. Explain why this policy is in place.
Ex. 11 — Explain the spin lock mechanism in the Linux kernel (based on
ticket locks). In the case of a multithreaded program, how does the spin lock
mechanism create an order for acquiring the lock? Do we avoid starvation?
Ex. 13 — Why are memory barriers present in the code of the lock and
unlock functions?
Ex. 14 — Write a fair version of the reader-writer lock.
© Smruti R. Sarangi 228
Ex. 15 — What is the lost wakeup problem? Explain from a theoretical per-
spective with examples.
Ex. 16 — Does the Banker’s algorithm prevent starvation? Justify your an-
swer.
Ex. 17 — We wish to reduce the amount of jitter (non-determinism in exe-
cution). Jitter arises due to interrupts, variable execution times of system calls
and the kernel opportunistically scheduling its own work when it is invoked.
This makes the same program take different amounts of time when it is run
with the same inputs. How can we create an OS that reduces the amount of
jitter? What are the trade-offs?
Ex. 21 — Show the pseudocode for registering and deregistering readers, and
the synchronize rcu function.
Ex. 23 — Why is it advisable to use RCU macros like rcu assign pointer
and rcu dereferencec heck? Why cannot we read or write to the memory
locations directly using simple assignment statements?
/* . . . */
struct foo * p = kmalloc ( sizeof ( struct foo ) , GFP_KERNEL ) ;
/* kernel malloc */
p - > a = 1;
p - > b = 2;
p - > c = 3;
Ex. 25 — Correct the following piece of code in the context of the RCU mech-
anism.
p = gp ;
if ( p != NULL ) {
myfunc (p - >a , p - >b , p - > c ) ;
}
Ex. 27 — How can we modify the CFS scheduling policy to fairly allocate
processing time among all users instead of processes? Assume that we have a
single CPU and all the users have the same priority (they have an equal right
to the CPU regardless of the processes that they spawn). Each user may spawn
multiple processes, where each process will have its individual CFS priority
between 100 and 139. Do not consider the real-time or deadline scheduling
policies.
Ex. 28 — How does the Linux kernel respond if the current task has exceeded
its allotted time slice?
Ex. 29 — The process priorities vary exponentially with the nice values. Why
is this the case? Explain in the context of a mix of compute and I/O-bound
jobs where the nice values change over time.
** Ex. 33 — Prove that any algorithm that uses list scheduling will have a
competitive ratio (Clist /C ∗ ), which is less than or equal to (2 − 1/m). There
are m processors, C is the makespan and C ∗ is the optimal makespan.
Ex. 34 — For a system with periodic and preemptive jobs, what is the uti-
lization bound (maximum value of U till which the system remains schedulable)
for EDF?
Ex. 35 — Prove that in PCP algorithm, once the first resource is acquired,
there can be no more priority inversions (provide a very short proof).
© Smruti R. Sarangi 230
Chapter 6
The Memory System
231
© Smruti R. Sarangi 232
regions (see Figure 6.1). There are holes between the allocated regions. If a
new process is created, then its memory needs to be allocated within one of the
holes. Let us say that a process requires 100 KB and the size of a hole is 150
KB, then we are leaving 50 KB free. We basically create a new hole that is 50
KB long. This phenomenon of having holes between regions and not using that
space is known as external fragmentation. On the other hand, leaving space
empty within a page in a regular virtual memory system is known as internal
fragmentation.
Hole Limit
Allocated region Base
The next question that we need to answer is that if we are starting a new
process, and we are exactly aware of the maximum amount of memory that it
requires, then which hole do we select for allocating its memory? Clearly the size
of the hole needs to be more than the amount of requested memory. However,
there could be multiple such holes, and we need to choose one of them. Our
choice really matters because it determines the efficiency of the entire process.
It is very well possible that later on we may not be able to satisfy requests
primarily because we will not have holes of adequate size left. Hence, designing
a proper heuristic in this space is important particularly in anticipation of the
future. There are several heuristics in this space. Let us say that we need R
bytes.
Best Fit Choose the smallest hole that is just about larger than R.
Worst Fit Choose the largest hole.
Next Fit Start searching from the last allocation that was made and move
towards higher addresses (with wraparounds).
First Fit Choose the first available hole
their performance quite suboptimal. It is also possible to prove that they are
optimal in some cases assuming some simple distribution of memory request
sizes in the future. In general, we do not know how much memory a process is
going to access. Hence, declaring the amount of memory that a process requires
upfront is quite difficult. This information is not there with the compiler or even
the user. In today’s complex programs, the amount of memory that is going to
be used is a very complicated function of the input, and it is thus not possible
to predict it beforehand. As a result, these schemes are seldom used as of
today. They are nevertheless still relevant for very small embedded devices that
cannot afford virtual memory. However, by and large, the base-limit scheme is
consigned to the museum of virtual memory schemes.
The stack distance typically has a distribution that is similar to the one
shown in Figure 6.3. Note that we have deliberately not shown the units of the
x and y axes because the aim was to just show the shape of the figure and not
© Smruti R. Sarangi 234
0.3
0.25
Representa�ve plot
0.2
Probability
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8
Stack distance
distribution is typically used to model the stack distance curve because it cap-
tures the fact that very low stack distances are rare, then there is a strong peak
and finally there is a heavy tail. This is easy to interpret and also easy to use as
a theoretical tool. Furthermore, we can use it to perform some straightforward
mathematical analyses as well as also realize practical algorithms that rely on
some form of caching or some other mechanism to leverage temporal locality.
Stack-based Algorithms
WS-Clock Algorithm
Let us now implement the approximation of the LRU protocol. A simple im-
plementation algorithm is known as the WS-Clock page replacement algorithm,
which is shown in Figure 6.4. Here WS stands for “working set”, which we shall
discuss later in Section 6.1.3.
Every physical page in memory is associated with an access bit. That is set
to either 0 or 1 and is stored along with the corresponding page table entry. A
© Smruti R. Sarangi 238
pointer like the minute hand of a clock points to a physical page; it is meant
to move through all the physical pages one after the other (in the list of pages)
until it wraps around.
If the access bit of the page pointed to by the pointer is equal to 1, then it is
set to 0 when the pointer traverses it. There is no need to periodically scan all
the pages and set their access bits to 0. This will take a lot of time. Instead, in
this algorithm, once there is a need for replacement, we check the access bit and
if it is set to 1, we reset it to 0. However, if the access bit is equal to 0, then we
select that page for replacement. For the time being, the process stops at that
point. Next time the pointer starts from the same point and keeps traversing
the list of pages towards the end until it wraps around the end.
This algorithm can approximately find the pages that are not recently used
and select one of them for eviction. It turns out that we can do better if we
differentiate between unmodified and modified pages in systems where the swap
space is inclusive – every page in memory has a copy in the swap space, which
could possibly be stale. The swap space in this case acts as a lower-level cache.
If both the bits are equal to 0, then they remain so, and we go ahead and
select that page as a candidate for replacement. On the other hand if they are
equal to ⟨0, 1⟩, which means that the page has been modified and after that its
239 © Smruti R. Sarangi
access bit has been set to 0, then we perform a write-back and move forward.
The final state in this case is set to ⟨00⟩ because the data is not deemed to be
modified anymore since it is written back to memory. Note that every modified
page in this case has to be written back to the swap space whereas unmodified
pages can be seamlessly evicted given that the swap space has a copy. As a
result we prioritize unmodified pages for eviction.
Next, let us consider the combination ⟨1, 0⟩. Here, the access bit is 1, so we
set it to 0. The result combination of bits is now ⟨0, 0⟩; we move forward. We
are basically giving the page a second chance in this case as well because it was
accessed in the recent past.
Finally, if the combination of these 2 bits is ⟨1, 1⟩, then we perform the
write-back, and reset the new state to ⟨1, 0⟩. This means that this is clearly a
frequently used frame that gets written to, and thus it should not be evicted or
downgraded (access bit set to 0).
This is per se a simple algorithm, which takes the differing overheads of
reads and writes into account. For writes, it gives a page a second chance in a
certain sense.
We need to understand that such LRU approximating algorithms are quite
heavy. They introduce artificial page access faults. Of course, they are not
as onerous as full-blown page faults because they do not fetch data from the
underlying storage device that takes millions of cycles. Here, we only need to
perform some bookkeeping and change the page access permissions. This is
much faster than fetching the entire page from the hard disk or NVM drive.
It is also known as a soft page fault. They however still lead to an exception
and require time to service. There is some degree of complexity involved in this
mechanism. But at least we are able to approximate LRU to some extent.
FIFO Algorithm
The queue-based FIFO (first-in first-out) algorithm is one of the most popular
algorithms in this space, and it is quite easy to implement because it does not
require any last usage tracking or access bit tracking. It is easy to implement
primarily because all that we need to do is that we need to have a simple priority
queue in memory that stores all the physical pages based on when the time at
which they were brought into memory. The page that was brought in the earliest
is the replacement candidate. There is no run time overhead in maintaining or
updating this information. We do not spend any time in setting and resetting
access bits or in servicing page access faults. Note that this algorithm is not
stack based, and it does not follow the stack property. This is not a good thing
as we shall see shortly.
Even though this algorithm is simple, it suffers from a very interesting
anomaly known as the Belady’s Anomaly [Belady et al., 1969]. Let us un-
derstand it better by looking at the two examples shown in Figures6.5 and 6.6.
In Figure 6.5, we show an access sequence of physical page ids (shown in square
boxes). The memory can fit only four frames. If there is a page fault, we mark
the entry with a cross otherwise we mark the box corresponding to the access
with a tick. The numbers at the bottom represent the contents of the FIFO
queue after considering the current access. After each access, the FIFO queue
is updated.
If the memory is full, then one of the physical pages (frames) in memory
© Smruti R. Sarangi 240
needs to be removed. It is the page that is at the head of the FIFO queue –
the earliest page that was brought into memory. The reader should take some
time and understand how this algorithm works and mentally simulate it. She
needs to understand and appreciate how the FIFO information is maintained
and why this algorithm is not stack based.
Access 1 2 3 4 1 2 5 1 2 3 4 5
sequence
1 2 3 4 4 4 5 1 2 3 4 5
1 2 3 3 3 4 5 1 2 3 4
1 2 2 2 3 4 5 1 2 3
1 1 1 2 3 4 5 1 2
4 frames
Access 1 2 3 4 1 2 5 1 2 3 4 5
sequence
1 2 3 4 1 2 5 5 5 3 4 4
1 2 3 4 1 2 2 2 5 3 3
1 2 3 4 1 1 1 2 5 5
3 frames
In this particular example shown in Figure 6.5, we see that we have a total
of 10 page faults. Surprisingly, if we reduce the number of physical frames in
memory to 3 (see Figure 6.6), we have a very counter-intuitive result. We would
ideally expect the number of page faults to increase because the memory size is
smaller. However, we observe an anomalous result. We have 9 page faults (one
page fault less than the larger memory with 4 frames) !!!
The reader needs to go through this example in great detail. She needs to
understand the reasons behind this anomaly. These anomalies are only seen in
algorithms that are not stack-based. Recall that in a stack-based algorithm,
we have the stack property – at all points of time the set of pages in a larger
memory are a superset of the pages that we would have in a smaller memory.
Hence, we cannot observe such an anomaly. Now, we may be tempted to believe
241 © Smruti R. Sarangi
that this anomaly is actually limited to small discrepancies. This means that if
we reduce the size of the memory, maybe the size of the anomaly is quite small
(limited to a very few pages).
However, this presumption is sadly not true. It was shown in a classic paper
by Fornai et al. [Fornai and Iványi, 2010a, Fornai and Iványi, 2010b] that a
sequence always exists that can make the discrepancy arbitrarily large. In other
words, it is unbounded. This is why the Belady’s anomaly renders many of
these non-stack-based algorithms can ineffective. They perform very badly in
the worst case. One may argue that such “bad” cases are pathological and rare.
But in reality, such bad cases to occur to a limited extent. This significantly
reduces the performance of the system because page faults are associated with
massive overheads.
#pages in memory
Figure 6.7: Page fault rate versus the working set size
© Smruti R. Sarangi 242
Thrashing
Consider a system with a lot of processes. If the space that is allocated to a
process is less than the size of its working set, then the process will suffer from
a high page fault rate. Most of its time will be spent in fetching its working set
and servicing page faults. The CPU performance counters will sadly indicate
that there is a low CPU utilization. The CPU utilization will be low primarily
because most of the time is going in I/O: servicing page faults. However, the
kernel’s load calculator will observe that the CPU load is low. Recall that we
had computed the CPU load in Section 5.4.6 (Equation 5.6) using a similar
logic.
Given that the load average is below a certain threshold, the kernel will try
to spawn more processes to increase the average CPU utilization. This will
actually exacerbate the problem and make it even worse. Now the memory that
is available to a given process will further reduce.
Alternatively, the kernel may try to migrate processes to the current CPU
that is showing reduced activity. Of course, here we are assuming a non-uniform
memory access machine (NUMA machine), where a part of the physical memory
is “close” to the given CPU. This proximate memory will now be shared between
many more processes.
In both cases, we are increasing the pressure on memory. Process will spend
most of their time in fetching their working set into memory – the system will
thus become quite slow and unresponsive. This process can continue and become
a vicious cycle. In the extreme case, this will lead to a system crash because
key kernel threads will not be able to finish their work on time.
This phenomenon is known as thrashing. Almost all modern operating sys-
tems have a lot of counters and methods to detect thrashing. The only practical
243 © Smruti R. Sarangi
Holes
User space 64 PB
more or less unfettered access to the kernel’s data structures, however off late
this is changing.
Modules are typically used to implement device drivers, file systems, and
cryptographic protocols/mechanisms. They help keep the core kernel code
small, modular and clean. Of course, security is a big concern while load-
ing kernel modules and thus module-specific safeguards are increasingly getting
more sophisticated – they ensure that modules have limited access to only the
functionalities that they need. With novel module signing methods, we can
ensure that only trusted modules are loaded. 1520 MB is a representative figure
for the size reserved for storing module-related code and data in kernel v6.2.
Note that this is not a standardized number, it can vary across Linux versions
and is also configurable.
struct mm_struct {
….
Pointer to the page table. The CR3
pgd_t *pgd; register is set to this value. Type: u64
…
};
Figure 6.9: The high-level organization of the page table (57-bit address)
Figure 6.9 shows the mm struct structure that we have seen before. It specif-
ically highlights a single field, which stores the page table (pgd t *pgd). The
page table is also known as the page directory in Linux. There are two virtual
memory address sizes that are commonly supported: 48 bits and 57 bits. We
have chosen to describe the 57-bit address in Figure 6.9. We observe that there
are five levels in a page table. The highest level of the page table is known as the
page directory (PGD). Its starting address is stored in the CR3 MSR (model
specific register). CR3 stores the starting address of the page table (highest
level) and is specific to a given process. This means that when the process
changes, the contents of the CR3 register also need to change. It needs to point
to the page table of the new process. There is a need to also flush the TLB. This
is very expensive. Hence, various kinds of optimizations have been proposed.
We shall quickly see that the contents of the CR3 register do not change
when we make a process-to-kernel transition or in some cases in a kernel-to-
process transition as well. Here the term process refers to a user process. The
main reason for this is that changing the virtual memory context is associated
with a lot of performance overheads and thus there is a need to minimize such
events as much as possible.
© Smruti R. Sarangi 246
The page directory is indexed using the top 9 bits of the virtual address
(bits 49-57). Then we have four more levels. For each level, the next 9 bits
(towards the LSB) are used to address the corresponding table. The reason
that we have a five-level page table here is because we have 57 virtual address
bits and thus there is a need to have more page table levels. Our aim is to
reduce the memory footprint of page tables as much as possible and properly
leverage the sparsity in the virtual address space. The details of all of these
tables are shown in Table 6.2. We observe that the last level entry is the page
table entry, which contains the mapping between the virtual page number and
the page frame number (or the number of the physical page) along with some
page protection information.
Listing 6.1: The follow pte function (assume the entry exists)
source : mm/memory.c
int follow_pte ( struct mm_struct * mm , unsigned long address ,
pte_t ** ptepp , spinlock_t ** ptlp ) {
pgd_t * pgd ;
p4d_t * p4d ;
pud_t * pud ;
pmd_t * pmd ;
pte_t * ptep ;
* ptepp = ptep ;
return 0;
}
Listing 6.1 shows the code for traversing the page table (follow pte func-
tion) assuming that an entry exists. We first walk the top-level page directory,
and find a pointer to the next level table. Next, we traverse this table, find a
pointer to the next level, so on and so forth. Finally, we find the pointer to the
page table entry. However, in this case, we also pass a pointer to a spinlock.
It is locked prior to returning a pointer to the page table entry. This allows us
to make changes to the page table entry. It needs to be subsequently unlocked
after it has been used/modified by another function.
Let us now look slightly deeper into the code that looks up a table in the
5-level page table. A representative example for traversing the PUD table is
shown in Listing 6.2. Recall that the PUD table contains entries that point to
PMD tables. Let us thus traverse the PUD table. We find the index of the
PMD entry using the function pmd index and add it to the base address of the
PUD table. This gives us a pointer to the PMD table. Recall that each entry of
the PUD table contains a pointer to a PMD table (pmd t *). Let us elaborate.
Listing 6.2: Accessing the page table at the PMD level
/* include / linux / pgtable . h */
pmd_t * pmd_offset ( pud_t * pud , unsigned long address ) {
return pud_pgtable (* pud ) + pmd_index ( address ) ;
}
First consider the pmd index inline function that takes the virtual address
as input. We need to next extract bits 31-39. This is achieved by shifting the
address to the right by 30 positions and then extracting the bottom 9 bits (using
a bitwise AND operation). The function returns the entry number in the PMD
table. This is multiplied with the size of a PMD entry and then added to the
base address of the PUD page table that is obtained using the pud pgtable
function. Note that the multiplication is implicit primarily because the return
type of the pud pgtable function is pmd t *.
Let us now look at the pud pgtable function. It relies on the va inline
function that takes a physical address as input and returns the virtual address.
The reverse is done by the pa inline function (or macro). In va(x), we
simply add the argument x to an address called PAGE OFFSET. This is not the
offset within a page, as the name may suggest. It is an offset into a memory
region where the page table entries are stored. These entries are stored in the
direct-mapped region of kernel memory. The PAGE OFFSET variable points to the
starting point of this region or some point within this region (depending upon
the architecture). Note the linear conversion between a physical and virtual
address.
The inline pud pgtable function invokes the va function with an argument
that is constructed as follows. The pud valpud returns the bits corresponding
to the physical address of the PUD table. We compute a bitwise AND between
this value and a constant that has all 1s between bit positions 13 and 52 (rest
0s). The reason is that the maximum physical address size is assumed to be 252
bytes in Linux. Furthermore, we are aligning the address with a page boundary,
hence, the first 12 bits (offset within the page) are set to 0. This is the address
corresponding to the PUD, which is assumed to be aligned with a page boundary.
This physical address is then converted to a virtual address using the va
function. We then add the PMD index to it and find the virtual address of the
PMD entry.
struct page
struct page is defined in include/linux/mm types.h. It is a fairly complex data
structure that extensively relies on unions. Recall that a union in C is a data
type that can store multiple types of data in the same memory location. It is a
good data type to use if we want it to encapsulate many types of data, where
only one type is used at a time.
The page structure begins with a set of flags that indicate the status of the
page. They indicate whether the page is locked, modified, in the process of being
written back, active, already referenced or reserved for special purposes. Then
there is a union whose size can vary from 20 to 40 bytes depending upon the
configuration. We can store a bunch of things such as a pointer to the address
space (in the case of I/O devices), a pointer to a pool of pages, or a page
map (to map DMA pages or pages linked to an I/O device). Then we have a
reference count, which indicates the number of entities that are currently holding
a reference of the page. This includes regular processes, kernel components or
even external devices such as DMA controllers.
We need to ensure that before a page is recycled (returned to the pool of
pages), its reference count is equal to zero. It is important to note that the
page structure is ubiquitously used and that too for numerous purposes, hence
it needs to have a very flexible structure. This is where using a union with the
large number of options for storing diverse types of data turns out to be very
useful.
Folios
Let us now discuss folios [Corbet, 2022, Corbet, 2021]. A folio is a compound or
aggregate page that comprises two or more contiguous pages. The reason that
folios were introduced is because memories are very large as of today, and it
is very difficult to handle the millions of pages that they contain. The sheer
translation overhead and overhead for maintaining page-related metadata and
information is quite prohibitive. Hence, a need was felt to group consecutive
pages into larger units called folios. Specifically, a folio points to the first page
in a group of pages (compound page). Additionally, it stores the number of
pages that are a part of it.
The earliest avatars of folios were meant to be a contiguous set of virtual
pages, where the folio per se is identified by a pointer to the head page (first
page). It is a single entity insofar as the rest of the kernel code is concerned. This
in itself is a very useful concept because in a sense we are grouping contiguous
virtual memory pages based on some notion of application-level similarity.
Now if the first page of the folio is accessed, then in all likelihood the rest
of the pages will also be accessed very soon. Hence, it makes a lot of sense to
prefetch these pages to memory in anticipation of being used in the near future.
However, over the years the thinking has somewhat changed even though folios
are still in the process of being fully integrated into the kernel. Now most
interpretations try to also achieve contiguity in the physical address space as
well. This has a lot of advantages with respect to I/O, DMA accesses and
reduced translation overheads. Let us discuss another angle.
Almost all server-class machines as of today have support for huge pages,
which have sizes ranging from 2 MB to 1 GB. They reduce the pressure on the
© Smruti R. Sarangi 250
TLB and page tables, and also increase the TLB hit rate as well. We maintain
a single entry for the entire huge page. Consider a 1 GB huge page. It can
store 218 4 KB pages. If we store a single mapping for it, then we are basically
reducing the number of entries that we need to have in the TLB and page table
substantially. Of course, this requires hardware support and also may sometimes
be perceived to be wasteful in terms of memory. However, in today’s day and
age we have a lot of physical memory. For many applications this is a very
useful facility and the entire 1 GB region can be represented by a set of folios –
this simplifies its management significantly.
Furthermore, I/O and DMA devices do not use address translation. They
need to access physical memory directly, and thus they benefit by having a
large amount of physical memory allocated to them. It becomes very easy to
transfer a huge amount of data directly to/from physical memory if they have
a large contiguous allocation. Additionally, from the point of view of software
it also becomes much easier to interface with I/O devices and DMA controllers
because this entire memory region can be mapped to a folio. The concept
of a folio along with a concomitant hardware mechanism such as huge pages
enables us to perform such optimizations quite easily. We thus see the folio as
a multifaceted mechanism that enables prefetching and efficient management of
I/O and DMA device spaces.
Given that a folio is perceived to be a single entity, all usage and replacement-
related information (LRU stats) are maintained at the folio level. It basically
acts like a single page. It has its own permission bits as well as copy-on-write
status. Whenever a process is forked, the entire folio acts as a single unit like
a page and is copied in totality when there is a write to any constituent page.
LRU information and references are also tracked at the folio level.
Mapping the struct page to the Page Frame Number (and vice versa)
Let us now discuss how to map a page or folio structure to a page frame number
(pfn). There are several simple mapping mechanisms. Listing 6.3 shows the code
for extracting the pfn from a page table entry (pte pfn macro). We simply right
shift the address by 12 positions (PAGE SHIFT).
Listing 6.3: Converting the page frame number to the struct page and vice
versa
source : include/asm − generic/memory model.h
# define pte_pfn ( x ) phys_to_pfn ( x . pte )
# define phys_to_pfn ( p ) (( p ) >> PAGE_SHIFT )
The next macro pfn to page has several variants. A simpler avatar of
this macro simply assumes a linear array of page structures. There are n such
structures, where n is the number of frames in memory. The code in Listing 6.3
shows a more complex variant where we divide this array into a bunch of sec-
tions. We figure out the section number from the pfn (page frame number), and
251 © Smruti R. Sarangi
every section has a section-specific array. We find the base address of this array
and add the page frame number to it to find the starting address of the corre-
sponding struct page. The need for having sections will be discussed when we
introduce zones in physical memory (in Section 6.2.5).
ASIDs
Intel x86 processors have the notion of the processor context ID (PCID), which
in software parlance is also known as the address space ID (ASID). We can take
some important user-level processes that are running on a CPU and assign them
a PCID each. Then their corresponding TLB entries will be tagged/annotated
with the PCID. Furthermore, every memory access will now be annotated with
the PCID (conceptually). Only those TLB entries will be considered that match
the given PCID. Intel CPUs typically provide 212 (=4096) PCIDs. One of them
is reserved, hence practically 4095 PCIDs can be supported. There is no separate
register for it. Instead, the top 12 bits of the CR3 register are used to store the
current PCID.
Now let us come to the Linux kernel. It supports the generic notion of ASIDs
(address space IDs), which are meant to be architecture independent. Note that
it is possible that an architecture does not even provide ASIDs.
In the specific case of Intel x86-64 architectures, an ASID is the same as a
PCID. This is how we align a software concept (ASID) with a hardware concept
(PCID). Given that the Linux kernel needs to run on a variety of machines and
all of them may not have support for so many PCIDs, it needs to be slightly
more conservative, and it needs to find a common denominator across all the
architectures that it is meant to run on. For the current kernel (v6.2), the
developers decided support only 6 ASIDs, which they deemed to be enough.
This means that out of 4095, only 6 PCIDs on an Intel CPU are used. From
253 © Smruti R. Sarangi
Point 6.2.1
Kernel threads do not have separate page tables. A common kernel page
table is appended to all user-level page tables. At a high level, there is a
pointer to the kernel page table from every user-level page table. Recall
that the kernel and user virtual addresses only differ in their highest bit
(MSB bit), and thus a pointer to the kernel-level page table needs to be
there at the highest level of the five-level composite page table.
Let us now do a case-by-case analysis. Assume that the kernel in the course
of execution tries to access the invalidated page – this will create a correctness
issue if the mapping is still there. Note that since we are in the lazy TLB
mode, the mapping is still valid in the TLB of the CPU on which the kernel
thread is executing. Hence, in theory, the kernel may access the user-level page
that is not valid at the moment. However, this cannot happen in the current
implementation of the kernel. This is because access to user-level pages does not
happen arbitrarily. Instead, such accesses happen via functions with well-defined
entry points in the kernel. Some examples of such functions are copy from user
and copy to user. At these points, special checks can be made to find out if the
pages that the kernel is trying to access are currently valid or not. If they are
not valid because another core has invalidated them, then an exception needs
to be thrown.
Next, assume that the kernel switches to another user process. In this case,
either we flush all the pages of the previous user process (solves the problem) or
if we are using ASIDs, then the pages remain but the current task’s ASID/PCID
changes. Now consider shared memory-based inter-process communication that
involves the invalidated page. This happens through well-defined entry points.
Here checks can be carried out – the invalidated page will thus not be accessed.
© Smruti R. Sarangi 254
Finally, assume that the kernel switches back to a thread that belongs to
the same multithreaded user-level process. In this case, prior to doing so, the
kernel checks if the CPU is in the lazy TLB mode and if any TLB invalidations
have been deferred. If this is the case, then all such deferred invalidations are
completed immediately prior to switching from the kernel mode. This finishes
the work.
The sum total of this discussion is that to maintain TLB consistency, we do
not have to do it in mission mode. There is no need to immediately interrupt all
the other threads running on the other CPUs and invalidate some of their TLB
entries. Instead, this can be done lazily and opportunistically as and when there
is sufficient computational bandwidth available – critical high-priority processes
need not be interrupted for this purpose.
Node
Shared interconnect
Refer to Figure 6.10 that shows a NUMA machine where multiple chips
(group of CPUs) are connected over a shared interconnect. They are typically
organized into clusters of chips/CPUs and there is a notion of local memory
within a cluster, which is much faster than remote memory (present in another
cluster). We would thus like to keep all the data and code that is accessed within
a cluster to remain within the local memory. We need to minimize the number
of remote memory accesses as far as possible. This needs to be explicitly done to
255 © Smruti R. Sarangi
guarantee the locality of data and ensure a lower average memory access time.
In the parlance of NUMA machines, each cluster of CPUs or chips is known as
a node. All the computing units (e.g. cores) within a node have roughly the
same access latency to local memory as well as remote memory. We need to
thus organize the physical address hierarchically. The local memory needs to be
the lowest level and the next level should comprise pointers to remote memory.
Zones
Given that the physical address space is not flat, there is a need to partition
it. Linux refers to each partition as a zone [Rapoport, 2019]. The aim is to
partition the set of physical pages (frames) in the physical address space into
different nonoverlapping sets.
Each such set is referred to as a zone. They are treated separately and dif-
ferently. This concept can easily be extended to also encompass frames that
are stored on different kinds of memory devices. We need to understand that
in modern systems, we may have memories of different types. For instance,
we could have regular DRAM memory, flash/NVMe drives, plug-and-play USB
memory, and so on. This is an extension of the NUMA concept where we have
different kinds of physical memories, and they clearly have different characteris-
tics with respect to the latency, throughput and power consumption. Hence, it
makes a lot of sense to partition the frames across the devices and assign each
group of frames (within a memory device) to a zone. Each zone can then be
managed efficiently and appropriately (according to the device that it is associ-
ated with). Memory-mapped I/O and pages reserved for communicating with
the DMA controller can also be brought within the ambit of such zones.
Listing 6.4 shows the details of the enumeration type zone type. It lists the
different types of zones that are normally supported in a regular kernel.
The first is ZONE DMA, which is a memory area that is reserved for physical
pages that are meant to be accessed by the DMA controller. It is a good idea to
partition the memory and create an exclusive region for the DMA controller. It
can then access all the pages within its zone freely, and we can ensure that data
in this zone is not cached. Otherwise, we will have a complex sequence of cache
evictions to maintain consistency with the DMA device. Hence, partitioning the
set of physical frames helps us clearly mark a part of the memory that needs to
remain uncached as is normally the case with DMA pages. This makes DMA
operations fast and reduces the number of cache invalidations and writebacks
substantially.
Next, we have ZONE NORMAL, which is for regular kernel and user pages.
Sometimes we may have a peculiar situation where the size of the physical
memory actually exceeds the total size of the virtual address space. This can
happen on some older processors and also on some embedded systems that use
16-bit addressing. In such special cases, we would like to have a separate zone
of the physical memory that keeps all the pages that are currently not mapped
to virtual addresses. This zone is known as ZONE HIGHMEM.
User data pages, anonymous pages (stack and heap), regions of memory used
by large applications, and regions created to handle large file-based applications
can all benefit from placing their pages in contiguous zones of physical memory.
For example, if we want to design a database’s data structures, then it is a good
idea to create a large folio of pages that are contiguous in physical memory. The
© Smruti R. Sarangi 256
database code can lay out its data structures accordingly. Contiguity in physical
addresses ensures better prefetching performance. A hardware prefetcher can
predict the next frame very accurately. The other benefit is a natural alignment
with huge pages, which leads to reduced TLB miss rates and miss penalties. To
create such large contiguous regions in physical memory, pages have to be freely
movable – they cannot be pinned to physical addresses. If they are movable, then
pages can dynamically be consolidated at runtime and large holes – contiguous
regions of free pages – can be created. These holes can be used for subsequent
allocations. It is possible for one process to play spoilsport by pinning a page.
Most often these are kernel processes. These actions militate against the creation
of large contiguous physical memory regions. Hence, it is a good idea to group
all movable pages and assign them to a separate zone where no page can be
pinned. Linux defines such a special zone called ZONE MOVABLE that comprises
pages that can be easily moved or reclaimed by the kernel.
The next zone pertains to novel memory devices that cannot be directly man-
aged by conventional memory management mechanisms. This includes parts of
the physical address space stored on nonvolatile memory devices (NVMs), mem-
ory on graphics cards, Intel’s Optane memory (persistent memory) and other
novel memory devices. A dedicated zone called ZONE DEVICE is thus created
to encompass all these physical pages that are stored on a device that is not
conventional DRAM.
Such unconventional devices have many peculiar features. For example, they
can be removed at any point of time without prior notice. This means that no
copy of pages stored in this zone should be kept in regular DRAM – they will
become inconsistent. Page caching is therefore not allowed. This zone also
allows DMA controllers to directly access device memory. The CPU need not
be involved in such DMA transfers. If a page is in ZONE DEVICE, we can safely
assume that the device that hosts the pages will manage them.
It plays an important role while managing nonvolatile memory (NVM) de-
vices because now the hardware can manage the pages in NVMs directly. They
are all mapped to this zone and there is a notion of isolation between device
pages and regular memory pages. The key idea here is that device pages need to
be treated differently in comparison to regular pages stored on DRAM because
of device-specific idiosyncrasies.
Point 6.2.2
NVM devices are increasingly being used to enhance the capacity of
the total available memory. We need to bear in mind that nonvolatile
memory devices are in terms of performance between hard disks and
regular DRAM memory. The latency of a hard disk is in milliseconds,
whereas the latency of nonvolatile memory is typically in microseconds
or in the 100s of nanoseconds range. The DRAM memory on the other
hand has a sub 100-ns latency. The advantage of nonvolatile memories
is that even if the power is switched off, the contents still remain in the
device (persistence). The other advantage is that it also doubles up as a
storage device and there is no need to actually pay the penalty of page
faults when a new processor starts or the system boots up. Given the
increasing use of nonvolatile memory in laptops, desktops and server-
257 © Smruti R. Sarangi
/* Normal pages */
ZONE_NORMAL ,
Sections
Recall that in Listing 6.3, we had talked about converting page frame numbers
to page structures and vice versa. We had discussed the details of a simple linear
layout of page structures and then a more complicated hierarchical layout that
divides the zones into sections.
It is necessary to take a second look at this concept now (refer to Figure 6.11).
To manage all the memory and that too efficiently, it is necessary to sometimes
divide it into sections and create a 2-level hierarchical structure. The first reason
is that we can efficiently manage the list of free frames within a section because
we use smaller data structures. Second, sometimes zones can be noncontiguous.
It is thus a good idea to break a noncontiguous zone into a set of sections,
where each section is a contiguous chunk of physical memory. Finally, sometimes
there may be intra-zone heterogeneity in the sense that the latencies of different
memory regions within a zone may be slightly different in terms of performance
© Smruti R. Sarangi 258
or some part of the zone may be considered to be volatile, especially if the device
tends to be frequently removed.
Given such intra-zone heterogeneity, it is a good idea to partition a zone
into sections such that different sections can be treated differently by the kernel
and respective memory management routines. Next, recall that the code in
Listing 6.3 showed that each section has its mem map that stores the mapping
between page frame numbers (pfns) and struct pages. This map is used to
convert a pfn to a struct page.
Zone
one-to-one
PFN struct page
mapping
struct page anon_vma
field
ing
pp
ma ld
struct page fie
associated with two vmas across two processes (the parent and the child). The
relationship is as follows: 2 vma ↔ 1 anon vma (refer to Figure 6.13(a)).
Now, consider another case. Consider a case where a parent process has
forked a child process. In this case, they have their separate vmas that point to
the same anon vma. This is the one that the shared pages also point to. Now,
assume a situation where the child process writes to a page that is shared with
the parent. This means that a new copy of the page has to be created for the
child due to the copy-on-write mechanism. This new page needs to point to
an anon vma, which clearly cannot be the one that the previously shared page
was pointing to. It needs to point to a new anon vma that corresponds to pages
exclusive to the child. There is an important question that needs to be answered
here. What happens to the child’s vma? Assume it had 1024 pages, and the
write access was made to the 500th page. Do we then split it into three parts:
0-499, 500, 501-1023? The first and last chunks of pages are unmodified up till
now. However, we made a modification in the middle, i.e., to the 500th page.
This page is now pointing to a different anon vma.
vma anon_vma
anon_vma vma
vma anon_vma
(a) (b)
Figure 6.13: (a) Many-to-one mapping (b) One-to-many mapping
Splitting the vma is not a good idea. This is because a lot of pages in a
vma may see write accesses when they are in a copy-on-write (COW) mode. We
cannot keep on splitting the vma into smaller and smaller chunks. This is a lot
of work and will prohibitively increase the number of vma structures that need
to be maintained. Hence, as usual, the best solution is to do nothing, i.e., not
split the vma. Instead, we maintain the vma as it is but assume that all the
pages in its range may not be mapped to the same anon vma. We thus have the
following relationship: 1 vma ↔ 2 anon vma . Recall that we had earlier shown
a case where we had the following relationship: 2 vma ↔ 1 anon vma (refer to
Figure 6.13(b)).
Let us now summarize our learnings.
263 © Smruti R. Sarangi
Point 6.3.1
This is what we have understood about the relationship between a vma
and anon vma.
• For a given virtual address region, every process has its own private
vma.
vma
vma anon_vma
vma anon_vma
We thus observe that a complex relationship between the anon vma and vma
has developed at this point (refer to Figure 6.14 for an example). Maintaining
this information and minimizing runtime updates is not easy. There is a classical
time and space trade-off that we need to defer to here. If we want to minimize
time, we should increase space. Moreover, we desire a data structure that
captures the dynamic nature of the situation. The events of interest that we
have identified up till now are as follows: a fork operation, a write to a COW
page, splitting or merging a vma and killing a process.
Let us outline our requirements using a few 1-line principles.
1. Every anon vma should know which vma structures it is associated with.
2. Every vma should know which anon vma structures it is associated with.
3. The question that we need to answer now is whether a these two structures
© Smruti R. Sarangi 264
We shall show the C code of an anon vma after we describe another structure
known as the anon vma chain because both are quite intimately connected. It
is not possible to completely explain the former without explaining the latter.
Figure 6.15: The relationship between vma, anon vma and anon vma chain
(avc). A dashed arrow indicates that anon vma does not hold a direct pointer
to the avc, but holds a reference to a red-black tree that in turn has a pointer
to the avc.
We can think of the anon vma chain structure as a link between a vma and
an anon vma. We had aptly referred to it as a level of indirection. An advantage
of this structure is that we can link it to other anon vma chain nodes via the
same vma list (refer to Figure 6.16). All of them correspond to the same vma.
265 © Smruti R. Sarangi
They are thus stored in a regular doubly-linked list. We shall see later that for
faster access, it is also necessary to store anon vmas in a red-black tree. Hence,
an anon vma chain points to a red-black tree node.
avc anon_vma
avc anon_vma
Figure 6.16: anon vma chain nodes connected together as a list (all correspond-
ing to the same vma)
Now, let us look at the code for the anon vma in Listing 6.8.
Listing 6.8: anon vma
source : include/linux/rmap.h
struct anon_vma {
/* Arrange all the anon_vmas corresponding to a vma
hierarchically */
/* The root node holds the lock for accessing the chain
of anon_vmas */
struct anon_vma * root ;
struct anon_vma * parent ;
We don’t actually store any state. We just store pointers to other data
structures. All that we want is that all the pages in a virtual memory region with
similar access policies point to the same anon vma. Now, given the anon vma,
we need to quickly access all the vma structures that may contain the page. This
is where we use a red-black tree (rb root) that actually stores three pieces of
information: anon vma chain nodes (the value), and the start and end virtual
addresses of the associated vma (the key). The red-black tree can quickly be
used to find all the vma structures that contain a page.
Additionally, we organize all the anon vma structures as a tree, where each
node also has a direct pointer to the root. The root node stores a read-write
semaphore that is used to lock the list of anon vma chain nodes such that
changes can be made (add/delete/etc.).
The exclusive anon vma is named anon vma and the list of anon vma chain
nodes is represented by its namesake anon vma chain.
vma avc
anon_vma
Figure 6.17: Updated relationship between vma, anon vma and avc
Next, it is possible that because of fork operations, many other vma struc-
tures (across processes) point to the same anon vma via avcs – the anon vma
that is associated with the vma of the parent process. Recall that we had cre-
ated a linked list of anon vma chain nodes (avcs) precisely for this purpose.
Figure 6.18 shows an example where an anon vma is associated with multiple
vmas across processes.
anon_vma
vma avc
avc anon_vma
avc anon_vma
Figure 6.18: Example of a scenario with multiple processes where an anon vma
is associated with multiple vmas
267 © Smruti R. Sarangi
Red-black tree
Figure 6.19: Reverse map structures after a fork operation
Let us now consider the case of a fork operation. The reverse map (rmap)
structures are shown in Figure 6.19 for both the parent and child processes.
The parent process in this case has one vma and an associated anon vma. The
fork operation starts out by creating a copy of all the rmap structures of the
parent. The child thus gets an avc that links its vma to the parent’s anon vma.
This ensures that all shared pages point to the same anon vma and the structure
is accessible from both the child and parent processes.
The child process also has its own private anon vma that is pointed to by its
vma. This is for pages that are exclusive to it and not shared with its parent
process. Let us now look at the classical reverse mapping problem. Up till now,
we have discussed a one-way mapping from vmas to anon vmas. But we have
not discussed how we can locate all vmas of different processes given a page.
This is precisely the reverse mapping problem that we shall solve next.
Given a page frame number, we can locate its associated struct page, and
then using its mapping field, we can retrieve a pointer to the anon vma. The next
problem is to locate all the avcs that point to the anon vma. Every anon vma
stores a red-black tree that stores the list of avcs that point to it. Specifically,
each node of the red-black tree stores a pointer to an avc (value) and the range
of virtual addresses it covers (key). The latter is the key used to access the
red-black tree and the former is the value.
The child process now has multiple avcs. They are connected to each other
and the child process’s vma using a doubly linked list.
Let us now consider the case of a shared page that points to the anon vma
of the parent process (refer to Figure 6.20). After a fork operation, this page is
stored in copy-on-write (COW) mode. Assume that the child process writes to
the page. In this case, a new copy of the page needs to be made and attached
to the child process. This is shown in Figure 6.21.
The new page now points to the private anon vma of the child process. It is
now the exclusive property of the child process.
Next, assume that the child process is forked. In this case, the rmap struc-
tures are replicated, and the new grandchild process is also given its private vma
and anon vma (refer to Figure 6.22). In this case, we create two new avcs. One
avc points to the anon vma of the child process and the other avc points to the
anon vma of the original parent process. We now have two red-black trees: one
© Smruti R. Sarangi 268
Red-black tree
page
Figure 6.20: A page pointing to the parent’s anon vma
Red-black tree
old new
page page
Figure 6.21: A new page pointing to the child’s anon vma
corresponds to the parent process’s anon vma and the other corresponds to the
child process’s anon vma. The avcs of each process are also nicely linked using
a doubly linked list, which also includes its vma.
After repeated fork operations, it is possible that a lot of avcs and anon vmas
get created. This can lead to a storage space blowup. Modern kernels optimize
this. Consider the anon vma (+ associated avc) that is created for a child process
such that pages that are exclusive to the child can point to it. In some case,
instead of doing this, an existing anon vma along with its avc can be reused.
This introduces an additional level of complexity; however, the space savings
justify this design decision to some extent.
Red-black tree
Grandchild process avc
vma avc
avc anon_vma
Figure 6.22: The structure of the rmap structures after a second fork operation
(fork of the child)
• The algorithm divides pages into different generations based on the recency
of the last access. If a page is accessed, there is a fast algorithm to upgrade
it to the latest generation.
• The algorithm reclaims pages in the background by swapping them out to
the disk. It swaps pages that belong to the oldest generations.
• It ages the pages very intelligently. This is workload dependent.
• It is tailored to running large workloads and integrates well with the notion
of folios.
struct lruvec {
/* contains the physical memory layout of the NUMA
node */
struct pglist_data * pgdat ;
/* Number of refaults */
unsigned long refaults [ ANON_AND_FILE ];
/* LRU state */
struct lru_gen_struct lrugen ;
struct lru_gen_mm_state mm_state ;
};
Linux uses the lruvec structure to store LRU replacement related important
information. Its code is shown in Listing 6.10. The first key field is a pointer
to a pglist data structure that stores the details of the zones in the current
NUMA node (discussed in Sections 6.2.5 and 6.2.5).
Next, we store the number of refaults for anonymous and file-backed pages. A
refault is a page access after it has been evicted. We clearly need to minimize the
number of refaults. If it is high, it means that the page replacement and eviction
algorithms are suboptimal – they evict pages that have a high probability of
being accessed in the near future.
The next two fields lrugen and mm state store important LRU-related state.
lrugen is of type lru gen struct (shown in Listing 6.11). mm state is of type
lru gen mm state (shown in Listing 6.12).
/* 3 D array of lists */
struct list_head lists [ MAX_NR_GENS ][ ANON_AND_FILE ][
MAX_NR_ZONES ];
};
A lru gen struct structure stores a set of sequence numbers: maximum and
minimum (for anonymous and file-backed pages, resp.), an array of timestamps
(one per generation) and a 3D array of linked lists. This array will prove to
be very useful very soon. It is indexed by the generation number, type of page
(anonymous or file) and zone number. Each entry in this 3D array is a linked
list whose elements are struct pages. The idea is to link all the pages of the
same type that belong to the same generation in a single linked list. We can
traverse these lists to find pages to evict based on additional criteria.
Let us next discuss the code of lru gen mm state (see Listing 6.12). This
structure stores the current state of a page walk – a traversal of all the pages
to find pages that should be evicted and written to secondary storage (swapped
out). At one point, multiple threads may be performing page walks (stored in
the nr walkers variable).
The field seq is the current sequence number that is being considered in
the page walk process. Each sequence number corresponds to a generation –
lower the sequence number, earlier the generation. The head and tail pointers
point to consecutive elements of a linked list of mm struct structures. In a
typical page walk, we traverse all the pages that satisfy certain criteria of a
given process (its associated mm struct), then we move to the next process (its
mm struct), and so on. This process is easily realized by storing a linked list of
mm struct structures. The tail pointer points to the mm struct structure that
was just processed (pages traversed). The head pointer points to the mm struct
that needs to be processed.
Finally, we use an array of Bloom filters to speed up the page walk process
(we shall see later). Whenever, the word Bloom filter comes up, the only thing
that one should have in mind is that in a Bloom filter a false negative is not
possible, but a false positive is possible.
facility. They can automatically set this flag. A simple scan of the page table
can yield the list of pages that have been recently accessed (after the last time
that the bits were cleared). If hardware support is not available, then there is
a need to mark the pages as inaccessible. This will lead to a page fault, when
the page is subsequently accessed. This is not a full-scale (hard) page fault,
where the contents of the page need to be read from an external storage device.
It is instead, a soft page fault, where after recording the fact that there was a
page fault, its access permissions are changed – the page is made accessible once
again. We basically deliberately induce fake page faults to record page accesses.
can have plain old-fashioned eviction, or we can reclaim pages from specialized
page buffers. The sizes of the latter can be adjusted dynamically to release
pages and make them available to other processes.
17 return false ;
Let us first understand, whether we need to run aging or not in the first
place. The logic is shown in Listing 6.14. The aim is to maintain the following
relationship: min seq + MIN NR GENS == max seq. This means that we wish
to ideally maintain MIN NR GENS+1 sequence numbers (generations). The check
in Line 2 checks if there are too few generations. Then, definitely there is a
need to run the aging algorithm. On similar lines, if the check in Line 7 is true,
then it means that there are too many generations. There is no need to run the
aging algorithm.
Next, let us consider the corner case when there is equality. First, let us
define what it means for a page to be young or old. A page is said to be young
if its associated sequence number is equal to max seq. This means that it belongs
to the latest generation. On similar lines, a page is said to be old if its sequence
number follows this relationship: seq + MIN NR GENS == max seq. Given that
we would ideally like to maintain the number of generations at MIN NR GENS+1,
we track two important pieces of information – the number of young and old
pages, respectively.
The first check young × MIN NR GENS > total ensures that if there are too
many young pages, there is a need to run aging. The reason is obvious. We
want to maintain a balance between the young and not-so-young pages. Let
us consider the next inequality: old × (MIN NR GENS + 2) < total. This clearly
says that if the number of old pages is lower than what is expected (too few),
then also we need to age. An astute reader may notice that here that we add
an offset of 2, whereas we did not add such an offset in the case of young
pages. There is an interesting explanation here, which will help us appreciate
the nuances involved in designing practical systems.
As mentioned before, we wish to ideally maintain only MIN NR GENS+1 gen-
erations. There is a need to provide a small safety margin here for young pages
because we do not want to run aging very frequently. Hence, the multiplier is
set to MIN NR GENS. In the case of old pages, the safety margin works in the
reverse direction. We can allow it to go as low as total / (MIN NR GENS + 2).
This is because, we do not want to age too frequently, and in this case aging
will cause old pages to get evicted. We would also like to reduce unnecessary
eviction. Hence, we set the safety margin differently in this case.
actually old. To eliminate this possibility, there is therefore a need to scan all
the PMD’s constituent pages and check if the pages were recently accessed or
not. The young/old status of the PMD can then be accurately determined.
This entails extra work. However, if a PMD address is not found, then it means
that it predominantly contains old pages for sure (the PMD is old). Given
that Bloom filters do not produce false negatives, we can skip such PMDs with
certainty because they are old.
When we walk through the page tables, the idea is to skip unaccessed pages.
To accelerate this process, we can skip full PMDs (512 pages) if they are not
found in the Bloom filter. For the rest of the young pages, we clear the accessed
bit and set the generation of the page or folio to max seq.
Both the arguments are correct in different settings. Hence, Linux pro-
vides both the options. If nothing is specified, then the first argument holds –
file-backed pages. However, if there is more information and the value of the
swappiness is in the range 1-200, then anonymous pages are slightly depriori-
tized as we shall see.
Let us next compare the minimum sequence numbers for both the types.
If the anonymous pages have a lower generation, then it means that they are
more aged, and thus should be evicted. However, if the reverse is the case –
file-backed pages have a lower generation (sequence number) – then we don’t
evict them outright (as per the second aforementioned argument).
Listing 6.15: The algorithm that chooses the type of the pages/folios to evict
source : mm/vmscan.c
1 if (! swappiness )
2 type = LRU_GEN_FILE ;
3 else if ( min_seq [ LRU_GEN_ANON ] < min_seq [ LRU_GEN_FILE ])
4 type = LRU_GEN_ANON ;
5 else if ( swappiness == 1)
6 type = LRU_GEN_FILE ;
7 else if ( swappiness == 200)
8 type = LRU_GEN_ANON ;
9 else if (!( sc - > gfp_mask & __GFP_IO ) ) /* I / O operations are
involved */
10 type = LRU_GEN_FILE ;
11 else
12 type = get_type_to_scan ( lruvec , swappiness , & tier ) ;
least). Next, we define two ctrl pos variables: sp (set point) and pv (process
variable). The goal of any such algorithm based on control theory is to set the
process variable equal to the set point (a pre-determined state of the system).
We shall stick to this generic terminology because such an algorithm is valuable
in many scenarios, not just in finding the type of pages to evict. It tries to bring
two quantities closer in the real world. Hence, we would like to explain it in
general terms.
The parameter gain plays an important role here. For anonymous pages
it is defined as the swappiness (higher it is, more are the evicted anonymous
pages) and for file-backed pages it is 200-swappiness. The gain indirectly is
a measure of how aggressively we want to evict a given type of pages. If it
approaches 200, then we wish to evict anon pages, and if it approaches 1, we
wish to evict f ile pages. It quantifies the preference.
Next, we initialize the sp and pv variables. The set point is set equal to the
eviction statistics ⟨ # refaults, #evictions, gain ⟩ of anon pages. The process
variable is set to the eviction statistics of f ile pages. We need to now compare
sp and pv. Note that we are treating sp (anon) as the reference point here and
trying to ensure that “in some sense” pv approaches pv (f ile) approaches sp.
This will balance both of them, and we will be equally fair to both of them.
Point 6.3.2
In any control-theoretic algorithm, our main aim is to bring pv as close
to sp as possible. In this case also we wish to do so and in the process
ensure that both f ile and anon pages are treated fairly.
pv.refaulted sp.refaulted
≤ (6.1)
pv.total × pv.gain sp.total × pv.gain
This rationale is captured in Equation 6.1. If the normalized refault rate of
pv divided by its gain is less than the corresponding quantity of the set point,
then we modify the pv. In other words, we choose pages to evict from the pv
class (f ile class).
pv.refaulted (sp.refaulted + α)
≤ (6.2)
pv.total × pv.gain (sp.total + β) × sp.gain
Equation 6.1 is an idealized equation. In practice, Linux adds a couple
of constants to the numerator and the denominator to incorporate practical
considerations and maximize the performance of real workloads. Hence, the
exact version of the formula implemented in the Linux kernel v6.2 is shown in
Equation 6.2. The designers of Linux set α = 1 and β = 64. Note that these
constants are based on experimental results, and it is hard to explain them
logically.
Linux has another small trick. If the absolute value of the f ile refaults is low
< 64, then it chooses to evict file-backed pages. The reason is that most likely
the application either does not access a file or the file accessed is very small.
On the other hand, every application shall have a large amount of anonymous
memory comprising stack and heap sections. Even if it is a small application,
the code to initialize it is sizable. Hence, it is good idea to only evict anon
pages only if the f ile refault rate is above a certain threshold.
Now, we clearly have a winner. If the inequality in Equation 6.2 is true,
then pv is chosen (f ile), else sp is chosen (anon).
Assume that the type of files chosen for eviction is T , and the other type
(not chosen for eviction) is T ′ . Clearly, the choice was made based on average
statistics. Let us now do a tier-wise analysis, and compare their normalized re-
fault rates by taking the gain into account tier-wise using Equation 6.2. Instead
of comparing average statistics, we perform the same comparison tier-wise. We
may find that till a certain tier k, folios of type T need to be evicted. However,
for tier k + 1 the reverse may be the case. It means that folios of type T ′ need
to be evicted as per the logic in Equation 6.2. If no such k exists, then nothing
needs to be done. Let us consider the case, when we find such a value of k.
In this case, the folios in tiers [0, k] should be considered for eviction as long
as they are not pinned, being written back, involved in race conditions, etc.
However, the folios in tiers k + 1 and beyond, should be given a second chance
because they have seen more references. We already know that folios in tier
k + 1 should not be evicted because if we compare their statistics with those of
the corresponding folios in type T , it is clear that by Equation 6.2 they should
remain in memory. Instead of doing the same computation for the rest of the
folios, we can simply assume that they also need to be kept in memory for the
time being. This is done in the interest of time and there is a high probability of
this being a good decision. Note that higher-numbered tiers see an exponential
number of more references. Hence, we increment the generations of all folios in
the range [k + 1, M AX T IERS]. They do not belong to the oldest generation
anymore. They enjoy a second chance. Once a folio is promoted to a new
generation, we can optionally clear its reference count. The philosophy here is
that the folio gained because of its high reference count once. Let it not benefit
once again. Let it start afresh after getting promoted to the higher generation.
This is depicted pictorially in Figure 6.23.
© Smruti R. Sarangi 280
Eviction of a Folio
Now, we finally have a list of folios that can be evicted. Note that some folios
did get rescued a and were given a second chance though.
The process of eviction can now be started folio after folio. Sometimes it
is necessary to insert short delays in this process, particularly if there are high
priority tasks that need to access storage devices or some pages in the folio are
being swapped out. This is not a one-shot process, it is punctuated with periods
of activity initiated by other processes.
Once a folio starts getting evicted, we can do some additional bookkeeping.
We can scan proximate (nearby) virtual addresses. The idea here is that pro-
grams tend to exhibit spatial locality. If one folio was found to be old, then
pages in the same vicinity should also be scrutinized. We may find many more
candidates that can possibly be evicted in the near future. For such candidate
pages, we can mark them to be old (clear the accessed bit) and also note down
PMD (Page Middle Directory) entries that comprise mostly of young pages.
These can be added to a Bloom filter, which will prove to be very useful later
when we discuss the page walk process. We can also slightly reorganize the
folios here. If a folio is very large, it can be split to several smaller folios.
Point 6.3.3
Such additional bookkeeping actions that are piggybacked on regular
operations like a folio eviction are a common pattern in modern operating
systems. Instead of operating on large data structures like the page table
in one go, it is much better to slightly burden each operation with a small
amount of additional bookkeeping work. For example, folio eviction is
not on the critical path most of the time, and thus we can afford to do
some extra work.
Once the extra work of bookkeeping is done, the folio can be written back
to the storage device. This would involve clearing its kernel state, freeing all
the buffers that were storing its data (like the page cache for file-backed pages),
flushing the relevant entries in the TLB and finally writing the folio back.
We enter PMD addresses (2nd lowest level of the page table) in a Bloom filter
if they predominantly contain young pages. Now, we know that in a Bloom filter,
there is no chance of a false negative. This means that if it says that a given
PMD address is not there, it is not there for sure. Now, when we walk the page
table, we query the Bloom filter to check if a given PMD address. If the answer
is in the negative, then we know that it is correct, and it is not there because it
contains mostly old pages. Given that there is no possibility of an error, we can
confidently skip scanning all the constituent page table entries that are covered
by the PMD entry. This will save us a lot of time.
Let us consider the other case, when we find a PMD address in the Bloom
filter. It is most likely dominated by young pages. The reason we use the term
“most likely” because Bloom filters can lead to false positive outcomes. We scan
the pages in the PMD region – either all or a subset of them at a time based
on performance considerations. This process of looking around marks young
folios as old on the lines of classic clock-based page replacement algorithms.
Moreover, note that when a folio is marked, all its constituent pages are also
marked. At this point, we can do some additional things. If we find a PMD
region to comprise mostly of young pages, then the PMD address can be added
to the Bloom filter. Furthermore, young folios in this region can be promoted to
the latest generation – their generation/sequence number can be set to max seq.
This is because they are themselves young, and they also lie in a region that
mostly comprises young pages. We can use spatial and temporal locality based
arguments to justify this choice.
6.3.3 Thrashing
Your author is pretty sure that everybody is guilty of the following performance
crime. The user boots the machine and tries to check her email. She finds
it to be very slow because the system is booting up and all the pages of the
email client are not in memory. She grows impatient, and tries to start the web
browser as well. Even that is slow. She grows even more impatient and tries to
write a document using MS Word. Things just keep getting slower. Ultimately,
she gives up and waits. After a minute or two, all the applications come up and
the system stabilizes. Sometimes if she is unlucky, the system crashes.
What exactly is happening here? Let us look at it from the point of view of
paging. Loading the pages for the first time into memory from a storage device
such as a hard disk or even a flash drive takes time. Storage is after several orders
of magnitude slower than main memory. During this time, if another application
is started its pages also start getting loaded. This reduces the bandwidth to
the storage device and both applications get slowed down. However, this is
not the only problem. If these are large programs, whose working set (refer
to Section 6.1.3) is close to the size of main memory, then they need to evict
each other’s pages. As a result, when we start a new application it evicts pages
of the applications that are already running. Then, when there is a context
switch existing applications stall because crucial pages from their working set
were evicted. They suffer from page faults. Their pages are then fetched from
memory. However, this has the same effect again. These pages displace the
pages of other applications. This cycle continues. This phenomenon is known
as thrashing. A system goes into thrashing when there are too many applications
running at the same time and most of them require a large amount of memory.
© Smruti R. Sarangi 282
They end up evicting pages from each other’s working sets, which just increases
the page fault rate without any beneficial outcome.
It turns out that things can get even worse. The performance counters
detect that there is low CPU activity. This is because most of the time is
going in servicing page faults. As a result, the scheduler tries to schedule even
more applications to increase the CPU load. This increases the thrashing even
further. This can lead to a vicious cycle, which is why thrashing needs to be
detected and avoided at all costs.
Linux has a pretty direct solution to stop thrashing. It tries to keep the
working set of an application in memory. This means that once a page is brought
in, it is not evicted very easily. The page that is brought in (upon a page fault)
is most likely a part of the working set. Hence, it makes little sense to evict
it. The MGLRU algorithm already ensures this to some extent. A page that is
brought into main memory has the latest generation. It takes time for it to age
and be a part of the oldest generation and become eligible for eviction. However,
when there are a lot of applications the code in Listing 6.14 can trigger the aging
process relatively quickly because we will just have a lot of young pages. This
is not a bad thing when there is no thrashing. We are basically weeding out old
pages. However, when thrashing sets in, such mechanisms can behave in erratic
ways.
There is thus a need for a master control. The eviction algorithm simply
does not allow a page to be evicted if it was brought into memory in the last N
ms. In most practical implementations, N = 1000. This means that every page
is kept in memory for at least 1 second. This ensures that evicting pages in the
working set of any process is difficult. Thrashing can be effectively prevented
in this manner.
64 KB 64 KB 64 KB 64 KB
32 KB 32 KB 32 KB 32 KB
Full
Allocate 20 KB Allocated 20 KB
Now we can clearly see that 20 KB is between two powers of two: 16 KB and
32 KB. Hence, we take the leftmost 32 KB region and out of that we allocate
20 KB to the current request. We basically split a large free region into two
equal-sized smaller regions until the request lies between the region size and the
region size divided by two. We are basically overlaying a binary tree on top of
a linear array of pages.
If we traverse the leaves of this buddy tree from left to right, then they
essentially form a partition of the single large region. An allocation can only be
made at the leaves. If the request size is less than half the size of a leaf node
that is unallocated, then we split it into two equal-sized regions (contiguous in
memory), and continue to do so until we can just about fit the request. Note
that throughout this process, the size of each subregion is still a power of 2.
Now assume that after some time, we get a request for a 64 KB block of
memory. Then as shown in the second part of Figure 6.24, we allocate the
remaining 64 KB region (right child of the parent) to the request.
128 KB
page buddy
64 KB 64 KB
Figure 6.25: Freeing the 20 KB region allocated earlier
Let us now free the 20 KB region that was allocated earlier (see Figure 6.25).
In this case, we will have two 32 KB regions that are free and next to each other
(they are siblings in the tree). There is no reason to have two free regions at
the same level. Instead, we can get rid of them and just keep the parent, whose
size is 64 KB. We are essentially merging free regions (holes) and creating a
larger free region. In other words, we can say that if both the children of a
parent node are free (unallocated), they should be removed, and we should only
have the parent node that coalesces the full region. Let us now look at the
implementation. We refer to the region represented by each node in the buddy
tree as a block.
285 © Smruti R. Sarangi
Implementation
Let us look at the implementation of the buddy allocator by revisiting the
free area array in struct zone (refer to Section 6.2.5). We shall define the
order of a node in the buddy tree. The order of a leaf node that corresponds
to the smallest possible region – one page – is 0. Its parent has order 1. The
order keeps increasing by 1 till we reach the root. Let us now represent the tree
as an array of lists: one list per order. All the nodes of the tree (of the same
order) are stored one after the other (left to right) in an order-specific list. A
node represents an aggregate page, which stores a block of memory depending
upon the order. Thus, we can say that each linked list is a list of pages, where
each page is actually an aggregate page that may point to N contiguous 4 KB
pages, where N is a power of 2.
The buddy tree is thus represented by an array of linked lists – struct
free area free area[MAX ORDER]. Refer to Listing 6.16, where each struct
free area is a linked list of nodes (of the same order). The root’s order is
limited to MAX ORDER - 1. In each free area structure, the member nr free
refers to the number of free blocks (=number of pages in the associated linked
list).
There is a subtle twist involved here. We actually have multiple linked
lists – one for each migration type. The Linux kernel classifies pages based on
their migration type: it is based on whether they can move, once they have
been allocated. One class of pages cannot move after allocation, then there are
pages that can freely move around physical memory, there are pages that can be
reclaimed and there are pages reserved for specific purposes. These are different
examples of migration types. We maintain separate lists for different migration
types. It is as if their memory is managed separately.
free_area
list of type 0
zone free_area list of type 1 List of buddy blocks
list of type 2
free_area
Figure 6.26: Buddies within a zone. The type refers to the migration type
Listing 6.18 shows the code for freeing an aggregate page (block in the buddy
system). In this case, we start from the block that we want to free and keep
proceeding towards the root. Given the page, we find the page frame number
of the buddy. If the buddy is not free then the find buddy page pfn returns
NULL. Then, we exit the for loop and go to label done merging. If this is not
the case, we delete the buddy and coalesce the page with the buddy.
Let us explain this mathematically. Assume that pages with frame numbers
A and B are buddies of each other. Let the order be ϕ. Without loss of
generality, let us assume that A < B. Then we can say that B = A + 2ϕ , where
ϕ = 0 for the lowest level (the unit here is pages). Now, if we want to combine
A and B and create one single block that is twice the block size of A and B,
then it needs to start at A and its size needs to be 2ϕ+1 pages.
287 © Smruti R. Sarangi
Let us now remove the restriction that A < B. Let us just assume that they
are buddies of each other. We then have A = B⊕2ϕ . Here ⊕ stands for the XOR
operator. Then, if we coalesce them, the aggregate page corresponding to the
parent node needs to have its starting pfn (page frame number) at min(A, B).
This is the same as A&B, where & stands for the logical AND operation. This
is because they vary at a single bit: the (ϕ + 1)th bit (LSB is bit #1). If we
compute a logical AND, then this bit gets set to 0, and we get the minimum of
the two pfns. Let us now compute min(A, B) − A. It can either be 0 or −2ϕ ,
where the order is ϕ.
We implement exactly the same logic in Listing 6.18, where A and B are
buddy pfn and pfn, respectively. The combined pfn represents the minimum:
starting address of the new aggregate page. The expression combined pfn -
pfn is the same as min(A, B) − A. If A < B, it is equal to 0, which means
that the aggregate page (corresp. to the parent) starts at struct page* page.
However, if A > B, then it starts at page minus an offset. The offset should
be equal to A − B multiplied by the size of struct page. In this case A − B
is equal to pfn - combined pfn. The reason that this offset gets multiplied
with struct page is because when we do pointer arithmetic in C, any constant
that gets added or subtracted to a pointer automatically gets multiplied by the
size of the structure (or data type) that the pointer is pointing to. In this case,
the pointer is pointing to date of type struct page. Hence, the negative offset
combined pfn - pfn also gets multiplied with sizeof(struct page). This is
the starting address of the aggregate page (corresponding to the parent node).
done_merging :
/* set the order of the new
set_buddy_order ( page , order ) ;
add_to_free_list ( page , zone , order , migratetype ) ;
}
Once we combine a page and its buddy, we increment the order and try to
combine the parent with its buddy and so on. This process continues until we are
successful. Otherwise, we break from the loop and reach the label done merging.
Here we set the order of the merged (coalesced) page and add it to the free list
© Smruti R. Sarangi 288
at the corresponding order. This completes the process of freeing a node in the
buddy tree.
The buddy system overlays a possibly unbalanced binary tree over a lin-
ear array of pages. Each node of the tree corresponds to a set of contigu-
ous pages (the number is a power of 2). The range of pages represented
by a node is equally split between its children (left-half and right-half).
This process continues recursively. The allocations are always made at
the leaf nodes that are also constrained to have a capacity of N pages,
where N is a power of 2. It is never the case that two children of the
same node are free (unallocated). In this case, we need to delete them
and make the parent a leaf node. Whenever an allocation is made in
a leaf node that exceeds the minimum page size, the allocated memory
always exceeds 50% of the capacity of that node (otherwise we would
have split that node).
object
full slab
slab
object slab
par�al
free slab
object slab
slab
kmem_cache_node
memory_region
The slab cache has a per-CPU array of free objects (array cache). These
are recently freed objects, which can be quickly reused. This is a very fast way
of allocating an object without accessing other data structures to find which
object is free. Every object in this array is associated with a slab. Sadly, when
such an object is allocated or freed, the state in its encapsulating slab needs to
also be changed. We will see later that this particular overhead is not there in
the slub allocator.
Now, if there is a high demand for objects, then we may run out of free
objects in the per-CPU array cache. In such a case, we need to find a slab
that has a free object available.
It is very important to appreciate the relationship between a slab and the
slab cache at this point of time. The slab cache is a system-wide pool whose job
is to provide a free object and also take back an object after it has been used
(added back to the pool). A slab on the other hand is just a storage area for
storing a set of k objects: both active and inactive.
The slab cache maintains three kinds of slab lists – full, partial and free –
for each NUMA node. The full list contains only slabs that do not have any
free object. The partial list contains a set of partially full slabs and the free
list contains a set of slabs that do not even have a single allocated object. The
algorithm is to first query the list of partially full slabs and find a partially full
slab. Then in that slab, it is possible to find an object that has not been allo-
cated yet. The state of the object can then be initialized using an initialization
function whose pointer must be provided by the user of the slab cache. The
object is now ready for use.
However, if there are no partially full slabs, then one of the empty slabs
needs to be taken and converted to a partially full slab by allocating an object
within it.
© Smruti R. Sarangi 290
We follow the reverse process when returning an object to the slab cache.
Specifically, we add it to the array cache, and set the state of the slab that
the object is a part of. This can easily be found out by looking at the address
of the object and then doing a little bit of pointer math to find the nearest
slab boundary. If the slab was full, then now it is partially full. It needs to be
removed from the full list and added to the partially full list. If this was the
only allocated object in a partially full slab, then the slab is empty now.
We assume that a dedicated region in the kernel’s memory map is used to
store the slabs. Clearly all the slabs have to be in a contiguous region of the
memory such that we can do simple pointer arithmetic to find the encapsulating
slab. The memory region corresponding to the slabs and the slab cache can be
allocated in bulk using the high-level buddy allocator.
This is a nice, flexible and rather elaborate way of managing physical memory
for storing objects of only a particular type. A criticism of this approach is that
there are too many lists, and we frequently need to move slabs from one list to
the other.
per cpu
kmem_cache kmem_cache_cpu
void ** freelist: pointer
slab kmem_cache_cpu
to free objects
*cpu_slab
slab_cache uint object_size struct slab *slab
inuse: #objects ctor: object constructor
func�on
freelist: list of free objects kmem_cache_node * slab
node [NUMA_NODES]
object
usage
counters 1. Return empty slab
slab
slabs to the par�al slab
memory system. kmem_cache_node
2. Forget about full
slabs.
We reuse the same slab that was used for designing the slab allocator. We
specifically make use of the inuse field to find the number of objects that are
currently being used and the freelist. Note that we have compressed the slab
part in Figure 6.28 and just summarized it. This is because it has been shown
in its full glory in Figure 6.27.
Here also every slab has a pointer to the slab cache (kmem cache). However,
the slab cache is architected differently. Every CPU in this case is given a private
slab that is stored in its per-CPU region. We do not have a separate set of free
objects for quick allocation. It is necessary to prioritize regularity for achieving
291 © Smruti R. Sarangi
There are performance benefits because there is more per-CPU space, and
it is quite easy to manage it. Recall that in the case of the slab allocator, we
had to also go and modify the state of the slabs that encapsulated the allocated
objects. Here we maintain state at only one place, and we never separate an
object from its slab. All the changes are confined to a slab and there is no need
to go and make changes at different places. We just deal in terms of slabs and
assign them to the CPUs and slab caches at will. Given that a slab is never split
into its constituent objects their high-level management is quite straightforward.
If the per-CPU slab becomes full, all that we need to do in this case is simply
forget about it and find a new free slab to assign to the CPU. In this case, we
do not maintain a list of fully free and full slabs. We just forget about them.
We only maintain a list of partially full slabs, and query this list of partially full
slabs, when we do not find enough objects in the per-CPU slab. The algorithm
is the same. We find a partially full slab and allocate a free object. If the
partially full slab becomes full, then we remove it from the list and forget about
it. This makes the slab cache much smaller and more memory efficient. Let us
now see where pointer math is used. Recall that the slub allocator heavily relies
on pointer arithmetic.
Note that we do not maintain a list of full slabs nor empty slabs. Instead,
we chose to just forget about them. Now if an object is deallocated, we need
to return it back to the pool. From the object’s address, we can figure out that
it was a part of a slab. This is because slabs are stored in a dedicated memory
region. Hence, the address is sufficient to figure out that the object is a part of
a slab, and we can also find the starting address of the slab by computing the
nearest “slab boundary”. We can also figure out that the object is a part of a
full slab because the slab is not present in the slab cache. Now that the object
is being returned to the pool, a full slab becomes partially full. We can then
add it to the list of partially full slabs in the slab cache.
Exercises
Ex. 4 — Let us say that we want to switch between user-mode processes with-
out flushing the TLB or splitting the virtual address space among user processes.
How can we achieve this with minimal hardware support?
* Ex. 5 — We often transfer data between user programs and the kernel. For
example, if we want to write to a device, we first store our data in a character
array, and transfer a pointer to the array to the kernel. In a simple implemen-
tation, the kernel first copies data from the user space to the kernel space, and
then proceeds to write the data to the device. Instead of having two copies of the
same data, can we have a single copy? This will lead to a more high-performance
implementation. How do we do it, without compromising on security?
Now, consider the reverse problem, where we need to read a device. Here also,
the kernel first reads data from the device, and then transfers data to the user’s
memory space. How do we optimize this, and manage with only a single copy
of data?
Ex. 6 — Prove the optimality of the optimal page replacement algorithm.
Ex. 9 — When and how is the MRU page replacement policy better than the
LRU page replacement policy?
Ex. 10 — What is the reason for setting the page size to 4 KB? What happens
if the page size is higher or lower? List the pros and cons.
293 © Smruti R. Sarangi
Ex. 11 — Consider a memory that can hold only 3 frames. We have a choice
of two page-replacement algorithms: LRU and LFU.
a)Show a page access sequence where LRU is better than LFU?
b)Show a page access sequence where LFU is better than LRU?
Explain the insights as well.
Ex. 13 — What are the causes of thrashing? How can we prevent it?
Ex. 14 — What is the page walking process used for in the MG-LRU algo-
rithm? Answer in the context of the lru gen mm state structure.
Ex. 15 — How is a Bloom filter used to reduce the overhead of page walking?
Ex. 16 — What is the need to deliberately mark actively used pages as “non-
accessible”?
Ex. 17 — What is the swappiness variable used for, and how is it normally
interpreted? When would you prefer evicting FILE pages as opposed to ANON
pages, and vice versa? Explain with use cases.
Ex. 18 — Let us say that you want to “page” the page table. In general, the
page table is stored in memory, and it is not removed or swapped out – it is
basically pinned to memory at a pre-specified set of addresses. However, now let
us assume that we are using a lot of storage space to store page tables, and we
would like to page the page tables such that parts of them, that are not being
used very frequently, can be swapped out. Use concepts from folios, extents and
inodes to create such a swappable page table.
Ex. 19 — How is reverse mapping done for ANON and FILE pages?
Ex. 20 — How many anon vma structures is an anon vma chain connected
to?
Ex. 21 — Why do we need separate anon vma chain structures for shared
COW pages and private pages?
Ex. 22 — Given a page, what is the algorithm for finding the pfn number of
its buddy page, and the pfn number of its parent?
Ex. 23 — What are the possible advantages of handing over full slabs to the
baseline memory allocation system in the SLUB allocator?
Ex. 24 — Compare the pros and cons of all the kernel-level memory alloca-
tors.
© Smruti R. Sarangi 294
Chapter 7
The I/O System, Storage Devices
and Device Drivers
There are three key functions of an OS: process manager, memory manager and
device manager. We have already discussed the role of the OS for the former
two Functionalities in earlier chapters. Let us now come to the role of the OS in
managing devices, especially storage devices. As a matter of fact, most low level
programmers to work with garden court actually work in the space of writing
device drivers for I/O devices. Core kernel developers in comparison are much
fewer, mainly because 70% of the overall kernel code is accounted for by device
drivers. This is expected mainly because a modern OS supports a very large
number of devices and each device pretty much needs its own custom driver.
Of course with the advent of USB technology, some of that is changing in the
sense that it is possible for a single USB driver to handle multiple devices.
For example, a generic keyboard driver can take care of a large number of
USB-based keyboards. Nevertheless, given the sheer diversity of devices, driver
development still accounts for the majority of “OS work”.
In the space of devices, storage devices such as hard disks and flash/NVM
drives have a very special place. They are clearly the more important citizens
in the device world. Other devices such as keyboards, mice and web cameras
are nonetheless important, but they are clearly not in the same league as stor-
age devices. The reasons are simple. Storage devices are often needed for a
computing system to function. Such a device stores all of its data when the
system is powered off (provides nonvolatile storage). It plays a vital role in the
boot process, and also stores the swap space, which is a key component of the
overall virtual memory system. Hence, any text on devices and drivers always
has a dedicated set of sections that particularly look at storage devices and the
methods of interfacing with them.
Linux distinguishes between two kinds of devices: block and character. Block
devices read and write a large block of data at a time. For example, storage
devices are block devices that often read and write 512-byte chunks of data in
one go. On the other hand, character devices read and write a single character
or a set of few characters at a time. Examples of character devices are keyboards
and mice. For interfacing with character devices, a device driver is sufficient;
295
© Smruti R. Sarangi 296
it can be connected to the terminal or the window manager. This provides the
user a method to interact with the underlying OS and applications.
We shall see that for managing block devices, we need to create a file system
that typically has a tree-structured organization. The internal nodes of this file
system are directories (folders in Windows). The leaf nodes are the individual
files. The file is defined as a set of bytes that has a specific structure based on
the type of data it contains. For instance, we can have image files, audio files,
document files, etc. A directory or a folder on the other hand has a fixed tabular
structure that just stores the pointers to every constituent file or directory within
it. Linux generalizes the concept of a file. For it, everything is a file including
a directory, device, regular file and process. This allows us to interact with all
kinds of entities within Linux using regular file-handling mechanisms.
This chapter has four parts: basics of the I/O system, details of storage
devices, structure of character and block device drivers and the design of file
systems.
Processor
Frontside
PCI express bus
bus Memory
Graphics North Bridge modules
processor chip
disk drives. The role of the South bridge chip is quite important in the sense
that it needs to interface with numerous controllers corresponding to a diverse
set of buses. Note that we need additional chips corresponding to each kind of
bus, which is why when we look at the picture of a motherboard, we see many
chips. Each chip is customized for a given bus (set of devices). These chips
are together known as the chipset. They are a basic necessity in a large system
with a lot of peripherals. Without a chipset, we will not be able to connect to
external devices, notably I/O and storage devices. Over the last decade, the
North Bridge functionality has moved on-chip. Connections to the GPU and
the memory modules are also more direct in the sense that they are directly
connected to the CPUs via either dedicated buses or memory controllers.
However, the South Bridge functionality has remained as an off-chip entity in
many general purpose processors on server-class machines. It is nowadays (as of
2024) referred to as the Platform Controller Hub (PCH). Modern motherboards
still have a lot of chips including the PCH primarily because there are limits to
the functionality that can be added on the CPU chip. Let us elaborate.
I/O controller chips sometimes need to be placed close to the corresponding
I/O ports to maintain signal integrity. For example, the PCI-X controller and
the network card (on the PCI-X bus) are in close proximity to the Ethernet
port. The same is the case for USB devices and audio inputs/ outputs. The
© Smruti R. Sarangi 298
Applica�on Request
Response
Opera�ng Device
system Kernel driver
So�ware
Hardware
I/O I/O Processor
device system
Figure 7.2: Flow of actions in the kernel: application → kernel → device driver
→ CPU → I/O device (and back)
Figure 7.2 shows the flow of actions when an application interacts with an
I/O device. The application makes a request to the kernel via a system call.
This request is forwarded to the corresponding device driver, which is the only
subsystem in the kernel that can interact with the I/O device. The device driver
issues specialized instructions to initiate a connection with the I/O device. A
request gets sent to the I/O device via the chips in the chipset. A set of chips
that are a part of the chip set route the request to the I/O device. The South
Bridge chip is one of them. Depending upon the request type, read or write, an
appropriate response is sent back. In the case of a read, it is a chunk of data
and in the case of a write, it is an acknowledgment.
The response follows the reverse path. Here there are several options. If it
299 © Smruti R. Sarangi
was a synchronous request, then the processor waits for the response. Once it
is received, the response (or a pointer to it) is put in a register, which is visible
to the device driver code. However, given that I/O devices can take a long time
to respond, a synchronous mechanism is not always the best. Instead, an asyn-
chronous mechanism is preferred where an interrupt is raised when the response
is ready. The CPU that handles the interrupt fetches the data associated with
the response from the I/O system.
This response is then sent to the interrupt handler, which forwards it to the
device driver. The device driver processes the response. After processing the
received data, a part of it can be sent back to the application via other kernel
subsystems.
Device-driver level
Protocol layer transmission protocol
Routing messages to
Network layer the right I/O device
cases when a string of 0s and 1s are sent, artificial transitions are inserted into
the data for easier clock recovery.
The data link layer has a similar functionality as the corresponding layer in
networks. It performs the key tasks of error correction and framing (chunk data
into fixed sets of bytes).
Finally, the protocol layer is concerned with the high-level data transfer
protocol. There are many methods of transferring data such as interrupts,
polling and DMA. Interrupts are a convenient mechanism. Whenever there
is any new data at an I/O device, it simply raises an interrupt. Interrupt
processing has its overheads.
On the other hand polling can be used where a thread continuously polls
(reads the value) an I/O register to see if new data has arrived. If there is new
data, then the I/O register stores a logical 1. The reading thread can reset this
value and read the corresponding data from the I/O device. Polling is a good
idea if there is frequent data transfer. We do not have to pay the overhead of
interrupts. We are always guaranteed to read or write some data. On the other
hand, the interrupt-based mechanism is useful when data transfer is infrequent.
The last method is outsourcing the entire process to a DMA (Direct Memory
Access) controller. It performs the full I/O access (read or write) on its own
and raises an interrupt when the overall operation has completed. This is useful
for reading/ writing large chunks of data.
301 © Smruti R. Sarangi
So�ware interface
Registers
Input Output
Port controller
Port connector
Figure 7.4: I/O ports
Instruction Semantics
in r1, ⟨i/oport⟩ r1 ← contents of ⟨i/oport⟩
out r1, ⟨i/oport⟩ contents of ⟨i/oport⟩ ← r1
port-mapped I/O. An I/O request contains the address of the I/O port. We
can use the in and out instructions to read the contents of an I/O port or
write to it, respectively. The pipeline of a processor sends an I/O request to
the North Bridge chip, which in turn forwards it to the South bridge chip. The
latter forwards the request to the destination – the target I/O device. This uses
the routing resources available in the chipset. This pretty much works like a
conventional network. Every chip in the chipset maintains a small routing table;
it knows how to forward the request given a target I/O devices. The response
follows the reverse path, which is towards the CPU.
This is a simple mechanism that has its share of problems. The first is that
it has very high overheads. An I/O port is 8 to 32 bits wide, which means
that we can only read or write 1 to 4 bytes of data at a time. This basically
means that if we want to access a high-bandwidth device such as a scanner or
a printer, a lot of I/O instructions need to be issued. This puts a lot of load on
the CPU’s pipeline and prevents the system from doing any other useful work.
We need to also note that such I/O instructions are expensive instructions in
the sense that they need to be executed sequentially. They have built-in fences
(memory barriers). They do not allow reordering. I/O instructions permanently
change the system state and thus no other instruction – I/O or regular memory
read/write – instruction can be reordered with respect to it.
Along with bandwidth limitations and performance overheads, using such
instructions makes the code less portable across architectures. Even if the code is
migrated to another machine, it is not guaranteed to work because the addresses
of the I/O ports assigned to a given device may vary. The assignment of I/O
port numbers to devices is a complicated process. For devices that are integrated
into the motherboard, the port numbers are assigned at the manufacturing time.
For other devices that are inserted to expansion slots, PCI-express buses, etc.,
the assignment is done at boot time by the BIOS. Many modern systems can
modify the assignments after booting. This is why, there can be a lot of variance
in the port numbers across machines, even of the same type.
Now, if we try to port the code to a different kind of machine, for example,
if we try to port the code from an Intel machine to an ARM machine, then
pretty much nothing will work. ARM has a very different I/O port architecture.
Note that the in and out assembly instructions are not supported on ARM
machines. At the code level, we thus desire an architecture-independent solution
for accessing I/O devices. This will allow the kernel or device driver code to be
portable to a large extent. The modifications to the code required to port it to
a new architecture will be quite limited.
Note that the I/O address space is only 64 KB using this mechanism. Often
there is a need for much more space. Imagine we are printing a 100 MB file; we
would need a fair amount of buffering capacity on the port controller. This is
why many modern port controllers include some amount of on-device memory.
303 © Smruti R. Sarangi
It is possible to write to the memory in the port controller directly using conven-
tional instructions or DMA-based mechanisms. GPUs are prominent examples
in this space. They have their memory. The CPU can write to it. Many modern
devices have started to include such on-device memory. USB 3.0, for example,
has about 250 KB of buffer space on its controllers.
it has become the dominant storage technology in all kinds of computing de-
vices starting from laptops to desktops to servers. After 2015, it started to get
challenged in a big way by other technologies that rely on nonvolatile memory.
However, hard disks are still extremely popular as of 2024, given their scalability
and cost advantages.
Clock
0 1 1 1 0 1 0 1 0 0 0
Data
We need to understand the hard disk read/write head moves very quickly
over the recording surface. At periodic intervals, it needs to read the bits stored
on the recording surface. Note that the magnetic field is typically not directly
measured, instead the change in magnetic field is noted. It is much easier to
do so. Given that a changing magnetic field induces a current across terminals
on the disk head, this can be detected very easily electronically. Let us say this
happens at the negative edge of the clock. We need perfect synchronization
here. This means that whenever the clock has a negative edge, that is exactly
when a magnetic field transition should be happening. We can afford to have
a very accurate clock but placing magnets, which are physical devices, such
accurately on the recording surface is difficult. There will be some variation in
the production process. Hence, there is a need to periodically resynchronize the
clock with the magnetic field transitions recorded by the head while traversing
over the recording surface. Some minor adjustments are continuously required.
If there are frequent 0 → 1 and 1 → 0 transitions in the stored data, then such
resynchronization can be done.
However, it is possible that the data has a long sequence of 0s and 1s. In
this case, it is often necessary to introduce dummy transitions for the purpose
of synchronization. In the light of this discussion, let us try to understand the
NRZI protocol. A value equal to 0 maintains the voltage value. Whereas, a
value equal to 1, flips the voltage. If the voltage is high, it becomes low, and
vice versa. A logical 1 thus represents a voltage transition, whereas a logical 0
simply maintains the value of the voltage. It is true that there are transitions in
this protocol whenever there is a logical 1, however, if there could still be a long
run of 0s. This is where, it is necessary to introduce a few dummy transitions.
The dummy data is discarded later.
305 © Smruti R. Sarangi
N S N S S N S N N S
0 1 0 1
Figure 7.6: Arrangement of tiny magnets on a hard disk’s recording surface
Sector
Track
Let us now understand how these small magnets are arranged on a circular
disk that is known as a platter. As we can see in Figure 7.7, the platter is divided
into concentric rings that contain such tiny magnets. Each such ring is called a
track. It is further divided into multiple sectors. Each sector typically has the
same size: 512 bytes. In practice, a few more bytes are stored for the sake of
© Smruti R. Sarangi 306
error correction. To maximize the storage density, we would like each individual
magnet to be as small as possible. However, there is a trade-off here. If the
magnets are very small, then the EMF that will be induced will be very small
and will become hard to detect. As a result, there are technological limitations
on the storage density.
Hence, it is a wise idea to store different numbers of sectors per track. The
number of sectors that we store per track depends on the latter’s circumference.
The tracks towards the periphery shall have more sectors and track towards the
center will have fewer sectors. Modern hard disks are actually slightly smarter.
They divide the set of tracks into contiguous sets of rings called zones. Each zone
has the same number of sectors per track. The advantage of this mechanism is
that the electronic circuits get slightly simplified given that the platters rotate
at a constant angular velocity. Within a zone, we can assume that the same
number of sectors pass below the head each second.
Spindle
Platter
Head
Arm
The structure of a hard disk is shown in Figures 7.8 and 7.9. As we can
see, there are a set of platters that have a spindle passing through their centers.
307 © Smruti R. Sarangi
Spindle
Read/Write
head Platter
Arm
Bus
Bus
interface Actuator
Spindle motor
Drive
electronics
The spindle itself is controlled by a spindle motor that rotates the platters at a
constant angular velocity. There are disk heads on top of each off the recording
surfaces. These heads are connected to a common rotating arm. Each disk head
can read as well as write data. Reading data involves sensing whether there is
a change in the voltage levels or not (presence or absence of an induced EMF).
Writing data involves setting a magnetic field using a small electromagnet. This
aligns the magnet on the platter with the externally induced magnetic field. We
have a sizable amount of electronics to accurately sense the changes in the
magnetic field, perform error correction, and transmit the bytes that are read
back to the processor via a bus.
Let us now understand how a given sector is accessed. Every sector has
a physical address. Given the physical address, the disk controller knows the
platter on which it is located. A platter can have two recording surfaces: one
on the top and one on the bottom. The appropriate head needs to be activated,
and it needs to be positioned at the beginning of the corresponding sector and
track. This involves first positioning the head on the correct track, which will
happen via rotating the disk arm. The time required for this is known as the
seek time. Once the disk head is on the right track, it needs to wait for the sector
to come underneath it. Given the fact that the platter rotates at a constant
angular velocity, this duration can be computed quite accurately. This duration
is known as the rotational latency. Subsequently, the data is read, error checking
is done, and after appropriately framing the data, it is sent back to the CPU
via a bus. This is known as the transfer latency. The formula for the overall
disk access time is shown in Equation 7.1.
• The rotational latency is the time that the head needs to wait for
beginning of the desired sector to come below it after it has been
positioned on the right track.
• The transfer time is the time it takes to transfer the sector to
the CPU. This time includes the time to perform error checking,
framing and sending the data over the bus.
Given that in a hard disk, there are mechanical parts and also the head
needs to physically move, there is a high chance of wear and tear. Hence, disk
drives have limited reliability. They mostly tend to have mechanical failures.
To provide a degree of failure resilience, the disk can maintain a set of spare
sectors. Whenever there is a fault in a sector, which will basically translate to
an unrecoverable error, one of the spare sectors can be used to replace this “bad
sector”.
There are many optimizations possible here. We will discuss many of these
when we introduce file systems. The main idea here is to store a file in such a
way on a storage device that it can be transferred to memory very quickly. This
means that the file system designer has to have some idea of the physical layout
of the disk and the way in which physical addresses are assigned to logical
addresses. If some of this logic is known, then the seek time, as well as the
rotational latency can be reduced substantially. For instance, in a large file all
the data sectors can be placed one after the other on the same track. Then
they can be placed in corresponding tracks (same distance from the center)
in the rest of the recording surfaces such that the seek time is close to zero.
This will ensure that transitioning between recording surfaces will not involve
a movement of the head in the radial direction.
All the tracks that are vertically above each other have almost the same
distance from the center. We typically refer to a collection of such tracks using
the term cylinder. The key idea here is that we need to preserve locality and
thus ensure that all the bytes in a file can quickly be read or written one after
the other. Once a cylinder fills up, the head can move to the adjacent cylinder
(next concentric track), so on and so forth.
309 © Smruti R. Sarangi
7.2.2 RAID
Hard drives are relatively flimsy and have reliability issues. This is primarily
because they rely on mechanical parts, which are subject to wear and tear. They
thus tend to fail. As a result, it is difficult to create large storage arrays that
comprise hard disks. We need to somehow make large storage arrays resilient to
disk failures. There is a need to have some built-in redundancy in the system.
The concept of RAID (Redundant Array of Inexpensive Disks) was proposed
to solve such problems. Here, the basic idea is to have additional disks that
store redundant data. In case a disk fails, other disks can be used to recover
the data. The secondary objective of RAID-based solutions is to also enhance
the bandwidth given that we have many disks that can be used in parallel. If
we consider the space of these two dual aims – reliability and performance – we
can design many RAID solutions that cater to different kinds of users. The user
can choose the best solution based on her requirements.
RAID 0
B1 B2
B3 B4
B5 B6
B7 B8
B9 B10
Disk 1 Disk 2
Figure 7.10: RAID 0
RAID 1
On the other hand, RAID 1 (shown in Figure 7.11) enhances the reliability.
Here the same block is stored across the two disks. For example, block B1 is
stored on both the disks: Disk 1 and 2. If one of the disks fails, then the other
disk can be used to service all the reads and writes (without interruption). Later
on, if we decide to replace the failed disk then the other disk that is intact can
provide all the data to initialize the new disk.
© Smruti R. Sarangi 310
B1 B1
B2 B2
B3 B3
B4 B4
B5 B5
Disk 1 Disk 2
Figure 7.11: RAID 1
This strategy does indeed enhance the reliability by providing a spare disk.
However, the price that is incurred is that for every write operation, we actually
need to write the same copy of the block to both the disks. Reads are still fast
because we can choose one of the disks for reading. We especially choose the
one that is lightly loaded to service the read. This is sadly not possible in the
case of write operations.
RAID 2, 3 and 4
B1 B2 B3 B4 P1
B5 B6 B7 B8 P2
B9 B10 B11 B12 P3
B13 B14 B15 B16 P4
B17 B18 B19 B20 P5
We clearly have some issues with RAID 1 because it does not enhance the
bandwidth of write operations. In fact, in this case we need to write the same
data to multiple disks. Hence, a series of solutions have been proposed to
ameliorate this issue. They are named RAID 2, 3 and 4, respectively. All of
them belong to the same family of solutions (refer to Figure 7.12).
In the figure we see an array of five disks: four store regular data and one
stores parities. Recall that the parity of n bits is just their XOR. If one of the
311 © Smruti R. Sarangi
bits is lost, we can use the parity to recover the lost bit. The same can be done
at the level of 512-byte blocks as well. If one block is lost due to a disk failure, it
can be recovered with the help of the parity block. As we can see in the figure,
the parity block P1 is equal to B1 ⊕ B2 ⊕ B3 ⊕ B4, where ⊕ stands for the
XOR operation. Assume that the disk with B2 fails. We can always compute
B2 as B1 ⊕ P 1 ⊕ B3 ⊕ B4.
Let us instead focus on some differences across the RAID levels: 2, 3 and 4.
RAID 2 stores data at the level of a single bit. This means that its block size is
just a single bit, and all the parities are computed at the bit level. This design
offers bit-level parallelism, where we can read different bit streams in parallel
and later on fuse them to recreate the data. Such a design is hardly useful,
unless we are looking at bit-level storage, which is very rare in practice.
RAID level 3 increases the block size to a single byte. This allows us to read
or write to different bytes in parallel. In this case, Disk i stores all the bytes at
locations 4n + i. Given a large file, we can read its constituent bytes in parallel,
and then interleave the byte streams to create the file in memory. However, this
reconstruction process is bound to be slow and tedious. Hence, this design is
also not very efficient nor very widely used.
Finally, let us consider RAID 4, where the block size is equal to a conven-
tional block size (512 bytes). This is typically the size of a sector in a hard disk
and thus reconstructing data at the level of blocks is much easier and much more
intuitive. Furthermore, it is also possible to read multiple files in parallel given
that their blocks are distributed across the disks. Such designs offer a high level
of parallelism and if the blocks are smartly distributed across the disks, then a
theoretical bandwidth improvement of 4× is possible in this case.
There is sadly a problem with these RAID designs. The issue is that there
is a single parity disk. Whenever, we are reading something, we do not have
to compute the parity because we assume that if the disk is alive, then the
block that is read is correct. Of course, we are relying on block-level error
checking, and we are consequently assuming that they are sufficient to attest
the correctness of the block’s contents. Sadly, in this case writing data is much
more onerous. Let us first consider a naive solution.
We may be tempted to argue that to write to any block, it is necessary to
read the rest of the blocks from the other disks and compute the new value of
the parity. It turns out that there is no need to actually do this; we can instead
rely on an interesting property of the XOR function. Note the following:
P 1 = B1 ⊕ B2 ⊕ B3 ⊕ B4
(7.2)
P 1′ = P 1 ⊕ B1′ ⊕ B1 = B1′ ⊕ B2 ⊕ B3 ⊕ B4
The new parity P 1′ is thus equal to B1′ ⊕B2⊕B3⊕B4. We thus have a neat
optimization here; it is not necessary to read the rest of the disks. Nevertheless,
there is still a problem. For every write operation, the parity disk has to be read,
and it has to be written to. This makes the parity disk a point of contention –
it will slow down the system because of requests queuing up. Moreover, it will
also see a lot of traffic, and thus it will wear out faster. This will cause many
reliability problems, and the parity disk will most likely fail the first. Hence,
there is a need to distribute the parity blocks across these disks. This is precisely
the problem the novelty of RAID 5.
© Smruti R. Sarangi 312
RAID 5
Figure 7.13 shows a set of disks with distributed parity, where there is no single
disk dedicated to exclusively storing parity blocks. We observe that for the first
set of blocks, the parity block is in Disc 5. Then for the next set, the parity
block P 2 is stored in Disk 1, so on and so forth. Here the block size is typically
equal to the block size of RAID 4, which is normally the disk sector size, i.e.,
512 bytes. The advantage here is that there is no single disk that is a point
of contention. The design otherwise has the rest of the advantages of RAID 4,
which are basically the ability to support parallel read accesses and optimized
write accesses. The only disks that one needs to access while writing are as
follows: the disk that is being written to and the parity disk.
B1 B2 B3 B4 P1
P2 B5 B6 B7 B8
B9 P3 B10 B11 B12
B13 B14 P4 B15 B16
B17 B18 B19 P5 B20
RAID 6
Let us now ask a more difficult question, “What if there are two disk failures?”
Having a single parity block will not solve the problem. We need at least two
parity blocks. The mathematics to recover the contents of the blocks is also
much more complex.
Without getting into the intricate mathematical details, it suffices to say
that we have two parity blocks for every set of blocks, and these blocks are dis-
tributed across all the disks such that there is no point of contention. Figure 7.14
pictorially describes the scheme.
7.2.3 SSDs
Let us next discuss another genre of storage devices that rely on semiconductor
technologies. The technology that is used here is known as flash. This technol-
ogy is used to create SSDs (solid state devices). Such storage technologies do
not use magnets to store bits, and they also do not have any mechanical parts.
Hence, they are both faster and often more reliable as well. Sadly, they have
their share of failure mechanisms, and thus they are not as reliable as we think
they perhaps are. Nevertheless, we can confidently say that they are immune
313 © Smruti R. Sarangi
B1 B2 B3 B4 P1A P1B
P2B B5 B6 B7 B8 P2A
P3A P3B B9 B10 B11 B12
B13 P4A P4B B14 B15 B16
B17 B18 P5A P5B B19 B20
Basic Operation
Let us understand at a high level how they store a bit. Figure 7.15 shows a novel
device that is known as a floating gate transistor. It looks like a normal NMOS
transistor with its dedicated source and drain terminals and a gate connected
to an external terminal (known as the control gate). Here, the interesting point
to note is that there are actually two gates stacked on top of each other. They
are separated by an insulating silicon dioxide layer.
Let us focus on the gate that is sandwiched between the control gate and the
transistor’s channel. It is known as the floating gate. If we apply a very strong
positive voltage, then electrons will get sucked into the floating gate because of
the strong positive potential and when the potential is removed, many of the
electrons will actually stay back. When they stay back in this manner, the cell
is said to be programmed. We assume that at this point it stores a logical 0. If
we wish to reset the cell, then there is a need to actually push the electrons back
into the transistor’s substrate and clear the floating gate. This will necessitate
the application of a strong negative voltage at the control gate terminal, which
will push the electrons back into the transistor’s body. In this case, the floating
gate transistor or the flash cell are said to be reset. The cell stores a logical 1
in this state.
Let us now see look at the process of reading the value stored in such a
memory cell. When the cell is programmed, its threshold voltage rises. It
becomes equal to VT+ , which is higher than the normal threshold voltage VT .
Hence, to read the value in the cell we set the gate voltage equal to a value that
is between VT and VT+ . If it is not programmed, then the voltage will be higher
than the threshold voltage and the cell will conduct current, otherwise it will
be in the cutoff state and will not conduct current. This is known as enabling
the cell (or the floating gate transistor).
© Smruti R. Sarangi 314
Control gate
SiO2 Floating gate
Source Drain
Symbol
(a) (b)
Figure 7.15: Floating gate transistor
Bit
line
WL1
WL2
greater than VT+ . They will thus become conducting. The voltage on the bit
line will be decided by the value stored in the transistor that is enabled. We
can thus infer the value that it stores.
Bit line
Ground Bit line
select WL8 WL7 WL6 WL5 WL4 WL3 WL2 WL1
select
P/E Cycles
Let us now discuss a very fascinating aspect of such flash-based devices. These
devices provide read-write access at the level of pages, not bytes – we can only
read or write a full page (512-4096 bytes) at a time. We cannot access data at
a smaller granularity. As we have seen, the storage of data within such devices
is reasonably complicated. We have fairly large NAND flash cells and reading
them requires some work. Hence, it is a much better idea to read a large number
of bytes in one go such that a lot of the overheads can be amortized. Enabling
these cells and the associated circuits have associated time overheads, which
necessitates page-level accesses. Hence, reading or writing small chunks of data,
let’s say a few bytes at a time, is not possible. We would like to emphasize here
that even though the term “page” is being used, it is totally different from a
page in virtual memory. They just happen to share the same name.
Let us now look at writes. In general, almost all such devices have a DRAM-
backed cache that accumulates/coalesces writes. A write is propagated to the
array of flash cells either periodically, when there is an eviction from the cache, or
when the device is being ejected. In all cases effecting a write is difficult mainly
because there is no way of directly writing to a flash cell that has already been
programmed. We need to first erase it or rather deprogram it. In fact, given
that we only perform page-level writes, the entire page has to be erased. Recall
that this process involves applying a very strong negative voltage to the control
gate to push the electrons in the floating gates back into the substrate.
Sadly, in practice, it is far easier to do this at the level of a group of pages,
because then we can afford to have a single strong driver circuit to push the
© Smruti R. Sarangi 316
Reliability Issues
Let us now discuss some reliability issues. Unfortunately, a flash device as of
today can only endure a finite number of P/E cycles per physical block. The
maximum number of P/E cycle is sadly not much – it is in the range of 50-
150k cycles as of 2024. The thin oxide layer breaks down, and the floating
gate does not remain usable anymore. Hence, there is a need to ensure that
all the blocks wear out almost at the same rate or in other words, they endure
the same number of P/E cycles. Any flash device maintains a counter for each
317 © Smruti R. Sarangi
block. Whenever there is a P/E cycle, this counter is incremented. The idea is
to ensure that all such counts are roughly similar across all the blocks.
Free block
Performance Considerations
Let us now take a high-level view and go over what we discussed. We introduced
a flash cell, which is a piece of nonvolatile memory that retains its values when
powered off, quite unlike conventional DRAM memory. Nonvolatile memories
essentially play the role of storage devices. They are clearly much faster than
hard disks, and they are slower than DRAM memory. However, this does not
come for free. There are concomitant performance and reliability problems that
require both OS support, and features such as wear leveling and swapping blocks
to minimize read disturbance.
Modern SSD devices take care of a lot of this within the confines of the de-
vice itself. Nevertheless, operating system support is required, especially when
we have systems with large flash arrays. It is necessary to equally distribute
requests across the individual SSD memories. This requires novel data layout
and partitioning techniques. Furthermore, we wish to minimize the write am-
plification. This is the ratio of the number of physical writes to the number
of logical writes. Writing some data to flash memory may involve many P/E
cycles and block movements. All of them increase the write amplification. This
is why there is a need to minimize all such extraneous writes that are made to
the SSD drive.
Most modern SSD disk arrays incorporate many performance optimizations.
They do not immediately erase a block that has served as a temporary block and
is not required anymore. They simply mark it as invalid, and it is later erased
or rather garbage collected. This is done to increase performance. Moreover,
the OS can inform the SSD disk that a given block is not going to be used in
the future. It can then be marked as invalid and can be erased later. Depending
upon its P/E count, it can be either used to store a regular block, or it can even
act as a temporary block that is useful during a block swap operation. The
OS also plays a key role in creating snapshots of file systems stored on SSD
devices. These snapshots can be used as a backup solution. Later on, if there
is a system crash, then a valid image of the system can be recovered from the
stored snapshot.
Up till now, we have used SSD drives as storage devices (as hard disk re-
placements). However, they can be used as regular main memory as well. Of
course, they will be much slower. Nevertheless, they can be used for capacity
enhancement. There are two configurations in which SSD drives are used: ver-
tical and horizontal. The SSD drive can be used in the horizontal configuration
to just increase the size of the usable main memory. The OS needs to place
physical pages intelligently across the DRAM and SSD devices to ensure opti-
mal performance. The other configuration is the vertical configuration, where
the SSD drive is between the main memory and the hard disk. It acts like a
cache for the hard disk – a faster storage device that stores a subset of the data
319 © Smruti R. Sarangi
stored on the hard disk. In this case also, the role of the OS is crucial.
take all such characteristics of the block device into account. For example, hard
disks and SSDs are block devices. We have already seen that a typical block
size in a hard disk is 512 bytes, whereas in an SSD we read/write data at the
granularity of pages and erase data at the granularity of blocks.
On the other hand, there are character devices such as keyboards and mice.
They transfer data one or a few bytes at a time. Such devices typically don’t
have addressable locations and do not function as storage devices. Such devices
either read or produce a stream of characters. There could be the notion of
a position within a stream; however, that is not integral to the definition of
a character device. The device driver for a character device needs to be very
different from a block device driver. Data that is meant to be read or written
needs to be handled and managed very differently.
Linux follows the “everything is a file” model similar to Unix. This means
that every entity in the operating system is treated as a file regardless of whether
it is a file or not. For example, devices are treated as files. They are stored in
the /dev directory, and can be accessed as regular files. Let us elaborate. The
type of the file can be found out by running the command ls -l. An entry is of
the form -rw-r--r--. The first character is ‘-’, which means that it is a regular
file. It can indicate other types of files as well (refer to Table 7.3).
Before accessing a file, it is necessary to open it first. This lets the kernel
know that the process issuing the system call is accessing the file. Some data
structures are initialized to maintain the status of the file access. A file is
treated as an array of contiguous bytes. If the process has read 8 bytes, then its
current position is set to 8 (assume that we start counting from 0). The current
position is a part of the bookkeeping information that is maintained for each
process. Sometimes a file needs to be locked, especially when multiple processes
are concurrently accessing the same file. This information is also maintained in
the bookkeeping information maintained along with each open file. The state
associated with each open file needs to be cleaned up when the file is closed.
Given that every file is treated as a contiguous set of bytes, where the first
byte is located at position 0, it is necessary to maintain the current position (a
byte pointer). It is known as a file pointer. If we do not maintain the position,
then with every read or write system call, it will be necessary to provide the file
pointer (offset within the file). For example, if we would like to read 4 bytes
from the file pointer 100 onwards, then the value “100” needs to be provided as
an argument to the read or write system call. This is fine if we have a random file
access pattern. However, most files are not accessed in this manner. Typically,
they are accessed sequentially. Hence, it is a good idea to maintain an internal
file pointer that need not be visible to programmers. From the programmer’s
point of view, it is maintained implicitly.
faster.
This sounds like a good idea; however, it seems to be hard to implement.
How does the program know if a given file offset is present in memory (in a
mapped page), or is present in the underlying storage device? Furthermore,
the page that a file address is mapped to, might change over the course of
time. Thankfully, there are two easy solutions to this problem. Let us consider
memory-mapped I/O, where file addresses are directly mapped to virtual mem-
ory addresses. In its quintessential form, the TLB is supposed to identify that
these are actually I/O addresses, and redirect the request to the I/O (storage)
device that stores the file. This is something that we do not want in this case.
Instead, we can map the memory-mapped virtual addresses to the physical ad-
dresses of pages, which are stored in the page cache. In this case, the target
of memory-mapped I/O is a set of another pages located in the memory itself.
These are pages that are a part of the page cache. This optimization will not
change the programmer’s view and programs can run unmodified, albeit much
faster. Memory mapping files is of course not a very scalable solution and does
not work for large files.
The other solution is when we use I/O-mapped I/O. This means that I/O is
performed on files using read and write system calls, and the respective requests
go to the I/O system. They are ultimately routed to the storage device. Of
course, the programmer does not work with the low-level details. She simply
invokes library calls and specifies the number of bytes that need to be read or
written, including their contents (in case of writes). The library calls curate
the inputs and make the appropriate system calls. After transferring data to
the right kernel buffers, the system call handling code invokes the device driver
routines that finally issues the I/O instructions. This is a long and slow process.
Modern kernels optimize this process. They hold off on issuing I/O instructions
and instead effect the reads and writes on pages in the page cache. This is
a much faster process and happens without the knowledge of the executing
program.
Let us look now at the data structures used to manage devices in Linux in
detail.
major number
device
device
driver
minor number
the same major number, and thus share the same driver. However, individual
devices are assigned different minor numbers.
a long time, the adoption of Linux was somewhat subdued primarily because of
the limited device support. Over the years, the situation has changed, which is
why can we see the disproportionate fraction of driver code in the overall code
base. As Linux gets more popular, we will see more driver code entering the
codebase. Note that the set of included drivers is not exhaustive. There are
still a lot of devices whose drivers are not bundled with the operating system
distribution. The drivers have to be downloaded separately. Many times there
are licensing issues, and there is also a need to reduce the overall size of the OS
install package.
Let us ask an important question at this stage. Given that the Linux kernel is
a large monolithic piece of code, should we include the code of all the drivers also
in the kernel image? There are many reasons for why this should not be done.
The first reason is that the image size will become very large. It may exhaust
the available memory space and little memory will be left for applications. The
second is that very few drivers may actually be used because it is not the case
that a single system will be connected to 200 different types of printers, even
though the code of the drivers of these printers needs to be bundled along with
the OS code. The reason for this is that when someone is connecting a printer,
the expectation is that things will immediately work and all drivers will get
auto-loaded.
In general, if it is a common device, there should be no need to go to the web
and download the corresponding driver. This would be a very inefficient process.
Hence, it is a good idea to bundle the driver along with the OS code. However,
bundling the code does not imply that the compiled version of it should be
present in the kernel image all the time. Very few devices are connected to a
machine at runtime. Only the images of the corresponding drivers should be
present in memory.
Modules
Recall that we had a very similar discussion in the context of libraries in Ap-
pendix B. We had argued that there is no necessity to include all the library
code in a process’s image. This is because very few library functions are used
in a single execution. Hence, we preferred dynamic loading of libraries and cre-
ated shared objects. It turns out that something very similar can be done here.
Instead of statically linking all the drivers, the recommended method is to cre-
ate a kernel module, which is nothing but a dynamically linked library/shared
object in the context of the kernel. All the device drivers should preferably be
loaded as modules. At run time they can be loaded on demand. This basically
means that the moment a device is connected, we find the driver code corre-
sponding to it. It is loaded to memory on-demand the same way that we load
a DLL. The advantages are obvious: efficiency and reduced memory footprint.
To load a module, the kernel provides the insmod utility that can be invoked
by the superuser – one who has administrative access. The kernel can also au-
tomatically do this action, especially when a new device is connected. There is
a dedicated utility called modprobe that is tasked with managing and loading
modules (including their dependences).
The role of a module-loading utility is specifically as follows:
1. Locate the compiled code and data of the module, and map its pages to
325 © Smruti R. Sarangi
2. Concomitantly, increase the kernel’s runtime code size and memory foot-
print.
4. The symbols exported by the module is added to the global symbol table.
Address of a func�on/
global variable
Kernel symbol table
Module 1 symbol
table
Module 2 symbol
table
device
Block device
bus_type request_queue
of the system are two types of objects namely generic devices (struct device)
and block devices.
A device is a generic construct that can represent both character and block
devices. It points to a device driver and a bus. A bus is an abstraction for a
shared hardware interconnect that connects many kinds of devices. Examples of
such buses are USB buses and PCI Express (PCIe) buses. A bus has associated
data structures and in some cases even device drivers. Many times it is necessary
to query all the devices connected to the bus and find the device that needs to
be serviced. Hence, the generic device has two specific bindings: one with device
drivers and one with buses.
Next, let us come to the structure of a block device. It points to many oth-
ers important subsystems and data structures. First, it points to a file system
(discussed in detail in Section7.6) – a mechanism to manage the full set of files
on a storage device. This includes reading and writing the files, managing the
metadata and listing them. Given that every block device stores blocks, it is
conceptually similar to a hard disk. We associate it with a gendisk structure,
which represents a generalized disk. We need to appreciate the historical sig-
nificance of this choice of the word “gendisk”. In the good old days, hard disks
were pretty much the only block devices around. However, later on many other
kinds of block devices such as SSD drives, scanners and printers came along.
Nevertheless, the gendisk structure still remained. It is a convenient way of
abstracting all of such block devices. Both a block device and the gendisk are
associated with a request queue. It is an array of requests that is associated
with a dedicated I/O scheduler. A struct request is a linked list of memory
regions that need to be accessed while servicing an I/O request.
/* generic parameters */
327 © Smruti R. Sarangi
/* function pointers */
int (* probe ) ( struct device * dev ) ;
void (* sync_state ) ( struct device * dev ) ;
int (* remove ) ( struct device * dev ) ;
void (* shutdown ) ( struct device * dev ) ;
int (* suspend ) ( struct device * dev , pm_message_t state ) ;
int (* resume ) ( struct device * dev ) ;
}
The generic structure of a device driver is shown in Listing 7.2. struct device
driver stores the name of the device, the type of the bus that it is connected
on and the module that corresponds to the device driver. It is referred to as the
owner. It is the job of the module to run the code of the device driver.
The key task of the kernel is to match a device with its corresponding
driver. Every device driver maintains an identifier of type of device id. It
encompasses a name, a type and other compatibility information. This can be
matched with the name and the type of the device.
Next, we have a bunch of function pointers, which are the callback func-
tions. They are called by other systems of the kernel, when there is a change in
the state. For example, when a device is inserted, the probe function is called.
When there is a need to synchronize the state of the device’s configuration be-
tween the in-memory buffers and the device, the sync state function is called.
The remove, shutdown, suspend and resume calls retain their usual meanings.
The core philosophy here is that these functions that are common to all kinds
of devices. It is the job of every device driver to provide implementations for
these functions. Creating such a structure with function pointers is a standard
design technique – it is similar to virtual functions in C++ and abstract functions
in Java.
A Generic Device
The code of struct device is shown in Listing 7.3. Every device contains
a ⟨major, minor⟩ number pair (devt) and an unsigned 32-bit id.
Devices are arranged as a tree. Every device thus has a parent. It addi-
tionally has a pointer to the bus and its associated device driver. Note that a
device driver does not point to a device because it can be associated with many
devices. Hence, the device is given as an argument to the functions defined
in the device driver. However, every device needs to maintain a pointer to its
associated device driver because it is associated with only a single one.
Every block device has a physical location. There is a generic way of describ-
ing a physical location at which the block device is connected. It is specified
using struct device physical location. Note the challenges in designing
such a data structure. Linux is designed for all kinds of devices: wearables,
mobile phones, laptops, desktops and large servers. There needs to be a device-
independent way for specifying where a device is connected. The kernel defines
a location panel (id of the surface on the housing), which can take generic values
such as top, left, bottom, etc. A panel represents a generic region of a device.
On each panel, the horizontal and vertical positions are specified. These are
coarse-grained positions: (top, center, bottom) and (left, center, right). We
additionally store two bits. One bit indicates whether the device is connected
to a docking station and the second bit indicates whether the device is located
on the lid of the laptop.
Block devices often read and write large blocks of data in one go. Port-
mapped I/O and memory-mapped I/O often turn out to be quite slow and
unwieldy in such cases. DMA-based I/O is much faster in this case. Hence,
every block I/O device is associated with a DMA region. Further, it points to
a linked list of DMA pools. Each DMA pool points to a set of buffers that can
be used for DMA transfers. These are buffers in kernel memory and managed
by a slab cache (refer to Section 6.4.2).
The code of a block device is shown in Listing 7.4. It is like a derived class
where the base class is a device. Given that C does not allow inheritance, the
next best option is to add a pointer to the base class (device in this case) in the
definition of struct block device. Along with a pointer, we add the version
numbers as well in the device type (devt) field.
Every block device is divided into a set of sectors. However, it can be divided
into several smaller devices that are virtual. Consider a hard disk, which is
a block device. It can be divided into multiple partitions. For example, in
Windows they can be C:, D:, E:, etc. In Linux, popular partitions are /swap,
/boot and the base directory ‘/’. Each such partition is a virtual disk. It
represents a contiguous range of sectors. Hence, we store the starting sector
number and the number of sectors.
For historical reasons, a block device is always associated with a generic
disk. This is because the most popular block devices in the early days were
hard disks. This decision has persisted even though there are many more types
of block devices these days such as SSD drives, NVM memories, USB storage
devices, SD cards and optical drives. Nevertheless, a block device structure has
a pointer to a struct gendisk.
/* table of partitions */
struct block_device * part0 ;
struct xarray part_tbl ; /* partition table */
block device). Recall that we had associated a block device structure with
each partition.
The next important data structure is a pointer to a structure called block
device operations. It contains a set of function pointers that are associated
with different functions that implement specific functionalities. There are stan-
dard functions to open a device, release it (close it), submit an I/O request,
check its status, check pending events, set the disk as read-only and freeing the
memory associated with the disk.
Let us now discuss the request queue that is a part of the gendisk structure.
It contains all the requests that need to be serviced.
The code of struct request queue is shown in Listing 7.6. It stores a small
amount of current state information – the last request that has been serviced
(last merge).
The key structure is a queue of requests – an elevator queue. Let us explain
the significance of the elevator here. We need to understand how I/O requests
are scheduled in the context of storage devices, notably hard disks. We wish
to minimize the seek time (refer to Section 7.2.1). The model is that at any
point of time, a storage device will have multiple pending requests. They need
to scheduled in such a way that per-request the disk head moves the least.
One efficient algorithm is to schedule I/O requests the same way an elevator
schedules its stops. We will discuss more about this algorithm in the section on
I/O scheduling algorithms.
The two other important data structures that we need to store are I/O
request queues. The first class of queues are per-CPU software request queues.
They store pending requests for the I/O device. It is important to note that
these are waiting requests that have not been scheduled to execute on the storage
device yet. Once they are scheduled, they are sent to a per-device request queue
that sends requests directly to the underlying hardware. These per-CPU queues
are lockless queues, which are optimized for speed and efficiency. Given that
multiple CPUs are not accessing them at the same time, there is no need for
331 © Smruti R. Sarangi
request_queue
HW queues
SW queues
phy. struct
struct bio address request
Func�on called to ranges
total length of the data (data length), starting sector number, the deadline and
a timeout value (if any). The fact that block I/O requests can have a deadline
associated with them is important. This means that they can be treated as soft
real time tasks.
Listing 7.7: struct request
source : include/linux/blk − mq.h
struct request {
/* Back pointers */
struct request_queue * q ;
struct blk_mq_ctx * mq_ctx ;
struct blk_mq_hw_ctx * mq_hctx ;
struct block_device * part ;
/* Parameters */
unsigned int __data_len ;
sector_t sector ;
unsigned int deadline , timeout ;
There are two more fields of interest. The first is a function pointer (end io)
that is invoked to complete the request. This is device-specific and is imple-
mented by its driver code. The other is a generic data structure that has more
details about the I/O request (struct bio).
Its structure is shown in Listing 7.8. It has a pointer to the block device and
an array of memory address ranges (struct bio vec). Each entry is a 3-tuple:
physical page number, length of the data and starting offset. It points to a
memory region that either needs to be read or written to. A bio vec structure
is a list of many such entries. We can think of it as a sequence of memory
regions, where each single chunk is contiguous. The entire region represented
by bio vec however may not be contiguous. Moreover, it is possible to merge
multiple bio structs or bio vec vectors to create a larger I/O request. This
is often required because many storage devices such as SSDs, disks and NVMs
prefer long sequential accesses.
to the center (innermost) and stop at the request that is the farthest from the
center (outermost). This minimizes back and forth movement of the disk head,
and also ensures fairness. After reaching the outermost request, the disk head
then moves towards the innermost track servicing requests on they way. An
elevator processes requests in the same manner.
A Old request
New request
B
Figure 7.24: Example of the elevator algorithm, where fairness is being com-
promised. Fairness would require Request B to be scheduled before Request
A because it arrived earlier. If we start servicing requests on the reverse path
(outer to inner) then all the requests in the vicinity (nearby tracks) of A will
get serviced first. Note that requests in the vicinity of A got two back-to-back
chances: one when the head was moving towards the outermost track and one
when it reversed its direction.
There are many variants of this basic algorithm. We can quickly observe
that fairness is slightly being compromised here. Assume the disk head is on
the track corresponding to the outermost request. At that point of time, a
new request arrives. It is possible for the disk head to immediately reverse its
direction and process the new request that has just arrived. This will happen
if it is deemed to be in the outermost track (after the earlier request has been
processed). This situation is shown in Figure 7.24.
It is possible to make the algorithm fairer by directly moving to the innermost
request after servicing the outermost direction. In the reverse direction (outer
to inner), no requests are serviced. It is a direct movement of the head, which
is a relatively fast operation. These classes of algorithms are very simple and
not used in modern operating systems.
Linux uses three I/O scheduling algorithms: Deadline, BFQ and Kyber.
Deadline Scheduler
The Deadline scheduler stores requests in two queues. The first queue (sorted
queue) stores requests in the order of their block address, which is roughly the
same as the order of sectors. The reason for such a storage structure is to
© Smruti R. Sarangi 334
BFQ Scheduler
The BFQ (Budget Fair Queuing) scheduler is similar to the CFS scheduler for
processes (see Section 5.4.6). The same way that CFS apportions the processing
time between jobs, BFQ creates sector slices and gives every process the freedom
to access a certain number of sectors in the sector slice. The main focus here is
fairness across processes. Latency and throughout are secondary considerations.
Kyber Scheduler
This scheduler was introduced by Facebook (Meta). It is a simple scheduler
that creates two buckets: high-priority reads and low-priority writes. Each
type of request has a target latency. Kyber dynamically adjusts the number of
allowed in-flight requests such that all operations complete within their latency
thresholds.
General Principles
In general, I/O schedulers and libraries perform a combination of three opera-
tions: delay, merge and reorder. Sometimes it is desirable to delay requests a bit
such that a set of sufficient size can be created. It is easy to apply optimizations
on such a set with a sizeable number of requests. One of the common optimiza-
tions is to merge requests. For example, reads and writes to the same block
can be easily merged, and redundant requests can be eliminated. Accesses to
adjacent and contiguous memory regions can be combined. This will minimize
the seek time and rotational delay.
Furthermore, requests can be reordered. We have already seen examples of
reordering reads and writes. This is done to service reads quickly because they
are often on the critical path. We can also distinguish between synchronous
writes (we wait for it to complete) and asynchronous writes. The former should
have a higher priority because there is a process that is waiting for it to complete.
Other reasons for reordering could factor in the current position of the disk head
and deadlines.
In this case, the structure that represents the driver is mspro block driver.
It contains pointers to the probe, initialize, remove, suspend and resume
functions.
The initialization function mspro block probe initializes the memory card.
It sends it instructions to initialize its state and prepare itself for subsequent
read/write operations. Next, it creates an entry in the sysfs file system, which
is a special file system that exposes attributes of kernel objects such as devices
to users. Files in the sysfs file system can be accessed by user-level applica-
tions to find the status of devices. In some cases, superusers can also write
to these files, which allows them to control the behavior of the corresponding
devices. Subsequently, the initialization function initializes the various block
device-related structures: gendisk, block device, request queue, etc.
Typically, a device driver is written for a family of devices. For a specific
device in the family, either a generic (core) function can be used or a specific
function can be implemented. Let us consider one such specific function for the
Realtek USB memory card. It uses the basic code in the memstick directory but
defines its own function for reading/writing data. Let us explain the operation
of the function rtsx usb ms handle req.
It maintains a queue of outstanding requests. It uses the basic memstick code
to fetch the next request. There are three types of requests: read, write and bulk
transfer. For reading and writing, the code creates generic USB commands and
passes them on to a low-level USB driver. Its job is to send the raw commands
to the device. For a bulk transfer, the driver sets up a one-way pipe with
the low-level driver, which ensures that the data is directly transferred to a
memory buffer, which the USB device can access. The low-level commands can
be written to the USB command registers by the USB driver.
Point 7.4.1
The argument is a struct urb, which is a generic data structure that holds
information pertaining to USB requests and responses. It holds details about
the USB endpoint (device address), status of the transfer, pointer to a memory
buffer that holds the data and the type of the transfer. The details of the keys
that were pressed are present in this memory buffer. Specifically, the following
pieces of information are present for keyboards.
3. The character corresponding to the key that was pressed (internal code or
ASCII code).
4. Report whether Num Lock or Caps Lock have been processed.
Point 7.5.1
Unlike block devices that have direct connections to the motherboard’s
buses, character devices are connected to the motherboard via a network
of ports and controllers. The latency and throughput constraints are
more relaxed for such devices. The main challenge in managing such
devices is to effectively ferry the data across various buses, controllers,
ports, protocols and device drivers. A long chain of callback functions
needs to be created and data needs to be efficiently passed between the
device and its driver. Given that such devices are often hot-pluggable,
the driver and OS utilities need to be quite responsive.
Linux-like operating systems define the concept of an inode (see Figure 7.26). It
stores the metadata associated with a file like its name, ownership information,
size, permissions, etc., and also has a pointer to the block mapping table. If the
user wishes to read the 1046th byte of a file, all that she needs to do is compute
the block number and pass the file’s inode to a generic function. The outcome
is the address on the storage device.
Now a directory is also in a certain sense a regular file that stores data. It is
thus also represented by an inode. Since an inode acts like a metadata storage
unit in the world of files, it does not care what it is actually representing. It can
represent either regular files or directories or even devices. While representing
data (files and directories), it simply stores pointers to all the constituent blocks
without caring about their semantics. A block is treated as just a collection of
bytes. A directory’s structure is also simple. It stores a table that is indexed
by the name of the file/directory. The columns store the metadata information.
One of the most important fields is a pointer to the inode of the entry. This is
the elegance of the design. The inode is a generic structure that can point to
339 © Smruti R. Sarangi
any type of file including network sockets, devices and inter-process pipes – it
does not matter.
Let us explain with an example. In Linux, the default file system’s base direc-
tory is /. Every file has a path. Consider the path /home/srsarangi/ab.txt.
Assume that an editor wants to open the file. It needs to access its data blocks
and thus needs a pointer to its inode. The open system call locates the inode
and provides a handle to it that can be used by user programs. Assume that
the location of the inode of the / (or root) directory is known. It is inode #2
in the ext4 file system. The kernel code reads the contents of the / directory
and locates the inode of the home subdirectory in the table of file names. This
process continues recursively until the inode of ab.txt is located. Once it is
identified, there is a need to remember this information. The inode is wrapped
in a file handle, which is returned to the process. For subsequent accesses such
as reading and writing to the file, all that the kernel needs is the file handle.
It can easily extract the inode and process the request. There is a no need to
recursively traverse the tree of directories.
Let us now look at the file system in its entirety. It clearly needs to store all
the constituent inodes. Let us look at the rest of its components.
Recall that in hard disks and similar block devices, a single physical device
can be partitioned into multiple logical devices or logical disks. This is done
for effective management of the storage space, and also for security purposes –
we may want to keep all the operating system related files in one partition and
store all the user data in another partition. Some of these partitions may be
bootable. Bootable partitions typically store information related to booting the
kernel in the first sector (Sector 0), which is known as the boot block. The BIOS
can then load the kernel.
Most partitions just store a regular file system and are not bootable. For
example, D: and E: are partitions on Windows systems (refer to Figure 7.25).
On Linux, /usr and /home may be mounted on different partitions. In general, a
partition has only one file system. However, there are exceptions. For example,
swap on Linux (swap space) does not have a file system mounted on it. There
are file systems that span multiple partitions, and there are systems where
multiple file systems are mounted on the same partition. However, these are
very specialized systems. The metadata of most file systems is stored in Block
1, regardless of whether they are bootable or not. This block is known as the
superblock. It contains the following pieces of information: file system type
and size, attributes such as the block size or maximum file length, number of
inodes and blocks, timestamps and additional data. Some other important data
structures include inode tables, and a bitmap of free inodes and disk blocks. For
implementing such data structures, we can use bitmaps that can be accelerated
with Emde Boas trees.
as a root directory. The key question is how do we access the files stored
in the new file system? Any file or directory has a path that is of the form
/dir1/dir2/.../filename in Linux. In Windows / is replaced with \. The
baseline is that all files need to be accessible via a string of this form, which is
known as the path of the file. It is an absolute path because it starts from the
root directory. A path can also be relative, where the location is specified with
respect to the current directory. Here the parent directory is specified with the
special symbol “..”. The first thing that the library functions do is convert all
relative paths to absolute paths. Hence, the key question still remains. How do
we specify paths across file systems?
/home/srsarangi/doc
Mount point
/books/osbook.pdf
/home/srsarangi/doc/books/osbook.pdf
mounted file system. It will be tasked with retrieving the file /videos/foo.mpg
relative to its root. Things can get more interesting. A mounted file system
can mount another file system, so on and so forth. The algorithm for traversing
the file system remains the same. Recursive traversal involves first identifying
the file system from the file path, and then invoking its functions to locate the
appropriate inode.
Finally, the unmount command can be used to unmount a file system. Its
files will not be accessible anymore.
File's inode
Directory
Path
of the Symbolic
link
Points to the
file
same inode
Hard
link
A separate file is created with its own inode (file type ‘l’). Its contents
contain the path of the target. Hence, resolving such a link is straightforward.
The kernel reads the path contained in the symbolic link file, and then uses the
© Smruti R. Sarangi 342
Hard Links
Hence, hard links were introduced. The same ln command can be used to create
hard links as follows (refer to Figure 7.28).
Listing 7.11: Creating a hard link
ln path_to_target path_to_link
A hard link is a directory entry that points to the same inode as the target
file. In this case, both the hard link and the directory file point to the same
inode. If one is modified, then the changes are reflected in the other. However,
deleting the target file does not lead to the deletion of the hard link. Hence,
the hard link still remains valid. We pretty much maintain a reference count
with each inode. The inode is deleted when all the files and hard links that
point to it are deleted. Another interesting property is that if the target file
is moved (within the same file system) or renamed, the hard link still remains
valid. However, if the target file is deleted and a new file with the same name
is created in the same directory, the inode changes and the hard link does not
remain valid.
There are nonetheless some limitations with hard links. They cannot be
used across file systems, and normally cannot link directories. The latter will
create infinite loops because a child can now link to an ancestor directory.
Applica�on Run df -a
Figure 7.29: File systems supported by the Linux virtual file system (VFS).
shows a conceptual view of the virtual file system where a single file system
unifies many different types of file systems. We wish to use a single interface to
access and work with all the files in the VFS regardless of how they are stored
or which underlying file system they belong to. Finally, given Linux’s histori-
cal ties to Unix, we would like the VFS’s interface to be similar to that of the
classical Unix file system (UFS). Given our observations, let us list down the
requirements of a virtual file system in Point 7.6.1.
Point 7.6.1
when the file system is mounted. Then they can be added to a software cache.
Any subsequent access will find the pseudostructure in the cache.
Next, we store the size of the file in terms of the number of blocks (i blocks)
and the exact size of the file in bytes (i size).
We shall study in Chapter 8 that permissions are very important in Linux
from a security perspective. Hence, it is important to store ownership and
permission information. The field i mode stores the type of the file. Linux
supports several file types namely a regular file, directory, character device,
block device, FIFO pipe, symbolic link and socket. Recall that everything-is-a-
file assumption. The file system treats all such diverse entities as files. Hence,
it becomes necessary to store their type as well. The field i uid shows the
id of the user who is the owner of the file. In Linux, every user belongs to
one or more groups. Furthermore, resources such as files are associated with
a group. This is indicated by the field i gid (group id). Group members get
some additional access rights as compared to users who are not a part of the
group. Some additional files include access times, modification times and file
locking-related state.
The next field i op is crucial to implementing VFS. It is a pointer to an inode
operations structure that contains a list of function pointers. These function
pointers point to generic file operations such as open, close, read, write, flush
(move to kernel buffers), sync (move to disk), seek and mmap (memory map).
Note that each file system has its own custom implementations of such functions.
The function pointers point to the relevant function (defined in the codebase of
the underlying file system).
Given that the inode in VFS is meant to be a generic structure, we cannot
store more fields. Many of them may not be relevant to all file systems. For
example, we cannot store a mapping table because inodes may correspond to
devices or sockets that do not store blocks on storage devices. Hence, it is a
good idea to have a pointer to data that is used by the underlying file system.
The pointer i private is useful for this purpose. It is of type void *, which
means that it can point to any kind of data structure. Often file systems set it
to custom data structures. Many times they define other kinds of encapsulating
data structures that have a pointer to the VFS inode and file system-specific
custom data structures. i private can also point to a device that corresponds
to the file. It is truly generic in character.
Point 7.6.2
An inode is conceptually a two-part structure. The first part is a VFS
inode (shown in Listing 7.12), which stores generic information about
a file. The second part is a file system-specific inode that may store a
mapping structure, especially in the case of regular files and directories.
Directory Entry
Point 7.6.3
VFS maintains a cache of inodes and dentry structures. For frequently
visited directories, there is no need to make a call to the underlying file
system and traverse its directory tree. The dentry corresponding to the
directory can directly be retrieved from the cache (if it is present).
Address Space
© Smruti R. Sarangi 348
Let us now discuss the page cache. This is a very useful data structure
especially for file-backed pages. I/O operations are slow; hence, it should not be
necessary to access the I/O devices all the time. It is a far wiser idea to maintain
an in-memory page cache that can service reads and writes quickly. A problem
of consistency is sadly created. If the system is powered off, then there is a
risk of updates getting lost. Thankfully, in modern systems, this behavior can
be controlled and regulated to a large extent. It is possible to specify policies.
For example, we can specify that when a file is closed, all of its cached data
needs to be written back immediately. The close operation will be deemed
to be successful only after an acknowledgement is received indicating that all
the modified data has been successfully written back. There are other methods
as well. Linux supports explicit sync (synchronization) calls, kernel daemons
that periodically sync data to the underlying disk, and write-back operations
triggered when the memory pressure increases.
struct address space is an important part of the page cache (refer to List-
ing 7.14). It stores a mapping (i pages) from an inode to its cached memory
pages (stored as a radix tree). The second map is a mapping i mmap from the
inode to a list of vma s (stored as a red-black tree). The need to maintain all the
virtual memory regions (vma s) that have cached pages arises from the fact that
there is a need to quickly check if a given virtual address is cached or not. It
additionally contains a list of pointers to functions that implement regular oper-
ations such as reading or writing pages (struct address space operations).
Finally, each address space stores some private data, which is used by the
functions that work on it.
Point 7.6.4
This is a standard pattern that we have been observing for a while now.
Whenever we want to define a high-level base class in C, there is a need
to create an auxiliary structure with function pointers. These pointers
are assigned to real functions by (conceptually) derived classes. In an
object-oriented language, there would have been no reason to do so. We
could have simply defined a virtual base class and then derived classes
349 © Smruti R. Sarangi
could have overridden its functions. However, in the case of the kernel,
which is written in C, the same functionality needs to be created using
a dedicated structure that stores function pointers. The pointers are
assigned to different sets of functions based on the derived class. In this
case, the derived class is the actual file system. Ext4 will assign them to
functions that are specific to it, and other file systems such as exFat or
ReiserFS will do the same.
The role of struct vma s needs to be further clarified. A file can be mapped
to the address spaces of multiple processes. For each process, we will have
a separate vma region. Recall that a vma region is process-specific. The key
problem is to map a vma region to a contiguous region of a file. For example,
if the vma region’s start and end addresses are A and B (resp.), we need some
record of the fact that the starting address corresponds to the file offset P and
the ending address corresponds to file offset Q (note: B − A = Q − P ). Each
vma structure stores two fields that help us maintain this information. The
first is vm file (a pointer to the file) and the second is vm pgoff. It is the
offset within the file – it corresponds to the starting address of the vma region.
The page offset within the file can be calculated from the address X using the
following equation.
(X − vm start)
offset = + vm pgoff (7.3)
PAGE SIZE
Here PAGE SIZE is 4 KB and vm start is the starting address of the vma.
Finally, note that we can reverse map file blocks using this data structure as
well.
Op�mized for
small files Ptrs
Pointers to 12 directly
mapped blocks
Pointers to
Indirect data blocks Ptrs
Double indirect
Triple indirect
Trivia 7.6.1
The basic idea is similar to the concept of folios – long contiguous sequences of
pages in physical and virtual memory. In this case, we define an extent to be a
contiguous region of addresses on a storage device. Such a region can be fairly
large. Its size can vary from 4 KB to 128 MB. The advantage of large contiguous
chunks is that there is no need to repeatedly query a mapping structure for
addresses that lie within it. Furthermore, allocation and deallocation is easy.
A large region can be allocated in one go. The flip side is that we may end up
creating holes as was the case with the base-limit scheme in memory allocation
(see Section 6.1.1). In this case, holes don’t pose a big issue because extents
can be of variable sizes. We can always cover up holes with extents of different
sizes. However, the key idea is that we wish to allocate large chunks of data as
extents, and simultaneously try to reduce the number of extents. This reduces
the amount of metadata required to save information related to extents.
The organization of extents is shown in Figure 7.31. In this case, the struc-
ture of the ext4 inode is different. It can store up to four extents. Each extent
points to a contiguous region on the disk. However, if there are more than 5 ex-
tents, then there is a need to organize them as a tree (as shown in Figure 7.31).
The tree can at the most have 5 levels. Let us elaborate.
351 © Smruti R. Sarangi
ext4_extent_header Regions in
ext4_inode
index node the disk
ext4_extent
ext4_extent_header
ext4_extent_idx
i_block[] (first ext4_extent
12 bytes)
ext4_extent_idx ext4_extent_header
ext4_extent
ext4_extent
There is no need to define a separate ext4 inode for the extent-based filesys-
tem. The ext4 inode defines 15 block pointers: 12 for direct block pointers,
1 for the single-indirect block, 1 for the double-indirect block and 1 for the
triple-indirect block. Each such pointer is 4 bytes long. Hence, the total storage
required in the ext4 inode structure is 60 bytes.
The great idea here is to repurpose these 60 bytes to store information related
to extents. There is no need to define a separate data structure. The first
12 bytes are used to store the extent header (struct ext4 extent header).
The structure is directly stored in these 12 bytes (not its pointer). An ext4
header stores important information about the extent tree: number of entries,
the depth of the tree, etc. If the depth is zero, then there is no extent tree.
We just use the remaining 48 (60-12) bytes to directly store extents (struct
ext4 extent). Here also the structures are directly stored, not their pointers.
Each ext4 extent requires 12 bytes. We can thus store four extents in this
case.
The code of an ext4 extent is shown in Listing 7.15. It maps a set of
contiguous logical blocks (within a file) to contiguous physical blocks (on the
disk). The structure stores the first logical block, the number of blocks and
the 48-bit address of the starting physical block. We store the 48 bits using
two fields: one 16-bit field and one 32-bit field. An extent basically maps a set
of contiguous logical blocks to the same number of contiguous physical blocks.
The size of an extent is naturally limited to 215 (32k) blocks. If each block is 4
KB, then an extent can map 32k × 4 KB = 128 MB.
Listing 7.15: struct ext4 extent
source : fs/ext4/ext4 extents.h
struct ext4_extent {
__le32 ee_block ; /* first logical block */
__le16 ee_len ; /* number of blocks */
__le16 ee_start_hi ; /* high 16 bits ( phy . block ) */
__le32 ee_start_lo ; /* low 32 bits ( phy . block ) */
};
© Smruti R. Sarangi 352
Now, consider the case when we need to store more than 4 extents. In this
case, there is a need to create an extent tree. Each internal node in the extent
tree is represented by the structure struct ext4 extent idx (extent index).
It stores the starting logical block number and pointer to the physical block
number of the next level of the tree. The next level of the tree is a block
(typically 4 KBs). Out of the 4096 bytes, 12 bytes are required for the extent
header and 4 bytes for storing some more metadata at the end of the block.
This leaves us with 4080 bytes, which can be used to store 340 12-byte data
structures. These could either be extents or extent index structures. We are
thus creating a 340-ary tree, which is massive. Now, note that we can at the
most have a 5-level tree. The maximum file size is thus extremely large. Many
file systems limit it to 16 TB. Let us compute the maximum size of the entire file
system. The total number of addressable physical blocks is 248 . If each block is
4 KB, then the maximum file system size (known as volume size) is 260 bytes,
which is 1 EB (exabyte). We can thus quickly conclude that an extent-based
file system is far more scalable than an indirect block-based file system.
Directory Structure
As discussed earlier, it is the job of the ext4 file system to define the internal
structure of the directory entries. VFS simply stores structures to implement
the external interface.
Listing 7.16 shows the structure of a directory entry in the ext4 file system.
The name of the structure is ext4 dir entry 2. It stores the inode number,
length of the directory entry, length of the name of the file, the type of the file
and the name of the file. It basically establishes a connection between the file
name and the inode number. In this context, the most important operation is a
lookup operation. The input is the name of a file, and the output is a pointer to
the inode (or alternatively its unique number). This is a straightforward search
problem in the directory. We need to design an appropriate data structure for
storing the directory entries (e.g.: ext4 dir entry 2 in the case of ext4). Let
us start with looking at some naive solutions. Trivia 7.6.2 discusses the space
of possible solutions.
353 © Smruti R. Sarangi
Trivia 7.6.2
• We can simply store the entries in an unsorted linear list. This
will require roughly n/2 time comparisons on an average, where n
is the total number of files stored in the directory. This is clearly
slow and not scalable.
• The next solution is a sorted list that requires O(log(n)) compar-
isons. This is a great data structure if files are not being added or
removed. However, if the contents of a directory change, then we
need to continuously re-sort the list, which is seldom feasible.
• A hash table has roughly O(1) search complexity. It does not
require continuous maintenance. However, it also has scalability
problems. There could be a high degree of aliasing (multiple keys
map to the same bucket). This will require constant hash table
resizing.
• Traditionally, red-black trees and B-trees have been used to solve
such problems. They scale well with the number of files in a direc-
tory.
File “abc”
FAT
table 1 2 3
Each entry points to the next entry in
the FAT table and the corresponding
loca�on in the storage device
Storage device
The basic concept is quite simple. We have a long table of entries (the FAT
table). This is the primary data structure in the overall design. Each entry has
two pointers: a pointer to the next entry in the FAT table (can be null) and a
pointer to a cluster stored on the disk. A cluster is defined as a set of sectors
(on the disk). It is the smallest unit of storage in this file system. We can think
of a file as a linked list of entries in the FAT table, where each entry additionally
points to a cluster on the disk (or some storage device). Let us elaborate.
Regular Files
Consider a file “abc”. It is stored in the FAT file system (refer to Figure 7.32).
Let us assume that the size of the file is three clusters. We can number the
clusters 1, 2 and 3, respectively. The 1st cluster is the first cluster of the file as
shown in the figure. The first FAT table entry of the file in the FAT table has
a pointer to this cluster. Note that this pointer is a disk address. Given that
this entry is a part of a linked list, it contains a pointer to the next entry (2nd
entry). This entry is designed similarly. It has a pointer to the second cluster
of the file. Along with it, it also points to the next node on the linked list (3rd
entry). Entry number 3 is the last element on the linked list. Its next pointer
is null. It contains a pointer to the third cluster.
The structure is thus quite simple and straightforward. The FAT table just
stores a lot of linked lists. Each linked list corresponds to a file. In this case a
file represents both a regular file and a directory. A directory is also represented
as a regular file, where the data blocks have a special format.
Almost everybody would agree that the FAT table distinguishes itself on the
basis of its simplicity. All that we need to do is divide the total storage space
355 © Smruti R. Sarangi
into a set of clusters. We can maintain a bitmap for all the clusters, where the
bit corresponding to a cluster is 1 if the cluster is free, otherwise it is busy.
Any regular file or directory is a sequence of clusters and thus can easily be
represented by a linked list.
Even though the idea seems quite appealing, linked lists have their share of
problems. They do not allow random access. This means that given a logical
address of a file block, we cannot find its physical address in O(1) time. There
is a need to traverse the linked list, which requires O(N ) time. Recall that
the ext4 file system allowed us to quickly find the physical address of a file
block regardless of its design in O(1) time (indirect blocks or extents). This
is something that we sacrifice with a FAT table. If we have pure sequential
accesses, then this limitation does not pose a major problem.
Directories
File “abc”
FAT
table
Figure 7.33: Storing files and directories in the FAT file system
Both ext4 and exFAT treat a directory as a regular file to a large extent. It is
just a collection of blocks (clusters in the case of exFAT). The “data” associated
with a directory has a special format. As shown in Figure 7.33, a directory is
a table with several columns. The first column is the file name, which is a
unique identifier of the file. Modern file systems such as exFAT support long
file names. Sometimes comparing such large file names can be time-consuming.
In the interest of efficiency, it is a better idea to hash a file name to a 32 or 64-
bit number. Locating a file thus involves simple 32 or 64-bit hash comparison,
which is an efficient solution.
The next set of columns store the file’s attributes that include a file’s status
(read-only, hidden, etc.), file length and creation/modification times. The last
column is a pointer to the first entry in the FAT table. This part is crucial. It
ties a directory entry to the starting cluster of a file via the FAT table. The
directory entry does not point to the cluster directly. Instead, it points to the
first entry of the file in the FAT table. This entry has two pointers: one points
to the first cluster of the file and the other points to the next entry of the linked
list.
Due to the simplicity of such file systems, they have found wide use in
portable storage media and embedded devices.
© Smruti R. Sarangi 356
Phase Action
Pre-write Discard the journal entry
Write Replay the journal entry
Cleanup Finish the cleanup process
Table 7.4: Actions that are taken when there the system crashes in different
phases of a write operation
Assume that the system crashes in the pre-write phase. This can be detected
from its journal entry. The journal entry would be incomplete. We assume that
it is possible to find out whether a journal entry is fully written to the journal
or not. This is possible using a dedicated footer section at the end of the entry.
Additionally, we can have an error checking code to verify the integrity of the
entry. In case, the entry is not fully written, then it can simply be discarded.
357 © Smruti R. Sarangi
If the journal entry is fully written, then the next stage commences where
a set of blocks on the storage device are written to. This is typically the most
time-consuming process. At the end of the write operation, the file system driver
updates the journal entry to indicate that the write operation is over. Now
assume that the system crashes before this update is made. After a restart, this
fact can easily be discovered. The journal entry will be completely written, but
there will no record of the fact that the write operation has been fully completed.
The entire write operation can be re-done (replayed). Given the idempotence
of writes, there are no correctness issues.
Finally, assume that the write operation is fully done but before cleaning up
the journal, the system crashes. When the system restarts it can clearly observe
that the write operation has been completed, yet the journal entry is still there.
It is easy to finish the remaining bookkeeping and mark the entry for removal.
Either it can be removed immediately or it can be removed later by a dedicated
kernel thread.
Example 7.6.1
int main () {
char c ;
FILE * src_file , * dst_file ;
if ( src_file == NULL ) {
printf ( " Could not open a . txt \ n " ) ;
exit (1) ;
}
if ( dst_file == NULL ) {
fclose ( src_file ) ;
printf ( " Could not open b . txt \ n " ) ;
exit (1) ;
}
On similar lines, we open the file “b.txt” for writing. In this case, the mode
is “w”, which means that we wish to write to the file. The corresponding mode
for opening the source file (“a.txt”) was “r” because we opened it in read-only
mode. Subsequently, we keep reading the source file character by character and
keep writing them to the destination file. If the character read is equal to EOF
(end of file), then it means that the end of the file has been reached and there
are no more valid characters left. The C library call to read characters is fgetc
and the library call to write a character is fputc. It is important to note that
both these library calls take the FILE handle (structure) as the sole argument
for identifying the file that has been opened in the past. Here, it is important
359 © Smruti R. Sarangi
to note that a file cannot be accessed without opening it first. This is because
opening a file creates some state in the kernel that is subsequently required while
accessing it. We are already aware of the changes that are made such as adding
a new entry to the systemwide open file table, per-process open file table, etc.
Finally, we close both the files using the fclose library calls. They clean up
the state in the kernel. They remove the corresponding entries from the per-
process file table. The entries from the systemwide table are removed only if
there is no other process that has simultaneously opened these files. Otherwise,
we retain the entries in the systemwide open file table.
Let us consider the next example (Example 7.6.2) that opens a file, maps it
to memory and counts the number of ’a’s in the file. We proceed similarly. We
open the file “a.txt”, and assign it to a file handle file. In this case, we need
to also retrieve the integer file descriptor because there are many calls that need
it. This is easily achieved using the fileno function.
Example 7.6.2
Open a file ”a.txt”, and count the number of ’a’s in the file.
Answer:
int main () {
FILE * file ;
int fd ;
char * buf ;
struct stat info ;
int i , size , count = 0;
7.6.10 Pipes
Let us now look at a special kind of file known as a pipe. A pipe functions as a
producer-consumer queue. Even though modern pipes have support for multiple
producers and consumers, a typical pipe has a process that writes data at one
end, and another process that reads data from the other end. There is built-in
synchronization. This is a fairly convenient method of transferring data across
361 © Smruti R. Sarangi
processes. There are two kinds of pipes: named and anonymous. We shall look
at anonymous pipes first.
Anonymous Pipes
An anonymous pipe is a pair of file descriptors. One file descriptor is used to
write, and the other is used to read. This means that the writing process has one
file descriptor, which it uses to write to the pipe. The reading process has one
more file descriptor, which it uses to read. A pipe is a buffered channel, which
means that if the reader is inactive, the pipe buffers the data that has not been
read. Once the data is read, it is removed from the pipe. Example 7.6.3 shows
an example.
Example 7.6.3
int main () {
pid_t pid ;
int pipefd [2];
char msg_sent [] = " I love my OS book " ;
char msg_rcvd [30];
/* fork */
pid = fork () ;
if ( pid > 0) {
/* parent process */
close ( pipefd [0]) ;
As we can see in the example, the pipe library call (and system call) creates
a pair of file descriptors. It returns a 2-element array of file descriptors. 0 is the
read end, and 1 is the write end. In the example, the array of file descriptors is
passed to both the parent and the child process. Given that the parent needs
to write data, it closes the read end (pipefd[0]). Note that instead of using
fclose, we use close that takes a file descriptor as input. In general, the
library calls with a prefix of ‘f’ are at a high level and have lower flexibility. On
the other hand, calls such as open, close, read and write directly wrap the
corresponding system calls and are at a much lower level.
The parent process quickly closes the file descriptor that it does not need
(read end). It writes a string msg sent to the pipe. The child process is the
reader. It does something similar – it closes the write end. It reads the message
from the pipe, and then prints it.
Named Pipes
DELL@Desktop-home2 ~
Create a DELL@Desktop-home2 ~
$ mkfifo mypipe
$ echo "I love my OS
named pipe course" > mypipe
DELL@Desktop-home2 ~
$ file mypipe
mypipe: fifo (named pipe) Write to the
pipe
DELL@Desktop-home2 ~
$ ls -al mypipe
Note the ‘p’ prw-rw-rw- 1 DELL None 0 Apr 23 09:38
mypipe
DELL@Desktop-home2 ~
Wait �ll the $ tail -f mypipe
pipe is I love my OS course
wri�en to
Figure 7.34 shows a method for using named pipes. In this case the mkfifo
command is used to create a pipe file called mypipe. Its details can be listed
with the file command. The output shows that it is a named pipe, which is
akin to a producer-consumer FIFO queue. A directory listing shows the file to
be of type ‘p’. Given that the file mypipe is now a valid file in the file system, a
process running on a different shell can simply write to it. In this case, we are
writing the string “I love my OS course” to the pipe by redirecting the output
stream to the pipe. The ‘>’ symbol redirects the output to the pipe. The other
363 © Smruti R. Sarangi
reading process can now read the message from the pipe by using the tail shell
command. We see the same message being printed.
Using such named pipes gives processes a convenient mechanism to pass
messages between each other. They do not have to create a new pipe all the
time. One end of the pipe can just be treated as a regular file that is being
written to. As we have seen the ‘>’ symbol redirects the standard output
stream to the pipe. Similarly, the other side, which is the read end can be used
by any program to read any messages present in the pipe. Here also it is possible
to redirect the standard input to a file using the ‘<’ symbol.
Exercises
Ex. 2 — Why are modern buses like USB designed as serial buses?
Ex. 4 — Give an example where RAID 3 (striping at the byte level) is the
preferred approach.
Ex. 5 — What is the advantage of a storage device that rotates with a con-
stant linear velocity?
** Ex. 6 — RAID 0 stripes data – stores odd numbered blocks in disk 0 and
even numbered blocks in disk 1. RAID 1 creates a mirror image of the data (disk
0 and disk 1 have the same contents). Consider RAID 10 (first mirror and then
stripe), and RAID 01 (first stripe and then mirror). Both the configurations
will have four hard disks divided into groups of two disks. Each group is called
a first-level RAID group. We are essentially making a second-level RAID group
out of two first-level RAID groups. Now, answer the following questions:
a)Does RAID 01 offer the same performance as RAID 10?
b)What about their reliability? Is it the same? You need to make an implicit
assumption here, which is that it is highly unlikely that both the disks
belonging to the same first-level RAID group will fail simultaneously.
© Smruti R. Sarangi 364
Ex. 7 — The motor in hard disks rotates at a constant angular velocity. What
problems does this cause? How should they be solved?
Ex. 8 — We often use bit vectors to store the list of free blocks in file systems.
Can we optimize the bit vectors and reduce the amount of storage?
Ex. 9 — What is the difference between the contents of a directory, and the
contents of a file?
Ex. 10 — Describe the advantages and disadvantages of memory-mapped I/O
and port-mapped I/O.
Ex. 14 — How does memory-mapped I/O work in the case of hard disks? We
need to perform reads, writes and check the status of the disk. How does the
processor know that a given address is actually an I/O address, and how is this
communicated to software? Are these operations synchronous or asynchronous?
What is the advantage of this method over a design that uses regular I/O ports?
Explain your answers.
Ex. 16 — FAT file systems find it hard to support seek operations. How can
a FAT file system be modified to support such operations more efficiently?
Ex. 20 — Most flash devices have a small DRAM cache, which is used to
reduce the number of PE-cycles and the degree of read disturbance. Assume
that the DRAM cache is managed by software. Suggest a data structure that
can be created on the DRAM cache to manage flash reads and writes such that
we minimize the #PE-cycles and read disturbance.
Ex. 22 — Answer the following questions with respect to devices and device
drivers:
a)Why do we have both software and hardware request queues in structrequest queue?
b)Why do device drivers deliberately delay requests?
c)Why should we just not remove (eject) a USB key?
d)What can be done to ensure that even if a user forcefully removes a USB
key, its FAT file system is not corrupted?
Ex. 25 — Suggest an algorithm for periodically draining the page cache (sync-
ing it with the underlying storage device). What happens if the sync frequency
is very high or very low?
** Ex. 27 — Design a file system for a system like Twitter/X. Assume that
each tweet (small piece of text) is stored as a small file. The file size is limited
to 256 bytes. Given a tweet, a user would like to take a look at the replies to
the tweet, which are themselves tweets. Furthermore, it is possible that a tweet
may be retweeted (posted again) many times. The “retweet” (new post) will
be visible to a user’s friends. Note that there are no circular dependences. It
is never the case that: (1) A tweets, (2) B sees it because B is A’s friend, (3)
B retweets the same message, and (4) A gets to see the retweet. Design a file
system that is suitable for this purpose.
© Smruti R. Sarangi 366
Ex. 28 — Consider a large directory in the exFAT file system. Assume that
its contents span several blocks. How is the directory (represented as a file)
stored in the FAT table? What does each row in a directory’s data block look
like? How do we create a new file and allocate space to it in this filesystem?
For the last part, explain the data structures that we need to maintain. Justify
the design.
Chapter 8
Virtualization and Security
367
© Smruti R. Sarangi 368
Exercises
Ex. 2 — Describe the trap-and-emulate method. How does it work for inter-
rupts, privileged instructions and system calls?
** Ex. 3 — Most proprietary software use a license server to verify if the user
has sufficient credentials to run the software. Think of a “license server” as an
external server. The client sends its id, and IP address (cannot be spoofed) along
with some more information. After several rounds of communication, the server
sends a token that the client can use to run the application only once. The next
time we run the application, a fresh token is required. Design a cryptographic
protocol that is immune to changing the system time on the client machine,
369 © Smruti R. Sarangi
replay attacks, and man-in-the-middle attacks. Assume that the binary of the
program cannot be changed.
** Ex. 5 — Let us design an operating system that supports record and re-
play. We first run the operating system in record mode, where it executes a host
of applications that interact with I/O devices, the hard disk, and the network.
A small module inside the operating system records all the events of interest.
Let us call this the record phase.
After the record phase terminates, later on, we can run a replay phase. In this
case, we shall run the operating system and all the constituent processes exactly
the same way as they were running in the record phase. The OS and all the
processes will show exactly the same behavior, and also produce exactly the
same outputs in the same order. To an outsider both the executions will be
indistinguishable. Such systems are typically used for debugging and testing,
where it is necessary to exactly reproduce the execution of an entire system.
Your answer should at least address the following points:
a)What do we do about the time? It is clear that we have to use some notion
of a logical time in the replay phase.
b)How do we deliver I/O messages from the network or hard disk, and inter-
rupts with exactly the same content, and exactly at the same times?
c)What about non-determinism in the memory system such as TLB misses,
and page faults?
d)How do we handle inherently non-deterministic instructions such as reading
the current time and generating a random number?
Ex. 7 — How does the VMM keep track of updates to the guest OS’s page
tables in shadow and nested paging?
Ex. 9 — If there is a context switch in the guest OS, how does the VMM
get to know the id (or something equivalent) of the new process (one that is
being swapped in)? Even if the VMM is not able to find the pid of the new
process being run by the guest OS, it should have some information available
with it such that it can locate the page table and other bookkeeping information
corresponding to the new process.
In this book, we have concerned ourselves only with the Linux kernel and that
too in the context of the x86-64 (64-bit) ISA. This section will thus provide a
brief introduction to this ISA. It is not meant to be a definitive reference. For a
deeper explanation, please refer to the book on basic computer architecture by
your author [Sarangi, 2021].
The x86-64 architecture is a logical successor of the x86 32-bit architec-
ture, which succeeded the 16 and 8-bit versions, respectively. It is the default
architecture of all Intel and AMD processors as of 2023. The CISC ISA got
complicated with the passage of time. From its early 8-bit origins, the develop-
ment of these processors passed through several milestones. The 16-bit version
arrived in 1978, and the 32-bit version arrived along with Intel 80386 that was
released in 1985. Intel and AMD introduced the x86-64 ISA starting from 2003.
The ISA has become increasingly complex over the years and hundreds of new
instructions have been added henceforth particularly vector extensions (a single
instruction can work on a full vector of data).
A.1 Registers
ax ah al
bx bh bl
cx ch cl
dx dh dl
371
© Smruti R. Sarangi 372
16-bit avatar of the ISA, these registers were simply extended to 16 bits. Their
names changed though, for instance a became ax, b became bx, and so on. As
shown in Figure A.1, the original 8-bit registers continued to be accessible. Each
16-bit register was split into a high and low part. The lower 8 MSB bits are
addressable using the specifier al (low) and bits 9-16 are addressable using the
register ah (high).
A few more registers are present in the 16-bit ISA. There is a stack pointer
sp (top of the stack), a frame pointer bp (beginning of the activation block for
the current function), and two index registers for performing computations in a
loop via a single instruction (si and di). In the 32-bit variant, a prefix ‘e’ was
added. ax became eax, so on and so forth. Furthermore, in the 64-bit variant
the prefix ‘e’ was replaced with the prefix ‘r’. Along with these, 8 new registers
were added from r8 to r15. This is shown in Figure A.2. Note that even in
the 64-bit variant of the ISA known as x86-64 the 8, 16 and 32-bit registers
are accessible. It is just that these registers exist virtually (as a part of larger
registers).
64 bits
32 bits
16 bits
rax eax ax
rbx ebx bx
rcx ecx cx
rdx edx dx
rsp esp sp
rbp ebp bp
rsi esi si
rdi edi di
r8
r9
r15
Figure A.2: The registers in the x86-64 ISA
Note that unlike newer RISC ISAs, the program counter is not directly acces-
sible. It is known as the instruction pointer in the x86 ISA, which is not visible
to the programmer. Along with the program counter, there is also a flags
register that becomes rflags in x86-64. It stores all the ALU flags. For exam-
ple, it stores the result of compare instructions. Subsequent branch instructions
373 © Smruti R. Sarangi
use the result of the last compare instruction for deciding the outcome of a
conditional branch instruction. Refer to Figure A.3.
64 bits
32 bits
16 bits
rip eip ip
There are a couple of flag fields in the rflags register that are commonly
used. These are bit positions. If the corresponding bit position is set to 1, then
it means that the corresponding flag is set otherwise it is unset (flag is false).
OF is the integer overflow flag, CF is the carry flag (generated in an addition),
the ZF flag is set when the last comparison resulted in an equality, and the
SF sign flag is set when the last operation that could set a flag resulted in a
negative result. Note that a comparison operation is basically implemented as
a subtraction operation. If the two operands are equal, then the comparison
results in an equality (zero flag is set) otherwise if the first operand is less than
the second operand, then the result is negative and the sign bit is set to 1 (result
is negative).
FP register
st0 st1 st0
st2 st0
st3 st4 st5 st6 st7
stack
The basic mov operation moves the first operand to the second operand.
The first operand is the source and the second operand is the destination in this
format. Each instruction admits a suffix, which specifies the number of bits that
we want it to operate on. The ‘q’ modifier means that we wish to operate on 64
bits, whereas the ‘l’ modifier indicates that we wish to operate on 32-bit values.
In the instruction movq $3, %rax, we move the number 3 (prefixed with a ‘$’)
to the register rax. Note that all registers are prefixed with a percentage (‘%’)
symbol. Similarly, the next instruction movq $4, %rbx moves the number 4 to
the register rbx. The third instruction addq %rbx, %rax adds the contents of
register rbx to the contents of register rax, and stores the result in rax. Note
that in this case, the second operand %rax is both a source and a destination.
The final instruction stores the contents of rax (that was just computed) to
memory. In this case, the memory address is computed by adding the base
address that is stored in the stack pointer (%rsp) to the offset 8. The movq
instruction moves data between registers as well as between a register and a
memory location. It thus works as both a load and a store. Note that we
cannot transfer data from one memory location to another memory location. It
is basically not possible to have two memory operands in an instruction.
Let us look at the code for computing the factorial in Listing A.1. In this
case, we use the 32-bit version of the ISA. Note that it is perfectly legal to do
so in a 64-bit processor for power and performance reasons. In the code shown
in Listing A.1, eax stores the number that we are currently multiplying and
edx stores the product. The imull instruction multiplies the partial product
375 © Smruti R. Sarangi
377
© Smruti R. Sarangi 378
The last step in the backend of the compiler is code generation. The low-level
IR is converted to actual machine code. It is important for the compiler to know
the exact semantics of instructions on the target machine. Many a time, there
are complex corner cases where we have floating point flags and other rarely used
instructions involved. They have their own set of idiosyncrasies. Needless to
say, any compiler needs to be aware of them, and it needs to use the appropriate
set of instructions such that the code executes as efficiently as possible. We need
to guarantee 100% correctness. Furthermore, many compilers as of 2023 allow
the user to specify the compilation priorities. For instance, some programmers
may be looking at reducing the code size and for them performance may not
be that great a priority. Whereas, for other programmers, performance may be
the topmost priority. Almost all modern compilers are designed to handle such
concerns and generate code accordingly.
379 © Smruti R. Sarangi
gcc –c x.c
x.c x.o
such cases where a type conversion hierarchy is used. But to do that we need
to insert code in the compiled program and thus knowing the signature of the
function is essential. Once the signature is provided, the original function could
be defined in some other C file, which per se is not an issue – we can compile
the C program seamlessly. To summarize, with the signature we can solve many
problems like automatic type conversion, correctly arranging the arguments and
properly typecasting the return value. During compilation, the address of the
function may not be known if it is defined in another C file. This is something
that the linker will need to resolve as we have discussed.
Let us further delve into the problem of specifying function signatures, which
will ensure that we can at least compile a single C program correctly and create
object files, which the linker can process.
for providing a bunch of signatures to a C file. For instance, there could be a set
of C files that provide cryptographic services. All of them could share a common
header file via which they export the signatures of the functions that they define
to other modules in a large software project. Other C files could include this
header file and call the relevant functions defined in it to obtain cryptographic
services. The header file thus facilitates a logical grouping of variable, function
and structure/class declarations. It is much easier for programmers to include
a single header file that provides a cohesive set of declarations as opposed to
manually adding declarations at the beginning of every C file.
Header files have other interesting uses as well. Sometimes, it is easier to
simply go through a header file to figure out the set of functions that a set of C
functions provide to the rest of the world.
Barring a few exceptions, header files never contain function definitions or
any other form of executable code. Their role is not to have regular C state-
ments. This is the role of regular source code files and header files should be
reserved only for signatures that aid in the process of compilation. For the
curious reader, it is important to mention that the only exception to this rule
is C++ templates. A template is basically a class definition that takes another
class or structure as an argument and generates code based on the type of the
class that is passed to it at compile time.
Now, let us look at a set of examples to understand how header files are
meant to be used.
Listing B.1 shows the code for the header file factorial.h. First, we check if
a preprocessor variable FACTORIAL H is already defined. If it is already defined,
it means that the header file has already been included. This can happen for
a variety of reasons. It is possible that some other header file has included
factorial.h, and that header file has been included in a C file. Given that the
contents of factorial.h are already present in the C file, there is no need to include
it again explicitly. This is ensuring using preprocessor variables. In this case, if
FACTORIAL H has not been defined, then we define the function’s signature: int
factorial(int);. This basically says that it takes a single integer variable as
input and the return value is an integer.
Listing B.2 shows the code of the factorial.c file. Note the way in which
we are including the factorial.h file. It is being included by specifying its name
© Smruti R. Sarangi 382
in between double quotes. This basically means that the header file should
be there in the same directory as the C file (factorial.c). We can also use the
traditional way of including a header file between the ’<’ and ’>’ characters.
In this case, the directory containing the header file should be there in the
include path. The “include path” is a set of directories in which the C compiler
searches for header files. The directories are searched in ascending order of
preference based on their order in the include path. There is always an option
of adding an additional directory to the include path by using the ‘-I’ flag in
gcc. Any directory that succeeds the ‘-I’ flag is made a part of the include path
and the compiler searches that directory as well for the presence of the header
file. Now, when the compiler compiles factorial.c, it can create factorial.o (the
corresponding object file). This object file can now be used by other C files
whenever they want to use the factorial function. They know that it is defined
there. The signature is always available in factorial.h.
Let us now try to write the file that will use the factorial function. Let us
name it prog.c. Its code is shown in listing B.3.
int main () {
printf ( " % d \ n " , factorial (3) ) ;
}
All that the programmer needs to do is include the factorial.h header file and
simply call the factorial function. The compiler knows how to generate the code
for prog.c and create the corresponding object file prog.o. Given that we have
two object files now – prog.o and factorial.o – we need to link them together and
create a single binary that can be executed. This is the job of the linker that we
shall see next. Before we look at the linker in detail, an important point that
needs to be understood here is that we are separating the signature from the
implementation. The signature was specified in factorial.h that allowed prog.c to
be compiled without knowing how exactly the factorial function is implemented.
The signature was enough information for the compiler to compile prog.c.
The other interesting part of having a header file is that the linkage between
the signature and the implementation is delinked. The programmer can happily
change the implementation as long as the signature is the same. The rest of
the world will not be affected, and they can continue to use the same function
as if nothing has changed. This allows multiple teams of programmers to work
independently as long as they agree on the signatures of functions that their
respective modules export.
to link these library object files as well. In this case, we think of the standard
library as a library of functions that make it easy for accessing system services
such as reading and writing to files or the terminal.
There are two ways of linking: static and dynamic. Static linking is a simple
approach where we just combine all the .o files and create a single executable.
This is an inefficient method as we shall quickly see. This is why dynamic
linking is used where all the .o files are not necessarily combined into a single
executable.
factorial.h
Declara�on of the
factorial func�on
#include
factorial.c prog.c
Figure B.3: Compiling the code in the factorial program and linking the com-
ponents
© Smruti R. Sarangi 384
The precise role of the linker is shown in Figure B.3. Each object file contains
two tables: the symbol table and the relocation table. The symbol table contains
a list of all the symbols – variables and functions – defined in the .o file. Each
entry contains the name of the symbol, sometimes its type and scope, and its
address. The relocation table contains a list of symbols whose address has not
been determined as yet. Let us now explain the linking process that uses these
tables extensively.
Each object file contains some text (program code), read-only constants and
global variables that may or may not be initialized. Along with that it references
variables and functions that are defined in other object files. All the symbols
that an object files exports to the world are defined in the symbol table and all
the symbols that an object file needs from other object files are listed in the
relocation table. The linker thus operates in two passes.
Pass 1: It scans through all the object files and concatenates all the text sec-
tions (instructions), global variable, function definition and constant def-
inition sections. It also makes a list of all the symbols that have been
defined in the object files. This allows the linker to compute the final
sizes of all the sections: text, data (initialized global/static variables), bss
(uninitialized global/static variables) and constants. All the program code
and symbols can be concatenated and the final addresses of all variables
and functions are computed. The concatenated code is however incom-
plete. The addresses of all the relocated variables and functions (defined
in other object files) are set to zero.
Pass 2: In this stage, the addresses of all the relocated variables and functions
are set to their real values. We know the address of each variable at the
end of Pass 1. In the second pass, the linker replaces the zero-valued
addresses of relocated variables and functions with the actual addresses
computed as a part of the first pass.
invoked that release all the resources that the process owned such as open files
and network connections.
Surprisingly, all this overhead is not much. The dominant source of over-
heads here is the code of all the C libraries that is added to the executable.
This means that if we invoke the printf function, then the code of printf as
well as the set of all the library functions that printf invokes (and in turn they
invoke) are added to the executable. These overheads can be quite prohibitive.
Assume that a program has one hundred unique library calls, but in any practi-
cal execution only 25 unique library calls are made. The size overhead is 100/25
(4×). Sadly, at compile time, we don’t know which library calls will be made
and which ones should not be made. Hence, we conservatively assume that
every single library call that is made in any object file will actually be made,
and it is not dead code. Consequently, the code for all those library functions
(and their backward slice) needs to be included in the executable. Here, the
backward slice of a library function such as printf comprises the set S of li-
brary functions called by printf, as well as all the library functions invoked by
functions in S, so on and so forth. Because of this, we need to include a lot of
code in executables and hence they become very large. This can be visualized
in Figure B.4.
Along with the large size of executables, which in itself is problematic, we
lose a chance to reuse code pages that are required by multiple processes. For
instance, almost all processes use some of the library functions defined in the
standard C library, even if they are written in a different language. As a re-
sult, we would not like to replicate the code pages of library functions – this
would lead to a significant wastage of memory space. Hence, we would like to
share them across processes saving a lot of runtime memory in this process.
To summarize, if we use such statically linked binaries where the entire code is
packaged within a single executable, such code reuse options are not available
to us. Hence, we need a better solution. This solution is known as dynamic
linking.
test.c
#include <stdio.h>
Add the code to
a.out int main(){
� gcc –sta�c test.c int a = 4;
Check if all the func�ons � ldd a.out printf ("%d",a);
not a dynamic executable }
are bundled or not
� du –h a.out
892K a.out
The size of the binary is quite
large because the code of the
en�re library is included in a.out
high chance that the code will actually be used and that too very frequently.
Furthermore, we would also not like to add code to an executable if there is
a high chance that it will be reused across many processes. If we follow these
simple rules, the size of the binary will remain reasonably small. However, the
program execution gets slightly complicated because now there will be many
functions whose code is not a part of the executable. As a result, invoking those
functions will involve a certain amount of complexity. Some of this is captured
in Figure B.5.
first
Locate the
Stub �me
prin� func�on in a
func�on
library
subsequently
Copy the func�on
Call the func�on using to the address
its address space of the
process
In this case, where printf is dynamically linked, the address of the printf
symbol is not resolved at link time. Instead, the address of printf is set to a
dummy function known as a stub function. The first time that the stub function
is called, it locates the path of the library that contains the printf function,
then it copies the code of the printf function to a memory address that is
within the memory map of the process. Finally, it stores the address of the
first byte of the printf function in a dedicated table known as the jump table.
The next time the stub function is called, it directly accesses the address of the
printf function in the jump table. This basically means that the first access
to the printf function is slow. Henceforth, it is very fast.
The advantages of this scheme are obvious. We only load library functions
on-demand. This minimizes the size of the executable. Furthermore, we can
have one copy of the shared library code in physical memory and simply map
regions of the virtual address space of each process to the physical addresses
corresponding to the library code. This also minimizes the memory footprint
and allows as much of runtime code reuse as possible. Of course, there is a very
minor performance penalty. Whenever a library function is accessed for the first
time, there is a necessity to first search for the library first and then find the
address of the function within it. Searching for a library, proceeds in the same
manner as searching for header files.
During the process of compilation, a small note is made about which function
is available in which library. Now if the executable is transferred to another
machine and run there or run on the same machine, at runtime the stub function
will call a function called dlopen. When invoked for the first time for a given
library function, its job is to locate the library. Similar to the way that we
search for a header file, there is a search order. We first search for the library
387 © Smruti R. Sarangi
Figure B.6 shows the method to generate a shared object or shared library
in Linux. In this case, we want to generate a shared library that contains the
code for the factorial function. Hence, we first compile the factorial.c file to
generate the object file (factorial.o) using the ‘-c’ gcc option. Then we create a
© Smruti R. Sarangi 388
library out of the object file. The archive or ar command then creates a library
out of an object file. The extension of the archive is ’.a’. It is not a shared
object/library yet. It is a static library that can only be statically linked.
The next part shows us how to generate a dynamic library. First, we need to
compile the factorial.c file in a way that is position independent – the starting
address does not matter. This allows us to place the code at any location in
the virtual address space of a process. All the addresses are relative to a base
address. In the next line, we generate a shared object from the factorial.o object
file using the ‘-shared’ flag using gcc. This generates libfactorial.so. Next,
we need to compile and link prog.c with the dynamic library that we just
created (libfactorial.so). This part is tricky. We need to do two separate
things.
Consider the command gcc -L. prog.c -lfactorial. We use the ‘-L’ flag
to indicate that the library will be found in the current directory. Then, we
specify the name of the C file, and finally we specify the library using the ‘-l’
flag. Note that there is no space in this case between ‘-l’ and factorial. The
compiler searches for factorial.so in the current directory because of the -L
and -l flags.
However, in this case, running the executable a.out is not very straight-
forward. We need to specify the location at which the factorial library will be
found given that it is not placed in a standard location that the runtime (library
loader) usually checks such as a /lib or /usr/lib. We thus add the current
directory (output of the pwd command) to the LD LIBRARY PATH environment
variable. After that we can safely execute the dynamically linked executable –
it will know where to find the shared library (libfactorial.so).
Readers are welcome to check the size of dynamically linked executables.
Recall the roughly 1 MB executable that we produced post static linking (see
Figure B.4); its size reduces to roughly 12 KB with dynamic linking !!!
Let us finish this round of discussion with describing the final structure of
the executable. After static or dynamic linking, Linux produces a shared object
file or executable in the ELF format.
B.3 Loader
The loader is the component of the operating system whose job is to execute a
program. When we execute a program in a terminal window, a new process is
spawned that runs the code of the loader. The loader reads the executable file
from the file system and lays it out in main memory. It needs to parse the ELF
executable to realize this.
It creates space for all the sections, loads the constants into memory and
allocates regions for the stack, heap and data/bss sections (static and global
variables). Additionally, it also copies all the instructions into memory. If
they are already present in the memory system, then instead of creating a new
copy, we can simply map the instructions to the virtual memory of the new
process. If there is a need for dynamic linking, then all the information regarding
dynamically linked symbols is stored in the relocation table and dynamic section
in the process’s memory image. It also initializes the jump tables.
Next, it initializes the execution environment such as setting the state of all
the environment variables, copying the command line arguments to variables
accessible to the process and setting up exception handlers. Sometimes for
security reasons, we wish to randomize the starting addresses of the stack and
heap such that it is hard for an attacker to guess runtime addresses. This can
be done by the loader. It can generate random values within a pre-specified
range and initialize base addresses in the program such as the starting value of
the stack pointer and the heap memory region.
The very last step is to issue a system call to erase the memory state of
the loader and start the process from the first address in its text section. The
process is now alive, and the program is considered to be loaded.
© Smruti R. Sarangi 390
Appendix C
Data Structures
391
© Smruti R. Sarangi 392
list head is in some sense representing a linked list node; it has pointers to
the next and previous entries. This is enough information to operate on the
linked list. For example, we can add new entries as well as remove entries. The
crucial question that we need to answer here is, “Where is the encapsulating
object that needs to be linked together?” In general, we define a linked list in
the context of an object. We interpret the linked list to be a list of objects.
Here, we do not see an object with its fields, instead we just see a generic linked
list node with pointers to the next and previous nodes. It is true that it satisfies
our demand for generality; however, it does not align with our intuitive notion
of a linked list as we have studied in a data structures course.
Listing C.1: The definition of a linked list
source : include/linux/types.h
struct list_head {
struct list_head * next , * prev ;
}
We will use two macros to answer this question as shown in Listing C.2.
Listing C.2: The list entry and container of macros
source : include/linux/list.h and
source : include/linux/container of.h (resp.)
# define list_entry ( ptr , type , member ) container_of (
ptr , type , member )
Let us start with describing the container of macro. It takes three inputs:
a pointer, a type and a member name. The first statement simply typecasts
the pointer to void*. This is needed because we want to create a generic
implementation, which is not dependent on any particular type of object. The
offsetof macro provides the offset of the starting address of the member from
the beginning of the structure. Consider the structures in Listing C.3. In the
case of struct abc, the value of offsetof(abc, list) is 4. This is because,
we are assuming the size of an integer is four bytes. This integer is stored in
the first four addresses of struct abc. Hence, the offset of the list member
is 4 here. On the same lines, we can argue that the offset of the member list
in struct def is 8. This is because the size of an integer as well as float is
4 bytes each. Hence, mptr − offsetof(type,member) provides the starting
address of the structure that can be thought of to be the linked list node. To
summarize, the container of macro returns the starting address of the linked
list node or in other words the encapsulating object given the offset of the list
member in the object.
Listing C.3: Examples of structures
struct abc {
int x ;
struct list_head list ;
}
struct def {
393 © Smruti R. Sarangi
int x ;
float y ;
struct list_head list ;
}
Listing C.4: Example of code that uses the list entry macro
struct abc * current = ... ;
struct abc * next = list_entry ( current - > list . next , struct
abc , list ) ;
Listing C.4 shows a code snippet that uses the list entry macro where
struct abc is the linked list node. The current node that we are considering is
called current. To find the next node (the next one after current), which will
again be of type struct abc, all that we need to do is invoke the list entry
macro. In this case, the pointer (ptr) is current->list.next. This is a pointer
to the struct list head object in the next node. From this pointer, we need
to find the starting address of the encapsulating abc structure. The type is thus
struct abc and the member is list. The list entry macro will internally call
offsetof, which will return an integer. This integer will be subtracted from
the starting address of the struct list head member in the next node. This
will provide the pointer to the encapsulating object.
Such a mechanism is a very fast and generic mechanism to traverse linked
lists in Linux that is independent of the type of the encapsulating object. Note
that we can stretch this discussion to create a linked list that has different kinds
of encapsulating objects. Note that, in theory, this is possible as long as we
know the type of the encapsulating object for each struct list head on the
list.
Listing C.5: Example of code that uses the list entry macro
struct hlist_head {
struct hlist_node * first ;
};
© Smruti R. Sarangi 394
struct hlist_node {
struct hlist_node * next , ** pprev ;
};
Let us now describe singly-linked lists that have a fair amount of value in
kernel code. Here the explicit aim is a one-way traversal of the linked list. An
example would be a hashtable where we resolve collisions by chaining entries
that hash to the same entry. Linux uses the struct hlist head structure that
is shown in Listing C.5. It points to a node that is represented using struct
hlist node.
It is a very simple data structure. Specifically, it has a next pointer to an-
other hlist node. However, note that this information is not enough if we wish
to delete the hlist node from the linked list. We need a pointer to the previous
entry. This is where a small optimization is possible, and a few instructions can
be saved. We actually store a pointer to the next member in the previous node
of the linked list in the field pprev. The advantage of this is that we can directly
set it to another value while deleting the current node. We cannot do anything
else easily, which is the intention here.
Such data structures that are primarily designed to be singly-linked lists are
often very performance efficient. Their encapsulating objects are accessed in
exactly the same way as the doubly-linked list struct list head.
2. The leaf nodes are special. They don’t contain any data. However, they
are always presumed to be black. They are also referred to as sentinel
nodes.
3. A red node never has a red child. Basically, red nodes are never adjacent.
4. Any path from a node that is the root of a subtree to any leaf node has
the same black depth. Here, the black depth of a leaf node is defined as the
number of black nodes that we cross while traversing from the root of the
subtree to the leaf node. In this case, we are including both the root and
the leaf node.
5. If a node has exactly one non-leaf child, then its color must be red.
The maximum depth of any leaf is at most twice the minimum depth of
a leaf.
This is quite easy to prove. As we have mentioned, the black depth of all
the leaves is the same. Furthermore, we have also mentioned that a red node
can never have a red child. Assume that in any path from the root to a leaf,
there are r red nodes and b black nodes. We know that b is a constant for all
paths from the root. Furthermore, every red node will have a black child (note
that all leaves or sentinel nodes are black). Hence, r ≤ b. The total depth of
any leaf is r + b ≤ 2b. This basically means that the maximum depth is at most
twice the minimum depth b.
This vital property ensures that all search operations always complete in
O(log(n)) time. Note that a search operation in an RB tree operates in exactly
the same manner as a regular binary search tree. Insert and delete operations
also complete in O(log(n)) time. They are however not very simple because we
need to ensure that the black depth of all the leaves always stays the same, and
a red parent never has a red child.
They require a sequence of recolorings and rotations. However, we can prove
that at the end all the properties hold and the overall height of the tree is always
O(log(n)).
C.3 B-Tree
A B-tree is a generalization of a binary search tree, which is self-balancing. In
this case, a node can have more than two children; quite unlike a red-black tree.
This is a balanced tree and all of its operations are realizable in logarithmic time.
It is typically used in systems that store a lot of data and quickly accessing a
given datum or a contiguous subset of the data is essential. Hence, database
and file systems tend to use B-trees quite extensively.
Let us start with the main properties of a B-tree of order m.
3. It is important that the tree does not remain sparse. Hence, every internal
node needs to have at least ⌈m/2⌉ children (alternatively ⌈m/2⌉ − 1 keys).
4. If a node has k children, then it stores k−1 keys. These k−1 keys partition
the space of keys into k non-overlapping regions. Each child then stores
keys in the key space assigned to it. In this sense, an internal node’s key
acts as a key space separator.
2 4 8 10 16 25
1 3 5 7 9 11 14 17 26
we also store the values associated with the keys. These values could be stored
in the node itself or there could be pointers within a node to point to the values
corresponding to its keys. There are many ways of implementing this and the
storage of values is not central to the operation of a B-tree.
Now, if we consider the root node, we find that it stores two keys: 6 and 12.
The leftmost child is an internal node, which stores two keys: 2 and 4. They
point to leaf nodes that store a single key each. Note that as per our definition
(order=3), this is allowed. Now, let us consider the second child of the root
node. It needs to store keys that are strictly greater than 6 and strictly less
than 12. We again see a similar structure with an internal node that stores two
keys – 8 and 10 – and has three leaf nodes. Finally, the last child of the root
only stores keys that are greater than 12.
It is quite obvious that traversing a B-tree is similar to traversing a regular
BST (binary search tree). It takes O(log(n)) time. For added precision, we can
account for the time that it takes to find the pointer to the right subtree within
an internal node. Given the fact that we would have at most m subtrees, this
will take O(log(m)) time (use binary search).
Deletion is the reverse process. In this case, we can remove the key as long
as the node still has ⌈m/2⌉ − 1 keys left in it. However, if this is not the case,
then a need will arise to merge two adjacent sibling nodes and move the key
separating the internal nodes from the parent to the merged node. This is pretty
much the reverse of what we did while adding a new key. Here again, a situation
will arise when this cannot be done, and we will be forced to reduce the height
of tree.
C.3.3 B+ Tree
The B+ tree is a variant of the classical B-tree. In the case of a B-tree, internal
nodes can store both keys and values, however in the case of a B+ tree, internal
nodes can only store keys. All the values (or pointers to them) are stored in the
leaf nodes. Furthermore, all the leaf nodes are connected to each other using
a linked list, which allows for very efficient range queries. It is also possible to
do a sequential search in the linked list and locate data with proximate keys
quickly.
strings
tr
travel tryst
truck tread
a u i y ead tram tractor
trust trim
trick try
vel ctor ck
m ck st
m
st
has a valid key. In other words, this would mean that the path from the root
to the leaf node corresponds to a valid key.
The advantage of such a structure is that we can store a lot of keys very
efficiently and the time it takes to traverse it is proportional to the number
of letters within the key. Of course, this structure works well when the keys
share reasonably long prefixes. Otherwise, the tree structure will not form, and
we will simply have a lot of separate paths. Hence, whenever there is a fair
amount of overlap in the prefixes, a Radix tree should be used. It is important
to understand that the lookup time complexity is independent of the number of
keys – it is theoretically only dependent on the number of letters (digits) within
a key.
Insertion and deletion are easy. We need to first perform a lookup operation
and find the point at which the non-matching part of the current key needs to
be added. There will be a need to add a new node that branches out of an
existing node. Deletion follows the reverse process. We locate the key first,
delete the node that stores the suffix of the string that is unique to the key and
then possibly merge nodes.
There is a popular data structure known as a trie, which is a prefix tree
like a Radix tree with one important difference: in a trie, we proceed letter
by letter. This means that each edge corresponds to a single letter. Consider
a system with two keys “tractor” and “tram”. In this case, we will have the
root node, an edge corresponding to ‘t’, then an edge corresponding to ‘r’, an
edge corresponding to ‘a’, so on and so forth. There is no point in having a
node with a single child. We can compress this information to create a more
efficient data structure, which is precisely a Radix tree. In a Radix tree, we can
have multiletter edges. In this case, we can have an edge labeled “tra” (fuse all
single-child nodes).
key. Instead, we have edges labeled with multiple bits such that the number of
internal nodes is minimized. Assume a system with only two keys that are not
equal. Regardless of the Hamming distance between the two keys, the Patricia
Trie will always have three nodes – a root and two children. The root node
will store the shared prefix, and the two children will contain the non-shared
suffix of the binary keys. Incidentally, Patricia stands for Practical Algorithm
To Retrieve Information Coded In Alphanumeric.
This kind of data structure is very useful for representing information stored
in a bit vector.
Let us elaborate. Assume a very long vector of bits. This is a reasonably
common data structure in the kernel particularly when we consider page alloca-
tion. Assume a system that has a million frames (physical pages) in the physical
address space, and we need to manage this information. We can represent this
with a bit vector that has a million 1 bit-sized entries. If the value of the ith
entry is 1, then it means that the corresponding physical page is free, and the
value 0 means that the corresponding physical page has been allocated.
Now a common operation is to find the first physical page that has not been
allocated such that it can be allocated to a new process. In this case, we need
to find the location of the first 1 in the bit vector. On the same lines, we can
have an analogous problem where the task is to find the first 1 in the bit vector.
Regardless of whether we are searching for a 0 or 1, we need a data structure
to locate such bits efficiently.
A naive algorithm is to of course start traversing the bit vector from the
lowest address onwards and terminate the search whenever a 0 or 1 is found. If
we reach the end of the bit vector and do not find the entry of interest, then
we can conclude that no such entry exists. Now, if there are n entries, then
this algorithm will take O(n) time, which is too much. We clearly need a much
faster algorithm, especially something that runs in O(log(n)) time.
© Smruti R. Sarangi 400
This is where a van Emde Boas tree (vEB tree) is very useful. We show an
example in Figure C.3. We treat the single-bit cells of the bit vector as leaf
nodes. Adjacent cells have a parent node in the vEB tree. This means that if
we have n entries in the bit vector, then there are n/2 entries in the second last
level of the tree. This process continues towards the root similarly. We keep on
grouping adjacent internal nodes and create a parent for them until we reach
the root. We thus end up with a balanced binary tree if n is a power of 2. The
greatness of the vEB tree lies in the contents of the internal nodes. To explain
this, let us start with the root.
If the root node stores a 1, it means that at least a single location in the bit
vector stores a 1. This is a very convenient trick because we instantly know if
the bit vector contains all 0s, or it has at least one 1 value. Each of its children
is the root of a subtree (contiguous region in the bit vector). It stores exactly
the same information as the root. If the root of the subtree stores a 0, then it
means that all the bit vector locations corresponding to the subtree store a 0.
If it stores a 1, then it means that at least one location stores a 1.
Now let us consider the problem of locating the first 1 starting from the
lowest address (from the left in the figure). We first check the root. If it
contains a 1, then it means that there is at least a single 1 in the underlying
bit vector. We then proceed to look at the left child. If it contains a 1, then
it means that the first half of the bit vector contains a 1. Otherwise, we need
to look at the second child. This process continues recursively until we reach
the leaf nodes. We are ultimately guaranteed to find either a 1 or conclude that
there is no entry in the bit vector that is a 1.
This is a very fast process and runs in logarithmic time. Whenever we change
a value from 0 → 1 in the bit vector, we need to walk up the tree and convert all
0s to 1 on the path. However, when we change a value from 1 → 0, it is slightly
tricky. We need to traverse the tree towards the root, however we cannot blindly
convert 1s to 0s. Whenever, we reach a node on the path from a leaf to the root,
we need to take a look at the contents of the other child and decide accordingly.
If the other child contains a 1, then the process terminates right there. This is
because the parent node is the root of a subtree that contains a 1 (via the other
child). If the other child contains a 0, then the parent’s value needs to be set to
0 as well and the process will continue towards the root.
Map a key to k
Key
different bit posi�ons Hash func�on
using k hash func�ons
Set to 1
H1 H2 H3 H4
1 1 1 1
Array of m bits
Ini�ally all the bits are 0
functions. Next, we inspect the bits at all the k bit positions. If all of them are
1, then the key may be present in the set. The reason we use the phrase “may
be” is because it is possible that half the bits were set because of key x and the
rest half were set because of another key y. It is not possible to find out if this
is indeed the case. Hence, the answer that we get in this case is a probabilistic
“Yes”.
However, when one of the bits is 0, we can be sure that the associated key
is definitely not present. If it had been present, all the bits would have been 1
for sure. We shall find very interesting uses for such data structures.
Note that such a data structure has numerous shortcomings. We cannot
delete entries. Naively setting the k bits associated with a key to 0 will not
work. It is possible that there are multiple keys that map to a subset of these
bits. All of them will get removed, which is something that we clearly do not
want. One option is to store a counter at each entry instead of a bit. When
a key is added to the set, we just increment all the associated counters. This
is fine as long as we do not have overflows. One of the important reasons for
opting for a Bloom filter is its compactness. This advantage will be lost if we
start storing large counters in each entry. Of course, removing a key is easy –
just decrement the associated k counters. Nevertheless, the overheads can be
sizeable and the benefits of compactness will be lost.
The other issue is that bits get flipped in only one direction, 0 to 1. They
never get flipped back because we do not do anything when an element is re-
moved. As a result, the Bloom filter becomes full of 1s with the passage of time.
There is thus a need to periodically reset it.
© Smruti R. Sarangi 402
Bibliography
[Belady et al., 1969] Belady, L. A., Nelson, R. A., and Shedler, G. S. (1969).
An anomaly in space-time characteristics of certain programs running in a
paging machine. Communications of the ACM, 12(6):349–353.
[Corbet, 2010] Corbet, J. (2010). The case of the overly anonymous anon vma.
Online. Available at: https://lwn.net/Articles/383162/.
[Corbet, 2014] Corbet, J. (2014). Locking and pinning. Online. Available at:
https://lwn.net/Articles/600502/.
[Cormen et al., 2009] Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein,
C. (2009). Introduction to Algorithms. MIT Press, third edition.
[Fornai and Iványi, 2010a] Fornai, P. and Iványi, A. (2010a). Fifo anomaly is
unbounded. Acta Univ. Sapientiae, 2(1):80–89.
[Fornai and Iványi, 2010b] Fornai, P. and Iványi, A. (2010b). Fifo anomaly is
unbounded. arXiv preprint arXiv:1003.1336.
[Herlihy and Shavit, 2012] Herlihy, M. and Shavit, N. (2012). The Art of Mul-
tiprocessor Programming. Elsevier.
403
© Smruti R. Sarangi 404
405
© Smruti R. Sarangi 406