Osbook-V0 50
Osbook-V0 50
Kernel-Oriented Approach
(Partially written.
Updates are released every week on Fridays.)
Send bug reports/suggestions to
[email protected]
Version 0.50
Smruti R. Sarangi
List of Trademarks
• Microsoft and Windows are registered trademarks of Microsoft Corpora-
tion.
Contents
1 Introduction 7
1.1 Types of Operating Systems . . . . . . . . . . . . . . . . . . . . . 9
1.2 The Linux OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Versions, Statistics and Conventions . . . . . . . . . . . . 11
3 Processes 47
3.1 The Notion of a Process . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 The Process Descriptor . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 struct task struct . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2 struct thread info . . . . . . . . . . . . . . . . . . . . . 49
3.2.3 Task States . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.4 Kernel Stack . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.5 Task Priorities . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.6 Computing Actual Task Priorities . . . . . . . . . . . . . 56
3.2.7 sched info . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.8 Memory Management . . . . . . . . . . . . . . . . . . . . 58
3.2.9 Storing Virtual Memory Regions . . . . . . . . . . . . . . 59
3.2.10 The Process Id . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.11 Processes and Namespaces . . . . . . . . . . . . . . . . . . 61
3.2.12 File System, I/O and Debugging Fields . . . . . . . . . . 66
3.2.13 The PTrace Mechanism . . . . . . . . . . . . . . . . . . . 67
3
© Smruti R. Sarangi 4
Introduction
• Programs share hardware such as the CPU, the memory and storage de-
vices. These devices have to be fairly allocated to different programs based
on user-specified priorities. The job of the OS is to do a fair resource al-
7
© Smruti R. Sarangi 8
CPU
Memory Hard
Disk
I/0 devices
Figure 1.1: Diagram of the overall system
Programs
P1 P2 P3
Opera�ng System
CPUs
Hardware
Memory
I/O & Storage
Figure 1.2: Place of the OS in the overall system
• There are common resources in the system, which multiple programs may
try to access concurrently. There is a need to regulate this process such
that such concurrent accesses are disciplined and it is not possible for one
resource to be used in parallel by multiple programs when that was not
the original intention.
• Different devices have different methods and protocols for managing them.
It is essential to speak their language and ensure that high-level commands
are translated to device-level commands. This responsibility cannot be put
on normal programs. Hence, we need specialized programs within the OS
(device drivers) whose job is to exclusively interact with devices.
Definition 1
An operating system (OS) works as a CPU manager, memory manager,
device manager and storage manager. Its job is to arbitrate accesses to these
resources and ensure that programs execute in a secure fashion and their
performance is maximized subject to power and temperature constraints.
in this case the port size and the memory footprint of the OS needs to be very
small.
server operating system. People started taking it seriously and many academic
groups started moving away from UNIX to adopt Linux. Note that Linux was
reasonably similar to UNIX in terms of the interface and some high-level design
decisions. The year 2003 was a pivotal year for Linux, because this year, Linux
kernel version 2.6 was released. It had a lot of changes and was very differ-
ent from the previous kernel versions. After this, Linux started being taken
very seriously in both academic and enterprise circles. In a certain sense, it
had entered the big league. Many companies sprang up that started offering
Linux-based offerings, which included the kernel bundled with a set of pack-
ages (software programs) and also custom support. Red Hat(®), Suse(®)and
Ubuntu(®)(Canonical(®)) where some of the major vendors that dominated
the scene. As of writing this book, circa 2023, these continue to be major Linux
vendors. Since 2003, a lot of other changes have also happened. Linux has found
many new applications – it has made major inroads into the mobile and hand-
held market. The Android operating system, which as of 2023 dominates the
entire mobile operating space is based on Linux. Many of the operating systems
for smart devices and other wearable gadgets are also based on Android. In
addition, Google(®)’s Chrome OS also is a Linux-derived variant. So are other
operating systems for Smart TVs such as LG(®)’s webOS and Samsung(®)’s
Tizen.
As of today, Linux is not the only free open source operating system. There
are many others, which are derived from classical UNIX, notably the Berke-
ley standard distribution (BSD) variant. Some of the important variants are
FreeBSD, OpenBSD and NetBSD. Similar to Linux, their code is also Three
to use and distribute. Of course, they follow a different licensing mechanism.
however, they are also very good operating systems in their own right. They
have their niche markets and they have a large developer community that ac-
tively adds features and puts them to new hardware. The paper by Singh et
al. [] nicely compares the three operating systems in terms of performance for
different workloads (circa 2015).
Example: 6.2.12
x y z -rc<num>
rc � release candidate
(test release)
Major version
number Patch number
Odd numbers:
development
result, the code base for these directories is quite large. The other subsystems
for the memory and security modules are comparatively much smaller.
Figure 1.5 shows the list of prominent directories in the Linux kernel. The
kernel directory contains all the core features of the Linux kernel. Some of the
most important subsystems are the scheduler, time manager, synchronization
manager and debugging subsystem. It is by far the most important subsystem
or the core kernel. We will focus a lot on this subsystem.
We have already seen the arch directory. A related directory is the init
directory that contains all the booting code. Both these directories are hardware
dependent.
The mm, fs, block and io uring directories contain important code for the
© Smruti R. Sarangi 14
memory subsystem, file system and I/O modules. These modules are related to
the code to virtualize the operating system virt, where we can run the OS as a
regular program on top of another OS. The virtualization subsystem is tightly
coupled with the memory, file and I/O subsystems .
Finally, the largest directory is drivers that contains drivers (specialized
programs for talk to devices) for a host of I/O devices. Of course, there are
multiple things to be considered here. We don’t want to include the code of
every single device on the planet in the code base of the kernel. It will become
too large. At the same time, there is an advantage. The kernel can seamlessly
run on a variety of hardware. Hence, the developers of the kernel need to wisely
choose the set of drivers that they need to include in the code base that is
released and distributed. The devices should be very popular and the drivers
should be deemed to be safe (devoid of security issues).
Chapter 2
Basics of Computer
Architecture
15
© Smruti R. Sarangi 16
Core
Caches
Main memory
memories inside the chip such as the L1, L2 and L3 caches are not visible to the
OS and for all practical purposes, the OS is oblivious of them. Some ISAs have
specialized instructions that can flush certain levels of the cache hierarchy either
fully or partially. Sometimes even user applications can use these instructions.
However, this is the only notable exception. Otherwise, we can safely assume
that almost all software including privileged software like the operating system
are unaware of the caches. Let us live with the assumption that the highest
level of memory that an OS can see or access is the main memory.
A software program including the OS perceive the memory space to be one
large array of bytes. Any location in this space can be accessed at will and also
can be modified at will. Of course, later on when we discuss virtual memory,
we will have to change this abstraction. But, even then, large parts of this
abstraction will continue to hold.
ISA, the expression mov %eax, 4(%esp) stores the value in the eax register into
the memory location that is stored in the %esp register plus 4. As we can see,
registers are ubiquitous. They are used to access memory addresses and as we
shall see later, I/O address as well. Let us differentiate between CISC and RISC
processors here. RISC processors tend to use registers much more than CISC
processors that have a fair number of memory operands in their instructions.
In any case, registers are central to the operation of any program (be it RISC
or CISC), and needless to say the compiler needs to be aware of them.
2.1.3 Registers
General Purpose Registers
Let us look at the space of registers in some more detail. All the registers that
regular programs use are known as general purpose registers. These are visible to
all software including the compiler. Note that almost all the programs that are
compiled today will use registers and the author is not aware of any compilation
model or any architectural model that does not rely on registers.
Privileged Registers
A core also has a set of registers known as privileged registers, which only the
OS can see. In Chapter 8, we will also look at hypervisors or virtual machine
managers (VMMs) that also run with OS privileges. All such software are
known as system software or privileged mode software. They are given special
treatment by the CPU. They can access privileged registers.
For instance, an ALU has a flags register that stores its state especially
the state of instructions that have executed in the past such as comparison
instructions. Often these flags registers are not visible to regular application-
level software. However, they are visible to the OS and anything else that runs
with OS privileges such as VMMs. It is necessary to access these registers to
enable multi-tasking: run multiple programs on a core one after the other.
We also have control registers that can enable or disable specific hardware
features such as the fan, LED lights on the chassis and turn of the system
itself. We do not want all of these instructions to be visible to regular programs
because then a single application can create havoc. Because of this, we entrust
only a specific set of programs (OS and VMM) with access to these registers.
There are debug registers that are meant to debug hardware and system
software. Given the fact that they are privy to more information and can be
used to extract information out of running programs, we do not allow regular
programs to access these registers. Otherwise, there will be serious security
violations. However, from a system designer’s point of view or from the OS’s
point of view these are very important. This is because they give us an insight
into how the system is operating before and after an error is detected – this can
potentially allow us to find the root cause of bugs.
Finally, we have I/O registers that are used to communicate with externally
placed I/O devices such as the monitor, printer, network card, etc. Here again,
we need privileged access. Otherwise, we can have serious security violations
and different applications may try to monopolize an I/O resource and not allow
other applications to access them. Hence, the OS needs to act as a broker and
this is possible only why access to the device needs to be restricted.
© Smruti R. Sarangi 18
Given the fact that we have discussed so much about privileged registers, we
should now see how the notion of privileges is implemented and how we ensure
that only the OS and related system software such as the VMM can have access
to privileged resources such as the privileged registers.
Ring 3
Ring 0
We will ask an important question here and answer it when we shall discuss
virtual machines in Chapter 8. What happens when application code or code at
a lower privilege level (higher ring) accesses instructions that should be executed
by code at a higher privilege level (lower ring)? In general, we would expect that
there will be an exception. Then the appropriate exception handler will take
over and take appropriate action. If this is the case, we shall see in Chapter 8
that writing a virtual machine is reasonably easy. However, there are a lot of
instructions in the instruction sets of modern processors that do not show this
behavior. Their behavior is far more confusing and pernicious. They yield dif-
ferent results when executed in different modes without generating exceptions.
We shall see that handling such instructions is quite difficult and that is why
the design of virtual machines is actually quite complicated. The main reason
for this is that when instructions were created initially while designing the ISA,
virtual machines were not around and thus designers could not think that far
enough. As a result, they thought that having such polymorphic instructions
(instructions that change their behavior based on the ring level) was a good
idea. When virtual machines started gaining prevalence, this turned out to be
a huge problem.
System Call If an application needs some service from the OS such as creating
a file or sending a network packet, then it cannot use the conventional
mechanism, which is to make a function call. OS functions cannot directly
be invoked by the application. Hence there is a need to generate a dummy
interrupt such that The same set of actions can take place, which happens
when an external interrupt is received. In this case, a specialized system
call handler takes over and satisfies the request made by the application.
© Smruti R. Sarangi 20
Signal A system call is a message that is sent from the application to the OS.
A signal is the reverse. It is a message that is sent from the OS to the
application. An example of this would be a key press. In this case, an
interrupt is generated, which is processed by the OS. The OS reads the
key that was pressed, and then figures out the process that is running
in the foreground. The value of this key needs to be communicated to
this process. The signal mechanism is the method that is used. In this
case, a function registered by the process with the OS to handle a “key
press” event is invoked. The running application process then gets to know
that a certain key was pressed and depending upon its logic, appropriate
action is taken. A signal is basically a callback function that an application
registers with the OS. When an event of interest happens (pertaining to
that signal), the OS calls the callback function in the application context.
This callback function is known as the signal handler.
As we can see, communicating with the OS does require some novel and un-
conventional mechanisms. Traditional methods of communication that include
writing to shared memory or invoking functions are not used because the OS
runs in a separate address space and also switching to the OS is an onerous
activity. It also involves a change in the privilege level and a fair amount of
bookkeeping is required at both the hardware and software levels, as we shall
see in subsequent chapters.
As we can see, all that we need to do is that we need to load the number
of the system call in the rax register. The syscall Instruction subsequently
does the rest. We generate a dummy interrupt, store some data corresponding
to the state of the executing program (for more details, refer to [Sarangi, 2021])
and load the appropriate system call handler. An older approach is to directly
generate an interrupt itself using the instruction int 0x80. Here, the code 0x80
stands for a system call. However, as of today, this method is not used for x86
processors.
21 © Smruti R. Sarangi
The state of the running program is known as its context. Whenever, we have
an interrupt, exception or a system call, there is a need to store the context,
jump to the respective handler, finish some additional work in the kernel (if
there is any), restore the context and start the original program at exactly the
same point. The caveat is that all of this needs to happen without the explicit
knowledge of the program that was interrupted. Its execution should be identical
to a situation where it was not interrupted by an external event. Of course, if
the execution has led to an exception or system call, then the corresponding
event/request will be handled. In any case, we need to return back to exactly
the same point at which the context was switched.
Flags and
Program
Registers special Memory PC
state
registers
Register state
Figure 2.3 show an overview of the process to store the context of a run-
ning program. The state of the running program comprises the contents of the
general purpose registers, contents of the flags and special purpose registers,
the memory and the PC (program counter). Towards the end of this chapter,
we shall see that the virtual memory mechanism stores the memory state very
effectively. Hence, we need not bother about storing and restoring the mem-
ory state because there is already a mechanism namely virtual memory that
takes care of it completely. Insofar as the rest of the three elements are con-
cerned, we can think of all of them as the volatile state of the program that
is erased when there is a context switch. As a result, a hardware mechanism is
needed to read all of them and store them in memory locations that are known
a priori. We shall see that there are many ways of doing this and there are
specialized/privileged instructions that are used.
For more details about what exactly the hardware needs to do, readers can
refer to the computer architecture text by your author [Sarangi, 2021]. In the
example pipeline in the reference, the reader will appreciate the need for having
specialized hardware instructions for automatically storing the PC, the flags and
special registers, and possibly the stack pointer in either privileged registers or
a dedicated memory region. Regardless of the mechanism, we have a known
location where the volatile state of the program is stored and it can later on be
retrieved by the interrupt handler. Note that for the sake of readability, we will
use the term interrupt handler to refer to a traditional interrupt handler as well
as exception handlers and system call handlers wherever this is clear from the
context.
Subsequently, the first task of the interrupt handler is to retrieve the program
state of the executing program – either from specialized registers or a dedicated
memory area. Note that these temporary locations may not store the entire
© Smruti R. Sarangi 22
state of the program, for instance they may not store the values of all the
general purpose registers. The interrupt handler will thus have to be more
work and retrieve the full program state. In any case, the role of the interrupt
handler is to collect the full state of the executing program and ultimately store
it somewhere in memory, from where it can easily be retrieved later.
Restoring the context of a program is quite straightforward. We need to
follow the reverse sequence of steps.
The life cycle of a process can thus be visualized as shown in Figure 2.4. The
application program executes, it is interrupted for a certain duration after the
OS takes over, then the application program is resumed at the point at which
it was interrupted. Here, the word “interrupted” needs to be understood in a
very general sense. It could be a hardware interrupt, a software interrupt like a
system call or an exception.
Context Context
switch switch
OS + other
OS + other processes
processes
execute
execute
Execu�on Execu�on Execu�on
Figure 2.4: The life cycle of a process (active and interrupted phases)
Timer Interrupts
Question 1
Assume we have a situation, where we have a single-core machine and the
program that is running on the core is purely computational in nature. It
does not make any system calls and it also does not lead to any exceptions.
Furthermore, assume that there is no hardware or I/O activity and therefore
no interrupts are generated. In such a situation, the process that is running
on the core can potentially run forever unless it terminates on its own. Does
that mean that the entire system will remain unresponsive till this process
terminates? We will have a similar problem on a multicore machine where
there are k cores and k regular processes on them, where no events of interest
are generated.
This is a very fundamental question in this field. Can we cannot always rely
on system calls, exceptions and interrupts (events of interest) to bring in the
operating system. This is because as we have shown in question 1, it is indeed
possible that we have a running program that does not generate any events of
interest. In such a situation, when the OS is not running, an answer that the
OS will somehow swap out the current process and load another process in its
place is not correct. A core can run only one process at a time, and if it is
running a regular application process, it is not running the OS. If the OS is not
running on any core, it cannot possibly act.
CPU
Core Core
Timer chip
Core Core
Periodically send �mer
interrupts to the CPU
comes in even when there is no other event of interest. All platforms that
support an operating system need to have a timer chip. It is arguably the
most integral part of the machine that supports an operating system. The key
insight is that this is needed for ensuring that the system is responsive and it
periodically executes the OS code. The operating system kernel has full control
over the processes that run on cores, the memory, storage devices and I/O
systems. Hence, it needs to run periodically such that it can effectively manage
the system and provide a good quality of experience to users.
We divide time into jiffies, where there is a timer interrupt at the end of
every jiffy. The number of jiffys (jiffy count) is incremented by one when a
timer interrupt is received. The duration of a jiffy has been reducing over the
course of time. It used to be 10 ms in the Linux kernel around a decade ago
and as of 2023, it is 1 ms. It can be controlled by the compile time parameter
HZ. If HZ=1000, it means that the duration of a jiffy is 1 ms. We do not want
a jiffy to be too long, otherwise the system will take a fair amount of time to
respond. Simultaneously, we also do not want it to be too short, otherwise a lot
of time will be spent in servicing timer interrupts.
Inter-processor interrupts
As we have seen, the OS gets invoked on one core and now its job is to take
control of the system and basically manage everything including running pro-
cesses, waiting processes, cores, devices and memory. Often there is a need to
ascertain if a process has been running for a long time or not and whether it
needs to be swapped out or not. If there is a need to swap it out, then the OS
always chooses the most eligible process (using its scheduler) and runs it on a
core.
If the new process runs on the core on which the OS is executing, then it is
simple. All that needs to be done is that the OS needs to load the context of
the process that it wants to run. However, if a process on some other core needs
to be swapped out, and it needs to be replaced with the chosen process, then
the process is more elaborate. It is necessary to send an interrupt to that core
such that the OS starts running on that core. There is a mechanism to do so –
it is called an Internet-processor interrupt (or IPI). Almost all processors today,
particularly all multicore processors, have a facility to send an IPI to any core
with support from the hardware’s interrupt controller. Relevant kernel routines
frequently uses such APIs to run the OS on a given core. The OS may choose
to do more of management and bookkeeping activities or quickly find the next
process to run and run it on that core.
will entertain many complex corner cases and managing memory will be very
difficult. We are in search of simple abstractions.
we can always scale the system to assume that an address is 232 or 264 bytes
and still manage to run on physical systems with far lower memory. We thus
have a compatibility problem here, where we want our program to assume that
addresses are n bits wide (typically 32 or 64 bits), yet run on machines with all
memory sizes (typically much lower than the theoretical maximum).
Definition 2 Processes assume that they can access any byte in large mem-
ory regions of size 232 or 264 bytes at will (for 32-bit and 64-bit systems,
respectively). Even if processes are actually accessing very little data, there
is a need to create a mechanism to run them on physical machines with far
lower memory (let’s say a few GBs). The fact is that the addresses they
assume are not compatible with physical addresses (on real machines). This
is the compatibility problem.
Definition 3 Unless adequate steps are taken, it is possible for two pro-
cesses to access overlapping regions of memory and also it is possible to get
unauthorized access to other processes’ data by simply reading values that
they write to memory. This is known as the overlap problem.
27 © Smruti R. Sarangi
What we can see is that the memory map is partitioned into distinct zones.
The memory map starts from address zero. Then after a fixed offset, the text
section starts, which contains all the program’s instructions. The processor
starts executing the first instruction at the beginning of the text section and
then starts fetching subsequent instructions as per the logic of the program.
Once the text section ends, the data section begins. It stores initialized data
that comprises global and static variables that are typically defined outside the
scope of functions. After this, we have the bss (block starting symbol) section
that stores the same kind of variables, however they are uninitialized. Note
that each of these sections in the memory map is basically a range of memory
addresses and this range varies from process to process. It is possible that one
process has a very small data section and another process has a very large data
section – it all depends upon how the program is written.
Then we have the heap and the stack. The heap is a memory region that
stores dynamically allocated variables and data structures, which are typically
allocated using the malloc call in C and the new call in C++ and Java. Tra-
ditionally, the heap section has grown upwards (towards increasing addresses).
As and when we allocate new data, the heap size increases. It is also possible
for the heap size to decrease as we free or dynamically delete allocated data
structures. Then there is a massive hole, which basically means that there is
a very large memory region that doesn’t store anything. Particularly, in 64-bit
machines, this region is indeed extremely large.
Next, at a very high memory location (0xC0000000 in 32-bit Linux), the
stack starts. The stack typically grows downwards (grows towards decreasing
addresses). Given the fact that there is a huge gap between the end of the heap
and the top of stack, both of them can grow to be very large. If we consider the
value 0xC0000000, it is actually 3 GB. This basically means that on a 32-bit
system, an application is given 3 GB of memory at most. This is why the stack
section starts at this point. Of course, one can argue that if the size of the stack,
© Smruti R. Sarangi 28
heap and other sections combined exceeds 3 GB, we shall run out of space. This
indeed can happen and that is why we typically use a 64-bit machine where the
likelihood of this happening is very low because our programs are not that large
at the moment.
The last unanswered question is what happens to the one GB that is re-
maining (recall 232 bytes = 4 GB)? This is a region that is typically assigned
to the operating system kernel for storing all of its runtime state. As we shall
see in later chapters, there is a need to split the address space between user
applications and the kernel.
Now, the interesting thing is that all processes share the same structure of
the memory map. This means that the chances of them destructively interfering
with each other is even higher because most variables will have similar addresses:
they will be stored in roughly the same region of the memory map. Even if two
processes are absolutely innocuous (harmless), they may still end up corrupting
each other’s state, which is definitely not allowed. As a result, ensuring a degree
of separation is essential. Another point that needs to be mentioned with regards
to the kernel memory is that it is an invariant across process memory maps. It
is something that pretty much remains constant and in the case of a 32-bit
system, occupies the top one GB of the memory map of every process. In a
certain sense, processors assume that their range of operation is the first 3 GB
and the top one GB is beyond their jurisdiction.
The advantage of having a fixed memory map structure is that it is very easy
to generate code, binaries can also have a fixed format that is correlated with
the memory map and operating systems know how to layout code and data in
memory. Regardless of the elegance, simplicity and standardization, we need to
solve the overlap problem. Having a standard memory map structure makes this
problem worse because now regardless of the process, the variables are stored
in roughly the same set of addresses. Therefore, the chances of destructive
interference become very high. Additionally, this problem creates a security
nightmare.
Let us look at a simple implementation of this idea. Assume that we have two
registers associated with each process: base and limit. The base register stores
the first address that is assigned to a process and the limit register stores the last
address. Between base and limit, the process can access every memory address.
In this case, we are constraining the addresses that a process can access and via
this we are ensuring that no overlap is possible. We observe that the value of
the base register need not be known to the programmer or the compiler. All
29 © Smruti R. Sarangi
that either of them has to specify is the the difference between limit and base
(maximum number of bytes a process can access).
The first step is to find a free memory region when a process is loaded. Its
size needs to be more than the maximum size specified by the process. The
starting address of this region is set as the contents of the base register. A
address computed by the CPU is basically an offset added to the contents of the
base register. The moment the CPU sends an address to the memory system,
depending upon the process that is running, we generate addresses accordingly.
Note that the contents of the base registers vary depending upon the process. In
this system, if the process accesses an address that is beyond the limit register,
then a fault is generated. A graphical description of the system is shown in
Figure 6.1.
hole
We can clearly see that there are many processes and they have their mem-
ory regions clearly demarcated. Therefore, there is no chance of an overlap.
This idea does seem encouraging but this is not going to work in practice for
a combination of several reasons. The biggest problem is that neither the pro-
grammer nor the compiler know for sure how much memory a program requires
in run time. This is because for large programs, the user inputs are not known
and thus the total memory footprint is not predictable. Even if it is predictable,
we will have to budget for a very large footprint (conservative maximum). In
most cases, this conservative estimate is going to be much larger than the mem-
ory footprints we may see in practice. We may thus end up wasting a lot of
memory. Hence, in the memory region that is allocated to a process between
the base and limit registers, there is a possibility of a lot of memory getting
wasted. This is known as internal fragmentation.
Let us again take a deeper look at Figure 6.1. We see that there are holes or
unallocated memory regions between allocated memory regions. Whenever we
want to allocate memory for a new process, we need to find a hole that is larger
than what we need and then split it into an allocated region and a smaller hole.
Very soon we will have a large number of these holes in the memory space, which
cannot be used for allocating memory to any other process. It may be the case
© Smruti R. Sarangi 30
that we have enough memory available but it is just that it is partitioned among
so many processes that we do not have a contiguous region that is large enough.
This situation where a lot of memory is wasted in such holes is known as external
fragmentation. Of course, there are many ways of solving this problem. Some
may argue that periodically we can compact the memory space by reading data
and transferring them to a new region by updating the base and limit registers
for each process. In this case, we can essentially merge holes and create enough
space by creating one large hole. Of course, the problem is that a lot of reads
and writes will be involved in this process and during that time the process
needs to remain mostly stalled.
Another problem is that the prediction of the maximum memory usage may
be wrong. A process may try to access memory that is beyond the limit register.
As we have argued, in this case a fault is generated. However, this can be avoided
if we allocate another memory region and link the second memory region to the
first (using a linked list like structure). The algorithm now is that we first
access the memory region that is allocated to the process and if the offset is
beyond the limit register, then we access a second read memory region. The
second remain memory region will also have base and limit registers. We can
extend this idea and create a linked list of such memory regions. We can also
save time by having a lookup table. It will not be necessary to traverse linked
lists. Given an address, we can quickly figure out in which memory region it
lies. Many of the early approaches focused on such kind of techniques and they
grew to become very complex, but soon the community realized that this is not
a scalable solution and it is definitely not elegant.
A few ideas emerge from this discussion. Given a virtual address, there
should be some sort of a table that we can lookup and find the physical address
that it maps to. Clearly, one virtual address will always be mapped to one
physical address. This is a common sense requirement. However, if we can
also ensure that every physical address maps to one virtual address, or in other
words there is a strict one-to-one mapping, then we we observe that no overlaps
between processes are possible. Regardless of how hard a process tries, it will
not be able to access or overwrite the data that belongs to any other process
in memory. In this case we are using the term data in the general sense – it
encompasses both code and data. Recall that in the memory system, code is
actually stored as data.
to the view of the real system. On a real system, we will see all of these prob-
lems because we have real-world constraints namely other component programs
running and a small physical address space. We would thus like to formally
introduce the term “Virtual Memory” here, which is simply defined as an ab-
straction of the memory space that the user perceives. The user in this case
could be the programmer, compiler, running process or even the CPU. The
virtual memory abstraction also incorporates a method to translate virtual ad-
dresses into actual physical addresses such that all three of our problems are
solved.
The crux of the entire definition of virtual memory (see Definition 6) is that
we have a mapping table that maps each virtual address (that is used by the
program) to a physical address. If the mapping satisfy some conditions, then
we can solve all the three problems. So the main technical challenge in front
of us is to properly and efficiently create the mapping table to implement an
address translation system.
actually fetch 64 bytes because that is the block size. This ensures that when
we access data that is nearby, it is already available within the same block.
Something similar needs to be done here as well. We clearly cannot maintain
mapping information at the byte level – we will have to maintain a lot of infor-
mation and this is not a scalable solution. We thus need to create blocks of data
for the purpose of mapping. In this space, it has been observed that a block of
4 KB typically suits the needs of most systems very well. This block of 4 KB
is known as a page in the virtual memory space and as a frame or a physical
page in the physical memory space. Consequently, the mapping problem aims
to map a page
Process 1
Map process’s virtual
pages to physical frames
Process 2
Figure 2.9: Conceptual overview of the virtual memory based page mapping
system
space is actually not used. In fact, it is quite sparse particularly between the
stack and the heap, which can actually be quite large. This problem is still
manageable for 32-bit memory systems, especially if we don’t have a lot of con-
currently running processes. However, if we consider a 64-bit memory system,
then the page table storage overhead is prohibitively large and clearly this idea
will not work. Hence, we need far more efficient way of storing our mappings.
We need to a look at the memory map of a process seriously and understand the
structure of the sparsity to design a better page table (refer to Section 2.2.1).
Begin by noting that a very small part of the virtual address space is actually
populated. The beginning of the virtual address is populated with the text,
data, bss and heap sections. Then there is a massive gap. Finally, the stack
is situated at the highest end of the allowed virtual memory addresses. There
is nothing in between. Later on we will see that other memory regions such as
memory mapped files can occupy a part of this region. But still we will have
large gaps and thus there will be a significant amount of sparsity. This insight
can be used to design a multilevel page table, which can capture leverage this
pattern.
48-bit
Virtual 12 bits
Bits 48-40 Bits 39-31 Bits 30-22 Bits 21-13 (intra-page)
address
The top 16 bits of the VA 52-bit frame
are assumed to be zero address
Level 1 Level 2 Level 3 Level 4
CR3 register
(c) Smru� R. Sarangi, 2023 55
memory. Hence, most practical systems as of 2023, use a 48-bit virtual address.
That is sufficient. The top 16 (MSB) bits are assumed to be zero. We can
always break this assumption and have more levels in a multilevel page table.
This is seldom required. Let us thus proceed assuming 48-bit physical address.
We however assume a full 64-bit physical address in our examples. Note that
the physical address can be as wide as possible because we are just storing a
few additional bits per entry – we are not adding new levels in the page table.
Given that 12 bits are needed to address a byte in a 4 KB page, we are left with
52 bits. Hence, a physical frame number is specified using 52 bits. Figure 2.11
shows the memory map of a process assuming that the lower 48 bits of a memory
address are used to specify the virtual memory address.
248 - 1
Stack
In our 48-bit virtual address, we use the bottom 12 bits to specify the address
of the byte within the 4 KB page. Recall that 212 bytes = 4 KB. We are left
with 36 bits. We partition them into four blocks of 9 bits each. If we count from
1, then these are bit positions 40-48, 31-39, 22-30 and 13-21. Let us consider the
topmost level, i.e., the top 9 bits (bits 40-48). We expect the least amount of
randomness in these bits. The reason is obvious. In any system with temporal
and spatial locality, we expect most addresses to be close by. They may vary
in their lower bits, however, in all likelihood their more significant bits will be
the same. To cross-check in a decimal number system count from 0 to 999.
How, frequently does the unit’s digit change? It changes with every number.
The ten’s digit on the other hand changes more infrequently. It changes after
every 10 numbers, and the hundred’s digit changes even more infrequently. It
changes once for every 100 numbers. By the same logic, when we consider binary
addresses we expect the more significant bits to change far less often than less
significant bits.
Now, coming to the top level bits again (bits 40-48), we observe that 9 bits
can be used to access 29 (=512) entries. Let us create a Level 1 page table that
© Smruti R. Sarangi 36
Let us now solve the size problem. The problem is to run a program with a
large memory footprint on a machine with inadequate physical memory. The
solution is quite simple. We reserve a region of memory on a storage device such
as the hard disk or a flash drive, or even the hard disk of a machine accessible
over the network. This reserved region is known as the swap space.
Whenever we need more space than what physical memory can provide, we
take up space in the swap space. Frames can be resident either on physical
memory or in the swap space. However, for them to be usable, they need to be
brought into main memory.
Let us now go over the process. The processor computes the virtual address
based on the program logic. This address is translated to a physical address
using the TLB. If a valid translation exists, then the physical address is sent to
the memory system: instruction cache or L1 data cache. The access traverses
the memory system until it reaches the main memory, which is guaranteed to
have the data. However, in the rare case when an entry is not there in the TLB,
we record a TLB miss. There is a need to access the page table, which is a slow
process.
If the page table has a valid translation (frame in main memory), then there
is a need to first bring this translation into the TLB. Note that most modern
processors cannot use the translation directly. They need to add it to the TLB
first, and then reissue the memory instruction. The second time around, the
translation will be found in the TLB. Of course, if a new translation is added,
a need may arise to evict an earlier entry. An LRU scheme can be followed to
realize this.
If a translation is not found in the page table, then we can have several
situations. The first is that address is illegal. Then of course, an exception
needs to be raised. However, it is possible that the entry indicates that the
frame is not in memory. We can have a single bit for this, where 1 may indicate
that the frame is in main memory and 0 may indicate that the frame is in
the swap space of the hard disk. However, it is possible that we don’t have
a single swap space but a bunch of swap spaces or the swap space is on some
other device such as the USB drive or on a machine that is accessible over the
network. Hence, a page table entry can additionally store the location of the
frame and the device that contains it. The device itself can have a complex
description that could be a combination of an IP address and a device id. All
of this information is stored in the page table entry. However, it is not there in
the TLB entry because in this case we assume that the frame is there in main
memory. Let us look at some of the other fields that are there in a page table
entry notably the permission bits.
© Smruti R. Sarangi 38
foremost is security. We can for instance prohibit any data access from accessing
code page. This is easy to do. We have seen that in the memory map the lower
addresses are code addresses and once the end, the data region begins. These
sections store read-only constants and values stored on the heap. Any data
address is a positive offset from the location stored in the data segment register.
This means that a data page address will always be greater than any code page
address and thus it is not possible for any data access to modify the regions of
memory that store instructions. Most malwares try to access the code section
and change instructions such that they can hijack the program and make it do
what they want. Segmentation is an easy way of preventing such attacks. In
many other attacks, it is assumed that the addresses of variables in the virtual
address space are known. For example, the set of attacks that try to modify
return addresses stored on the stack need to know the memory address at which
the return address is stored. Using the stack segment register it is possible to
obfuscate these addresses and confuse the attacker. In every run, the operating
system can randomly set the contents of the stack segment register. This is
known as stack obfuscation. The attacker will thus not be able to guess what is
stored in a given address on the stack – it will change in every run. Note that
program correctness will not be hampered because a program is compiled in a
manner where it is assumed that all addresses will be computed as offsets from
the contents of their respective segment registers.
There are some other ingenious uses as well. It is possible to define small
memory regions that are private to each core (per-core regions). The idea here
is that it is often necessary to store some information in a memory region that
is private to each core and should not be accessed by processes running on other
cores, notably kernel processes. These regions are accessed by kernel threads
mainly for the purposes of memory management and scheduling. An efficient
way of implementing this is by associating an unused segment register with such
a region. All accesses to this region can then use this segment register as the
base address. The addresses (read offsets) can be simple and intuitive such as
0, 4, 8, 12, etc. In practice, these offsets will be added to the contents of the
segment register and the result will be a full 64-bit virtual address.
cs es
ss fs
ds gs
time other than some rare instances when a process is loaded either for the first
time or after a context switch.
Similar to a TLB miss, if there is a miss in the SDC, then there is a need to
search in a larger structure for a given segment register belonging to a process.
In older days there used to be an LDT (local descriptor table) and a global
descriptor table (GDT). We could think of the LDT as the L1 level and the
GDT as the L2 level. However, nowadays the LDT is mostly not used. If there
is a miss in the SDC, then a dedicated piece of hardware searches for the value in
the GDT, which is a hardware structure. Of course, it also has a finite capacity,
and if there is a miss there then an interrupt is raised. The operating system
needs to populate the GDT with the correct value. It maintains all the segment
register related information in a dedicated data structure.
GDT Address
SDC
Virtual
memory
Segment Register
As seen in Figure 2.13, the base address stored in the relevant segment reg-
ister is added to the virtual address. This address further undergoes translation
to a physical address before it can be sent to the physical memory system.
convenient, safe and comes with some performance guarantees. Along with
software support in the OS, we shall see that we also need to add a fair amount
of hardware on the motherboard to ensure that the I/O devices are properly
interfaced. These additional chips comprise the chipset. The motherboard is
the printed circuit board that houses the CPUs, memory chips, the chipset and
the I/O interface chips and ports.
2.3.1 Overview
CPU
GPU
Northbridge
PCI
slots
Keyboard, mouse, USB
Southbridge
ports, I/O chips, ...
Any processor chip has hundreds of pins. Complex designs have roughly
a 1000+ pins. Most of them are there to supply current to the chip: power
and ground pins. The reason that we need so many pins is because modern
processors draw a lot of current. A pin has a limited current delivery capacity.
However, a few hundred bits are typically left for communication with external
entities such as the memory chips, off-chip GPUs and I/O devices.
Memory chips have their dedicated memory controllers on-chip. These mem-
ory controllers are aware of the number of memory chips that are connected and
how to interact with them. This happens at the hardware level and the OS is
blissfully unaware of what goes on here. Depending on the motherboard, there
could be a dedicated connection to an off-chip GPU. An ultra-fast and high
bandwidth connection is required to a GPU that is housed separately on the
motherboard. Such buses (sets of copper wires) have their own controllers that
are typically on-chip.
Figure 2.14 shows a traditional design where the dedicated circuitry for com-
municating with the main memory modules and the GPU are combined together
into a Northbridge chip. The Northbridge chip used to traditionally be resident
on the motherboard (outside the chip). However, in most modern processors
today, the logic used in the chip has moved into the chip. It is much faster for
the cores and caches to communicate with an on-chip component. Given that
both the main memory and GPU have very high bandwidth requirements, this
design decision makes sense. Alternative designs are also possible where the
Northbridge logic is split into two and is placed at different ends of the chip.
One part communicates with the GPU and the other part communicates with
the memory modules.
© Smruti R. Sarangi 42
To communicate with other slower I/O devices such as the keyboard, mouse
and hard disk a dedicated controller chip called the Southbridge chip is used.
In most modern designs, this chip is outside the chip – it is placed on the moth-
erboard. The simplest way is to have a Northbridge-Southbridge connection.
However, this is not mandatory. There could be a separate connection to the
Southbridge chip and in high-performance implementation, we can have the
Southbridge logic inside the CPU chip. Let us however stick to the simplistic
design shown in Figure 2.14.
The Southbridge chip is further connected to dedicated chips in the chipset
whose job is to route messages to the large number of I/O devices that are
present in a typical system. In fact, we can have a tree of such chips, where
messages are progressively routed to the I/O devices through the different levels
of the tree. For example, the Soutbridge chip may send the messages to the
PCI-X chip, which subsequently send the messages down the PCI-X buses to
the target I/O device. The Southbridge chip may also choose to send a message
to the USB ports, and a dedicated controller may then route the message to the
specific USB port that the message is meant to be sent to.
The question that we need to answer is how do we programmatically interact
with these I/O ports? It should be possible for assembly programs to read and
write from I/O ports easily. There are several methods in modern processors.
There is a tradeoff between the ease of programming, latency and bandwidth.
operation.
If we dive in further, we observe that an in instruction is a message that is
sent to the chip on the motherboard that is directly connected to the I/O device.
Its job is to further interpret this instruction and send device-level commands
to the device. It is expected that the chip on the motherboard, which message
needs to be sent. The OS need not concern itself with such low-level details.
For example, a small chip on the motherboard knows how to interact with USB
devices. It handles all the I/O. It just exposes a set of I/O ports to the CPU
that are accessible via the in/out ports. Similar is the case for out instructions,
where the device drivers simply write data to I/O ports. The corresponding
chip on the motherboard knows how to translate this to device-level commands.
Using I/O ports is the oldest method to realize I/O operations and has
been around for the last fifty years. It is however a very slow method and the
amount of data that can be transferred is very little. Also, for transferring a
small amount of data (1-4 bytes), there is a need to issue a new I/O instruction.
This method is alright for control messages but not for data messages in high
bandwidth devices like the network cards. There is a need for a faster method.
Virtual
address
space
Mapped to I/O
addresses I/O device ports
The faster method is to map directly map regions of the virtual address
space to an I/O device. Insofar as the OS is concerned it makes regular reads
and writes. The TLB however stores an additional bit indicating that the page
is an I/O page. The hardware automatically translates memory requests to I/O
requests. There are several advantages of this scheme (refer to Figure 2.15).
The first is that we can send a large amount of data in one go. The x86
architecture has instructions that allow the programmer to move hundreds of
bytes between addresses in one go. These instructions can be used to trans-
fer hundreds of bytes or a few kilobytes to/from I/O space. The hardware
can then use fast mechanisms to ensure that this happens as soon as possible.
This would mean reading or writing a large amount of data from memory and
communicating with I/O devices.
At the side of the processor, we can clearly see the advantage. All that
we need is a few instructions to transfer a large amount of data. This reduces
© Smruti R. Sarangi 44
the instruction processing overhead at the end of the CPU and keeps the pro-
gram simple. I/O devices and chips in the chipset have also evolved to support
memory-mapped I/O. Along with their traditional port-based interface, they
are also incorporating small memories that are accessible to other chips in the
chipset. The data that is the process of being transferred to/from I/O devices
can be temporarily buffered in these small memories.
A combination of these technologies makes memory-mapped I/O very effi-
cient. Hence, it is very popular as of 2023. In many reference manuals, it is
conveniently referred to it via its acronym MMIO.
CPU DMA
2. Interrupt the CPU (transfer done)
Even though memory-mapped I/O is much more efficient than the older
method that relied on primitive instructions and basic I/O ports, it turns out
that we can do far better. Even in the case of memory-mapped I/O, the proces-
sor needs to wait for the load-store instruction that is doing the I/O to finish.
Given that I/O operations take a lot of time, the entire pipeline will fill up and
the processor will remain stalled until the outstanding I/O operations complete.
Of course, one simple solution is that we do the memory mapped I/O operations
in smaller chunks; however, some part of this problem will still remain. We can
also remove write operations from the critical path and assume that they are
done asynchronously. Still the problem of slow reads will be there.
Our main objective here is that we would like to do other work while I/O
operations are in progress. We can extend the idea of asynchronous writes to
also have asynchronous reads. In this model, the processor does not wait for
the read or write operation to complete. The key idea is shown in Figure 2.16,
where there is a separate DMA (direct memory access) chip that effects the
transfers between the I/O device and memory. The CPU basically outsources
the I/O operation to the DMA chip. The chip is provided the addresses in
memory as well as the addresses on the I/O device along with the direction of
data transfer. Subsequently, the DMA chip initiates the process of data transfer.
In the meanwhile, the CPU can continue executing programs without stalling.
Once the DMA operation completes, it is necessary to let the OS know about
it.
Hence, the DMA chip issues an interrupt, the OS comes into play and then
it realizes that the DMA operation has completed. Since user programs cannot
directly issue DMA requests, they instead just make system calls and let the
OS know about their intent to access an I/O device. This interface can be kept
simple primarily because it is only the OS’s device drivers that interact with
the DMA chip. When the interrupt arrives, the OS knows what to do with it
and how to signal the device drivers that the I/O operation is done and they
can either read the data that has been fetched from an I/O device or assume
45 © Smruti R. Sarangi
that the write has completed. In many cases, it is important to let the user
program also know that the I/O operation has completed. For example, when
the printer successfully finishes printing a page the icon changes from “printing
in progress” to “printing complete”.
To summarize, in this section we have seen three different approaches for
interacting with I/O devices. The first approach is also the oldest approach
where we use old-fashioned I/O ports. This is a simple approach especially
when we are performing extremely low-level accesses and we are not reading
or writing a lot of data. Nevertheless, I/O ports are still used mainly for in-
teracting with the BIOS (booting system), simple devices like LEDs and in
embedded systems. This method has mostly been replaced by memory-mapped
I/O (MMIO). MMIO is easy for programmers and it leverages on the strength
of the virtual memory system to provide a very convenient and elegant interface
for device drivers. Also, another advantage is that it is possible to implement a
zero-copy mechanism where if some data is read from an I/O device, it is very
easy to transfer it to a user program. The device driver can simply change the
mapping of the pages and map them to the user program after the I/O device
has populated the pages. Consequently, there is no necessity to read data from
an I/O device into pages that are accessible only to the OS and then copy all
the data once again to user pages. This is inefficient and it is important to note
that the key strength of memory-mapped I/O is that it is not required.
Subsequently, we looked at a method which provides much more bandwidth
and also does not stall the CPU. This is known as DMA (direct memory access).
Here, the entire role of interacting with I/O devices is outsourced to an off-chip
DMA device; it simply interrupts the CPU once the I/O operation completes.
After that the device driver can take appropriate action, which also includes
letting the user program know that its I/O operation has been completed.
© Smruti R. Sarangi 46
Chapter 3
Processes
47
© Smruti R. Sarangi 48
destruction. Specifically, we will look at the fork and exec system calls. Using
the fork system call, we can clone an existing process. Then, we can use
the exec family of calls to superimpose the image of a new process on top of
the currently running process. This is the standard mechanism by which new
processes are created in Linux.
Finally, we will discuss the context switch mechanism in a fair amount of
detail. We shall first introduce the types of context switches and the state
that the kernel needs to maintain to suspend a running process and resume it
later. We shall then understand that suspension and resumption is different
for different kinds of processes. For instance, if we are running an interrupt
handler, then certain rules apply whereas if we are running a regular program,
that some other rules apply.
Field Description
struct thread info thread info Low-level information
uint state Process state
void * stack Kernel stack
Priorities prio, static prio, normal prio
struct sched info sched info Scheduling information
struct mm struct *mm, *active mm Pointer to memory information
pid t pid Process id
struct task struct *parent Parent process
struct list head children, sibling Child and sibling processes
Other fields File system, I/O, synchroniza-
tion, and debugging fields
thread info used to be the heart of the task struct structure in older kernels.
However, it is on its way out now. It is a quintessential example of a low-level
data structure. We need to understand that high-level data structures such as
linked lists and queues are defined at the software level and their connections
with the real hardware are at best tenuous. They are usually not concerned with
the details of the machine, the memory layout or other constraints imposed
by the memory system. For instance, we typically do not think of word or
variable level alignment in cache lines, etc. Of course, highly optimized libraries
care about them but normal programmers typically do not concern themselves
with hardware-level details. However, while implementing an operating system,
it becomes very essential to align the fields of the data structure with actual
memory words such that they can be accessed very efficiently and even their
physical location in memory can be leveraged. For example, if it is known that
a given data structure always starts at a 4 KB page boundary, then it becomes
very easy to calculate the addresses of the rest of the fields or solve the inverse
problem – find the starting point of the data structure in memory given the
address of one of its fields. The thread info structure is a classic example of
this.
Before looking at the structure of thread info, let us describe the broad
philosophy surrounding it. The Linux kernel is designed to run on a large
number of machines that have very different instruction set architectures – in
fact some maybe 32-bit architectures and some may be 64-bit architectures.
Linux can also run on very small 16-bit machines as well. We thus want most of
the kernel code to be independent of the machine type otherwise it will be very
difficult to write the code. Hence there is an arch directory in the kernel that
stores all the machine-specific code. The job of the code in this directory is to
provide an abstract interface to the rest of the kernel code, which is not machine
dependent. For instance, we cannot assume that an integer is four bytes on every
platform or a long integer is eight bytes on every platform. These things are
quite important for implementing an operating system because many a time we
© Smruti R. Sarangi 50
/* current CPU */
u32 cpu ;
}
This structure basically stores the current state of the thread, the state of
the executing system call and synchronization-related information. Along with
that, it stores another vital piece of information, which is the number of the
CPU on which the thread is running or is scheduled to run at a later point
in time. We shall see in later sections that finding the id of the current CPU
(and the state associated with it) is a very frequent operation and thus there is a
pressing need to realize it as efficiently as possible. In this context, thread info
provides a somewhat sub-optimal implementation. There are faster mechanisms
of doing this, which we shall discuss in later sections. It is important to note
that the reader needs to figure out whether we are referring to a thread or a
51 © Smruti R. Sarangi
process depending upon the context. In most cases, it does not matter because
a thread is treated as a process. However, given that we allow multiple threads
or a thread group to also be referred to as a process (albeit, in limited contexts),
the term thread will more often be used because it is more accurate. It basically
refers to a single program executing as opposed to multiple related programs
(threads) executing.
Execu�on TASK_ZOMBIE
finished (task finishes or
is terminated)
New task created
Scheduler asks it to execute
TASK_RUNNING TASK_RUNNING
TASK_STOPPED
(ready but not (currently
(stopped)
running) running) SIGSTOP
Here is the fun part in Linux. A task that is currently running and the
task that is running have the same state: TASK RUNNING. There are historical
reasons for this as well as there are simple common sense reasons in terms of
efficiency. We are basically saying that a task that is ready to run and one that
is running have the same state and thus in a certain sense are indistinguishable.
This little trick allows us to use the same queue for maintaining all such tasks
that are ready to run or are currently running. We shall see later that this
simplifies many design decisions. Furthermore, if there is a context switch,
then there is no need to change the status of the task that was swapped out.
Of course, someone may argue that using the same state (TASK RUNNING)
introduces ambiguity. To a certain extent it is true, but it does simplify a lot of
things and does not appear to be a big hindrance in practice.
Now it is possible that a running task may keep on running for a long time
and the scheduler may decide that it is time to swap it out so that other tasks
get a chance. In this case, the task is said to be “preempted”. This means that
it is forcibly displaced from a core (swapped out). However, it is still ready to
© Smruti R. Sarangi 52
run, hence its state remains TASK RUNNING. Its place is taken by another task
– this process thus continues.
Let us look at a few other interactions. A task may be paused using the
SIGSTOP signal. Specifically, the kill system call can be used to send the stop
signal to a task. We can also issue the following command on the command line:
kill -STOP pid. Another approach is to send the SIGTSTP signal by pressing
Ctrl-z on the terminal. The only difference here is that this signal can be
ignored. Sometimes there is a need for doing this, especially if we want to run
the task at a later point of time when sufficient CPU and memory resources
are available. In this case, we can just pause the task. Note that SIGSTOP is a
special type of signal that cannot simply be discarded or caught by the process
that corresponds to this task. In this case, this is more of a message to the
kernel to actually pause the task. It has a very high priority. At a later point
of time, the task can be resumed using the SIGCONT signal. Needless to say,
the task resumes at the same point at which it was paused. The correctness of
the process is not affected unless it relies on some aspect of the environment
that possibly got changed while it was in a paused state. The fg command line
utility can also be used to resume such a suspended task.
Let us now come to the two interrupted states: INTERRUPTIBLE and UNINTERRUPTIBLE.
A task enters these states when it requests for some service like reading an I/O
device, which is expected to take a lot of time. In the first state, INTERRUPTIBLE,
the task can still be resumed to act on a message sent by the OS, which we refer
to as a signal. For instance, it is possible for other tasks to send the interrupted
process a message (via the OS) and in response it can invoke a signal handler.
Recall that a signal handler is a specific function defined in the program that
is conceptually similar to an interrupt handler, however, the only difference is
that it is implemented in user space. In comparison, in the UNINTERRUPTIBLE
state, the task does not respond to signals.
Zombie Tasks
The process of wrapping up a task is quite elaborate in Linux. Recall that the
processor has no way of knowing when a task has completed. It is thus necessary
to explicitly inform it by making the exit system call. However, a task’s state
is not cleaned up at this stage. Instead, the task’s parent is informed using the
SIGCHLD signal. The parent then needs to call the system call wait to read the
exit status of the child. It is important to understand that every time the exit
system call is called, the exit status is passed as an argument. Typically, the
value zero indicates that the task completed successfully. On the other hand,
a non-zero status indicates that there was an error. The status in this case
represents the error code.
Here again, there is a convention. The exit status ‘1’ indicates that there
was an error, however it does not provide any additional details. We can refer to
this situation as a non-specific error. Given that we have a structured hierarchy
of tasks with parent-child relationships, Linux explicitly wants every parent to
read the exit status of all its children. Until a parent task has read the exit
status of the child, the child remains a zombie task – neither dead nor alive.
53 © Smruti R. Sarangi
Pointer to
task_struct
thread_info
thread_info
current
For a long time, the kernel stack was limited to two pages, i.e., 8 KB. It
contained useful data about the running thread. These are basically per-thread
stacks. In addition, the kernel maintains a few other stacks, which are CPU
specific. The CPU-specific stacks are used to run interrupt handlers, for in-
stance. Sometimes, we have very high priority interrupts and some interrupts
cannot be ignored (not maskable). The latter kind of interrupts are known as
NMIs (non-maskable interrupts). This basically means that if we are execut-
ing an interrupt handler, if a higher priority interrupt arrives, we need to do a
context switch and run the interrupt handler for the higher-priority interrupt.
There is thus a need to switch to a new interrupt stack. Hence, each CPU has
an interrupt stack table with seven entries. This means that Linux can han-
dle deeply nested interrupts (till 7 levels). This is conceptually similar to the
regular context switch process for user-level tasks.
© Smruti R. Sarangi 54
Figure 3.2 shows the structure of the kernel stack in order kernels. The
thread info structure was kept at the lowest address and there was a dedicated
current pointer that pointed to the thread info structure. This is a very
quick method of retrieving the thread info associated with the current task.
In fact from any stack address, we can quickly compute the address stored in the
current pointer using the fact that the starting address of thread info needs
to be a multiple of 8 KB. Simple bitwise operations on the address can be used
to find this value (left as an exercise for the reader). Once, we get the address
of the thread info structure, we can get the pointer to the task struct.
The kernel stack as of today looks more or less the same. It is still limited to 8
KB in size. However, the trick of placing thread info at the lowest address and
using that to reference the corresponding task struct Is not needed anymore.
We can use a better method that relies on segmentation. This is one of the
rare instances in which x86 segmentation proves to be extremely beneficial. It
provides a handy reference point in memory for storing specific data that is
highly useful and has the potential for being frequently used. Furthermore, to
use segmentation, we do not need any extra instructions (see Appendix A). The
segment information can be embedded in the memory address itself. Hence, this
part comes for free and the Linux kernel designers leverage this to the hilt.
Refer to the code in Listing 3.2. It defines a macro current that returns a
pointer to the current task struct via a chain of macros and in-line functions
(works like a macro from a performance point of view). The most important
thing is that this pointer to the current task struct, which is accessible using the
current variable, is actually stored in the gs segment [Lameter and Kumar,
2014]. This serves as a dedicated region, which is specific to a CPU. We can
treat it as a per-CPU storage area that stores a lot of information that is relevant
to that particular CPU only.
Note that here we are using the term “CPU” as a synonym for a “core”.
This is Linux’s terminology. We can store a lot of important information in a
dedicated per-CPU/per-core area notably the current (task) variable, which
is needed very often. It is clearly a global variable insofar as the kernel code
running on the CPU is concerned. We thus want to access it with as few memory
accesses as possible. In our current solution with segmentation, we are reading
the variable with just a single instruction. This was made possible because of
the segmentation mechanism in x86. An astute reader can clearly make out that
this mechanism is more efficient than the earlier method that used a redirection
55 © Smruti R. Sarangi
via the thread info structure. The slower redirection based mechanism is still
used in architectures that do not have support for segmentation.
There are many things to be learned here. The first is that for something as
important as the current task, which is accessed very frequently, and is often on
the critical path, there is a need to devise a very efficient mechanism. Further-
more, we also need to note the diligence of the kernel developers in this regard
and appreciate how much they have worked to make each and every mechanism
as efficient as possible – save memory accesses wherever and whenever possible.
In this case, several conventional solutions are clearly not feasible such as storing
the current task pointer in CPU registers, a privileged/model-specific regis-
ter (may not be portable), or even a known memory address. The issue with
storing this pointer at a known memory address is that it significantly limits
our flexibility in using the virtual address space and there are portability issues
across architectures. As a result, the developers chose the segmentation-based
method for x86 hardware.
There is a small technicality here. We need to note that different CPUs
(cores on a machine) will have different per-CPU regions. This, in practice,
can be realized very easily with this scheme because different CPUs have dif-
ferent segment registers. We also need to ensure that these per-CPU regions
are aligned to cache line boundaries. This means that a cache line is uniquely
allocated to a per-CPU region. No cache line stores data corresponding to a
per-CPU region and other data. If this is the case, we will have a lot of false
sharing misses across the CPU cores, which will prove to be very detrimental
to the overall performance. Recall that false sharing misses are an artifact of
cache coherence. A cache line may end up continually bouncing between cores
if they are interested in accessing different non-overlapping chunks of that same
cache line.
Linux uses 140 task priorities. The priority range as shown in Table 3.2 is
from 0 to 139. The priorities 0-99 are for real-time tasks. These tasks are for
mission-critical operations, where deadline misses are often not allowed. The
scheduler needs to execute them as soon as possible.
The reason we have 100 different priorities for such real-time processes is
because we can have real-time tasks that have different degrees of importance.
We can have some that have relatively “soft” requirements, in the sense that
it is fine if they are occasionally delayed. Whereas, we may have some tasks
where no delay is tolerable. The way we interpret the priority range 0-99 is as
© Smruti R. Sarangi 56
follows. In this space 0 corresponds to the least priority real-time task and the
task with priority 99 has the highest priority in the overall system.
Some kernel threads run with real-time priorities, especially if they are in-
volved in important bookkeeping activities or interact with sensitive hardware
devices. Their priorities are typically in the range of 40 to 60. in general, it
is not advisable to have a lot of real-time tasks with very high priorities (more
than 60) because the system tends to become quite unstable. This is because the
CPU time is completely monopolized by these real-time tasks, resulting in the
rest of the tasks, including many OS tasks, not getting enough time to execute.
Hence, a lot of important kernel activities get delayed.
Now for regular user-level tasks, we interpret their priority slightly differ-
ently. In this case, higher the priority number, lower is the actual priority. This
basically means that in the entire system, the task with priority 139 has the
least priority. On the other hand, the task with priority 100 has the highest
priority among all regular user-level tasks. It still does not have a real-time
priority but among non-real-time tasks it has the highest priority. The impor-
tant point to understand is that the way that we understand these numbers
is quite different for real-time and non-real-time tasks. We interpret them in
diametrically opposite manners in both the cases.
There are two concepts here. The first is the number that we assign in the
range 0-139, and the second is the way that we interpret the number as a task
priority. It is clear from the preceding discussion that the number is interpreted
differently for regular and real-time tasks. However, if we consider the kernel, it
needs to resolve the ambiguity and use a single number to represents the priority
of a task. We would ideally like to have some degree of monotonicity. Ideally, we
want that either a lower value should always correspond to a higher priority or
the reverse, but we never want a combination of the two in the actual kernel code.
This is exactly what is done in the code snippet that is shown in Listing 3.3. We
need to note that there are historical reasons for interpreting user and real-time
priority numbers at the application level differently, but in the kernel code this
ambiguity needs to be removed. We need to ensure monotonicity.
In line with this philosophy, let us consider the first else if condition that
corresponds to real-time tasks. In this case, the value of MAX RT PRIO is 100.
Hence, the range [0-99] gets translated to [99-0]. This basically means that lower
the value of prio, greater the priority. We would want user-level priorities to
be interpreted in a similar manner. Hence, let us proceed to the body of the
else statement. Here, the macro NICE TO PRIO is used. Before expanding the
micro, it is important to understand the notion of being nice in Linux.
The default user-level priority associated with a regular task is 120. Given
a choice, every user would like to raise the priority of her task to be as high
57 © Smruti R. Sarangi
as possible. After all everybody wants their task to finish quickly. Hence,
the designers of Linux decided (rightfully so) to not give users the ability to
arbitrarily raise the priorities of their tasks. Instead, they allowed users to do
the reverse, which was to reduce the priority of their tasks. It is a way to be
nice to others. There are many instances where it is advisable to do so. For
instance, there are many tasks that do routine bookkeeping activities. They
are not very critical to the operation of the entire system. In this case, it is
a good idea for users to be courteous and let the operating system know that
their task is not very important. The scheduler can thus give more priority to
other tasks. There is a formal method of doing this, which is known as the
nice mechanism. As the name suggests, the user can increase the priority value
from 120 to any number in the range 121-139 by specifying a nice value. The
nice value in this case is a positive number, which is added to the number 120.
The final value represents the priority of the process. The macro NICE TO PRIO
effects the addition – it adds the nice value to 120.
There is a mechanism to also have a negative nice value. This mechanism
is limited to the superuser, who is also known as the root user in Linux-based
systems. The superuser is a special kind of a user who has more privileges and
is supposed to play the role of a system administrator. She still does not have
kernel privileges but she can specify a negative nice value that is between -1 and
-20. However, she still cannot raise the priority of a regular user-level process
to that of a real-time process, but she can arbitrarily alter the priority of all
user-level processes. We are underscoring the fact that regular users who are
not superusers cannot access this facility. Their nice values are strictly positive
and are in the range [1-19].
Now we can fully make sense of the code shown in Listing 3.3. We have
converted the user or real-time priority to a single number prio. Lower it is,
greater the actual priority. This number is henceforth used throughout the
kernel code to represent actual task priorities. We will see that when we discuss
schedulers, the process priorities will be very important and shall play a vital
role in making scheduling decisions.
/* Timestamps : */
The class sched info shown in Listing 3.4 contains some meta information
about the overall scheduling process. The variable pcount denotes the number
of times this task has run on the CPU. run delay is the time spent waiting in
the run queue. Note that the run queue is a structure that stores all the tasks
whose status is TASK RUNNING. As we have discussed earlier, this includes
tasks that are currently running on CPUs as well as tasks that are ready to
run. Then we have a bunch of timestamps. The most important timestamps
are last arrival and last queued that store when a task last ran on the CPU
and it was last queued to run. In general, the unit of time within a CPU is
either in milliseconds or in jiffies (refer to Section 2.1.4).
Key components
Listing 3.5 shows the code of vm area struct that represents a contiguous
virtual memory region. As we can see from the code, it maintains the details
of each virtual memory (VM) region including its start and end addresses. It
also contains a pointer to the original mm struct. Let us now introduce the two
kinds of memory regions in Linux: anonymous and file-backed.
Anonymous Memory Region These are many memory regions that are not
mirrored or copied from a file such as the stack and heap for instance.
© Smruti R. Sarangi 60
These memory regions are created during the execution of the process
and store dynamically allocated data. Hence, these are referred to as
anonymous memory regions. They have a dynamic existence and do not
have a file-backed copy and are not linked to specific sections in a binary
or object file.
Let us answer the first question here. We shall defer answering the rest of
the questions until we have explained some additional concepts. For mapping a
pid to a struct pid, we use a Radix tree (see Appendix C).
A natural question that will arise here is why not use a hash table? The
kernel developers conducted a lot of experiments and tried a large number of
data structures. They found that most of the time, processes share prefixes in
terms of their more significant digits. This is because, most of the time, the
processes that are active roughly have a similar pid range. As a result, if let’s say
we have looked up one process’s entry, the relevant part of the data structure is
still present in the processor’s caches. It can be used to quickly realize a lookup,
given that the subsequent pids are expected to share prefix digits. Hence, in
practice, such Radix trees were found to be faster than hash tables.
mind that a container at any point of time can be stopped, suspended and
migrated to another machine. On the other machine, the container can be
restarted. We want processes to continue to have the same process ids even on
the new machine. This will ensure that processes execute unbeknownst to the
fact that they have actually been migrated. Furthermore, if they would like
to communicate with each other, it is much more convenient if they retain the
same pids. This is where the notion of a namespace comes in handy.
In this case, a pid is defined only within the context of a namespace. Hence,
when we migrate a namespace along with its container, and then the container
is restarted on a remote machine, it is tantamount to reinstating the namespace
and all the processes within it. Given that we wish to operate in a closed
sandbox, they can continue to use the same pids and the operating system on
the target machine will respect them because they are running within their
separate namespace.
A namespace itself can be embedded in a hierarchy of namespaces. This is
done for the ease of management. It is possible for the system administrator to
provide only a certain set of resources to the parent namespace. Then the parent
namespace needs to appropriately partition them among its child namespaces.
This allows for fine-grained resource management and tracking.
struct pid_namespace {
/* A Radix tree to store allocated pid structures */
struct idr idr ;
The code of struct pid namespace is shown in Listing 3.6. The most
important structure that we need to consider is idr (IDR tree). This is an
annotated Radix tree (of type struct idr) and is indexed by the pid. The
reason that there is such a sophisticated data structure here is because, in
principle, a namespace could contain a very large number of processes. Hence,
there is a need for a very fast data structure for storing and indexing them.
We need to understand that often there is a need to store additional data
associated with a process. It is stored in a dedicated structure called (struct
pid). The idr tree returns the pid structure for a given pid number. We need
to note that a little bit of confusion is possible here given that both are referred
to using the same term “pid”.
Next, we have a kernel object cache (kmem cache) or pool called pid cachep.
It is important to understand what a pool is. Typically, free and malloc calls for
63 © Smruti R. Sarangi
allocating and deallocating memory in C take a lot of time. There is also need
for maintaining a complex heap memory manager, which needs to find a hole
of a suitable size for allocating a new data structure. It is a much better idea
to have a set of pre-allocated objects of the same type in an 1D array called a
pool. It is a generic concept and is used in a large number of software systems
including the kernel. Here, allocating a new object is as simple as fetching it
from the pool and deallocating it is also simple – we need to return it back to
the pool. These are very fast calls and do not involve the action of the heap
memory manager, which is far slower. Furthermore, it is very easy to track
memory leaks. If we forget to return objects back to the pool, then in due
course of time the pool will become empty. We can then throw an exception,
and let the programmer know that this is an unforeseen condition and is most
likely caused by a memory leak. The programmer must have forgotten to return
objects back to the pool.
To initialize the pool, the programmer should have some idea about the
maximum number of instances of objects that may be active at any given point
of time. After adding a safety margin, the programmer needs to initialize the
pool and then use it accordingly. In general, it is not expected that the pool
will become empty because as discussed earlier it will lead to memory leaks.
However, there could be legitimate reasons for this to happen such as a wrong
initial estimate. In such cases, one of the options is to automatically enlarge
the pool size up till a certain limit. Note that a pool can store only one kind of
an object. In almost all cases, it cannot contain two different types of objects.
Sometimes exceptions to this rule are made if the objects are of the same size.
Next, we store the level field that indicates the level of the namespace.
Recall that namespaces are stored in a hierarchical fashion. This is why, every
namespace has a parent field.
Listing 3.7: The struct pid
source : include/linux/pid.h
struct upid {
int nr ; /* pid number */
struct pid_namespace * ns ; /* namespace pointer */
};
struct pid
{
refcount_t count ;
unsigned int level ;
Let us now look at the code of struct pid in Listing 3.7. As discussed
earlier, often there is a need to store additional information regarding a process,
which may possibly be used after the pid has been reused and the process has
© Smruti R. Sarangi 64
terminated. The count field refers to the number of resources that are using the
process. Ideally, it should be 0 when the process is freed. Also, every process
has a default level, which is captured by the level field.
A struct pid can also be used to identify a group of tasks. This is why
a linked list of the type tasks is maintained. The last field numbers is very
interesting. It is an array of struct upid data structures (also defined in
Listing 3.7). Each struct upid is a tuple of the pid number and a pointer
to the namespace. Recall that we had said that a pid number makes sense in
only a given namespace. In other namespaces, the same process (identified with
struct pid) can have a different process id number (pid value). Hence, there
is a need to store such ⟨pid,namespace⟩ tuples.
Let us explain with an example shown in Figure 3.4. There are four pids: 1,
3, 4 and 7. Their binary representations are 001, 011, 100 and 111, respectively.
Given that we start from the most significant bit, we can create a Radix tree as
shown in Figure 3.4.
The internal nodes are shown as ovals and the leaf nodes are rectangles.
Each leaf node is uniquely identified by the path from the root leading to it (the
pid number). Each leaf node additionally points to the struct pid associated
with the process. Along with that each leaf node implicitly points to a location
in the bit map. If a leaf node corresponds to pid 3 (011), then we can say that
it maps to the 3rd bit in the bit vector.
Bit
vector
0 1 2 3 4 5 6 7
Figure 3.4: Example of an IDR tree
Let us now interpret the IDR tree differently. Each leaf node corresponds to
a range of pid numbers. For instance, we can associate pid 1 with the indices
0 and 1 in the bit vector. Similarly, for pid 3, we can associate all the indices
after the previous pid 1. These will be indices 2 and 3. Similarly, for pid 4, we
can associate index 4 and for pid 7, we can associate 5, 6 and 7.
This is just an example, we can associate larger indices until the next pid
(in the ascending/preorder sequence of leaves). The convention that is used
does not matter. We are basically creating non-overlapping partitions of the bit
vector that are in sorted order.
We interpret the bit vector as follows. If a bit is equal to FREE (logical 1 in
the case of x86), then it means that the corresponding pid number is free, and
vice versa. Each internal node stores similar information as a van Emde Boas
tree (vEB tree). The aim is to find the index of the leftmost (lowest) entry in
the bit vector that stores FREE. Note that in this case a node can have more
than 2 children. For each child we store a bit indicating if it has a free entry in
the subtree rooted at it or not. Let us say that there are k children. The parent
stores a k-bit vector, where we start searching it from index 0 to k − 1. We
find the earliest entry whose status is FREE. The algorithm recursively proceeds
using the subtree rooted at the chosen child. At the end, the algorithm reaches
the leaf node.
Each leaf node corresponds to a contiguous region of bits in the bit vector.
© Smruti R. Sarangi 66
This region can be very large especially if a lot of contiguous pids are deallocated.
Now, we need to note that there is no additional vEB tree that additionally
indexes this region. Other than the Radix tree nodes, we don’t have additional
nodes. Hence, we need a fast method to find the first location that contains a
FREE entry in a large contiguous chunk of bits in the bit vector.
Clearly, using a naive method that scans every bit sequentially will take a
lot of time. However, some smart solutions are possible here. We can start with
dividing the set of contiguous bits into chunks of 32 or 64 bits (depending on
the architecture). Let us assume that each chunk is 64 bits. We can typecast
this chunk to an unsigned long integer and compare it with 0. If the comparison
succeeds, then it means that all the bits are 0 and there is no 1 (FREE). If it is
non-zero, then it means that the 64-bit chunk has at least one 1. Fortunately,
x86 machines have an instruction called bsf (bit scan forward) that returns
the position of the first (least significant) 1. This is a very fast hardware in-
struction that executes in 1-2 cycles. The kernel uses this instruction to almost
instantaneously find the location of the first 1 bit (FREE bit).
As soon as a FREE bit is found in the bit vector, it is set to 0, and the
corresponding pid number is deemed to be allocated. This is equivalent to
converting a 1 to a 0 in a vEB tree (see Appendix C). There is a need to traverse
the path from the leaf to the root and change the status of nodes accordingly.
Similarly, when a pid is deallocated, we convert the entry from 0 to 1 (FREE)
and appropriate changes are made to the nodes in the path from the leaf to the
root.
/* I / O device context */
struct io_context * io_context ;
An example of using the fork System call is shown in Listing 3.9. Here, the
fork library recall is used that encapsulates the fork system call. The fork library
call returns a process id (pid in the code) after creating the child process.
Note that inside the code of the forking procedure, a new process is created,
which is a child of the parent process that made the fork call. It is a perfect
copy of the parent process. It inherits the parent’s code as well as its memory
state. In this case, inheriting means that all the memory regions and the state
is fully copied to the child. For example, if a variable x is defined to be 7 in the
code prior to executing the fork call, then after the call is over and the child
is created, both of them can read x. They will see its value to be 7. However,
there is a point to note here. The variable x is different for both the processes
– it has different physical addresses even though its starting value is the same
(i.e., 7). This means that if the parent changes x to 19, the child will still read
it to be 7 (because it possess a copy of x and not x itself). Basically, the child
gets a copy of the value of x, not a reference to it. Even though the name x is
the same across the two processes, the variables themselves are different.
Now that we have clarified the meaning of copying the entire memory space,
let us look at the return value. Both the child and the parent will return from
the fork call. A natural question that can be asked here is that prior to executing
the fork call, there was no child, how can it return from the fork call? This is
the fun and tricky part. When the child is created deep in the kernel’s process
cloning logic, a full task along with all of its accompanying data structures is
created. The memory space is fully copied including the register state and the
value of the return address. The state of the task is also fully copied. Since all
the addresses are virtual, creating a copy does not hamper correctness. Insofar
as the child process is concerned, all the addresses that it needs are a part of
its address space. It is, at this point of time, indistinguishable from the parent.
The same way that the parent will eventually return from the fork call, the child
also will. The child will get the return address from either the register or the
stack (depending upon the architecture). This return address, which is virtual,
will be in its own address space. Given that the code is fully copied the child
will place the return value in the variable pid and start executing Line 8 in
Listing 3.9. The notion of forking a new process is shown in Figure 3.5.
Herein lies the brilliance of this mechanism – the parent and child are re-
turned different values.
The child is returned 0 and the parent is returned the pid of the child.
© Smruti R. Sarangi 70
Original
process
Create a
copy
fork()
Child process
child's 0
pid
This part is crucial because it helps the rest of the code differentiate between
the parent and the child. A process knows Whether it is the parent process or
the child process from the return value: 0 for the child and the child’s pid for the
parent. After this happens, the child and parent go their separate ways. Based
on the return value of the fork call, the if statement is used to differentiate
between the child and parent. Both can execute arbitrary code beyond this
point in their behavior can completely diverge. In fact, we shall see that the
child can completely replace its memory map and execute some other binary.
However, before we go that far let us look at how the address space of one
process is completely copied. This is known as the copy-on-write mechanism.
Copy on Write
Figure 3.6(a) shows the copy-on-write mechanism. In this case, we simply copy
the page tables. The child inherits a verbatim copy of the parent’s page table
even though it has a different virtual address space. This mechanism ensures
that the same virtual address in both the child and parent’s virtual address space
points to the same physical address. This basically means that no memory is
wasted in the copying process and the size of the memory footprint remains
exactly the same. Note that copying the page table implies copying the entire
memory space including the text, data, bss, stack and heap. Other than the
return value of the fork call, nothing else differentiates the child and parent.
They can seamlessly read any address in their virtual address space and they
will get the same value. However, note that this is an implementation hack.
Conceptually, we do not have any shared variables. As we have discussed earlier,
if a variable x is defined before the fork call, after the call it actually becomes
two variables: x in the parent’s address space and x in the child’s address space.
It is true that to save space as well as for performance reasons, the same piece
of physical memory real estate is being used, however conceptually, these are
different variables. This becomes very clear when we consider write operations.
This part is shown in Figure 3.6(b) where we see that upon a write, a new
71 © Smruti R. Sarangi
Page
Parent Child
(a) Parent and child sharing a page
Page Page
Parent Child
physical copy of the frame that contains that variable that was written to is
created. The child and parent now have different mappings in their page tables.
Their page tables are updated to reflect this change. For the same virtual
address, they now point to different physical addresses. Assume that the child
initiated the write, then after a new physical copy of the page is created and the
respective mapping is made, the write is realized. In this case, the child writes
to its copy of the page. This write is not visible to the parent.
As the name suggests, this is a copy-on-write mechanism where the child
and parent continue to use the same physical page (frame) until there is no
write. This saves space and also allows us to fork a new process very quickly.
However, the moment there is a write, there is a need to create a new physical
copy of the page and realize the write on the respective copy (copy on write).
This does increase the performance overheads when it comes to the first write
operation after a fork call, however, a lot of this overhead gets amortized and is
often not visible.
There are several reasons for this. The first is that the parent or child may
not subsequently write to a large part of the memory space. In that case, the
copy-on-write mechanism will never kick in. The child may overwrite its memory
image with that of another binary and continue to execute it. In this case also,
there will be no need for a copy-on-write. Furthermore, lazily creating copies
of pages as and when there is a demand, distributes the overhead over a a long
© Smruti R. Sarangi 72
period of time. Most applications can absorb this very easily. Hence, the fork
mechanism has withstood the test of time.
Details
Let us now delve into some of the details of the forking mechanism. We would
like to draw your attention to the file in the kernel that lists all the supported
system calls: include/linux/syscalls.h. It has a long list of system calls. How-
ever, the system calls of our interest are clone and vfork. The clone system call
is the preferred mechanism to create a new process or thread in a thread group.
It is extremely flexible and takes a wide variety of arguments. However, the
vfork call is optimized for the case when the child process immediately makes
an exec call. In this case, there is no need to fully initialize the child and copy
the page tables of the parent. Finally, note that in a multi-threaded process
(thread group), only the calling thread is forked.
Inside the kernel, all of them ultimately end up calling the copy process
function in kernel/fork.c. The signature of the function is as follows:
Here, the ellipses . . . indicate that there are more arguments, which we are
not specifying for the sake of readability. The main tasks that are involved in
73 © Smruti R. Sarangi
The exec family of system calls are used to achieve this. In Listing 3.10
an example is shown where the child process runs the execv library call. Its
arguments are a null-terminated string representing the path of the executable
and an array of arguments. The first argument by default is the file name
– “pwd” in this case. The next few arguments should be the command-line
arguments to the executable and the last argument needs to be NULL. There
are a bunch of library calls in the exec family. All of them wrap the exec system
call.
There are many steps involved in this process. The first action is to clean
up the memory space (process map) of a process and reinitialize all the data
structures. We need to then load the starting state of the new binary in the
process’s memory map. This includes the contents of the text, data and bss
sections. Then there is a need to initialize the stack and heap sections as well
as initializing the starting value of the stack pointer. In general, connections to
external resources such as files and network addresses are preserved in an exec
call. Hence, there is no need to modify, cleanup or reinitialize them. We can
now start executing the process with an updated memory map from the start
of the text section. This is like regular program execution. The fact that we are
starting from a forked process is conveniently forgotten.
1. All the general-purpose registers including the stack pointer and return
address register (if the architecture has one)
3. Segment registers
There are many other minor component of the hardware context in a large
and complex processor like an x86-64 machine. However, we have listed the
main components for the sake of readability. This context needs to be stored
and later restored.
It is important to discuss the two virtual memory related structures here
namely the TLB and page table. The TLB stores the most frequently (or
recently) used virtual-to-physical mappings. There is a need to flush the TLB
when the process changes, because the new process will have a new virtual
memory map. We do not want it to use the mappings of the previous process.
They will be incorrect and this will also be a serious security hazard because now
the new process can access the memory space of the older process. Hence, once
a process is swapped out, at least no other user-level process should have access
to its TLB contents. An easy solution is to flush the TLB upon a context switch.
However, as we shall see later, there is a more optimized solution, which allows
us to append the pid number to each TLB entry. This does not require the
system to flush the TLB upon a context switch. Instead, every process is made
to use only its own mappings. Because of the pid information that is present,
a process cannot access the mappings of any other process. This mechanism
© Smruti R. Sarangi 76
mode. These are some of the minimal changes that need to be made. This
method is clearly performance-enhancing and in character it is very lightweight.
Also, there is no need to flush the TLB or change the page table. If we
agree to use a different set of virtual addresses in kernel mode, then there is no
problem. This is because the kernel code that will be executed will only use
virtual addresses that are in the kernel space and have no overlap with user-level
virtual addresses. This kernel code is loaded separately when the soft switch
happens – it is not a part of the original binary. Hence, there is no need to
flush any TLB or freshly load any page table. For example, in 32-bit Linux, the
virtual address space was limited to 4 GB. User processes could only use the
lower 3 GB, and the kernel threads could only use the upper 1 GB. Of course,
there are special mechanisms for a kernel thread to read data from the user
space. However, the reverse is not possible because that would be a serious
security lapse. There is a similar split in 64-bit kernels as well – separate user
and kernel virtual address spaces.
the fact that they are independent of any user thread. Hence, the same trick
of reusing the user thread and making it a kernel thread is not advisable here.
There may not be any relationship between the user thread that was interrupted
and the interrupt handler. Hence, in all likelihood, the interrupt handler will
require a dedicated kernel thread to run. If you may recall, we had discussed the
idea of an interrupt stack in Section 3.2.4, where we had mentioned that each
CPU maintains a stack of interrupt threads. Whenever an interrupt arrives, we
pop the stack and assign the task of running the interrupt handler to the thread
that was just popped.
An advantage of doing this is that such threads would already have been
initialized to have a high priority. Furthermore, interrupt handling threads
have other restrictions such as they cannot hold locks or use any kind of blocking
synchronization. All of these rules and restrictions can be built into them at
the time of creation. Interrupt handling in Linux follows the classical top half
and bottom half paradigm. Here, the interrupt handler, which is known as the
top half, does basic interrupt processing. However, it may be the case that a
lot more work is required. This work is deferred to a later time and is assigned
to a lower-priority thread, which is classically referred to as the bottom half.
Of course, this mechanism has become more sophisticated now; however, the
basic idea is still the same: do the high-priority work immediately and defer
the low-priority work to a later point in time. The bottom half thread does not
have the same restrictions that the top half thread has. It thus has access to a
wider array of features. Here also, the interrupt handler’s (top half’s) code and
variables are in a different part of the virtual address space (not in a region that
is accessible to any user process). Hence, there is no need to flush the TLB or
reload any page table. This speeds up the context switch process.
1. The hardware stores the program counter (rip register) to the register
rcx and stores the flags register rflags in r11. Clearly, prior to making
a system call it is assumed that the two general purpose registers rcx and
r11 do not contain any useful data.
4. Almost all x86 and x86-64 processors define a special segment per CPU
known as the Task State Segment or TSS. The size of the TSS segment is
small but it is used to store important information regarding the context
switch process. It was previously used to store the entire context of the
task. However, these days it is used to store a part of the overall hardware
context of a running task. On x86-64, a system call handler stores the stack
pointer (rsp) on it. There is sadly no other choice. We cannot use the
kernel stack because for that we need to update the stack pointer – the old
value will get lost. We also cannot use a general purpose register. Hence,
a separate memory region is necessary to act as a temporary region.
5. Finally, the stack of the current process can be set to the kernel stack.
6. We can now push the rest of the state to the kernel stack. This will include
the following:
Additional Context
Along with the conventional hardware context, there are additional parts of the
hardware context that needs to be stored and restored. Because the size of
the kernel stack is limited, it is not possible to store a lot of information there.
Hence, a dedicated structure called a thread struct is defined to store all extra
and miscellaneous information. It is defined in arch/x86/include/asm/processor.h.
Every thread defines a TLS region (thread local storage). It stores variables
specific to a thread. The thread struct stores a list of such TLS regions
(starting address and size of each), the stack pointer (optionally), the segment
registers (ds,es,fs and gs), I/O permissions and the state of the floating-point
unit.
Trivia 1 One will often find statements of the form in the kernel code:
if ( likely ( < some condition >) ) {...}
if ( unlikely ( < some condition >) ) {...}
These are hints to the branch predictor of the CPU. The term likely
means that the branch is most likely to be taken, and the term unlikely
means that the branch is most likely to be not taken. These hints increase
the branch predictor accuracy, which is vital to good performance.
We are using the value of the task struct* pointer as a source of ran-
domness. Many such random sources are combined in the kernel to create
a good random number generating source that can be used to generate cryp-
tographic keys.
Chapter 4
In this chapter, we will study the details of system calls, interrupts, exceptions
and signals. The first three are the only methods to invoke the OS. Normally,
the OS code lies dormant. It comes into action only after three events: system
calls, interrupts and exceptions. In common parlance all three of these events
are often referred to as interrupts even though sometimes distinctions are made
such as using the terms “hardware interrupts” and “software interrupts”. It is
important to note that in the general interrupt-processing mechanism, an inter-
rupt can be generated by either external hardware or internally by the program
itself. Internal interrupts comprise software interrupts to effect system calls and
exceptions 1 For all these mechanisms including the older way of making system
calls, i.e., invoking the instruction int 0x80 that simply generates a dummy
interrupt with interrupt code 0x80, the generic interrupt processing mechanism
is used. The processor even treats exceptions as a special kind of interrupts.
All interrupts have their own interrupt codes; they are also known as interrupt
vectors. An interrupt vector is used to index interrupt handler tables. Let us
elaborate.
Figure 4.1 shows the structure of the Interrupt Descriptor Table (IDT) that
is pointed to by the idtr register. As we can see, regardless of the source of the
interrupt, ultimately an integer code called an interrupt vector gets associated
with it. It is the job of the hardware to assign the correct interrupt vector to
an interrupting event. Once this is done, a hardware circuit is ready to access
the IDT using the interrupt vector as the index.
Accessing the IDT is a simple process. A small module in hardware simply
finds the starting address of the IDT by reading the contents of the idtr register
and then accesses the relevant entry using the interrupt vector. The output is
the address of the interrupt handler, whose code is subsequently loaded. The
handler finishes the rest of the context switch process and begins to execute the
code to process the interrupt. Let us now understand the details of the different
types of handlers.
1 Note that the new syscall instruction for 64-bit processors, does not use this mechanism.
Instead, it directly transitions the execution to the interrupt entry point after a ring level
change and some basic context saving.
83
© Smruti R. Sarangi 84
IDT
Excep�ons
System calls
(via int 0x80)
Interrupt Address of
vector the handler
Interrupts
idtr
Hardware device Register
iden�fied by its IRQ
int main () {
printf ( " Hello World \ n " ) ;
}
Let us now understand this process in some detail. The signature of the
printf function is as follows: int printf(const char* format, ...). The
format string is of the form ‘‘The result is %d, %s’’. It is succeeded by
a sequence of arguments, which replace the format specifiers (like ‘‘%d’’ and
‘‘%s’’) in the format string. The ellipses . . . indicates a variable number of
arguments.
A sequence of functions are called in the glibc code. The sequence is as
follows: printf → printf → vfprintf → printf positional → outstring
→ PUT. Gradually the signature changes – it becomes more and more generic.
This ensures that other calls like fprintf that write to a file are all covered by
the same function as special cases. Note that Linux treats every device as a file
including the terminal. The terminal is a special kind of file, which is referred to
as stdout. The function vfprintf accepts a generic file as an argument, which
it can write to. This generic file can be a regular file in the file system or the
terminal (stdout). The signature of vprintf is as follows:
int vfprintf ( FILE *s , const CHAR_T * format , va_list ap ,
unsigned int mode_flags ) ;
85 © Smruti R. Sarangi
Note the generic file argument FILE *s, the format string, the list of ar-
guments and the flags that specify the nature of the I/O operation. Every
subsequent call generalizes the function further. Ultimately, the control reaches
the new do write in the glibc code (fileops.c). It makes the write system
call, which finally transfers control to the OS. At this point, it is important to
digress and make a quick point about the generic principles underlying library
design.
At this point there is a ring level switch and interrupts are switched off. The
reason to turn off interrupts is to ensure that the context is saved correctly. If
there is an interrupt in the middle of saving the context, there is a possibility
that an error will be induced. Hence, the context saving process cannot termi-
nate prematurely. Saving the context is a short process and masking interrupts
during this process will not create a lot of performance issues in handling even
very critical tasks. Interrupts can be enabled as soon as the context is saved.
Linux has a standard system call format. It is shown in Table 4.1 that
shows which register stores which type of argument. For instance, rax stores
the system call number. Six more arguments can be supplied via registers
as shown in Table 4.1. If there are more arguments, then they need to be
transferred via the user stack. The kernel can read user memory and thus it can
easily retrieve these arguments. However, passing arguments using the stack is
not the preferred method. It is much slower as opposed to passing values via
registers.
Note that a system call is a planned activity as opposed to an interrupt.
Hence, we can free up some registers such as rcx and r11 by spilling their
contents. Recall that the PC and the flags are automatically stored in these
registers once a system call is made. The system call handler subsequently
stores the contents of these registers on the kernel stack.
Attribute Register
System call number rax
Arg. 1 rdi
Arg. 2 rsi
Arg. 3 rdx
Arg. 4 r10
Arg. 5 r8
Arg. 6 r9
Let us now discuss the do syscall 64 function more. After basic context
saving, interrupts are enabled, and then the function accesses a system call table
as shown in Table 4.2. Given a system call number, the table lists the pointer to
the function that handles the specific type of system call. This function is then
subsequently called. For instance, the write system call ultimately gets handled
by the ksys write function, where all the arguments are processed and the real
work is done.
scheduler when it feels that the current task has executed for a long time and it
needs to give way to other processes or there are other higher priority processes
that are waiting. Sometimes threads explicitly request for getting preempted
such that other threads get a chance to execute. Other threads may create
some value that is useful to the current thread. In this case also, the thread
that wishes to yield the CPU gets this flag set.
If this flag is set, then the scheduler needs to run and find the most worthy
process to run next. The scheduler uses very complex algorithms to decide this.
The scheduler treats the TIF NEED RESCHED flag as a coarse-grained estimate.
Nevertheless, it makes its independent decision. It may decide to continue with
the same task or it may decide to start a new task on the same core. This is
purely its prerogative.
The context restore mechanism follows the reverse sequence vis-a-vis the
the context switch process. Note that there are some entities such as segment
registers that normally need not be stored but have to be restored. The reason
is that because they are not transient (ephemeral) in character. We don’t expect
them to get changed often, especially by the user task. Once, they are set, they
typically continue to retain their values till the end of the execution. Their
values can be stored in the task struct. At the time of restoring a context, if
the task changes, we can read the respective values from the task struct and
set the values of the segment registers.
Finally, the kernel calls the sysret instruction that sets the value of the PC
and completes the control transfer back to the user process. It also changes the
ring level or in other words effects a mode switch (from kernel mode to user
mode).
LAPIC
CPU CPU
I/O
APIC
CPU CPU cntrlr
4.2.1 APICs
Figure 4.4 represents the flow of actions. We need to distinguish between two
terms: interrupt request and interrupt number/vector. The interrupt number of
interrupt vector is a unique identifier of the interrupt and is used to identify the
interrupt service routine that needs to run whenever the interrupt is generated.
The IDT is indexed by this number.
The interrupt request(IRQ) on the other hand is a hardware signal that
is sent to the CPU indicating that a certain hardware needs to be serviced.
There are different IRQ lines (see Figure 4.4). For example, one line may be
for the keyboard, another one for the mouse, so on and so forth. In older
systems, the number/index of the IRQ line was the same as the interrupt vector.
However, with the advent of programmable interrupt controllers (read APICs),
this has been made more flexible. The mapping can be changed dynamically.
For example, we can program a new device such as a USB device to actually
act as a mouse. It will generate exactly the same interrupt vector. In this way,
it is possible to obfuscate a device and make it present itself as a different or
somewhat altered device to software.
Here again there is a small distinction between the LAPIC and I/O APIC.
The LAPIC directly generates interrupt vectors and sends them to the CPU.
The flow of actions (for the local APIC) are shown in Figure 4.4. ❶ The first
step is to check if interrupts are enabled or disabled. Recall that we discussed
that often there are sensitive sections in the execution of the kernel where it
is a wise idea to disable interrupts such that no correctness problems are in-
troduced. Interrupts are typically not lost. They are queued in the hardware
queue in the respective APIC and processed in priority order when interrupts
are enabled back again. Of course, there is a possibility of overflows. This is a
rare situation but can happen. In this case interrupts will be lost. Note that
89 © Smruti R. Sarangi
IRQ
lines APIC CPU
IRQ (interrupt request): An interrupt vector (INT) identifies any kind The LAPIC sends an
Kernel iden�fier for a HW of an interrupting event: interrupt, system interrupt vector to the
interrupt source call, exception, fault, etc. CPU (not an IRQ)
disabling and masking interrupts are two different concepts. Disabling is more
of a sledgehammer like operation where all interrupts are temporarily disabled.
However, masking is a more fine-grained action where only certain interrupts are
disabled in the APIC. Akin to disabling, the interrupts are queued in the APIC
and presented to the CPU at a later point of time when they are unmasked.
❷ Let us assume that interrupts are not disabled. Then the APIC chooses
the highest priority interrupt and finds the interrupt vector for it. It also needs
the corresponding data from the device. ❸ It buffers the interrupt vector and
data, and then checks if the interrupt is masked or not. ❹ If it is masked, then
it is added to a queue as discussed, otherwise it is delivered to the CPU. ❺ The
CPU needs to acknowledge that it has successfully received the interrupt and
only then does the APIC remove the interrupt from its internal queues. Let
us now understand the roles of the different interrupt controllers in some more
detail.
I/O APIC
In the full system, there is only one I/O APIC chip. It is typically not a part
of the CPU, instead it is a chip on the motherboard. It mainly contains a
redirection table. Its role is to receive interrupt requests from different devices,
process them and dispatch the interrupts to different LAPICs. It is essentially
an interrupt router. Most I/O APICs typically have 24 interrupt request lines.
Typically, each device is assigned its IRQ number – the lower the number higher
is the priority. A noteworthy mention is the timer interrupt, whose IRQ number
is typically 0.
The kernel thread only has control over the current CPU, which is CPU 5. It
does not seem to have any control over what is happening on CPU 1. The IPI
mechanism, which is a hardware mechanism is precisely designed to facilitate
this. CPU 5 on the behest of the kernel thread running on it, can instruct
its LAPIC to send an IPI to the LAPIC of CPU 1. This will be delivered to
CPU 1, which will run a kernel thread on CPU 1. After doing the necessary
bookkeeping steps, this kernel thread will realize that it was brought in because
the kernel thread on CPU 5 wanted to replace the task running on CPU 1 with
some other task. In this manner, one kernel thread can exercise its control over
all CPUs. It does however need the IPI mechanism to achieve this. Often, the
timer chip is a part of the LAPIC. Depending upon the needs of the kernel, its
interrupt frequency can be configured or even changed dynamically. We have
already described the flow of actions in Figure 4.4.
Distribution of Interrupts
The next question that we need to address is how are the interrupts distributed
between the LAPICs. There are regular I/O interrupts, timer interrupts and
IPIs. We can either have a static distribution or a dynamic distribution. In
the static distribution, one specific core or a set of cores are assigned the role
of processing a given interrupt. Of course, there is no flexibility when it comes
to IPIs. Even in the case of timer interrupts, it is typically the case that each
LAPIC generates periodic timer interrupts to interrupt its local core. However,
this is not absolutely necessary and some flexibility is provided. For instance,
instead of generating periodic interrupts, it can be programmed to generate an
interrupt at a specific point of time. In this case, this is a one-shot interrupt and
periodic interrupts are not generated. This behavior can change dynamically
because LAPICs are programmable.
In the dynamic scheme, it is possible to send the interrupt to the core that is
running the task with the least priority. This again requires hardware support.
Every core on an Intel machine has a task priority register, where the
kernel writes the priority of the current task that is executing on it. This
information is used by the I/O APIC to deliver the interrupt to the core that
is running the least priority process. This is a very efficient scheme, because it
allows higher priority processes to run unhindered. If there are idle cores, then
the situation is even better. They can be used to process all the I/O interrupts
and sometimes even timer interrupts (if they can be rerouted to a different core).
4.2.2 IRQs
The file /proc/interrupts contains the details of all the IRQs and how they
are getting processed (refer to Figure 4.3). Note that this file is relevant to only
the author’s machine and that too as of 2023.
The first column is the IRQ number. As we see, the timer interrupt is IRQ#
0. The next four columns show the count of timer interrupts received at each
CPU. Note that it has a small value. This is because any modern machine has
a variety of timers. It has the low-resolution LAPIC timer. In this case, a more
high-resolution timer was used. Modern kernels prefer high-resolution timers
because they can dynamically configure the interrupt interval based on the
processes that are executing in the kernel. This interrupt is originally processed
91 © Smruti R. Sarangi
by the I/O APIC. The term “2-edge” means that this is an edge-triggered
interrupt on IRQ line 2. Edge-triggered interrupts are activated when there is
a level transition (0 → 1 and 1 → 1 transitions). The handler is the generic
function associated with the timer interrupt.
The “fasteoi” interrupts are level-triggered. Instead of being based on an
edge (a signal transition), they depend upon the level of the signal in the in-
terrupt request line. “eoi” stands for “End of Interrupt”. The line remains
asserted until the interrupt is acknowledged.
For every request that comes from an IRQ, an interrupt vector is generated.
Table 4.4 shows the range of interrupt vectors. NMIs (non-maskable interrupts
and exceptions) fall in the range 0-19. The interrupt numbers 20-31 are reserved
by Intel for later use. The range 32-127 corresponds to interrupts generated by
external sources (typically I/O devices). We are all familiar with interrupt num-
ber 128 (0x80 in hex) that is a software-generated interrupt corresponding to a
system call. Most modern machines have stopped using this mechanism because
they now have a faster method based on the syscall instruction. 239 is the local
APIC (LAPIC) timer interrupt. As we have argued many IRQs can generate
this interrupt vector because there are many timers in modern systems with
different resolutions. Lastly, the range 251-253 corresponds to inter-processor
interrupts (IPIs). A disclaimer is due here. This is the interrupt vector range in
the author’s Intel i7-based system as of 2023. This in all likelihood may change
in the future. Hence, a request to the reader is to treat this data as an example.
Table 4.5 summarizes our discussion quite nicely. It shows the IRQ number,
interrupt vector and the hardware device. We see that IRQ 0 for the default
timer corresponds to interrupt vector 32. The keyboard, system clock, network
© Smruti R. Sarangi 92
interface and USB ports have their IRQ numbers and corresponding interrupt
vector numbers. One advantage of separating the two concepts – IRQ and in-
terrupt vector – are clear from the case of timers. We can have a wide variety
of timers with different resolutions. However, they can be mapped to the same
interrupt vector. This will ensure that whenever an interrupt arrives from any
one of them the timer interrupt handler can be invoked. The current can dy-
namically decide which timer to use depending on the requirements and load
on the system.
Given that HW IRQs are limited in number, it is possible that we may have
more devices than the numbe of IRQs. In this case, several devices have to
share the same IRQ number. We can do our best to dynamically manage the
IRQs such as dellocating the IRQ when a device is not in use or dynamically
allocating an IRQ when a device is accessed for the first time. In spite of that we
still may not have enough IRQs. Hence, there is a need to share an IRQ between
multiple devices. Whenever an interrupt is received from an IRQ, we need to
check which device generated it by running all the handlers corresponding to
each connected device (that share the same IRQ). These handlers will query the
individual devices or inspect the data and find out. Ultimately, we will find a
device that is responsible for the interrupt. This is a slow but compulsory task.
Listing 4.2 shows the important fields in struct irqdesc. It is the nodal
data structure for all IRQ-related data. It stores all the information regarding
the hardware device, the interrupt vector, CPU affinities (which CPUs process
it), pointer to the handler, special flags and so on.
Similar to process namespaces, IRQs are allow subdivided into domains.
This is especially necessary given that modern processors have a lot of devices
and interrupt controllers. We can have a lot of IRQs, but at the end of the day,
the processor will use the interrupt vector (a simple number between 0-255). It
still needs to retain its meaning and be unique.
A solution similar to hierarchical namespaces is as follows: assign each in-
terrupt controller a domain. Within a domain, the IRQ numbers are unique.
Recall that we followed a similar logic in process namespaces – within a names-
pace pid numbers are unique. The IRQ number (like a pid) is in a certain sense
getting virtualized. Similar to a namespace’s IDR tree whose job was to map pid
numbers to struct pid data structures, we need a similar mapping structure
here per domain. It needs to map IRQ numbers to irq desc data structures.
This is known as reverse mapping (in this specific context). Such a mapping
mechanism allows us to quickly retrieve an irq desc data structure given an
IRQ number. Before that we need to add the interrupt controller to an IRQ
domain. Typically, the irq domain add function is used to realize this. This is
similar to adding a process to a namespace first before starting any operation
on the process.
In the case of an IRQ domain, we have a more nuanced solution. If there are
less than 256 IRQs, then the kernel uses a simple linear list, otherwise it uses a
radix tree. This gives us the best of both worlds.
The domains are organized hierarchically. We have an I/O APIC domain
whose parent is a larger domain known as the interrupt remapping domain – its
job is to virtualize the multiple I/O APICs. This domain forwards the interrupt
to the controllers in the LAPIC domain that further virtualize the IRQs, map
them to interrupt vectors and present them to the cores.
An astute reader will quickly notice the difference between hierarchical names-
paces and hierarchical IRQ domains. In the former, the aim is to make a child
© Smruti R. Sarangi 94
process a member of the parent namespace such that it can access resources
that the parent owns. However, in the case of IRQ domains, interrupts flow
from the child to parent. There is some degree of virtualization and remapping
at every stage. For example, one of the domains in the middle could send all
keyboard interrupts to only one VM (virtual machine) running on the system.
This is because the rest of the VMs may not be allowed to read entries from the
keyboard. Such policies can be enforced with IRQ domains.
The IDT maps the interrupt vector to the address of the handler.
The initial IDT is set up by the BIOS. During the process of the kernel
booting up, it is sometimes necessary to process user inputs or other important
system events like a voltage or thermal emergency. Also in many cases prior
to the OS booting up, the boot loader shows up on the screen; it asks the user
about the kernel that she would like to boot. For all of this, we need a bare
bones IDT that is already set up. However, once the kernel boots, it needs
to reinitialize or overwrite it. For every single device and exception-generating
situation, entries need to be made. These will be custom entries and only the
kernel can make them because the BIOS would simply not be aware of them –
they are very kernel specific. Furthermore, the interrupt handlers will be in the
kernel’s address space and thus only the kernel will be aware of their locations.
In general, interrupt handlers are not kept in a memory region that can be
relocated or swapped out. The pages are locked and pinned in physical memory
(see Section 3.2.8).
The kernel maps the IDT to the idt table data structure. Each entry of this
table is indexed by the interrupt vector. Each entry points to the corresponding
interrupt handler. It basically contains two pieces of information: the value
of the code segment register and an offset within the code segment. This is
sufficient to load the interrupt handler. Even though this data structure is
set up by the kernel, it is actually looked up in hardware. There is a simple
mechanism to enable this. There is a special register called the IDTR register.
Similar to the CR3 register for the page table, it stores the base address of
the IDT. Thus the processor knows where to find the IDT in physical memory.
The rest of the lookup can be done in hardware and interrupt handlers can be
automatically loaded by a hardware circuit. The OS need not be involved in
this process. Its job is to basically set up the table and let the hardware do the
rest.
Next, it makes a call to init IRQ to setup the per-CPU interrupt stacks
(Section 3.2.4) and the basic IDT. Once that is done the LAPICs and the I/O
APIC can be setup along with all the connected devices. The apic bsp setup
function realizes this task. All the platform specific initialization functions for
x86 machines are defined in a structure .irqs that contains a list of function
pointers as shown in Listing 4.3. The function apic intr mode init specifically
initializes the APICs on x86 machines.
TPR Task priority register. This stores the priority of the task. When we
are dynamically assigning interrupts to cores, the priority stored in this
register comes handy.
Setting up an I/O APIC is somewhat different (refer to the code in arch/x86/kernel/apic/io apic.c).
For every single pin in the I/O APIC that is connected to a hardware device, we
need to probe the device and setup an IRQ data structure for it. Next, for each
I/O APIC in the system there is a need to create an IRQ domain for it and all
the constituent hardware IRQs to the domain. Note that a large multi-socket
system may have many I/O APICs.
© Smruti R. Sarangi 96
The entry point to the IDT is shown in Listing 4.4. The vector irq array
is a table that uses the interrupt vector (vector) as an index to fetch the
corresponding irq desc data structure. This array is stored in the per-CPU
region, hence the this cpu read macro is used to access it. Once we obtain the
irq desc data structure, we can process the interrupt by calling the handle irq
function. Recall that the interrupt descriptor stores a pointer to the function
that is meant to handle the interrupt. The array regs contains the value of
all the CPU registers. This was populated in the process of saving the context
of the running process that was interrupted. Let us now look at an interrupt
handler, referred to as an IRQ handler in the parlance of the Linux kernel. The
specific interrupt handlers are called from the handle irq function.
Recall that an IRQ can be shared across devices. When an IRQ is set to
high, we don’t know which device has raised an interrupt. It is important to call
all the handlers associated with the IRQ one after the other, until one of them
associates the interrupt with its corresponding device. This interrupt handler
97 © Smruti R. Sarangi
can then proceed to handle the interrupt. The rest of the interrupt handlers
associated with the IRQ will return NONE, whereas the handler that handles it
returns either HANDLED or WAKE THREAD.
The structure irq desc has a linked list comprising struct irqaction*
elements. It is necessary to walk this list and find the handler that is associated
with the device that raised the interrupt.
Note that all of these handlers (of type irq handler t) are function point-
ers. They can either be generic interrupt handlers defined in the kernel or
device-specific handlers defined in the device driver code (the drivers direc-
tory). Whenever a device is connected or at boot time, the kernel locates the
device drivers for every such device. The device driver registers a list of func-
tions with the kernel. Whenever an interrupt arrives on a given IRQ line, it is
necessary to invoke the interrupt handlers of all the devices that share that IRQ
line. Basically, we traverse a list of function pointers and invoke one after the
other until the interrupt is successfully handled.
The interrupt handler that does the basic interrupt processing is conven-
tionally known as the top half. Its primary job is to acknowledge the receipt of
the interrupt to the APIC and communicate urgent data with the device. Note
that such interrupt handlers are not allowed to make blocking calls or use locks.
Given that they are very high priority threads, we do not want to wait for a long
time to acquire a lock or potentially get into a deadlock situation. Hence, they
are not allowed to use any form of blocking synchronization. Let us elaborate.
Top-half interrupt handlers run in a specialized interrupt context. In the
interrupt context, blocking calls such as lock acquisition are not allowed, pre-
emption is disabled, there are limitations on the stack size (similar to other
kernel threads) and access to user-space memory is not allowed. These are
clearly attributes of ultra high priority threads that you want to run and finish
quickly.
If the interrupt processing work is very limited, then the basic top half
interrupt handler is good enough. Otherwise, it needs to schedule a bottom
half thread for deferred interrupt processing. We schedule the work for a later
point in time. The bottom half thread does not have the same restrictions of
the top half thread. It can acquire locks, perform complex synchronization and
can take a long time to complete. Also interrupts are enabled when a bottom
half thread is running. Given that such threads have a low priority, they can
execute for a long time. They will not make the system unstable.
4.2.6 Exceptions
The Intel processor on your author’s machine defines 24 types of exceptions.
These are treated exactly the same way as interrupts and similar to an interrupt
vector, an exception number is generated.
Even though interrupts and exceptions are conceptually different, they are
still handled by the same mechanism, i.e., the IDT. Hence, from the stand-
point of interrupt handling, they are the same (they index the IDT in the same
manner), however, later on within the kernel their processing is very different.
Table 4.6 shows a list of some of the most common exceptions supported by
Intel x86 processors and the latest version of the Linux kernel.
Many of the exceptions are self-explanatory. However, some need some ad-
ditional explanation as well as justification. Let us consider the “Breakpoint”
© Smruti R. Sarangi 98
Exception Handling
Let us now look at exception handling (also known as trap handling). For every
exception, we define a macro of the form (refer to Listing 4.5) –
99 © Smruti R. Sarangi
We are declaring a macro for division errors. It is named exc divide error.
It is defined in Listing 4.6. The generic function do error trap handles all the
traps (to begin with). Along with details of the trap, it takes all the CPU
registers as an input.
Listing 4.6: Definition of a trap handler
source : arch/x86/kernel/traps.c
DEFINE_IDTENTRY ( exc_divide_error )
{
do_error_trap ( regs , 0 , " divide error " , X86_TRAP_DE ,
SIGFPE , FPE_INTDIV , error_get_trap_addr ( regs ) ) ;
}
There are several things that an exception handler can do. The various
options are shown in Figure 4.5.
The first often is clearly the most innocuous, which is to simply send a signal
to the process and not take any other kernel-level action. This can for instance
happen in the case of debugging, where the processor will generate an exception
upon the detection of a debug event. The OS will then be informed and the
OS needs to send a signal to the debugging process. This is exactly how a
breakpoint or watchpoint work.
The second option is not an exclusive option – it can be clubbed with the
other options. The exception handler can additionally print messages to the
kernel logs using the built-in printk function. This is a kernel-specific print
function that writes to the logs. These logs are visible using either the dmesg
command or are typically found in the /var/log/messages file. Many times
understanding the reasons behind an exception is very important, particularly
when kernel code is being debugged.
The third option is meant to genuinely be an exceptional case. It is a dou-
ble fault – an exception within an exception handler. This is never supposed
to happen unless there is a serious bug in the kernel code. In this case, the
recommended course of action is to halt the system and restart the kernel This
event is also known as a kernel panic (srckernel/panic.c).
The fourth option is very useful. For example, assume that a program has
been compiled for a later version of a processor that provides a certain instruc-
tion that an earlier version does not. For instance, processor version 10 in the
processor family provides the cosine instruction, which version 9 does not. In
this case, it is possible to create a very easy patch in software such that code
that uses this instruction can still seamlessly run on a version 9 processor.
The idea is as follows. We allow the original code to run. When the CPU
will encounter an unknown instruction (in this case the cosine instruction), it
will generate an exception – illegal instruction. The kernel’s exception handler
can then analyze the nature of the exception and figure out that it was actually
the cosine operation that the instruction was trying to compute. However, that
instruction is not a part of the ISA of the current processor. In this case, it
is possible to use other existing instructions and perform the computation to
compute the cosine of the argument and populate the destination register with
© Smruti R. Sarangi 100
the result. The running program can be restarted at the exact point at which it
trapped. The destination register will have the correct result. It will not even
perceive the fact that it was running on a CPU that did not support the cosine
instruction. Hence, from the point of view of correctness, there is no issue.
Of course, there is a performance penalty – this is a much slower solution
as compared to having a dedicated instruction. However, the code now be-
comes completely portabile. Had we not implemented this patching mechanism
via exceptions, the entire program would have been rendered useless. A small
performance penalty is a very small price to pay in this case.
The last option is known as the notify die mechanism, which implements
the classic observer pattern in software engineering.
of code to fix any errors that may have occurred. Another interested listener
can log the event. These two processes are clearly doing different things, which
was the original intention. We can clearly add more processors to the chain of
listeners, and do many other things.
The return values of the different handlers are quite relevant and important
here. This process is similar in character to the irqaction mechanism, where
we invoke all the interrupt handlers that share an IRQ line in sequence. The
return value indicates whether the interrupt was successfully handled or not. In
that case, we would like to handle the interrupt only once. However, in the case
of an exception, multiple handlers can be invoked and they can perform different
kinds of processing. They may not enjoy a sense of exclusivity (as in the case
of interrupts). Let us elaborate on this point by looking at the return values of
exception handlers that use the notify die mechanism (shown in Table 4.7). We
can either continue traversing the chain of listeners/observers after processing
an event or stop calling any more functions. All the options have been provided.
Value Meaning
NOTIFY DONE Do not care about this event. However, other
functions in the chain can be invoked.
NOTIFY OK Event successfully handled. Other functions in
the chain can be invoked.
NOTIFY STOP Do not call any more functions.
NOTIFY BAD Something went wrong. Stop calling any more
functions.
Table 4.7: Status values returned by exception handlers that have subsribed to
the notify die mechanism. source : include/linux/notifier.h
4.3.1 Softirqs
A regular interrupt’s top-half handler is known as a hard IRQ. It is bound by a
large number of rules and constraints regarding what it can and cannot do. A
softirq on the other hand is a bottom half handler. There are two ways that it
can be invoked (refer to Figure 4.6).
System management
Hard IRQ
(kernel threads)
do_so�irq
The first method (on the left) starts with a regular I/O interrupt (hard IRQ).
After basic interrupt processing, a softirq request is raised. This means that a
work parcel is created that needs to be executed later using a softirq thread.
It is important to call the function local bh enable after this such that the
processing of bottom-half threads like softirq threads is enabled.
Then at a later point of time the function do softirq is invoked whose job
is to check all the deferred work items and execute them one after the other
using specialized high-priority threads.
There is another mechanism of doing this type of work (the right path in the
figure). It is not necessary that top-half interrupt handlers raise softirq requests.
They can be raised by regular kernel threads that want to defer some work for
later processing. It is important to note that there may be more urgent needs
in the system and thus some kernel work needs to be immediately. Hence, a
deferred work item can be created as stored as a softirq request.
A dedicated kernel thread called ksoftirqd runs periodically and checks for
pending softirq requests. These threads are called daemons. Daemons are ded-
icated kernel threads that typically run periodically and check/process pending
requests. ksoftirqd periodically follows the same execution path and calls the
function do softirq where it picks an item from a softirq queue and executes
a function to process it.
The net summary is that softirqs are generic mechanisms that can be used
by both top-half interrupt handlers as well as specialized kernel threads; both
can insert softirq requests in dedicated queues and later on processed when CPU
time is available.
Raising a softirq
Many kinds of interrupt handlers can raise softirq requests. They all invoke the
raise softirq function whenever they need to add a softirq request. Instead
of using a software queue, there is a faster method to record this information.
103 © Smruti R. Sarangi
A fast method is to store a word in memory in the per-CPU region. Each bit
of this memory word has a bit corresponding to a specific type of softirq. If a
bit is set, then it means that a softirq request of the specific type is pending at
the corresponding CPU.
Here are examples of some types of softirqs (defined in include/linux/interrupt.h):
HI SOFTIRQ, TIMER SOFTIRQ, NET TX SOFTIRQ, BLOCK SOFTIRQ, SCHED SOFTIRQ
and HRTIMER SOFTIRQ. As the names suggest, for different kinds of inter-
rupts, we have different kinds of softirqs defined. Of course, the list is limited
and so is flexibility. However, the softirq mechanism was never meant to be very
generic in the first place. It was always meant to offload deferred work for a few
well-defined classes of interrupts and kernel tasks. It is not meant to be used
by device drivers.
...
}
Broad Overview
Work queue
Wrapper
Wrapper
Linked list of work
Wrapper
Worker pool items (work_structs)
Worker poolpool
Worker
Inac�ve
items
Threads
Let us provide a brief overview of how a work queue works (refer to Fig-
ure 4.7).
A work queue is typically associated with a certain class of tasks such as high-
priority tasks, batch jobs, bottom halves, etc. This is not a strict requirement,
however, in terms of software engineering, this is a sensible decision.
Each work queue contains a bunch of worker pool wrappers that each wrap
a worker pool. Let us first understand what is a worker pool, and then we will
discuss the need to wrap it (create additional code to manage it). A worker
105 © Smruti R. Sarangi
pool has three components: set of inactive work items, a group of threads that
process the work in the pool and a linked list of work items that need to be
processed (executed).
The main role of the worker pool is to basically store a list of work items
that need to be completed at some point of time in the future. Consequently,
it has a set of ready threads to perform this work and to also guarantee some
degree of timely completion. This is why, it maintains a set of threads that can
immediately be given a work item to process. A work item contains a function
pointer and the arguments of the function. A thread executes the function with
the arguments that are stored in the work item (referred to as a work struct
in the kernel code).
It may appear that all that we need for creating such a worker pool is a
bunch of threads and a linked list of work items. However, there is a little bit
of additional complexity here. It is possible that a given worker pool may be
overwhelmed with work. For instance, we typically associate a worker pool with
a CPU or a group of CPUs. It is possible that a lot of work is being added to
it and thus the linked list of work items ends up becoming very long. Hence,
there is a need to limit the size of the work that is assigned to a worker pool.
We do not want to traverse long linked lists.
An ingenious solution to limit the size of the linked list is as follows. We
tag some work items as active and put them in the linked list of work items
and tag the rest of the work items as inactive. The latter are stored in another
data structure, which is specialized for storing inactive work items (meant to
be processed much later). The advantage that we derive here is that for the
regular operation of the worker pool, we deal with smaller data structures.
Given that now there is an explicit size limitation, whenever there is an
overflow in terms of adding additional work items, we can safely store them in
the set of inactive items. When we have processed a sizeable number of active
items, we can bring in work items from the inactive list into the active list. It is
the role of the wrapper of a worker pool to perform this activity. Hence, there
is a need to wrap it.
The worker pool along with its wrapper can be thought of as one cohesive
unit. Now, we may need many such wrapped worker pools because in a large
system we shall have a lot of CPUs, and we may want to associate a worker
pool with each CPU or a group of CPUs. This is an elegant way of partitioning
the work and also doing some load-balancing.
Let us now look at the kernel code that is involved in implementing a work
queue.
/kernel/workqueue
/kernel/workqueue.c workqueue_struct
internal.h
kernel task
pool_workqueue worker_pool
a basic work item – the work struct data structure. The fields are shown in
Listing 4.8.
There is not much to it. The member struct list head entry indicates
that this is a part of a linked list of work structs. This is per se not a field
that indicates the details of the operation that needs to be performed. The only
two operational fields of importance are data (data to be processed) and the
function pointer (func). The data field can be a pointer as well to an object that
contains all the information about the arguments. A work struct represents
the basic unit of work.
The advantage of work queues is that it is usable by third-party code and
device drivers as well. This is quite unlike threaded IRQs and softirqs that are
not usable by device drivers. Any entity can create a struct work struct and
insert it in a work queue. This is executed later on when the kernel has the
bandwidth to execute such work.
Let us now take a deeper look at a worker pool (represented by struct
worker pool). Its job is to maintain a pool of worker threads that process work
queue items. Whenever a thread is required it can be retrieved from the pool.
There is no need to continually allocate new worker threads. We maintain a
pool of threads that are pre-initialized. Recall that we discussed the notion of
a pool in Section 3.2.11 and talked about its advantages in terms of eliminating
the time to allocate and deallocate objects.
Every worker pool has an associated CPU that it has an affinity with, a list of
work struct s, a list of worker threads, and a mapping between a work struct
and the worker thread that it is assigned to. Whenever a new work item comes,
and a corresponding work struct is allocated and a worker thread is assigned
107 © Smruti R. Sarangi
to it.
The relationship between the pool of workers, the work queue and the worker
pool is shown in Figure 4.9. The apex data structure that represents the entire
work queue is the struct workqueue struct. It has a member, which is a
wrapper class called struct pool workqueue. It wraps the worker pool.
Let us explain the notion of a wrapper class. This wrapper class wraps the
worker pool (struct worker pool). This means that it intercepts every call to
the worker pool, checks it and appropriately modifies it. Its job is to restrict
the size of the pool and limit the amount of work it does at any point of time.
It is not desirable to overload a worker pool with a lot of work – it will perform
inefficiently. This also means that either the kernel itself is doing a lot of work
or it is not distributing the work efficiently among the CPUs (different worker
pools).
The standard approach used by this wrapper class is to maintain two lists:
active and inactive. If there is more work than what the queue can handle, we
put the additional work items in the inactive list. When the size of the pool
reduces, items can be moved from the inactive list to the list of active work
items.
In addition, each CPU has two work queues: one for low-priority tasks and
one for high-priority tasks.
Listing 4.10 shows the code of a signal handler. Here, the handler function is
handler that takes as input a single argument: the number of the signal. Then
the function executes like any other function. It can make library calls and also
call other functions. In this specific version of the handler, we are making an
exit call. This kills the thread that is executing the signal handler. However,
this is not strictly necessary.
Let us assume that we did not make the call to the exit library function,
then one the following could have happened: if the signal blocked other signals
or interrupts, then their respective handlers would be executed, if the signal
was associated with process or thread termination, then the respective thread
(or thread group) would be terminated (eg: SIGSEGV and SIGABRT) upon
returning from the handler. If the thread was not meant to be terminated,
then it resumes executing from the point at which it was paused and the signal
109 © Smruti R. Sarangi
handler’s execution began. From the thread’s point of view this is like a regular
context switch.
Now let us look at the rest of the code. Refer to the main function. We need
to register the signal handler. This is done in Line 10. After that, we fork the
process. It is important to bear in mind that signal handling information is also
copied. In this case, for the child process its signal handler will be the copy of
the handler function in its address space. The child process prints that it is
the child and then goes into an infinite while loop.
The parent process on the other hand has more work to do. First, it waits
for the child to get fully initialized. There is no point in sending a signal to a
process that has not been fully initialized. Otherwise, it will ignore the signal.
It thus sleeps for 2 seconds, which is deemed to be enough. It then sends a signal
to the child using the kill library call that in turns makes the kill system call,
which is used to send signals to processes. In this case, it sends the SIGUSR1
signal. SIGUSR1 has no particular significance otherwise – it is meant to be
defined by user programs for their internal use.
When the parent process sends the signal, the child at that point of time is
stuck in an infinite loop. It subsequently wakes up and runs the signal handler.
The logic of the signal handler is quite clear – it prints the fact that it is the
child along with its process id and then makes the exit call. The parent in turn
waits for the child to exit and then it collects the pid of the child process along
with its exit status. The WEXITSTATUS macro can be used to parse the exit value
(extract its lower 8 bits).
The output of the program shall clearly indicate that the child was stuck in
an infinite loop. Then the parent called the signal handler and waited for the
child to exit. Finally, the child thread exited.
Please refer to Table 4.8 that shows some of the most common signals used
in the Linux operating system. Many of them can be handled and blocked.
However, there are many like SIGSTOP and SIGKILL that are not sent to the
processes. The kernel directly stops or kill the process, respectively.
In this sense, the SIGKILL signal is meant for all the threads of a multi-
threaded process. But, as we can see, in general , a signal is meant to be
handled only by a single thread in a thread group. There are different ways of
sending a signal to a thread group. One of the simplest approaches is the kill
system call that can send any given signal to a thread group. One of the threads
handles the signal. There are many versions of this system call. For example,
the tkill call can send a signal to specific thread within a process, whereas the
tgkill call takes care of a corner case. It is possible that thread id specified
in the tkill call is recycled. This means that the thread completes and then a
new thread is spawned with the same id. This can lead to a signal being sent
to the wrong thread. To guard against this the tgkill call takes an additional
argument, the thread group id. It is unlikely that both will be recycled and still
remain the same.
Regardless of the method that is used, it is very clear that signals are sent
© Smruti R. Sarangi 112
to a thread group; they are not meant to be sent to a particular thread unless
the tgkill call is used. Sometimes, there is an arithmetic exception in a thread
and thus there is a need to call the specific handler for that thread only. In this
case, it is not possible nor advisable to call the handler associated with another
thread in the same thread group.
Furthermore, signals can be blocked as well as ignored. When a signal is
blocked and such a signal arrives, it is queued. All such queued/pending signals
are handled once they are unblocked. Here also there is a caveat: no two pending
signals of the same type can be pending for a process at the same time. Also
when a signal handler executes, it blocks the corresponding signal.
There are several ways in which a signal can be handled.
The first option is to ignore the signal – it means that the signal is not
important and no handler is registered for it. In this case, the signal can be
happily ignored. On the other hand, if the signal is important and can lead to
process termination, then the action that needs to be taken is to kill the process.
Examples of such signals are SIGKILL and SIGINT (refer to Table 4.8). There
can also be a case where there will be process termination but an additional file
will be created called the core dump. It can be used by a debugger to inspect
the state of the process at which it was paused or stopped because of the receipt
of the signal. For instance, we can find the values of all the local variables, the
stack’s contents and the memory contents.
We have already seen the process stop and resume signals earlier. Here, the
stop action is associated with suspending a process indefinitely until the resum-
ing action is initiated. The former corresponds to the SIGSTOP and SIGTSTP
signals, whereas the latter corresponds to the SIGCONT signal. It is impor-
tant to understand that like SIGKILL, these signals are not sent to the process.
They are instead intercepted by the kernel and the process is either terminated
or stopped/resumed. SIGKILL and SIGSTOP in particular cannot be ignored,
handled or blocked.
Finally, the last method is to handle the signal by registering a handler. Note
that in many cases this may not be possible, especially if the signal arose because
of an exception. The same exception-causing instruction will execute after the
handler returns and again cause an exception. In such cases, process termination
or stopping that thread and spawning a new thread are good options. In some
cases, if the circumstances behind an exception can be changed, then the signal
handler can do that like remapping a memory page or changing the value of
a variable. Making such changes in a signal handler is quite risky and is only
meant for black belt programmers ,.
size_t sass_ss_size ;
Let us now look at the relevant kernel code. The apex data structure in signal
handling is signal struct (refer to Listing 4.11). The information about the
signal handler is kept in struct sighand struct. The two important fields
that store the set of blocked/masked signals are blocked and real blocked.
They are of the type sigset t, which is nothing but a bit vector: one bit
for each signal. It is possible that a lot of signals have been blocked by the
process because it is simply not interested in them. All of these signals are
stored in the variable real blocked. Now, during the execution of any signal
handler, typically more more signals are blocked including the signal that is
being handled. There is a need to add all of these additional signals to the
set real blocked. With these additional signals, the expanded set of signals is
called blocked.
Note the following:
In this case we set the blocked signal set as a super set of the set real blocked.
These are all the signals that we do not want to handle when a signal handler
is executing. After finishing executing the handler, the kernel sets blocked =
real blocked.
struct sigpending stores the list of pending/queued signals that have not
been handled by the process yet. We will discuss its intricacies later.
Finally, consider the last field, which is quite interesting. For a signal han-
dler, we may want to use the same stack of the thread that was interrupted or
a different one. If we are using the same stack, then there is no problem; we
can otherwise use a different stack in the thread’s address space. In this case
its starting address and the size of the stack need to be specified in this case.
If we are using the alternative stack, which is different from the real stack that
the thread was using, no correctness problem is created. The original thread in
any case is stopped and thus the stack that is used does not matter.
Listing 4.12 shows the important fields in the main signal-related struc-
ture signal struct. It mainly contains process-related information such as the
number of active threads in the thread group, linked list of all the threads (in
the thread group), list of all the constituent threads that are waiting on the
wait system call, the last thread that processed a signal and the list of pending
signals (shared across all the threads in a thread group).
Let us now look at the next data structure – the signal handler.
Listing 4.13 shows the wrapper of signal handlers of the entire multi-threaded
process. It actually contains a lot of information in these few fields. Note that
this structure is shared by all the threads in the thread group.
The first field count maintains the number of task struct s that use this
handler. The next field signalfd wqh is a queue of waiting processes. At this
stage, it is fundamental to understand that there are two ways of sending a
signal to a process. We have already seen the first approach, which involves
calling the signal handler directly. This is a straightforward approach and uses
the traditional paradigm of using callback functions, where a callback function
is a function pointer that is registered with the caller. In this case, the caller or
the invoker is the signal handling subsystem of the OS.
It turns out that there is a second mechanism, which is not used that widely.
As compared to the default mechanism, which is asynchronous (signal handlers
can be run any time), this is a synchronous mechanism. In this case, signal
handling is a planned process and it is not the case that signals can arrive at
any point of time, and then they need to be handled immediately. This notion is
captured in the field signalfd wqh. The idea is that the process registers a file
descriptor with the OS – we refer to this as the signalfd file. Whenever a signal
needs to be sent to the process, the OS writes the details of the signal to the
signalfd file. Processes in this case, typically wait for signals to come. Hence,
processes need to be woken up. At their leisure, they can check the contents of
the file and process the signals accordingly.
Now, it is possible that multiple processes are waiting for something to be
written to the signalfd file. Hence, there is a need to create a queue of waiting
processes. This wait queue is the signalfd wqh field.
However, the more common method of handling signals is using the regular
asynchronous mechanism. All that we need to store here is an array of 64
( NSIG) signal handlers. 64 is the maximum number of signal handlers that
Linux on x86 supports. Each signal handler is wrapped using the k sigaction
structure. On most architectures, this simply wraps the sigaction structure,
which we shall describe next.
115 © Smruti R. Sarangi
struct sigaction
The important fields of struct sigaction are shown in Listing 4.14. The
fields are reasonably self-explanatory. sa handler is the function pointer in the
thread’s user space memory. flags represents the parameters that the kernel
uses to handle the signal such as whether a separate stack needs to be used or
not. Finally, we have the set of masked signals.
struct sigpending
The final data structure that we need to define is the list of pending signals
(struct sigpending). This data structure is reasonably complicated and we
will very soon understand why. It uses some of the tricky features of Linux
linked lists, which we have very nicely steered away from up till now.
struct sigpending {
struct list_head list; sigqueue sigqueue sigqueue sigqueue
sigset_t signal;
}
struct sigqueue {
struct list_head list; /* Poin�ng to its current posi�on in the
queue of sigqueues */
kernel_siginfo_t info; /* signal number, signal source, etc. */
}
Refer to Figure 4.10. The structure sigpending wraps a linked list that
contains all the pending signals. The name of the list is as simple as it can be,
list. The other field of interest is signal that is simply a bit vector whose ith
bit is set if the ith signal is pending for the process. Note that this is why there
is a requirement that two signals of the same type can never be pending for the
same process.
Each entry of the linked list is of type struct sigqueue. Note that we
discussed in Appendix C that in Linux different kinds of nodes can be part
of a linked list. Hence, in this case we have the head of the linked list as a
structure of type sigpending, whereas all the entries are of type sigqueue. As
non-intuitive as this may seem, this is indeed possible in Linux’s linked lists.
© Smruti R. Sarangi 116
struct ucontext {
unsigned long uc_flags ;
stack_t uc_stack ; /* user ’s stack pointer */
struct sigcontext uc_mcontext ; /* Snapshot of all the
registers and process ’s state */
};
struct rt sigframe keeps all the information required to store the context
of the thread that was signalled. The context per se is stored in the structure
struct ucontext. Along with some signal handlng flags, it stores two vital
pieces of information: the pointer to the user thread’s stack and the snapshot
of all the user thread’s registers and its state. The stack pointer can be in
the same region of memory as the user thread’s stack or in a separate memory
region. Recall that it is possible to specify a separate address for storing the
signal handler’s stack.
The next argument is the signal information that contains the details of the
signal: its number, the relevant error code and the details of the source of the
signal.
The last argument is the most interesting. The question is where should the
signal handler return to? It cannot return to the point at which the original
thread stopped executing. This is because its context has not been restored yet.
117 © Smruti R. Sarangi
Synchronization and
Scheduling
In this chapter we will discuss one of the most important concepts in operating
systems namely synchronization and scheduling. The first deals with managing
resources that are common to a bunch of processes or threads (shared between
them). It is possible that there will be competition amongst the threads or
processes to acquire the resource: this is also known as a race condition. Such
data races can lead to errors. As a result, only one of the processes needs to
access the shared resource at any point of time.
Once all such synchronizing conditions have been worked out, it is the role
of the operating system to ensure that all the computing resources namely the
cores and accelerators are optimally used. There should be no idleness or exces-
sive context switching. Therefore, it is important to design a proper scheduling
algorithm such that tasks can be efficiently mapped to the available compu-
tational resources. We shall see that there are a wide variety of scheduling
algorithms, constraints and possible scheduling goals. Given that there are such
a wide variety of practical use cases, situations and circumstances, there is no
one single universal scheduling algorithm that outperforms all the others. In
fact, we shall see that for different situations, different scheduling algorithms
perform very differently.
5.1 Synchronization
5.1.1 Introduction to Data Races
Consider the case of a multicore CPU. We want to do a very simple operation,
which is just to increment the value of the count variable that is stored in
memory. It is a regular variable and incrementing it should be easy. Listing 5.1
shows that it translates to three assembly-level instructions. We are showing C-
like code without the semicolon for the sake of enhancing readability. Note that
each line corresponds to one line of assembly code (or one machine instruction)
in this code snippet. count is a global variable that can be shared across
threads. t1 corresponds to a register (private to each thread and core). The
first instruction loads the variable count to a register, the second line increments
119
© Smruti R. Sarangi 120
the value in the register and the third line stores the incremented value in the
memory location corresponding to count.
This code is very simple, but when we consider multiple threads, it turns
out to be quite erroneous because we can have several correctness problems.
Consider the scenario shown in Figure 5.1. Note again that we first load
the value into a register, then we increment the contents of the register and
finally save the contents of the register in the memory address corresponding to
the variable count. This makes a total of 3 instructions that are not executed
atomically – execute at three different instants of time. Here there is a possibility
of multiple threads trying to execute the same code snippet at the same point
of time and also update count concurrently. This situation is called a data race
(a more precise and detailed definition follows later).
count = 0
Thread 1 Thread 2
t1 = count t2 = count
t1 = t1 + 1 t2 = t2 + 1
count = t1 count = t2
Figure 5.1: Incrementing the count variable in parallel (two threads). The
run on two different cores. t1 and t2 are thread-specific variables mapped to
registers
Before we proceed towards that and elaborate on how and why a data race
can be a problem, we need to list a couple of assumptions.
❶ The first assumption is that each basic statement in Listing 5.1 corre-
sponds to one line of assembly code, which is assumed to execute atomically.
This means that it appears to execute at a single instant of time.
❷ The second assumption here is that the delay between two instructions
can be indefinitely long (arbitrarily large). This could be because of hardware-
level delays or could be because there is a context switch and then the context
is restored after a long time. We cannot thus assume anything about the timing
of the instructions, especially the timing between consecutive instructions given
that there could be indefinite delays for the aforementioned reasons.
Now given these assumptions, let us look at the example shown in Figure 5.1
and one possible execution in Figure 5.2. Note that a parallel program can have
many possible executions. We are showing one of them, which is particularly
121 © Smruti R. Sarangi
count = 0
t2 = t2 + 1
count = t1 count = 1
count = t2 count = 1
Should be 2 5
Figure 5.2: An execution that leads to the wrong value of the count variable
problematic. We see that the two threads read the value of the variable count
at exactly the same point of time without any synchronization or coordination
between them. Then they store the value of the count variable in a register
(temporary variables t1 and t2, respectively). Finally they increment their
respective registers and then they store the incremented value in the memory
address corresponding to count. Since we are calling the instruction count++
twice, we expect that the final value of count should be equal to 2 (recall that
it is initialized to 0).
Here we get to see that the value of count is equal to 1, which is clearly
incorrect. Basically because there was a competition or a data race between
the threads, and the value of count could not be incremented correctly. This
allowed both the threads to compete or race, which did not turn out to be a
good idea in hindsight. What we instead should have done is allowed one thread
to complete the entire sequence of operations first, and then allowed the other
thread to begin executing the sequence of instructions to increment the variable
count.
The main issue here is that of competition or the overlapping execution, and
thus there is a need for a locking mechanism. A lock needs to be acquired before
we enter such a sequence of code, which is also referred to as a critical section.
The idea is that we first acquire a lock, which basically means only one thread
can proceed past the lock. Or in other words, if multiple threads are trying to
acquire the lock at the same time, then only one of them is successful. After that,
the succeeding thread proceeds to execute the instructions in the critical section,
which in this case is incrementing the value of the variable count. Finally, there
is a need to release the lock or unlock it. Once this is done, one of the other
threads that has been waiting to acquire the lock can again compete for it and
if it wins the lock acquiring race, it is deemed to acquire the lock. It can then
begin to execute the critical section. This is how traditional code works using
locks and this mechanism is extremely popular and effective – in fact, this is
© Smruti R. Sarangi 122
the de facto standard. All such shared variables such as count should always be
accessed using such kind of a lock-unlock mechanism. This mechanism avoids
such competing situations because locks play the role of access synchronizers
(see Figure 5.3).
Unlock.
Figure 5.4 shows the execution of the code snippet count++ by two threads.
Note the critical sections, the use of the lock and unlock calls. Given that the
critical section is protected with locks, there are no data races here. The final
value is correct: count = 2.
Thread 1 Thread 2
t1 = count
t1 = t1 + 1
count = t1
t2 = count
t2 = t2 + 1
count = t2
(c) Smru� R. Sarangi, 2023
Figure 5.4: Two threads incrementing count by wrapping the critical section
within a lock-unlock call pair
means that the lock is already acquired or in other words it is busy. Once a
thread finds that the value has changed back to 0 (free), it tries to set it to 1
(test-and-set phase). In this case, it is inevitable that there will be a competition
or a race among the threads to acquire the lock (set the value in A to 1). Regular
reads or writes cannot be used to implement such locks.
It is important to use an atomic synchronizing instruction that almost all
the processors provide as of today. For instance, we can use the test-and-set
instruction that is available on most hardware. This instruction checks the value
of the variable stored in memory and if it is 0, it atomically sets it to 1 (appears
to happen instantaneously). If it is able to do so successfully (0 → 1), it returns
a 1, else it returns 0. This basically means that if two threads are trying to set
the value of a lock variable from 0 to 1, only one of them will be successful. The
hardware guarantees this
The test-and-set instruction returns 1 if it is successful, and it returns 0 if
it fails (cannot set 0 → 1). Clearly we can extend the argument and observe that
if there are n threads that all want to convert the value of the lock variable from
0 to 1, then only one of them will be succeed. The thread that was successful
is deemed to have acquired the lock. For the rest of the threads that were
unsuccessful, they need to keep trying (iterating). This process is also known
as busy waiting. Such a lock that involves busy waiting – it is also called a spin
lock.
It is important to note that we are relying on a hardware instruction that
atomically sets the value in a memory location to another value and indicates
whether it was successful in doing so or not. There is a lot of theory around this
and there are also a lot of hardware primitives that play the role of atomic oper-
ations. Many of them fall in the class of read-modify-write (RMW) operations.
They read the value stored at a memory location, sometimes test if it satisfies
a certain property or not, and then they modify the contents of the memory
location accordingly. These RMW operations are typically used in implement-
ing locks. The standard method is to keep checking whether the lock variable
is free or not. The moment the lock is found to be free, threads compete to
acquire the lock using atomic instructions. Atomic instructions guarantee that
only one instruction is successful at a time. Once a thread acquires the lock, it
can proceed to safely access the critical section.
After executing the critical section, unlocking is quite simple. All that needs
to be done is that the value of the location at the lock needs to be set back to 0
© Smruti R. Sarangi 124
(free). However, bear in mind that if one takes a computer architecture course,
one will realize that this is not that simple. This is because all the memory
operations that have been performed in the critical section should be visible to
all the threads running on other cores once the lock has been unlocked. This
normally does not happen as architectures and compilers tend to reorder in-
structions. Also, it is possible that the instructions in the critical section are
visible to other threads before a lock is fully acquired unless additional precau-
tions are taken. This is again an unintended consequence or reordering that
is done by compilers and machines for performance reasons. Such reordering
needs to be checked.
This is why most atomic instructions either additionally act as fence instruc-
tions or a separate fence instruction is added by the library code to lock/unlock
functions.
This can be seen in Figure 5.4, where we show how two threads execute two
instances of the count++ operation. The increment of the count variable for the
first time leads to the unlock operation. There is a happens-before relationship
between this unlock operation and the subsequent lock operation. This leads to
the second update of the count variable. Therefore, we can say that there is a
happens-before relationship between the first and second updates to the count
variable. Note that this relationship is a property of a given execution. In a
different execution, a different happens-before relationship may be visible. A
happens-before relationship by definition is a transitive relationship.
The moment we do not have such happens-before relationships between ac-
cesses, they are deemed to be concurrent. Note that in our example, such
happens-before relationships are being enforced by the lock/unlock operations
and their inherent fences. Happens-before order: updates in the critical section
→ unlock operation → lock operation → reads/writes in the second critical sec-
tion (so and so forth). Encapsulating critical sections within lock-unlock pairs
creates such happens-before relationships. Otherwise, we have data races.
Such data races are clearly undesirable as we saw in the case of count++.
Hence, concurrent and conflicting accesses to the same shared variable should
not be there. With data races, it is possible that we may have hard-to-detect
bugs in the program. Also data races have a much deeper significance in terms
of the correctness of the execution of parallel programs. At this point we are not
in the position to appreciate all of this. All that can be said is that data-race-
free programs have a lot of nice and useful properties, which are very important
in ensuring the correctness of parallel programs. Hence, data races should be
avoided for a wide variety of reasons. Refer to the book by your author on
Advanced Computer Architecture [Sarangi, 2023] for a detailed explanation of
data races, and their implications and advantages.
Important Point 2 An astute reader may argue that there have to be data
races in the code to acquire the lock itself. However, those happen in a very
controlled manner and they don’t pose a correctness problem. This part of
the code is heavily verified and is provably correct. The same cannot be said
about data races in regular programs.
Properly-Labeled Programs
Figure 5.6: A figure showing a situation with two critical sections. The first
is protected by lock X and the second is protected by lock Y . Address C is
common to both the critical sections. There may be a data race on address C.
5.1.4 Deadlocks
Using locks sadly does not come for free; they can lead to a situation known
as deadlocks. A deadlock is defined as a situation where one thread is waiting
on another thread, that thread is waiting on another thread, so on and so forth
– we have a circular or cyclic wait. This basically means that in a deadlocked
situation, no thread can make any progress. In Figure reffig:deadlock, we see
such a situation with locks.
Thread 1 Thread 2
Lock X Lock Y
Lock Y Lock X
It shows that one thread holds lock X and it tries to acquire lock Y . On the
other hand, the second thread holds lock Y and tries to acquire lock X. There
is a clear deadlock situation here. It is not possible for any thread to make
progress because they are waiting on each other. This is happening because we
are using locks and a thread cannot make any progress unless it acquires the
lock that it is waiting for. A code with locks may thus lead to such kind of
deadlocks that are characterized by circular waits. Let us elaborate.
There are four conditions for a deadlock to happen. This is why if a deadlock
is supposed to be avoided or prevented, one of these conditions needs to be
prevented/avoided. The conditions are as follows:
1. Hold-and-wait: In this case, a thread holds on to a set of locks and waits
to acquire another lock. We can clearly see this happening in Figure 5.7,
where we are holding on to a lock and trying to grab one more lock.
2. No preemption: It basically means that a lock cannot be forcibly taken
away from a thread after it has acquired it. This follows from the literal
127 © Smruti R. Sarangi
4. Circular wait: As we can see in Figure 5.7, all the threads are waiting
on each other and there is a circular or cyclic wait. A cyclic wait ensures
that no thread can make any progress.
right ones) and starting to eat is the same as entering the critical section. This
means that both the forks have been acquired.
It is very easy to see that a deadlock situation can form here. For instance,
every philosopher can pick up his left fork first. All of the philosophers can
pick up their respective left forks at the same time and keep waiting for their
right forks to be put on the table. These have sadly been picked up from the
table by their respective neighbors. Clearly a circular wait has been created.
Let us look at the rest of the deadlock conditions, which are non-preeemption
and hold-and-wait, respectively. Clearly mutual exclusion will always have to
hold because a fork cannot be shared between neighbors at the same moment
of time.
Preemption – forcibly taking away a fork from the neighbor – seems to be
difficult because the neighbour can also do the same. Designing a protocol
around this idea seems to be difficult. et us try to relax hold-and-wait. A
philosopher may give up after a certain point of time and put the fork that he
has acquired back on the table. Again creating a protocol around this appears
to be difficult because it is very easy to get into a livelock.
Hence, the simplest way of dealing with this situation is to try to avoid the
circular wait condition. In this case, we would like to introduce the notion of
asymmetry, where we can change the rules for just one of the philosophers. Let
us say that the default algorithm is that a philosopher picks the left fork first
and then the right one. We change the rule for one of the philosophers: he
acquires his right fork first and then the left one.
It is possible to show that a circular wait cannot form. Let us number the
philosophers from 1 to n. Assume that the nth philosopher is the one that has
the special privilege of picking up the forks in the reverse order (first right and
then left). In this case, we need to show that a cyclic wait can never form.
Asume that a cyclic wait has formed. It means that a philosopher (other than
the last one) has picked up the left fork and is waiting for the right fork to be
put on the table. This is the case for philosophers 1 to and n−1. Consider what
is happening between philosophers and n − 1 and n. The (n − 1)th philosopher
picks its left fork and waits for the right one. The fact that it is waiting basically
means that the nth philosopher has picked it up. This is his left fork. It means
that he has also picked up his right fork because he picks up the forks in the
reverse order. He first picks up his right fork and then his left one. This basically
means that the nth philosopher has acquired both the forks and is thus eating
his food. He is not waiting. We therefore do not have a deadlock situation over
here.
In the first phase, we simply acquire all the locks in ascending order of their
addresses. In the second phase, we release all the locks. Here the assumption is
that all the locks that will be acquired are known in advance. In reality, this is
not a very serious limitation because in a large number of practical use cases,
this information is often known.
The advantage here is that we will not have deadlocks. This is because a
circular wait cannot happen. There is a fundamental asymmetry in the way that
we are acquiring locks in the sense that we are acquiring them in an ascending
order of addresses.
Let us prove deadlock prevention by contradiction. Assume that there is a
circular wait. Let us annotate each edge uv in this circular loop with the lock
address A – Process Pu wants to acquire lock A that is currently held by Pv .
As we traverse this list of locks (in the circular wait cycle), the addresses will
continue to increase because a process always waits on a lock whose address is
larger than the address of any lock that it currently holds. Continuing on these
lines, we observe that in a circular wait, the lock addresses keep increasing.
Given that there is a circular wait, there will be a process Px that is waiting for
lock A that is held by Py (Px → Py ). Given the circular wait, assume that Px
holds lock A′ , which Pz is waiting to acquire (Pz → Px ). We have a circular wait
of the form Px → Py → . . . → Pz → Px . Now, lock addresses need to increase as
we traverse the circular waiting loop. This is because a process always covets a
lock whose address is higher than the addresses of all the locks that it currently
holds (due to the two-phase locking protocol). We thus have A′ > A. Now, Px
holds A′ and it waits for A. This means that A > A′ . Both cannot be true. We
thus have a contradiction. Hence, a circular wait is not possible.
The other approach is deadlock avoidance. This is more like taking a
medicine for a disease. In this case, before acquiring a lock, we check if a
deadlock will happen or not, and if there is a possibility of a deadlock, then we
do not acquire the lock. We throw an exception such that the user process that
initiated the lock acquisition process can catch it and take appropriate action.
The last approach is called deadlock recovery. Here, we run the system
optimistically. We have a deadlock detector that runs as a separate thread.
Whenever, we detect sustained inactivity in the system, the deadlock detector
looks at all the shared resources and tries to find cycles. A cycle may indicate a
deadlock (subject to the other three conditions). If such a deadlock is detected,
there is a need to break it. Often sledgehammer like approaches are used. This
means either killing a process or forcefully taking the locks away from it.
In the main function, two pthreads are created. The arguments to the
pthread create function are a pointer to the pthread structure, a pointer to
a pthread attribute structure that shall control its behavior (NULL in this ex-
ample), the function pointer that needs to be executed and a pointer to its sole
argument. If the function takes multiple arguments, then we need to put all of
them in a structure and pass a pointer to that structure.
The return value of the func function is quite interesting. It is a void *,
which is a generic pointer. In our example, it is a pointer to an integer that is
equal to 2 times the thread id. When a pthread function (like func) returns, akin
to a signal handler, it returns to the address of a special routine. Specifically,
it does the job of cleaning up the state and tearing down the thread. Once the
thread finishes, the parent thread that spawned it can wait for it to finish using
the pthread join call.
This is similar to the wait call invoked by a parent process, when it waits
for a child to terminate in the regular fork-exec model. In the case of a regular
process, we collect the exit code of the child process. However, in the case
of pthreads, the pthread join call takes two arguments: the pthread, and the
address of a pointer variable (&result). The value filled in the address is exactly
the pointer that the pthread function returns. We can proceed to dereference
the pointer and extract the value that the function wanted to return.
Given that we have now created a mechanism to create pthread functions
that can be made to run in parallel, let us implement a few concurrent algo-
rithms. Let us try to increment a count.
© Smruti R. Sarangi 132
Consider the code in Listing 5.3. A lock in pthreads is of type pthread mutex t.
It needs to be initialized using the pthread mutex init call. The firt argument
is a pointer to the pthread mutex (lock), and the second argument is a pointer
to a pthread attributes structure. If it is NULL, then it means that the lock
will exhibit its default behavior.
The lock and unlock functions are indeed quite simple here. We can just use
the calls pthread mutex lock and pthread mutex unlock, respectively. All the
code between them comprises the critical section.
that needs to be achieved is complex. Let us look at another example that uses
another atomic primitive – the compare-and-swap instruction.
Let us now use the CAS method to increment count (code shown in List-
ing 5.6).
ascending order of physical time, then we can arrange all the methods sequen-
tially. If we think about it, this is a way of mapping a parallel execution to
a sequential execution, as we can see in Figure 5.9. This mapped sequential
execution is of great value because the human mind finds it very easy to reason
about sequential executions, whereas it is very difficult to make sense of parallel
executions.
Let us say that the sequential execution (shown at the bottom of the figure)
is equivalent to the parallel execution. If this sequential execution satisfies the
semantics of the algorithm, then it is said to be legal. For example, in Figure 5.9,
we show a set of enqueue and dequeue operations that are issued by multiple
threads. The parallel execution is hard to reason about (prove or disprove
correctness, either way); however, the equivalent sequential execution can easily
be checked to see if it follows the semantics of a queue – it needs to show FIFO
behavior. Atomicity and the notion of a point of completion allow us to check
a parallel algorithm for correctness. But, we are not fully there yet. We need a
few more definitions and concepts in place.
Hypothe�cal
sequen�al {1} {3,1} {3} {} {2} {} {4} {5,4} {5}
order
Figure 5.9: A parallel execution and its equivalent sequential execution. Every
event has a distinct start time and end time. In this figure, we assume that we
know the completion time. We arrange all the events in ascending order of their
completion times in a hypothetical sequential order at the bottom. Each point
in the sequential order shows the contents of the queue after the respective
operation has completed. Note that the terminology enq: 3 means that we
enqueue 3, and similarly deq: 4 means that we dequeue 4.
The key question that needs to be answered is where is this point of com-
pletion vis-a-vis the start and end points. If it always lies between them, then
we can always claim that before a method call ends, it is deemed to have fully
completed – its changes to the global state are visible to all the threads. This
is a very strong correctness criterion of a parallel execution. We are, of course,
assuming that the equivalent sequential execution is legal. This correctness
criteria is known as linearizability.
Linearizability
Linearizability is the de facto criterion used to prove the correctness of con-
current data structures that are of a non-blocking nature. If all the executions
© Smruti R. Sarangi 136
Now, let us address the last conundrum. Even if the completion times are not
known, which is often the case, as long as we can show that distinct completion
points appear to exist for each method (between its start and end), the execution
is deemed to be linearizable. Mere existence of completion points is what needs
to be shown. Whether the method actually completes at that point or not is
not important. This is why we keep using the word “appears” throughout the
definitions.
Write
CPU Delay
Write
buffer
L1 Cache
Now consider the other case when the point of completion may be after the
end of a method. For obvious reasons, it cannot be before the start point of
a method. An example of such an execution, which is clearly atomic but not
linearizable, is a simple write operation in multicore processors (see Figure 5.10).
The write method returns when the processor has completed the write operation
and has written it to its write buffer. This is also when the write operation is
removed the pipeline. However, that does not mean that the write operation has
137 © Smruti R. Sarangi
completed. It completes when it is visible to all the threads, which can happen
much later – when the write operation leaves the write buffer and is written to
a shared cache. This is thus a case when the completion time is beyond the end
time of the method. The word “beyond” is being used in the sense that it is
“after” the end time in terms of the real physical time.
We now enter a world of possibilities. Let us once again consider the simple
read and write operations that are issued by cores in a multicore system. The
moment we consider non-linearizable executions, the completion time becomes
very important. The reason for preferring non-linearizable executions is to
enable a host of performance-enhancing optimizations in the compiler, processor
and the memory system. These optimizations involve delaying and reordering
instructions. As a result, the completion time can be well beyond the end time.
The more relaxed we are in such matters, higher is the performance.
The question that naturally arises is how do we guarantee the correctness of
algorithms? In the case of linearizability, it was easy to prove correctness. We
just had to show that for each method a point of completion exists, and if we ar-
range these points in an ascending order of completion times, then the sequence
is legal – it satisfies the semantics of the concurrent system. For complex concur-
rent data structures like stacks and queues, linearizability is preferred; however,
for simpler operations like reads and writes at the hardware level, many other
models are used. These models that precisely define, which parallel executions
comprising just reads and writes are legal or not, are known as memory models
or memory consistency models. Every multicore processor as of today defines a
memory model. It needs to be respected by the compiler and library writers.
If we just confine ourselves to reads, writes and basic atomic operations like
test-and-set or compare-and-swap, then we need to decide if a given parallel
execution adheres to a given memory model or not. Answering this question
is beyond the scope of this book. The textbook on Next-Generation Computer
Architecture [Sarangi, 2023] by your author is the right point to start.
Let us consider another memory model called sequentially consistency (SC),
which again stands out in the space of memory models. It is perceived to be
quite slow in practice and thus not used. However, it is used as a gold standard
for correctness.
Sequential Consistency
Thread 1 Thread 2
x=1 y=1
t1 = y t2 = x
© Smruti R. Sarangi 138
Note that if we run this code many times on a multicore machine, we shall
see different outcomes. It is possible that Thread 1 executes first and completes
both of its instructions and then Thread 2 is scheduled on another core, or vice
versa, or their execution is interleaved. Regardless of the scheduling policy, we
will never observe the outcome t1 = t2 = 0 if the memory model is SC or lin-
earizability. This reason is straightforward. All SC and linearizable executions
respect the per-thread order of instructions. In this case, the first instruction
to complete will either be x = 1 or y = 1. Hence, at least one of t1 or t2 must
be non-zero.
kinds of concurrent algorithms. Let us start with the simplest and the most
relaxed progress guarantee.
Obstruction Freedom
It is called obstruction freedom, which basically says that in an n-thread system,
if we set any set of (n − 1) threads to sleep, then the only thread that is active
will be able to complete its execution in a bounded number of internal steps.
This means that we cannot use locks because if the thread that has acquired
the lock gets swapped out or goes to sleep, no other thread can complete the
operation.
Wait Freedom
Now, let us look at another progress guarantee, which is at the other end of
the spectrum. It is known as wait freedom. In this case, we avoid all forms
of starvation. Every thread completes the operation within a bounded number
of internal steps. So in this case, starvation is not possible. The code shown
in Listing 5.4 is an example of a wait-free algorithm because regardless of the
number of threads and the amount of contention, it completes within a bounded
number of internal steps. However, the code shown in Listing 5.6 is not a wait-
free algorithm. This is because there is no guarantee that the compare and swap
will be successful in a bounded number of attempts. Thus we cannot guarantee
wait freedom. However, this code is obstruction free because if any set of (n − 1)
threads go to sleep, then the only thread that is active will succeed in the CAS
operation and ultimately complete the overall operation in a bounded number
of steps.
Lock Freedom
Given that we have now defined what an obstruction-free and a wait-free al-
gorithm is, we can now tackle the definition of lock freedom, which is slightly
more complicated. In this case, let us count the cumulative number of steps
that all the n threads in the system execute. We have already mentioned that
there is no correlation between the time it takes to complete an internal step
across the n threads. That remaining true, we can still take a system and count
the cumulative number of internal steps taken by all the threads together. Lock
freedom basically says that if this cumulative number is above a certain thresh-
old or a bound, then we can say for sure that at least one of the operations has
completed successfully. Note that in this case, we are saying that at least one
thread will make progress and there can be no deadlocks.
All the threads also cannot get stuck in a livelock. However, there can be
starvation because we are taking a system-wide view and not a thread-specific
view here. As long as one thread makes progress by completing operations, we
do not care about the rest of the threads. This was not the case in wait-free
algorithms. The code shown in Listing 5.6 is lock free, but it is not wait free.
The reason is that the compare and exchange has to be successful for at least one
of the threads and that thread will successfully move on to complete the entire
count increment operation. The rest of the threads will fail in that iteration.
However, that is not of a great concern here because at least one thread achieves
success.
141 © Smruti R. Sarangi
It is important to note that every program that is wait free is also lock free.
This follows from the definition of lock freedom and wait freedom, respectively.
If we are saying that in less than k internal steps, every thread is guaranteed
to complete its operation, then in nk system-wide steps, at least one thread is
guaranteed to complete its operation. By the pigeon hole principle, at least one
thread must have taken k steps and completed its operation. Thus wait freedom
implies lock freedom.
Similarly, every program that is lock free is also obstruction free, which again
follows very easily from the definitions. This is the case because we are saying
that if the system as a whole takes a certain number of steps (let’s say k ′ ), then
at least one thread successfully completes its operation. Now, if n − 1 threads
in the system are quiescent, then only one thread is taking steps and within k ′
steps it has to complete its operation. Hence, the algorithm is obstruction free.
Obstruc�on free
Lock free
Wait free
Figure 5.11: Venn diagram showing the relationship between different progress
guarantees
However, the converse is not true in the sense that it is possible to find a
lock-free algorithm that is not wait free and an obstruction free algorithm that is
not lock free. This can be visualized in a Venn diagram as shown in Figure 5.11.
All of these algorithms cannot use locks. They are thus broadly known as non-
blocking algorithms even though they provide very different kinds of progress
guarantees.
An astute reader may ask why not use wait-free algorithms every time be-
cause after all there are theoretical results that say that any algorithm can be
converted to a parallel wait-free variant, which is also provably correct. This
part is correct, however, wait-free algorithms tend to be very slow and also are
very difficult to write and verify. Hence, in most practical cases, a lock-free
implementation is much faster and is far easier to code and verify. In general,
obstruction freedom is too week as a progress guarantee. Thus it is hard to find
a practical system that uses an obstruction- free algorithm. In most practical
systems, lock-free algorithms are used, which optimally trade off performance,
correctness and complexity.
There is a fine point here. Many authors have replaced the bounded property
in the definitions with finite. The latter property is more theoretical and often
does not gel well with practical implementations. Hence, we have not decided
© Smruti R. Sarangi 142
to use it in this book. We will continue with bounded steps, where the bound
can be known in advance.
5.1.8 Semaphores
Let us now consider another synchronization primitive called a semaphore. We
can think of it as a generalization of a lock. It is a more flexible variant of a
lock, which admits more than two states. Recall that a lock has just two states:
locked and unlocked.
void get_write_lock () {
LOCK (\ _ \ _rwlock ) ;
}
void release_write_lock () {
UNLOCK (\ _ \ _rwlock ) ;
}
void get_read_lock () {
LOCK (\ _ \ _rdlock ) ;
if ( readers == 0) LOCK (\ _ \ _rwlock ) ;
readers ++;
UNLOCK (\ _ \ _rdlock ) ;
}
void release_read_lock () {
LOCK (\ _ \ _rdlock ) ;
readers - -;
if ( readers == 0)
UNLOCK (\ _ \ _rwlock ) ;
UNLOCK (\ _ \ _rdlock ) ;
}
The code for the locks is shown in Listing 5.10. We are assuming two macros
LOCK and UNLOCK. They take a lock (mutex) as their argument, and invoke the
methods lock and unlock, respectively. We use two locks: rwlock (for both
readers and writers) and rdlock (only for readers). The prefix signifies
that these are internal locks within the reader-writer lock. These locks are
meant for implementing the logic of the reader-writer lock, which provides two
key functionalities: get or release a read lock (allow a process to only read),
and get or release a write lock (allow a process to read/write). Even though the
names appear similar, the internal locks are very different from the functionality
that the composite reader-writer lock provides, which is providing a read lock
(multiple readers) and a write lock (single writer only).
Let’s first look at the code of a writer. There are two methods that it
can invoke: get write lock and release write lock. In this case, we need
145 © Smruti R. Sarangi
a global lock that needs to stop both reads as well as writes from proceeding.
This is why in the function get write lock, we wait on the lock rwlock.
The read lock, on the other hand, is slightly more complicated. Refer to
the function get read lock in Listing 5.10. We use another mutex lock called
rdlock. A reader waits to acquire it. The idea is to maintain a count of the
number of readers. Since there are concurrent updates to the readers variable
involved, it needs to be protected by the rdlock mutex. After acquiring
rdlock, it is possible that the lock acquiring process may find that a writer is
active. We need to explicitly check for this by checking if the number of readers,
readers, is equal to 0 or not. If it is equal to 0, then it means that other readers
are not active – a writer could be active. Otherwise, it means that other readers
are active, and a writer cannot be active.
If readers = 0 we need to acquire rwlock to stop writers. The rest of the
method is reasonably straightforward. We increment the number of readers and
finally release rdlock such that other readers can proceed.
Releasing the read lock is also simple. We subtract 1 from the number of
readers after acquiring rdlock. Now, if the number of readers becomes equal
to 0, then there is no reason to hold the global rwlock. It needs to be released
such that writers can potentially get a chance to complete their operation.
A discerning reader at this point of time will clearly see that if readers are
active, then new readers can keep coming in and the waiting write operation
will never get a chance. This means that there is a possibility of starvation.
Because readers may never reach 0, rwlock will never be released by the
reader holding it. The locks themselves could be fair, but overall we cannot
guarantee fairness for writes. Hence, this version of the reader-writer lock’s
design needs improvement. Starvation-freedom is needed, especially for write
operations. Various solutions to this problem are proposed in the reference
[Herlihy and Shavit, 2012].
We can split the array into n chunks, where n is the number of threads and
assign the ith chunk to the ith thread (map phase). The thread can then add
all the elements in its respective chunk, and then send the computed partial
sum to a pre-designated parent thread. The parent thread needs to wait for all
the threads to finish so that it can collect all the partial sums and add them to
produce the final result (reduce phase). This is a rendezvous point insofar as all
the threads are concerned because all of them need to reach this point before
they can proceed to do other work. Such a point arises very commonly in a lot
of scientific kernels that involve linear algebra.
5.2 Queues
Let us now see how to use all the synchronization primitives introduced in
Section 5.1.
One of the most important data structures in a complex software system like
an OS is a queue. All practical queues have a bounded size. Hence, we shall not
differentiate between a queue and a queue with a maximum or bounded size.
Typically, to communicate messages between different subsystems, queues are
used as opposed to direct function calls or writing entries to an array. Queues
provide the FIFO property, which also enforces an implicit notion of priority.
There is thus a lot of benefit in implementing a concurrent queue using the
aforementioned synchronization primitives.
Such concurrent queues admit multiple enqueuers and dequeuers that exe-
cute their operations in parallel. There are several options here. We can opt
for a lock-free linearizable implementation, or use a version with locks. A lot of
modern lock implementations are fairly fast and scalable (performance does not
degrade when the number of threads increases). Let us look at different flavors
of queues in this section.
Producers Consumers
Figure 5.14: A bounded queue
void nap () {
struct timespec rem ;
int ms = rand () % 100;
struct timespec req = {0 , ms * 1000 * 1000};
nanosleep (& req , & rem ) ;
}
return 0; /* success */
}
int deq () {
int cur_head = atomic_load (& head ) ;
int cur_tail = atomic_load (& tail ) ;
int new_head = INC ( cur_head ) ;
The main function creates two threads. The odd-numbered thread enqueues
by calling enqfunc, and the even-numbered thread dequeues by calling deqfunc.
These functions invoke the enq and deq functions, respectively, NUM times. Be-
tween iterations, the threads take a nap for a random duration.
The exact proof of wait freedom can be found in textbooks on this topic
such as the book by Herlihy and Shavit [Herlihy and Shavit, 2012]. Given that
there are no loops, we don’t have a possibility of looping endlessly. Hence, the
enqueue and dequeue operations will complete in bounded time. The proof of
© Smruti R. Sarangi 150
linearizability and correctness needs more understanding and thus is beyond the
scope of this book.
Note the use of atomics. They are a staple of modern versions of program-
ming languages such as C++20 and other recent languages. Along with atomic
load and store operations, the library provides many more functions such as
atomic fetch add, atomic flag test and set and atomic compare exchange strong.
Depending upon the architecture and the function arguments, their implemen-
tations come with different memory ordering guarantees (embed different kinds
of fences).
int deq () {
int val ;
do {
LOCK ( qlock ) ;
if ( tail == head ) val = -1;
else {
val = queue [ head ];
head = INC ( head ) ;
}
UNLOCK ( qlock ) ;
151 © Smruti R. Sarangi
int main () {
sem_init (& qlock , 0 , 1) ;
...
sem_destroy (& qlock ) ;
}
Listing 5.14: A queue with semaphores but does not have busy waiting
# define WAIT ( x ) ( sem_wait (& x ) )
# define POST ( x ) ( sem_post (& x ) )
sem_t qlock , empty , full ;
POST ( qlock ) ;
POST ( full ) ;
return 0; /* success */
}
int deq () {
WAIT ( full ) ;
WAIT ( qlock ) ;
POST ( qlock ) ;
POST ( empty ) ;
return val ;
}
int main () {
sem_init (& qlock , 0 , 1) ;
sem_init (& empty , 0 , BUFSIZE ) ;
sem_init (& full , 0 , 0) ;
...
sem_destroy (& qlock ) ;
sem_destroy (& empty ) ;
sem_destroy (& full ) ;
}
We use three semaphores here. We still use qlock, which is needed to protect
the shared variables. We use the semaphore empty that is initialized to BUFSIZE
(maximum size of the queue) and the full semaphore that is initialized to 0.
These will be used for waking up threads that are waiting. We define the WAIT
and POST macros that wrap sem wait and sem post, respectively.
Consider the enq function. We first wait on the empty semaphore. There
need to be free entries available. Initially, we have BUFSIZE free entries. Every
time a thread waits on the semaphore, it decrements the number of free entries
by 1 until the count reaches 0. After that the thread waits. Then we enter
the critical section that is protected by the binary semaphore qlock. There is
no need to perform any check on whether the queue is full or not. We know
that it is not full because the thread successfully acquired the empty semaphore.
153 © Smruti R. Sarangi
This means that at least one free entry is available in the array. After releasing
qlock, we signal the full semaphore. This indicates that an entry has been
added to the queue.
Let us now look at the deq function. It follows the reverse logic. We start
out by waiting on the full semaphore. There needs to at least be one entry
in the queue. Once this semaphore has been acquired, we are sure that there
is at least one entry in the queue and it will remain there until it is dequeued
(property of the semaphore). The critical section again need not have any
checks regarding whether the queue is empty or not. It is protected by the
qlock binary semaphore. Finally, we complete the function by signaling the
empty semaphore. The reason for this is that we are removing an entry from
the queue, or creating one additional free entry. Waiting enqueuers will get
signaled.
Note that there is no busy waiting. Threads either immediately acquire the
semaphore if the count is non-zero or are swapped out. They are put in a wait
queue inside the kernel. They thus do not monopolize CPU resources and more
useful work is done. We are also utilizing the natural strength of semaphores.
int peak () {
/* This is a read function */
get_read_lock () ;
© Smruti R. Sarangi 154
return val ;
}
int enq ( int val ) {
WAIT ( empty ) ;
POST ( full ) ;
return 0; /* success */
}
int deq () {
int val ;
WAIT ( full ) ;
POST ( empty ) ;
return val ;
}
The code of the enq and deq functions remains more or less the same. We
wait and signal the same set of semaphores: empty and full. The only difference
is that we do not acquire a generic lock, but we acquire the write lock using the
get write lock function.
It is just that we are using a different set of locks for the peak function and
the enq/deq functions. We allow multiple readers to work in parallel.
# define preempt_disable () \
do { \
preempt_count_inc () ; \
barrier () ; \
} while (0)
# define preempt_enable () \
do { \
barrier () ; \
if ( unlikely ( p r e e m p t _ c o u n t _ d e c _ a n d _ t e s t () ) ) \
__preempt_schedule () ; \
} while (0)
The core idea is a preemption count variable. If the count is non-zero, then
it means that preemption is not allowed. Whereas if the count is 0, it means
that preemption is allowed. If we want to disable preemption, all that we have
to do is increment the count and also insert a fence operation, which is also
known as a memory barrier. The reason for a barrier is to ensure that the code
in the critical section is not reordered and brought before the lock acquire. Note
that this is not the same barrier that we discussed in the section on barriers
and phasers (Section 5.1.11). They just happen to share the same name. These
are synchronization operations, whereas the memory barrier is akin to a fence,
which basically disables memory reordering. The preemption count is stored in
a per-CPU region of memory (accessible via a segment register). Accessing it
is a very fast operation and requires very few instructions.
The code for enabling preemption is shown in Listing 5.16. In this case, we
do more or less the reverse. We have a fence operation to ensure that all the
pending memory operations (executed in the critical section) completely finish
and are visible to all the threads. After that, we decrement the count using an
atomic operation. If the count reaches zero, it means that now preemption is
allowed, so we call the schedule function. It finds a process to run on the core.
An astute reader will make out that this is like a semaphore, where if preemption
is disabled n times, it needs to be enabled n times for the task running on the
core to become preemptible.
The code for a spinlock is shown in Listing 5.17. We see that the spinlock
structure encapsulates an arch spinlock t lock and a dependency map (struct
lockdep map). The raw lock member is the actual spinlock. The dependency
map is used to check for deadlocks (we will discuss that later).
157 © Smruti R. Sarangi
Let us understand the design of the spinlock. Its code is shown in List-
ing 5.18. It is a classic ticket lock that has two components: a ticket, which
acts like a a coupon, and the id of the next ticket (next). Every time that a
thread tries to acquire a lock, it gets a new ticket. It is deemed to have acquired
the lock when ticket == next.
Consider a typical bank where we go to meet a teller. We first get a coupon,
which in this case is the ticket. Then we wait for our coupon number to be
displayed. Once that happens, we can go to the counter at which a teller is
waiting for us. The idea here is quite similar. If you think about it, you will
conclude that this lock guarantees fairness. Starvation is not possible. The
way that this lock is designed in practice is quite interesting. Instead of using
multiple fields, a single 32-bit unsigned integer is used to store both the ticket
and the next field. We divide the 32-bit unsigned integer into two smaller
unsigned integers that are 16 bits wide. The upper 16 bits store the ticket id.
The lower 16 bits store the value of the next field.
When a thread arrives, it tries to get a ticket. This is achieved by adding
216 (1 ¡¡ 16) to the lock variable. This basically increments the ticket stored in
the upper 16 bits by 1. The atomic fetch and add instruction is used to achieve
this. This instruction has a built-in memory barrier as well (more about this
later). Now, the original ticket can be extracted quite easily by right shifting
the value returned by the fetch and add instruction by 16 positions.
The next task is to extract the lower 16 bits (next field). This is the number
of the ticket that is the holder of the lock, which basically means that if the
current ticket is equal to the lower 16 bits, then we can go ahead and execute
the critical section. This is easy to do using a simple typecast operation. Hear
the type u16 refers to a 16-bit unsigned integer. Simply typecasting val to the
type u16 type retrieves the lower 16 bits as an unsigned integer. This is all that
we need to do. Then, we need to compare this value with the thread’s ticket,
which is also a 16-bit unsigned integer. If both are equal, then the spinlock has
effectively been acquired and the method can return.
Now, assume that they are not equal. Then there is a need to wait or rather
there is a need to busy wait. This is where we call the macro atomic cond read acquire,
which requires two arguments: the lock value and the condition that needs
to be true. This condition checks whether the obtained ticket is equal to
the next field in the lock variable. This macro ends up calling the macro
smp cond load relaxed, whose code is shown next.
The kernel code for the macro is shown in Listing 5.19. In this case, the
inputs are a pointer to the lock variable and an expression that needs to evaluate
to true. Then we have an infinite loop where we dereference the pointer and
fetch the current value of the lock. Next, we evaluate the conditional expression
(ticket == (u16)VAL). If the conditional expression evaluates to true, then it
means that the lock has been implicitly acquures. We can then break from the
infinite loop and resume the rest of the execution. Note that we cannot return
from a macro because a macro is just a piece of code that is copy-pasted by the
preprocessor with appropriate argument substitutions.
In case the conditional expression evaluates to false, then of course, there
is a need to keep iterating. But along with that, we would not like to contend
for the lock all the time. This would lead to a lot of cache line bouncing across
cores, which is detrimental to performance. We are unnecessarily increasing the
memory and on-chip network traffic. It is a better idea to wait for some time
and try again. This is where the function cpu relax is used. It makes the
thread back off for some time.
Given that fairness is guaranteed, we will ultimately exit the infinite loop,
and we will come back to the main body of the arc spinlock function. In this
case, there is a need to introduce a memory barrier. Note that this is a generic
pattern? Whenever we get a lock or acquire a lock, there is a need to insert a
memory barrier after it. This ensures that prior to entering the critical section all
the reads and writes are fully completed and are visible to all the threads in the
smp system. Moreover, no instruction in the critical section can complete before
the memory barrier has completed its operation. This ensures that changes
made in the critical section get reflected only after the lock has been acquired.
Let us now come to the unlock function. This is shown in Listing 5.20. It
is quite straightforward. The first task is to find the address of the next field.
159 © Smruti R. Sarangi
This needs to be incremented to let the new owner of the lock know that it
can now proceed. There is a little bit of a complication here. We need to see
if the machine is big endian or little endian. If it is a big endian machine,
which basically means that the lower 16 bits are actually stored in the higher
addresses, then a small correction to the address needs to be made. This logic
is embedded in the isenabled (Big endian) macro. In any case at the end of this
statement, the address of the next field is stored in the ptr variable. Next, we
get the value of the ticket from the lock variable, increment it by 1, and store it
in the address pointed to by ptr, which is nothing but the address of the next
field. Now if there is a thread whose ticket number is equal to the contents
of the next field, then it knows that it is the new owner of the lock. It can
proceed with completing the process of lock acquisition and start executing the
critical section. At the very end of the unlock function, we need to execute a
memory barrier known as an smp store release, which basically ensures that all
the writes made in the critical section are visible to the rest of the threads after
the lock has been released. This completes the unlock process.
fall back to the regular slow path code (arch spin lock).
Bear in mind that is a generic mechanism and it can be used for many other
kinds of concurrent objects as well. The fast path captures the scenario in
which there is less contention and the slow path captures scenarios where the
contention is moderate to high.
# ifdef C ON F IG _D E BU G _L OC K _A L LO C
struct lockdep_map dep_map ;
# endif
};
The code of the kernel mutex is shown in Listing 5.22. Along with a spinlock
(wait lock), it contains a pointer to the owner of the mutex and a waiting list
of threads. Additionally, to prevent deadlocks it also has a pointer to a lock
dependency map. However, this field is optional – it depends on the compilation
parameters. Let us elaborate.
The owner field is a pointer to the task struct of the owner. An as-
tute reader may wonder why it is an atomic long t and not a task struct
*. Herein, lies a small and neat trick. We wish to provide a fast-path mech-
anism to acquire the lock. We would like the owner field to contain the value
of the task struct pointer of the lock-holding thread, if the lock is currently
acquired and held by a thread. Otherwise, its value should be 0. This neat
trick will allow us to do a compare and exchange on the owner field in the hope
of acquiring the lock quickly. We will the fast path only once. To acquire the
lock, we will compare the value stored in owner with 0. If there is an equality
then we will store a pointer to the currently running thread’s task struct in
its place.
Otherwise, we enter the slow path. In this case, the threads waiting to
acquire the lock are stored in wait list, which is protected by the spinlock
wait lock. This means that before enqueueing the current thread in wait list,
we need to acquire the spinlock wait lock first.
Listing 5.23: The mutex lock operation
source : kernel/locking/mutex.c
void mutex_lock ( struct mutex * lock )
{
might_sleep () ; /* prints a stacktrace if called in an
atomic context ( sleeping not allowed ) */
if (! __mutex_trylock_fast ( lock ) ) /* cmpxchg on owner
*/
161 © Smruti R. Sarangi
Listing 5.23 shows the code of the lock function (mutex lock) in some more
detail. Its only argument is a pointer to the mutex. First, there is a need to
check if this call is being made in the right context or not. For example, the
kernel defines an atomic context in which the code cannot be preempted. In this
context, sleeping is not allowed. Hence, if the mutex lock call has been made
in this context, it is important to flag this event as an error and also print the
stack trace (the function call path leading to the current function).
Assume that the check passes and we are not in the atomic context, then
we first make an attempt to acquire the mutex via the fast path. If we are not
successful, then we try to acquire the mutex via the slow path using the function
mutex lock slowpath.
In the slow path, we first try to acquire the spinlock, and if that is not
possible then the process goes to sleep. In general, the task is locked in the
UNINTERRUPTIBLE state. This is because we don’t want to wake it up to
process signals. When the lock is released, it wakes up all such sleeping processes
such they can contend for the lock. The process that is successful in acquiring
the spinlock wait lock adds itself to wait list and goes to sleep. This is done
by setting its state (in general) to UNINTERRUPTIBLE.
Note that this is a kernel thread. Going to sleep does not mean going
to sleep immediately. It just means setting the status of the task to either
INTERRUPTIBLE or UNINTERRUPTIBLE. The task still runs. It needs to
subsequently invoke the scheduler such that it can find the most eligible task to
run on the core. Given the status of the current task is set to a sleep state, the
scheduler will not choose it for execution.
The unlock process pretty much does the reverse. We first check if there are
waiting tasks in the wait list. If there are no waiting tasks, then the owner
field can directly be set to 0, and we can return. However, if there are waiting
tasks, then there is a need to do much more processing. We first have to acquire
the spinlock associated with the wait list (list of waiting processes). Then, we
remove the first entry and extract the task, next, from it. The task next needs
to be woken up in the near future such that it can access the critical section.
However, we are not done yet. We need to set the owner field to next such that
incoming threads know that the lock is acquired by some thread, it is not free.
Finally, we release the spinlock and hand over the id of the woken up task next
to the scheduler.
node (containing our task), and modify the tail pointer to point to the new
node. Both of these operations need to execute atomically – it needs to appear
that both of them executed at a single point of time, instantaneously. The MCS
lock is a very classical lock and almost all texts on concurrent systems discuss
its design a great detail. Hence, we shall not delve further (reference [Herlihy
and Shavit, 2012]). It suffices to say that it uses complex lock-free programming
and we do not perform busy waiting on a single location, instead a thread only
busy waits on a Boolean variable declared within its node structure. When its
predecessor in the list releases the lock, it sets this variable to false, and the
current thread can then acquire the lock. This eliminates cache line bouncing
to a very large extent.
There are a few more variants like the osq lock (variant of the MCS lock)
and the qrwlock ( reader-writer lock that gives priority to readers).
The kernel code has its version of semaphores (see Listing 5.24). It has a spin
lock (lock), which protects the semaphore variable count. Akin to user-level
semaphores, the kernel semaphore supports two methods that correspond to
wait and post, respectively. They are known as down (wait) and up (post/signal).
The kernel semaphore functions in exactly the same manner as the user-level
semaphore. After acquiring the lock, the count variable is either incremented
or decremented. However, if the count variable is already zero, then it is not
possible to decrement it and the current task needs to wait. This is the point
at which it is added to the list of waiting processes (wait list) and the task
state is set to UNINTERRUPTIBLE. Similar to the case of unlocking a spin lock,
here also, if the count becomes non-zero from zero, we pick a process from the
wait list and set its task state to RUNNING. Given that all of this is happening
the kernel, setting the task state is very easy. All of this is very hard at the
user level for obvious reasons. We need a system call for everything. However,
in the kernel, we do not have those restrictions and thus these mechanisms are
much faster.
limit the number of locks that a thread is allowed to concurrently acquire such
that the overall complexity of the kernel is limited. Next, there is a need to
validate the possible lock acquisition. All kinds of lock acquisitions need to be
validated: spinlocks, mutexes and reader-writer locks. The main aim is to avoid
potential deadlock-causing situations.
We define four kinds of states: softirq − safe, softirq − unsafe, hardirq − safe
and hardirq − unsafe. A softirq − safe state means that the lock was acquired
while executing a softirq. At that point, interrupts (irqs) would have been
disabled. However, it is also possible to acquire a lock with interrupts turned
on. In this case, the state will be softirq − unsafe. Here, the thread can be
preempted by a softirq handler.
In any unsafe state, it is possible that the thread gets preempted and the
interrupt handler runs. This interrupt handler may try to acquire the lock. Note
that any softirq − unsafe state is hardirq − unsafe as well. This is because hard
irq interrupt handlers have a higher priority as compared to softirq handlers.
We define hardirq − safe and hardirq − unsafe analogously. These states will be
used to flag potential deadlock-causing situations.
We next validate the chain of lock acquire calls that have been made. Check
for trivial deadlocks first (fairly common in practice): A → B → A. Such
trivial deadlocks are also known as lock inversions. Let us now use the states.
No path can contain a hardirq − unsafe lock and then a hardirq − safe lock. This
allows the latter call to possibly interrupt the critical section associated with
the former lock. This may lead to the lock inversion deadlock.
Let us now look at the general case in which we search for cyclic (circular)
waits. We need to create a global graph where each lock instance is a node, and
if the process holding lock A waits to acquire lock B, then there is an arrow
from A to B. If we have V nodes and E edges, then the time complexity is
O(V + E). This is quite slow. Note that we need to check for cycles before
acquiring every lock.
Let us use a simple caching-based technique. Consider a chain of lock acqui-
sitions, where the lock acquire calls can possibly be made by different threads.
Given that the same kind of code sequences tend to repeat in the kernel code,
we can cache a full sequence of lock acquisition calls. If the entire sequence is
devoid of cycles, then we can deem the corresponding execution to be deadlock
free. Hence, the brilliant idea here is as follows.
Figure 5.15: A hash table that stores an entry for every chain of lock acquisitions
Instead of checking for a deadlock on every lock acquire, we check for dead-
locks far more infrequently. We consider a long sequence (chain) of locks and
hash all of them. A hash table stores the “deadlock status” associated with
such chains (see Figure 5.15). It is indexed with the hash of the chain. If the
chain has been associated with a cyclic wait (deadlock) in the past, then the
© Smruti R. Sarangi 164
5.4 Scheduling
/
Scheduling is one of the most important activities performed by an OS. It is
a major determinant of the overall system’s responsiveness and performance.
J1 J2 J3 J4
3 2 4 1
Figure 5.16: Example of a set of jobs that are awaiting to be scheduled
and the time at which the job fully completes. This determines the responsive-
ness of the system. It is possible for a system to minimize the makespan yet
unnecessarily delay a lot of jobs, which shall lead to an adverse mean completion
time value.
J1 J2 J3 J4
t1
t2
t3
t4
P
Figure 5.17: Mean completion time µ = i ti /n
We can thus observe that the problem of scheduling is a very fertile ground
for proposing and solving optimization problems. We can have a lot of con-
straints, settings and objective functions.
To summarize, we have said that in any scheduling problem, we have a list
of jobs. Each job has an arrival time, which may either be equal to 0 or some
other time instant. Next, we typically assume that we know how long a job shall
take to execute. Then in terms of constraints, we can either have preemptible
jobs or we can have non-preemptible jobs. The latter means that the entire
job needs to execute in one go without any other intervening jobs. Given these
constraints, there are a couple of objective functions that we can minimize. One
would be to minimize the makespan, which is basically the time from the start
of scheduling till the time it takes for the last job to finish execution. Another
objective function is the average completion time, where the completion time is
again defined as the time at which a job completes minus the time at which it
arrived (measure of the responsiveness).
For scheduling such a set of jobs, we have a lot of choice. We can use many
simple algorithms, which in some cases, can also be proven to be optimal. Let us
start with the random algorithm. It randomly picks a job and schedules it on a
free core. There is a lot of work that analyzes the performance of such algorithms
and many times such random choice-based algorithms perform quite well. In
the space of deterministic algorithms, the shortest job first (SJF) algorithm is
preferred. It schedules all the jobs in ascending order of their execution times.
It is a non-preemptible algorithm. We can prove that it minimizes the average
167 © Smruti R. Sarangi
completion time.
KSW Model
Let us now introduce a more formal way of thinking and introduce the Karger-
Stein-Wein (KSW) model [Karger et al., 1999]. It provides an abstract or generic
framework for all scheduling problems. It essentially divides the space of prob-
lems into large classes and finds commonalities in between problems that belong
to the same class. It requires three parameters: α, β and γ.
The first parameter α determines the machine environment. It specifies the
number of jobs and the processing time of each job. It specifies the number of
cores, number of jobs, and the execution time of each job. The second parameter
β specifies the constraints. For example, it specifies whether preemption is al-
lowed or not, whether the arrival times are all the same or are different, whether
the jobs have dependencies between them or whether there are job deadlines.
A dependency between a pair of jobs can exist in the sense that we can specify
that job J1 needs to complete before J2 . Note that in real-time systems, jobs
come with deadlines, which basically means that jobs have to finish before a
certain time. A deadline is thus one more type of constraint.
Finally, the last parameter is γ, which is the optimality criterion. We have
already discussed the average mean completion time and makespan criteria.
We can also define a weighted completion time – a weighted mean of completion
times. Here a weight in a certain sense represents a job’s priority. Note that
the mean completion time metric is a special case of the weighted completion
time metric – all the weights are equal to 1. Let the completion time of job i
be Ci . The cumulative completion time is equivalent to the mean completion
time in this case because the number of jobs behaves like a constant. We can
represent this criterion as ΣCi . The makespan is represented as Cmax (maximum
completion time of all jobs).
We can consequently have a lot of scheduling algorithms for every scheduling
problem, which can be represented using the 3-tuple α | β | γ as per the KSW
formulation.
We will describe two kinds of algorithms in this book. We will discuss the
most popular algorithms that are quite simple, and are also provably optimal
in some scenarios. We will also introduce a bunch of settings where finding the
optimal schedule is an NP-complete problem [Cormen et al., 2009].
Let us define the problem 1 || ΣCj in the KSW model. We are assuming
that there is a single core. The objective function is to minimize the sum of
completion times (Cj ). Note that minimizing the sum of completion times is
equivalent to minimizing the mean completion time because the number of tasks
is known a priori and is a constant.
The claim is that the SJF (shortest job first) algorithm is optimal in this
case (example shown in Figure 5.18). Let us outline a standard approach for
proving that a scheduling algorithm is optimal with respect to the criterion that
© Smruti R. Sarangi 168
J4 J2 J1 J3
1 2 3 4
Figure 5.18: Shortest job first scheduling
is defined in the KSW problem. Here we are minimizing the mean completion
time.
Let the SJF algorithm be algorithm A. Assume that another algorithm A′ is
optimal. There must be a pair of jobs j and k such that j immediately precedes
k and the processing time (execution time) of j > k. This means pj > pk . Note
that such a pair of jobs will not be found in algorithm A. Assume pj started
at time t. Let us exchange jobs j and k with the rest of the schedule remaining
the same. Let this new schedule be produced by another algorithm A′′ .
Let us evaluate the contribution to the cumulative completion time by jobs j
and k in algorithm A′ . It is (t+pj )+(t+pj +pk ). Let us evaluate the contribution
of these two jobs in the schedule produced by A′′ . It is (t + pk ) + (t + pj + pk ).
Given that pj > pk , we can conclude that the schedule produced by algorithm
A′ is longer (higher cumulative completion time). This can never be the case
because we have assumed A′ to be optimal. We have a contradiction here
because A′′ appears to be more optimal than A′ , which violates our assumption.
Hence A′ or any algorithm that violates the SJF order cannot be optimal.
Thus, algorithm A (SJF) is optimal.
Weighted Jobs
Let us now define the problem where weights are associated with jobs. It will
be 1 || wj Cj in the KSW formulation. If ∀j, wj = 1, we have the classical
unweighted formulation for which SJF is optimal.
For the weighted version, let us schedule jobs in a descending order of
(wj /pj ). Clearly, if all wj = 1, this algorithm is the same as SJF. We can
use the same exchange-based argument to prove that using (wj /pj ) as the job
priority yields an optimal schedule.
EDF Algorithm
Let us next look at the EDF (Earliest Deadline First) algorithm. It is one of
the most popular algorithms in real-time systems. Here, each job is associated
with a distinct non-zero arrival time and deadline. Let us define the lateness as
⟨completion time⟩ - ⟨deadline⟩. Let us define the problem as follows:
(Lmax ). This means that we would like to ensure that jobs complete as soon as
possible, with respect to their deadline. Note that in this case, we care about
the maximum value of the lateness, not the mean value. This means that we
don’t want any single job to be delayed significantly.
The algorithm schedules the job whose deadline is the earliest. Assume that
a job is executing and a new job arrives that has an earlier deadline. Then the
currently running job is swapped out, and the new job that now has the earliest
deadline executes.
If the set of jobs are schedulable, which means that it is possible to find a
schedule such that no job misses its deadline, then the EDF algorithm will pro-
duce such a schedule. If they are not schedulable, then then the EDF algorithm
will broadly minimize the time by which jobs miss their deadline (minimize
Lmax ).
The proof is on similar lines and uses exchange-based arguments (refer to
[Mall, 2009]).
SRTF Algorithm
Let us continue our journey and consider another problem: 1 | ri , pmtn | ΣCi .
Consider a single core machine where the jobs arrive at different times and
preemption is allowed. We aim to minimize the mean/cumulative completion
time.
In this case, the most optimal algorithm is shortest remaining time first
(SRTF). For each job, we keep a tab on the time that is left for it to finish
execution. We sort this list in an ascending order and choose the job that has
the shortest amount of time left. If a new job arrives, we compute its remaining
time and if that number happens to be the lowest, then we preempt the currently
running job and execute the newly arrived job.
We can prove that this algorithm minimizes the mean (cumulative) comple-
tion time using a same exchange-based argument.
• 1 | ri | ΣCi : In this case, preemption is not allowed and jobs can arrive at
any point of time. There is much less flexibility in this problem setting.
This problem is provably NP-complete.
We thus observe that making a small change to the problem renders it NP-
complete. This is how sensitive these scheduling problems are.
Practical Considerations
All the scheduling problems that we have seen assume that the job execution
(processing) time is known. This may be the case in really well-characterized
and constrained environments. However, in most practical settings, this is not
known.
Figure 5.19 shows a typical scenario. Any task typically cycles between two
bursts of activity: a CPU-bound burst and an I/O burst. The task typically
does a fair amount of CPU-based computation, and then makes a system call.
This initiates a burst where the task waits for some I/O operation to complete.
We enter an I/O bound phase in which the task typically does not actively
execute. We can, in principle, treat each CPU-bound burst as a separate job.
Each task thus yields a sequence of jobs that have their distinct arrival times.
The problem reduces to predicting the length of the next CPU burst.
We can use classical time-series methods to predict the length of the CPU
burst. We predict the length of the nth burst tn as a function of tn−1 , tn−2 . . . tn−k .
For example, tn could be described by the following equation:
Such approaches rooted in time series analysis often tend to work and yield
good results because the lenght of the CPU bursts have a degree of temporal
correlation. The recent past is a good predictor of the immediate future. Using
these predictions, the algorithms listed in the previous sections like EDF, SJF
and SRTF can be used. At least some degree of near-optimality can be achieved.
Let us consider the case when we have a poor prediction accuracy. We need
to then rely on simple, classical and intuitive methods.
Conventional Algorithms
We can always make a random choice, however, that is definitely not desirable
here. Something that is much more fair is a simple FIFO algorithm. To im-
plement it, we just need a queue of jobs. It guarantees the highest priority to
the job that arrived the earliest. A problem with this approach is the “convoy
effect”. A long-running job can delay a lot of smaller jobs. They will get unnec-
essarily delayed. If we would have scheduled them first, the average completion
time would have been much lower.
171 © Smruti R. Sarangi
Priority
Fairness
Queue-based Scheduling
Queue: Level 1
Queue: Level 2
Queue: Level 3
Dispatcher
Let us now come to the issue of multicore scheduling. The big picture is
shown in Figure 5.22. We have a global queue of tasks that typically contains
newly created tasks or tasks that needs to be migrated. A dispatcher module
sends the tasks to different per-CPU task queues. Theoretically, it is possible
to have different scheduling algorithms for different CPUs. However, this is not
a common pattern. Let us again look at the space of problems in the multicore
domain.
are similar to the knapsack, partition or bin packing problems (see [Cormen
et al., 2009]), which are quintessential NP-complete problems. A problem in
NP (nondeterministic polynomial time) can be verified in polynomial time if a
solution is presented. These are in general decision problems that have yes/no
answers. Now, the set of NP-complete problems are the hardest problems in
NP. This means that if we can solve them in polynomial time, then we have
polynomial time solutions for all the problems in NP.
Let us consider a simple version of the multicore scheduling problem: P |
pmtn | Cmax . Here, we have P processors and preemption is enabled. The
solution is to simply and evenly divide the work between the cores and schedule
the jobs. Given that every job is splittable arbitrarily, scheduling in this manner
becomes quite simple.
However, the problem P || Cmax is NP-complete. Jobs cannot be split and
this is the source of all our difficulties.
To prove the NP-completeness of such problems, we need to map every
instance of known NP-complete problems to an instance of such scheduling
problems. The NP-complete problems that are typically chosen are the bin
packing and partition problems. Mapping a problem’s instance I to another
problem’s instance J, means that if we can compute a solution for J, we can
adapt it to become a solution for I.
Bin Packing Problem: We have a finite number of bins, where each bin
has a fixed capacity S. There are n items. The size of the ith item is si .
We need to pack the items in bins without exceeding any bin’s capacity.
The objective is to minimize the number of bins and find an optimal
mapping between items to bins.
List Scheduling
Let us consider one of the most popular non-preemptive scheduling algorithms
in this space known as list scheduling. We maintain a list of ready jobs. They
are sorted in descending order according to some priority scheme. The priority
here could be the user’s job priority or could be some combination of the arrival
time, deadline, and the time that the job has waited for execution. When a
CPU becomes free it fetches the highest priority task from the list. In case, it
is not possible to execute that job, then the CPU walks down the list and finds
© Smruti R. Sarangi 174
a job to execute. The only condition here is that we cannot return without a
job if the list is non-empty. Moreover, all the machines are considered to be
identical in terms of computational power.
Let us take a deeper look at the different kinds of priorities that we can use.
We can order the jobs in descending order of arrival time or job processing time.
We can also consider dependencies between jobs. In this case, it is important to
find the longest path in the graph (jobs are nodes and dependency relationships
are edges). The longest path is known as the critical path. The critical path often
determines the overall makespan of the schedule assuming we have adequate
compute resources. This is why in almost all scheduling problems, a lot of
emphasis is placed on the critical path. We always prefer scheduling jobs on
the critical path as opposed to jobs off the critical path. We can also consider
attributes associated with nodes in this graph. For example, we can set the
priority to be the out-degree (number of outgoing edges). If a job as a high
out-degree, then it means that a lot of other jobs are dependent on it. Hence,
if this job is scheduled, many other jobs will get benefited – they will have one
less dependency.
Proof: Let there be n jobs and m CPUs. Let the execution times of the jobs
be p1 . . . pn . Let job k (exeuction time pk ) complete the last. Assume it started
at time t. Then Cmax = t + pk .
Given that there is no idleness in list scheduling, we can conclude that till t
all the CPUs were 100% busy. This means that if we add all the work done by
all the CPUs till point t, it will be mt. This comprises the execution times of
a subset of jobs that does not include job k (one that completes the last). We
thus arrive at the following inequality.
175 © Smruti R. Sarangi
X
pi ≥ mt
i̸=k
X
⇒ pi − pk ≥ mt
i
1 X pk
⇒t ≤ pi − (5.2)
m i m
P
pi pk
⇒t + pk = Cmax ≤ i − + pk
P m
m
pi 1
⇒Cmax ≤ i + pk 1 −
m m
Now, C ∗ ≥ pk and C ∗ ≥ mean(pi ). These follow from the fact that jobs
cannot be split across CPUs (no preemption) and we wait for all the jobs to
complete. We thus have,
P
ipi 1
Cmax ≤ + pk 1 −
m m
1
≤ C∗ + C∗ 1 − (5.3)
m
Cmax 1
⇒ ≤2−
C∗ m
P1 P2 P3
A B C
Let us look at the data structures used in the Banker’s algorithm (see Ta-
ble 5.1). There are n processes and m types of resources. The array avlbl
stores the number of copies that we have for resource i.
In Algorithm 1, we first initialize the cur cnt array and set it equal to avlbl
(count of free resources). At the beginning, the request of no process is assumed
to be satisfied (allotted). Hence, we set the value of all the entries in the array
done to false.
Next, we need to find a process with id i such that it is not done yet (done[i]
== false) and its requirements stored in the need[i] array are elementwise less
than cur cnt. Let us define some terminology here before proceeding forward.
need[][] is a 2-D array. need[i] is a 1-D array that captures the resource
requirements for process i – it is the ith row in need[n][m] (row-column format).
For two 1-D arrays A and B of the same size, the expression A ≺ B means that
∀i, A[i] ≤ B[i] and ∃j, A[j] < B[j]. This means that each element of A is less
than or equal to the corresponding element of B. Furthermore, there is at least
one entry in A that is strictly less than the corresponding entry in B. If both
the arrays are elementwise identical, we write A = B. Now, if either of the cases
is true – A ≺ B or A = B – we write A ⪯ B.
Let us now come back to need[i] ⪯ cur cnt. It means that the maximum
requirement of a process is less than the currently available count of resources
(for all entries) – the request of process i can be satisfied.
If no such process is found, we jump to the last step. It is the safety check
© Smruti R. Sarangi 178
step. However, if we are able to find such a process with id i, then we assume
that it will be able to execute. It will subsequently return all the resources that
it currently holds (acq[i]) back to the free pool of resources (cur cnt). Given
that we were able to satisfy the request for process i, we set done[i] equal to
true. We continue repeating this process till we can satisfy as many requests
of processes as we can.
Let us now come to the last step, where we perform the safety check. If
the requests of all the processes are satisfied, all the entries in the done array
will be equal to true. It means that we are in a safe state – all the requests
of processes can be satisfied. In other words, all the requests that are currently
pending can be safely accommodated. Otherwise we are in an unsafe state. It
basically means that we have more requirements as compared to the number of
free resources. This situation indicates a potential deadlock.
Let us now look at the resource request algorithm (Algorithm 2). We start
out with introducing a new array called req, which holds process i’s require-
ments. For example, if req[j] is equal to k, it means that process i needs k
copies of resource j.
Let us now move to the check phase. Consider the case where need[i] ≺ req,
which basically means that every entry of req is greater than or equal to the
corresponding entry of need[i], and at least one entry is strictly greater than
the corresponding entry in need[i]. In this case, there are clearly more re-
quirements then what was declared a priori (stored in the need[i] array). Such
requests cannot be satisfied. We need to return false. On the other hand, if
avlbl ≺ req, then it means that we need to wait for resource availability, which
may happen in the future. In this case, we are clearly not exceeding pre-declared
thresholds, as we were doing in the former case.
Next, let us make a dummy allocation once enough resources become avail-
able (allocate). The first step is to subtract req from avlbl. This basically
means that we satisfy the request for process i. The resources that it requires
are not free any more. Then we add req to acq[i], which basically means that
179 © Smruti R. Sarangi
the said resources have been acquired. We then proceed to subtract req from
need[i]. This is because at all points of time, max=acq + need.
After this dummy allocation, we check if the state is safe or not by invoking
Algorithm 1. If the state is not safe, then it means that the current resource
allocation request should not be allowed – it may lead to a deadlock.
Let us now understand the expression reqs[i] ⪯ cur cnt. This basically
means that for some process i, we can satisfy its request at that point of time.
We subsequently move to update, where we assume that i’s request has been
satisfied. Therefore, similar to the safety checking algorithm, we return the
resources that i had held. We thus add acq[i] to cur cnt. This process is done
now (done[i] ← true). We go back to the find procedure and keep iterating till
we can satisfy the requests of as many processes as possible. When this is not
possible any more, we jump to deadlock check.
Now, if done[i] == true for all processes, then it means that we were able
to satisfy the requests of all processes. There cannot be a deadlock. However,
if this is not the case, then it means that there is a dependency between pro-
cesses because of the resources that they are holding. This indicates a potential
deadlock situation.
There are several ways of avoiding a deadlock. The first is that before every
resource/lock acquisition we check the request using Algorithm 2. We do not
acquire the resource if we are entering an unsafe state. If the algorithm is
more optimistic and we have entered an unsafe state already, then we perform
a deadlock check, especially when the system does not appear to make any
progress. We kill one of the processes involved in a deadlock and release its
resources. We can choose one of the processes that has been waiting for a long
time or has a very low priority.
do {
preempt_disable () ;
__schedule ( SM_NONE ) ;
s c h e d _ p r e e m p t _ e n a b l e _ n o _ r e s c h e d () ;
} while ( need_resched () ) ;
181 © Smruti R. Sarangi
There are several ways in which the schedule function can be called. If
a task makes a blocking call to a mutex or semaphore, then there is a pos-
sibility that it may not acquire the mutex/semaphore. In this case, the task
needs to be put to sleep. The state will be set to either INTERRUPTIBLE or
UNINTERRUPTIBLE. Since the current task is going to sleep, there is a need
to invoke the schedule function such that another task can execute.
The second case is when a process returns after processing an interrupt or
system call. The kernel checks the TIF NEED RESCHED flag. If it is set to true,
then it means that there are waiting tasks and there is a need to schedule them.
On similar lines, if there is a timer interrupt, there may be a need to swap the
current task out and bring a new task in (preemption). Again we need to call
the schedule to pick a new task to execute on the current core.
Every CPU has a runqueue where tasks are added. This is the main data
structure that manages all the tasks that are supposed to run on a CPU. The
apex data structure here is the runqueue (struct rq) (see kernel/sched/sched.h).
Linux defines different kinds of schedulers (refer to Table 5.2). Each sched-
uler uses a different algorithm to pick the next task that needs to run on a
core. The internal schedule function is a wrapper function on the individual
scheduler-specific function. There are many types of runqueues – one for each
type of scheduler.
Scheduling Classes
Let us introduce the notion of scheduling classes. A scheduling class represents
a class of jobs that need to be scheduled by a specific type of scheduler. Linux
defines a hierarchy of scheduling classes. This means that if there is a pending
job in a higher scheduling class, then we schedule it first before scheduling a job
in a lower scheduling class.
The classes are as follows in descending order of priority.
Stop Task This is the highest priority task. It stops everything and executes.
DL This is the deadline scheduling class that is used for real-time tasks. Every
task is associated with a deadline. Typically, audio and video encoders
create tasks in this class. This is because they need to finish their work
in a bounded amount of time. For 60-Hz video, the deadline is 16.66 ms.
RT These are regular real-time threads that are typically used for processing
interrupts (top or bottom halves), for example softIRQs.
Fair This is the default scheduler that the current version of the kernel uses
(v6.2). It ensures a degree of fairness among tasks where even the lowest
priority task gets some CPU time.
Idle This scheduler runs the idle process, which means it basically accounts for
the time in which the CPU is not executing anything – it is idle.
© Smruti R. Sarangi 182
In Listing 5.26, we observe that most of the functions have the same broad
pattern. The key argument is the runqueue struct rq that is associated with
each CPU. It contains all the task structs scheduled to run on a given CPU. In
any scheduling operation, it is mandatory to provide a pointer to the runqueue
such that the scheduler can find a task among all the tasks in the runqueue to
execute on the core. We can then perform several operations on it such as en-
queueing or dequeueing a task: enqueue task and dequeue task, respectively.
The most important functions in any scheduler are the functions pick task
and pick next task – they select the next task to execute. These functions are
183 © Smruti R. Sarangi
scheduler specific. Each type of scheduler maintains its own data structures and
has its own internal notion of priorities and fairness. Based on the scheduler’s
task selection algorithm an appropriate choice is made. The pick task func-
tion is the fastpath that finds the highest priority task (all tasks are assumed
to be separate), whereas the pick next task function is on the slowpath. The
slowpath incorporates some additional functionality, which can be explained as
follows. Linux has the notion of control groups (cgroups). These are groups
of processes that share scheduling resources. Linux ensures fairness across pro-
cesses and cgroups. In addition, it ensures fairness between processes in a
cgroup. cgroups further can be grouped into hierarchies. The pick next task
function ensures fairness while also considering cgroup information.
Let us consider a few more important functions. migrate task rq mi-
grates the task to another CPU – it performs the crucial job of load balancing.
update curr performs some bookkeeping for the current task – it updates its
runtime statistics. There are many other functions in this class such as func-
tions to yield the CPU, check for preemptibility, set CPU affinities and change
priorities.
These scheduling classes are defined in the kernel/sched directory. Each
scheduling class has an associated scheduler, which is defined in a separate C
file (see Table 5.2).
Scheduler File
Stop task scheduler stop task.c
Deadline scheduler deadline.c
Real-time scheduler rt.c
Completely fair scheduler (CFS) cfs.c
Idle idle.c
The runqueue
Let us now take a deeper look at a runqueue (struct rq) in Listing 5.27. The
entire runqueue is protected by a single spinlock lock. It is used to lock all key
operations on the runqueue. Such a global lock that protects all the operations
on a data structure is known as a monitor lock.
The next few fields are basic CPU statistics. The field nr running is the
number of runnable processes in the runqueue. nr switches is the number of
process switches on the CPU and the field cpu is the CPU number.
The runqueue is actually a container of individual scheduler-specific run-
queues. It contains three fields that point to runqueues of different schedulers:
cfs, rt and dl. They correspond to the runqueues for the CFS, real-time and
deadline schedulers, respectively. We assume that in any system, at the min-
imum we will have three kinds of tasks: regular (handled by CFS), real-time
tasks and tasks that have a deadline associated with them. These scheduler
types are hardwired into the logic of the runqueue.
It holds pointers to the current task (curr), the idle task (idle) and the
mm struct (prev mm). The task that is chosen to execute is stored in struct
© Smruti R. Sarangi 184
*core pick.
Scheduling-related Statistics
/* Preferred CPU */
struct rb_node core_node ;
/* statistics */
u64 exec_start ;
u64 sum_exec_runtime ;
u64 vruntime ;
u64 p rev_ sum_e xec_ runti me ;
© Smruti R. Sarangi 186
u64 nr_migrations ;
struct sched_avg avg ;
/* runqueue */
struct cfs_rq * cfs_rq ;
};
Notion of vruntimes
Equation shows the relation between the actual runtime and the vruntime.
The vruntime is δvruntime times the actual runtime. The formula for δvruntime is
187 © Smruti R. Sarangi
shown in Equation 5.4. Let δ be equal to the time interval between the current
time and the time at which the current started executing. If vruntime is equal
to δ then it means that we are not using a scaling factor. The scaling factor
is equal to the weight assocated with a nice value of 0 divided by the weight
associated with the real nice value. We clearly expect the ratio to be less than
1 for high-priority tasks and be less than 1 for low-priority tasks.
weight(nice = 0)
δvruntime = δ × (5.4)
weight(nice)
Listing 5.30 shows the mapping between nice values as weights. The nice
value is 1024 for the nice value 0, which is the default. For every increase in the
nice value by 1, the weight reduces 1.25×. For example, if the nice value is 5,
the weight is 335. δvruntime = 3.05δ. Clearly, we have an exponential decrease
in the weight as we modify the nice value. For a nice value if n, the weight is
roughly 1024 × (1.25)n . The highest priority user task has a weight equal to
88761 (86.7×). This means that it gets significantly more runtime as compared
to a task that has the default priority.
Let us use the three mnemonics SP , N and G for the sake of readability.
Please refer to the code snippet shown in Listing 5.31. If the number of runnable
tasks are more than N (limit of the number of runnable tasks that can be con-
sidered in a scheduling period (SP )), then it means that the system is swamped
with tasks. We clearly have more tasks than what we can run. This is a crisis
situation and we are looking at a rather unlikely situation. The only option in
this case is to increase the scheduling period by multipling nr running with G
(minimum task execution time).
Let us consider the else part, which is the more likely case. In this case, we
set the scheduling period as SP .
Listing 5.31: Implementation of scheduling quanta in CFS
source : kernel/sched/fair.c
u64 __sched_period ( unsigned long nr_running )
{
if ( unlikely ( nr_running > sched_nr_latency ) )
return nr_running * s y s c t l _ s c h e d _ m i n _ g r a n u l a r i t y ;
© Smruti R. Sarangi 188
else
return sysctl_sched_latency ;
}
Once the scheduling period has been set, we set the scheduling slice for
each task as shown in Equation 5.5 (assuming we have the normal case where
nr running ≤ N).
weight(taski )
slicei = SP × P (5.5)
j weight(taskj )
We basically partition the scheduling period based on the weights of the con-
stituent tasks. Clearly, high-priority tasks get larger scheduling slices. However,
if we have the unlikely case where nr running > N), then each slice is equal to
G.
The scheduling algorithm works as follows. We find the task with the least
vruntime in the red-black tree. We allow it to run until it exhausts its scheduling
slice. This logic is shown in Listing 5.32. Here, if the CFS queue is non-
empty, we compute the time for which a task has already executed (ran). If
slice > ran, then we execute the task for slice − ran time units by setting
the timer accordingly, otherwise we reschedule the current task.
Listing 5.32: hrtick start fair
source : kernel/sched/fair.c
if ( rq - > cfs . h_nr_running > 1) {
u64 slice = sched_slice ( cfs_rq , se ) ;
u64 ran = se - > sum_exec_runtime - se - >
pre v_su m_exe c_ru ntime ;
s64 delta = slice - ran ;
if ( delta < 0) {
if ( task_current ( rq , p ) )
resched_curr ( rq ) ;
return ;
}
hrtick_start ( rq , delta ) ;
}
Clearly, once a task has exhausted its slice, its vruntime has increased and
its position needs to be adjusted in the RB tree. In any case, every time we
need to schedule a task, we find the task with the least vruntime in the RB tree
and check if it has exhausted its time slice or not. If it has, then we mark it as
a candidate for rescheduling (if there is spare time left in the current scheduling
period) and move to the next task in the RB tree with the second-smallest
vruntime. If that also has exhausted its scheduling slice or is not ready for some
reason, then we move to the third-smallest, and so on.
Once all tasks are done, we try to execute tasks that are rescheduled, and
then start the next scheduling period.
up after a long sleep and tasks getting migrated. They will start with a zero
vruntime and shall continue to have the minimum vruntime for a long time.
This has to be prevented – it is unfair for existing tasks. Also, when tasks move
from a heavily-loaded CPU to a lightly-loaded CPU, they should not have an
unfair advantage there. The following safeguards are in place.
3. If an old task is being restored or a new task is being added, then set
se− > vruntime+ = cfs rq− > minv runtime. This ensures that some
degree of a level playing field is being maintained.
4. This ensures that other existing tasks have a fair chance of getting sched-
uled
5. Always ensure that all vruntimes monotonically increase (in the cfs rq
and sched entity structures).
loadavg = u0 + u1 × y + u2 × y 2 + . . . (5.6)
This is a time-series sun with a decay term y. The decaying rate is quite
1
slow. y 32 = 0.5, or in other words y = 2− 32 . This is known as per-entity load
tracking (PELT, kernel/sched/pelt.c), where the number of intervals for which
we compute the load average is a configurable parameter.
© Smruti R. Sarangi 190
Real-time Scheduler
The real-time scheduler has one queue for every real-time priority. In addition,
we have a bit vector– one bit for each real-time priority. The scheduler finds
the highest-priority non-empty queue. It starts picking tasks from that queue.
If there is a single task then that task executes. Ths scheduling is clearly not
fair. There is no notion of fairness across real-time priorities.
However, for tasks having the same real-time priority, there are two op-
tions: FIFO and round-robin (RR). In the real-time FIFO option, we break ties
between two equal-priority tasks based on when they arrived (first-in first-out
order). In the round-robin (RR) algorithm, we check if a task has exceeded its
allocated time slice. If it has, we put it at the end of the queue (associated
with the real-time priority). We find the next task in this queue and mark it
for execution.
Chapter 6
191
© Smruti R. Sarangi 192
holes. Let us say that a process requires 100 KB and the size of a hole is 150
KB, then we are leaving 50 KB free. We basically create a new hole that is 50
KB long. This phenomenon of having holes between regions and not using that
space is known as external fragmentation. On the other hand, leaving space
empty within a page in a regular virtual memory system is known as internal
fragmentation.
Hole Limit
Allocated region Base
The next question that we need to answer is that if we are starting a new
process and we are exactly aware of the maximum amount of memory that it
requires, then which hole do we select for allocating its memory? Clearly the size
of the hole needs to be more than the amount of requested memory. However,
there could be multiple such holes and we need to choose one of them. Our
choice really matters because it determines the efficiency of the entire process.
It is very well possible that later on we may not be able to satisfy requests
primarily because we will not have holes of adequate size left. Hence, designing
a proper heuristic in this space is important particularly in anticipation of the
future. There are several heuristics in this space. Let us say that we need R
bytes.
Best Fit Choose the smallest hole that is just about larger than R.
Worst Fit Choose the largest hole.
Next Fit Start searching from the last allocation that was made and move
towards higher addresses (with wraparounds).
First Fit Choose the first available hole
The stack distance typically has a distribution that is similar to the one
shown in Figure 6.3. Note that we have deliberately not shown the units of the
x and y axes because the aim was to just show the shape of the figure and not
© Smruti R. Sarangi 194
0.3
0.25
Representa�ve plot
0.2
Probability
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8
Stack distance
typically used to model the stack distance curve because it captures the fact that
very low stack distances are rare, then there is a strong peak and finally there
is a heavy tail. This is easy to interpret and also easy to use as a theoretical
tool. Furthermore, we can use it to perform some straightforward mathematical
analyses as well as also realize practical algorithms that rely on some form of
caching or some other mechanism to leverage temporal locality.
Stack-based Algorithms
WS-Clock Algorithm
Let us now implement the approximation of the LRU protocol. A simple im-
plementation algorithm is known as the WS-Clock page replacement algorithm,
which is shown in Figure 6.4. Here WS stands for “working set”, which we shall
discuss later in Section 6.1.3.
Every physical page in memory is associated with an access bit. That is set
to either 0 or 1 and is stored along with the corresponding page table entry. A
© Smruti R. Sarangi 198
pointer like the minute hand of a clock points to a physical page; it is meant
to move through all the physical pages one after the other (in the list of pages)
until it wraps around.
If the access bit of the page pointed to by the pointer is equal to 1, then it is
set to 0 when the pointer traverses it. There is no need to periodically scan all
the pages and set their access bits to 0. This will take a lot of time. Instead, in
this algorithm, once there is a need for replacement, we check the access bit and
if it is set to 1, we reset it to 0. However, if the access bit is equal to 0, then we
select that page for replacement. For the time being, the process stops at that
point. Next time the pointer starts from the same point and keeps traversing
the list of pages towards the end until it wraps around the end.
This algorithm can approximately find the pages that are not recently used
and select one of them for eviction. It turns out that we can do better if we
differentiate between unmodified and modified pages in systems where the swap
space is inclusive – every page in memory has a copy in the swap space, which
could possibly be stale. The swap space in this case acts as a lower-level cache.
If both the bits are equal to 0, then they remain so and we go ahead and
select that page as a candidate for replacement. On the other hand if they are
equal to ⟨0, 1⟩, which means that the page has been modified and after that its
199 © Smruti R. Sarangi
access bit has been set to 0, then we perform a write-back and move forward.
The final state in this case is set to ⟨00⟩ because the data is not deemed to be
modified anymore since it is written back to memory. Note that every modified
page in this case has to be written back to the swap space whereas unmodified
pages can be seamlessly evicted given that the swap space has a copy. As a
result we prioritize unmodified pages for eviction.
Next, let us consider the combination ⟨1, 0⟩. Here, the access bit is 1, so we
set it to 0. The result combination of bits is now ⟨0, 0⟩; we move forward. We
are basically giving the page a second chance in this case as well because it was
accessed in the recent past.
Finally, if the combination of these 2 bits is ⟨1, 1⟩, then we perform the
write-back, and reset the new state to ⟨1, 0⟩. This means that this is clearly a
frequently used frame that gets written to and thus it should not be evicted or
downgraded (access bit set to 0).
This is per se a simple algorithm, which takes the differing overheads of
reads and writes into account. For writes, it gives a page a second chance in a
certain sense.
We need to understand that such LRU approximating algorithms are quite
heavy. They introduce artificial page access faults. Of course, they are not
as onerous as full-blown page faults because they do not fetch data from the
underlying storage device that takes millions of cycles. Here, we only need to
perform some bookkeeping and change the page access permissions. This is
much faster than fetching the entire page from the hard disk or NVM drive.
It is also known as a soft page fault. They however still lead to an exception
and require time to service. There is some degree of complexity involved in this
mechanism. But at least we are able to approximate LRU to some extent.
FIFO Algorithm
The queue-based FIFO (first-in first-out) algorithm is one of the most popular
algorithms in this space and it is quite easy to implement because it does not
require any last usage tracking or access bit tracking. It is easy to implement
primarily because all that we need to do is that we need to have a simple priority
queue in memory that stores all the physical pages based on when the time at
which they were brought into memory. The page that was brought in the earliest
is the replacement candidate. There is no run time overhead in maintaining or
updating this information. We do not spend any time in setting and resetting
access bits or in servicing page access faults. Note that this algorithm is not
stack based and it does not follow the stack property. This is not a good thing
as we shall see shortly.
Even though this algorithm is simple, it suffers from a very interesting
anomaly known as the Belady’s Anomaly [Belady et al., 1969]. Let us un-
derstand it better by looking at the two examples shown in Figures6.5 and 6.6.
In Figure 6.5, we show an access sequence of physical page ids (shown in square
boxes). The memory can fit only four frames. If there is a page fault, we mark
the entry with a cross otherwise we mark the box corresponding to the access
with a tick. The numbers at the bottom represent the contents of the FIFO
queue after considering the current access. After each access, the FIFO queue
is updated.
If the memory is full, then one of the physical pages (frames) in memory
© Smruti R. Sarangi 200
needs to be removed. It is the page that is at the head of the FIFO queue –
the earliest page that was brought into memory. The reader should take some
time and understand how this algorithm works and mentally simulate it. She
needs to understand and appreciate how the FIFO information is maintained
and why this algorithm is not stack based.
Access 1 2 3 4 1 2 5 1 2 3 4 5
sequence
1 2 3 4 4 4 5 1 2 3 4 5
1 2 3 3 3 4 5 1 2 3 4
1 2 2 2 3 4 5 1 2 3
1 1 1 2 3 4 5 1 2
4 frames
Access 1 2 3 4 1 2 5 1 2 3 4 5
sequence
1 2 3 4 1 2 5 5 5 3 4 4
1 2 3 4 1 2 2 2 5 3 3
1 2 3 4 1 1 1 2 5 5
3 frames
In this particular example shown in Figure 6.5, we see that we have a total
of 10 page faults. Surprisingly, if we reduce the number of physical frames in
memory to 3 (see Figure 6.6), we have a very counter-intuitive result. We would
ideally expect the number of page faults to increase because the memory size is
smaller. However, we observe an anomalous result. We have 9 page faults (one
page fault less than the larger memory with 4 frames) !!!
The reader needs to go through this example in great detail. She needs to
understand the reasons behind this anomaly. These anomalies are only seen in
algorithms that are not stack-based. Recall that in a stack-based algorithm,
we have the stack property – at all points of time the set of pages in a larger
memory are a superset of the pages that we would have in a smaller memory.
Hence, we cannot observe such an anomaly. Now, we may be tempted to believe
201 © Smruti R. Sarangi
that this anomaly is actually limited to small discrepancies. This means that if
we reduce the size of the memory, maybe the size of the anomaly is quite small
(limited to a very few pages).
However, this presumption is sadly not true. It was shown in a classic paper
by Fornai et al. [Fornai and Iványi, 2010a, Fornai and Iványi, 2010b] that a
sequence always exists that can make the discrepancy arbitrarily large. In other
words, it is unbounded. This is why the Belady’s anomaly renders many of
these non-stack-based algorithms can ineffective. They perform very badly in
the worst case. One may argue that such “bad” cases are pathological and rare.
But in reality, such bad cases to occur to a limited extent. This significantly
reduces the performance of the system because page faults are associated with
massive overheads.
#pages in memory
Figure 6.7: Page fault rate versus the working set size
© Smruti R. Sarangi 202
Thrashing
Consider a system with a lot of processes. If the space that is allocated to a
process is less than the size of its working set, then the process will suffer from
a high page fault rate. Most of its time will be spent in fetching its working set
and servicing page faults. The CPU performance counters will sadly indicate
that there is a low CPU utilization. The CPU utilization will be low primarily
because most of the time is going in I/O: servicing page faults. However, the
kernel’s load calculator will observe that the CPU load is low. Recall that we
had computed the CPU load in Section 5.4.6 (Equation 5.6) using a similar
logic.
Given that the load average is below a certain threshold, the kernel will try
to spawn more processes to increase the average CPU utilization. This will
actually exacerbate the problem and make it even worse. Now the memory that
is available to a given process will further reduce.
Alternatively, the kernel may try to migrate processes to the current CPU
that is showing reduced activity. Of course, here we are assuming a non-uniform
memory access machine (NUMA machine), where a part of the physical memory
is “close” to the given CPU. This proximate memory will now be shared between
many more processes.
In both cases, we are increasing the pressure on memory. Process will spend
most of their time in fetching their working set into memory – the system will
thus become quite slow and unresponsive. This process can continue and become
a viscious cycle. In the extreme case, this will lead to a system crash because
key kernel threads will not be able to finish their work on time.
This phenomenon is known as thrashing. Almost all modern operating sys-
tems have a lot of counters and methods to detect thrashing. The only practical
203 © Smruti R. Sarangi
Holes
User space 64 PB
more or less unfettered access to the kernel’s data structures, however off late
this is changing.
Modules are typically used to implement device drivers, file systems, and
cryptographic protocols/mechanisms. They help keep the core kernel code
small, modular and clean. Of course, security is a big concern while load-
ing kernel modules and thus module-specific safeguards are increasingly getting
more sophisticated – they ensure that modules have limited access to only the
functionalities that they need. With novel module signing methods, we can
ensure that only trusted modules are loaded. 1520 MB is a representative figure
for the size reserved for storing module-related code and data in kernel v6.2.
Note that this is not a standardized number, it can vary across Linux versions
and is also configurable.
struct mm_struct {
….
Pointer to the page table. The CR3
pgd_t *pgd; register is set to this value. Type: u64
…
};
Figure 6.9: The high-level organization of the page table (57-bit address)
Figure 6.9 shows the mm struct structure that we have seen before. It specif-
ically highlights a single field, which stores the page table (pgd t *pgd). The
page table is also known as the page directory in Linux. There are two virtual
memory address sizes that are commonly supported: 48 bits and 57 bits. We
have chosen to describe the 57-bit address in Figure 6.9. We observe that there
are five levels in a page table. The highest level of the page table is known as the
page directory (PGD). Its starting addresses is stored in the CR3 MSR (model
specific register). CR3 stores the starting address of the page table (highest
level) and is specific to a given process. This means that when the process
changes, the contents of the CR3 register also need to change. It needs to point
to the page table of the new process. There is a need to also flush the TLB. This
is very expensive. Hence, various kinds of opimizations have been proposed.
We shall quickly see that the contents of the CR3 register do not change
when we make a process-to-kernel transition or in some cases in a kernel-to-
process transition as well. Here the term process refers to a user process. The
main reason for this is that changing the virtual memory context is associated
with a lot of performance overheads and thus there is a need to minimize such
events as much as possible.
© Smruti R. Sarangi 206
The page directory is indexed using the top 9 bits of the virtual address
(bits 49-57). Then we have four more levels. For each level, the next 9 bits
(towards the LSB) are used to address the corresponding table. The reason
that we have a five-level page table here is because we have 57 virtual address
bits and thus there is a need to have more page table levels. Our aim is to
reduce the memory footprint of page tables as much as possible and properly
leverage the sparsity in the virtual address space. The details of all of these
tables are shown in Table 6.2. We observe that the last level entry is the page
table entry, which contains the mapping between the virtual page number and
the page frame number (or the number of the physical page) along with some
page protection information.
Listing 6.1: The follow pte function (assume the entry exists)
source : mm/memory.c
int follow_pte ( struct mm_struct * mm , unsigned long address ,
pte_t ** ptepp , spinlock_t ** ptlp ) {
pgd_t * pgd ;
p4d_t * p4d ;
pud_t * pud ;
pmd_t * pmd ;
pte_t * ptep ;
* ptepp = ptep ;
return 0;
}
Listing 6.1 shows the code for traversing the page table (follow pte func-
tion) assuming that an entry exists. We first walk the top-level page directory,
and find a pointer to the next level table. Next, we traverse this table, find a
pointer to the next level, so on and so forth. Finally, we find the pointer to the
page table entry. However, in this case, we also pass a pointer to a spinlock.
It is locked prior to returning a pointer to the page table entry. This allows us
to make changes to the page table entry. It needs to be subsequently unlocked
after it has been used/modified by another function.
Let us now look slightly deeper into the code that looks up a table in the
5-level page table. A representative example for traversing the PUD table is
shown in Listing 6.2. Recall that the PUD table contains entries that point to
PMD tables. Let us thus traverse the PUD table. We find the index of the
PMD entry using the function pmd index and add it to the base address of the
PUD table. This gives us a pointer to the PMD table. Recall that each entry of
the PUD table contains a pointer to a PMD table (pmd t *). Let us elaborate.
Listing 6.2: Accessing the page table at the PMD level
/* include / linux / pgtable . h */
pmd_t * pmd_offset ( pud_t * pud , unsigned long address ) {
return pud_pgtable (* pud ) + pmd_index ( address ) ;
}
First consider the pmd index inline function that takes the virtual address
as input. We need to next extract bits 31-39. This is achieved by shifting the
address to the right by by 30 positions and then extracting the bottom 9 bits
(using a bitwise AND operation). The function returns the entry number in the
PMD table. This is multplied with the size of a PMD entry and then added to
the base address of the PUD page table that is obtained using the pud pgtable
function. Note that the multiplication is implicit primarily because the return
type of the pud pgtable function is pmd t *.
Let us now look at the pud pgtable function. It relies on the va inline
function that takes a physical address as input and returns the virtual address.
The reverse is done by the pa inline function (or macro). In va(x), we simply
add the argument x to an address called PAGE OFFSET. This is not the offset
within a page as the name may possibly suggest. It is an offset into a memory
region where the page table entries are stored. These entries are stored in the
direct-mapped region of kernel memory. The PAGE OFFSET variable points to the
starting point of this region or some point within this region (depending upon
the architecture). Note the linear conversion between a physical and virtual
address.
The inline pud pgtable function invokes the va function with an argument
that is constructed as follows. The pud valpud returns the bits corresponding
to the physical address of the PUD table. We compute a bitwise AND between
this value and a constant that has all 1s between bit positions 13 and 52 (rest
0s). The reason is that the maximum physical address size is assumed to be 252
bytes in Linux. Furthermore, we are aligning the address with a page boundary,
hence, the first 12 bits (offset within the page) are set to 0. This is the address
corresponding to the PUD, which is assumed to be aligned with a page bondary.
This physical address is then converted to a virtual address using the va
function. We then add the PMD index index to it and find the virtual address
of the PMD entry.
struct page
struct page is defined in include/linux/mm types.h. It is a fairly complex data
structure that extensively relies on unions. Recall that a union in C is a data
type that can store multiple types of data in the same memory location. It is a
good data type to use if we want it to encapsulate many types of data, where
only one type is used at a time.
The page structure begins with a set of flags that indicate the status of the
page. They indicate whether the page is locked, modified, in the process of
being written back, active, already referenced or reserved for special purposes.
Then there is a union whose size can vary from 20 to 40 bytes depending upon
the configuration. We can store a bunch of things such as a pointer to the
address space (in the case of I/O devices), a pointer to a pool of pages, or a
page map (to map DMA pages or pages linked to an I/O device). Then we have
a refence count, which indicates the number of entities that are currently holding
a reference of the page. This includes regular processes, kernel components or
even external devices such as DMA controllers.
We need to ensure that before a page is recycled (returned back to the pool
of pages), its reference count is equal to zero. It is important to note that the
page structure is ubiquitously used and that too for a large number of purposes,
hence it needs to have a very flexible structure. This is where using a union
with the large number of options for storing diverse types of data turns out to
be very useful.
Folios
Let us now discuss folios [Corbet, 2022, Corbet, 2021]. A folio is a compound
or aggregate page that comprises two or more contiguous pages. The reason
that folios were introduced is because memories are very large as of today and
it is very difficult to handle the millions of pages that they contain. The sheer
translation overhead and overhead for maintaining page-related metadata and
information is quite prohibitive. Hence, a need was felt to group consecuive
pages into larger units called folios. Specifically, a folio points to the first page
in a group of pages (compound page). Additionally, it stores the number of
pages that are a part of it.
The earliest avatars of folios were meant to be a contiguous set of virtual
pages, where the folio per se is identified by a pointer to the head page (first
page). It is a single entity insofar as the rest of the kernel code is concerned. This
in itself is a very useful concept because in a sense we are grouping contiguous
virtual memory pages based on some notion of application-level similarity.
Now if the first page of the folio is accessed, then in all likelihood the rest
of the pages will also be accessed very soon. Hence, it makes a lot of sense to
prefetch these pages to memory in anticipation of being used in the near future.
However, over the years the thinking has somewhat changed even though folios
are still in the process of being fully integrated into the kernel. Now most
interpretations try to also achieve contiguity in the physical address space as
well. This has a lot of advantages with respect to I/O, DMA accesses and
reduced translation overheads. Let us discuss another angle.
Almost all server-class machines as of today have support for huge pages,
which have sizes ranging from 2 MB to 1 GB. They reduce the pressure on the
© Smruti R. Sarangi 210
TLB and page tables, and also increase the TLB hit rate as well. We maintain
a single entry for the entire huge page. Consider a 1 GB huge page. If can
store 218 4 KB pages. If we store a single mapping for it, then we are basically
reducing the number of entries that we need to have in the TLB and page table
substantially. Of course, this requires hardware support and also may sometimes
be perceived to be wasteful in terms of memory. However, in today’s day and
age we have a lot of physical memory. For many applications this is a very
useful facility and the entire 1 GB region can be represented by a set of folios –
this simplifies its management significantly.
Furthermore, I/O and DMA devices do not use address translation. They
need to access physical memory directly and thus they benefit by having a
large amount of physical memory allocated to them. It becomes very easy to
transfer a huge amount of data directly to/from physical memory if they have
a large contiguous allocation. Additionally, from the point of view of software
it also becomes much easier to interface with I/O devices and DMA controllers
because this entire memory region can be mapped to a folio. The concept
of a folio along with a concomitant hardware mechanism such as huge pages
enables us to perform such optimizations quite easily. We thus see the folio as
a multifaceted mechanism that enables prefetching and efficient management of
I/O and DMA device spaces.
Given that a folio is perceived to be a single entity, all usage and replacement-
related information (LRU stats) are maintained at the folio level. It basically
acts like a single page. It has its own permission bits as well as copy-on-write
status. Whenever a process is forked, the entire folio acts as a single unit like
a page and is copied in totality when there is a write to any constituent page.
LRU information and references are also tracked at the folio level.
Mapping the struct page to the Page Frame Number (and vice versa)
Let us now discuss how to map a page or folio structure to a page frame number
(pfn). There are several simple mapping mechanisms. Listing 6.3 shows the code
for extracting the pfn from a page table entry (pte pfn macro). We simply right
shift the address by 12 positions (PAGE SHIFT).
Listing 6.3: Converting the page frame number to the struct page and vice
versa
source : include/asm − generic/memory model.h
# define pte_pfn ( x ) phys_to_pfn ( x . pte )
# define phys_to_pfn ( p ) (( p ) >> PAGE_SHIFT )
The next macro pfn to page has several variants. A simpler avatar of
this macro simply assumes a linear array of page structures. There are n such
structures, where n is the number of frames in memory. The code in Listing 6.3
shows a more complex variant where we divide this array into a bunch of sec-
tions. We figure out the section number from the pfn (page frame number), and
211 © Smruti R. Sarangi
every section has a section-specific array. We find the base address of this array
and add the page frame number to it to find the starting address of the corre-
sponding struct page. The need for having sections will be discussed when we
introduce zones in physical memory (in Section 6.2.5).
ASIDs
Intel x86 processors have the notion of the processor context ID (PCID), which
in software parlance is also known as the address space ID (ASID). We can
take some of the important user-level processes that are running on a CPU
and assign them a PCID each. Then their corresponding TLB entries will
be tagged/annotated with the PCID. Furthermore, every memory access will
now be annotated with the PCID (conceptually). Only those TLB entries will
be considered that match the given PCID. Intel CPUs typically provide 212
(=4096) PCIDs. One of them is reserved, hence practically 4095 PCIDs can be
supported. There is no separate register for it. Instead, the top 12 bits of the
CR3 register are used to store the current PCID.
Now let us come to the Linux kernel. It supports the generic notion of ASIDs
(address space IDs), which are meant to be architecture independent. Note that
it is possible that an architecture does not even provide ASIDs.
In the specific case of Intel x86-64 architectures, an ASID is the same as a
PCID. This is how we align a software concept (ASID) with a hardware concept
(PCID). Given that the Linux kernel needs to run on a variety of machines and
all of them may not have support for so many PCIDs, it needs to be slightly
more conservative and it needs to find a common denominator across all the
architectures that it is meant to run on. For the current kernel (v6.2), the
developers decided support only 6 ASIDs, which they deemed to be enough.
This means that out of 4095, only 6 PCIDs on an Intel CPU are used. From
213 © Smruti R. Sarangi
Let us now do a case-by-case analysis. Assume that the kernel in the course
of execution tries to access the invalidated page – this will create a correctness
issue if the mapping is still there. Note that since we are in the lazy TLB
mode, the mapping is still valid in the TLB of the CPU on which the kernel
thread is executing. Hence, in theory, the kernel may access the user-level page
that is not valid at the moment. However, this cannot happen in the current
implementation of the kernel. This is because access to user-level pages does not
happen arbitrarily. Instead, such accesses happen via functions with well-defined
entry points in the kernel. Some examples of such functions are copy from user
and copy to user. At these points, special checks can be made to find out if the
pages that the kernel is trying to access are currently valid or not. If they are
not valid because another core has invalidated them, then an exception needs
to be thrown.
Next, assume that the kernel switches to another user process. In this case,
either we flush out all the pages of the previous user process (solves the problem)
or if we are using ASIDs, then the pages remain but the current task’s ASID/P-
CID changes. Now consider shared memory-based inter-process communication
that involves the invalidated page. This happens through well-defined entry
© Smruti R. Sarangi 214
points. Here checks can be carried out – the invalidated page will thus not be
accessed.
Finally, assume that the kernel switches back to a thread that belongs to
the same multi-threaded user-level process. In this case, prior to doing so, the
kernel checks if the CPU is in the lazy TLB mode and if any TLB invalidations
have been deferred. If this is the case, then all such deferred invalidations are
completed immediately prior to switching from the kernel mode. This finishes
the work.
The sum total of this discussion is that to maintain TLB consistency, we do
not have to do it in mission mode. There is no need to immediately interrupt all
the other threads running on the other CPUs and invalidate some of their TLB
entries. Instead, this can be done lazily and opportunistically as and when there
is sufficient computational bandwidth available – critical high-priority processes
need not be interrupted for this purpose.
Node
Shared interconnect
Refer to Figure 6.10 that shows a NUMA machine where multiple chips
(group of CPUs) are connected over a shared interconnect. They are typically
organized into clusters of chips/CPUs and there is a notion of local memory
within a cluster, which is much faster than remote memory (present in another
cluster). We would thus like to keep all the data and code that is accessed within
215 © Smruti R. Sarangi
a cluster to remain within the local memory. We need to minimize the number
of remote memory accesses as far as possible. This needs to be explicitly done to
guarantee the locality of data and ensure a lower average memory access time.
In the parlance of NUMA machines, each cluster of CPUs or chips is known as
a node. All the computing units (e.g. cores) within a node have roughly the
same access latency to local memory as well as remote memory. We need to
thus organize the physical address hierarchically. The local memory needs to be
the lowest level and the next level should comprise pointers to remote memory.
Zones
Given that the physical address space is not flat, there is a need to partition
it. Linux refers to each partition as a zone [Rapoport, 2019]. The aim is to
partition the set of physical pages (frames) in the physical address space into
different nonoverlapping sets.
Each such set is referred to as a zone. The are treated separately and dif-
ferently. This concept can easily be extended to also encompass frames that
are stored on different kinds of memory devices. We need to understand that
in modern systems, we may have memories of different types. For instance,
we could have regular DRAM memory, flash/NVMe drives, plug-and-play USB
memory, and so on. This is an extension of the NUMA concept where we have
different kinds of physical memories and they clearly have different characteris-
tics with respect to the latency, throughput and power consumption. Hence, it
makes a lot of sense to partition the frames across the devices and assign each
group of frames (within a memory device) to a zone. Each zone can then be
managed efficiently and appropriately (according to the device that it is associ-
ated with). Memory-mapped I/O and pages reserved for communicating with
the DMA controller can also be brought within the ambit of such zones.
Listing 6.4 shows the details of the enumeration type zone type. It lists the
different types of zones that are normally supported in a regular kernel.
The first is ZONE DMA, which is a memory area that is reserved for physical
pages that are meant to be accessed by the DMA controller. It is a good idea to
partition the memory and create an exclusive region for the DMA controller. It
can then access all the pages within its zone freely, and we can ensure that data
in this zone is not cached. Otherwise, we will have a complex sequence of cache
evictions to maintain consistency with the DMA device. Hence, partitioning
the set of physical frames helps us clearly mark a part of the memory that
needs to remain uncached as is the normally the case with DMA pages. This
makes DMA operations fast and reduces the number of cache invalidations and
writebacks substantially.
Next, we have ZONE NORMAL, which is for regular kernel and user pages.
Sometimes we may have a peculiar situation where the size of the physical
memory actually exceeds the total size of the virtual address space. This can
happen on some older processors and also on some embedded systems that use
16-bit addressing. In such special cases, we would like to have a separate zone
of the physical memory that keeps all the pages that are currently not mapped
to virtual addresses. This zone is known as ZONE HIGHMEM.
User data pages, anonymous pages (stack and heap), regions of memory used
by large applications, and regions created to handle large file-based applications
can all benefit from placing their pages in contiguous zones of physical memory.
© Smruti R. Sarangi 216
/* Normal pages */
ZONE_NORMAL ,
Sections
Recall that in Listing 6.3, we had talked about converting page frame numbers
to page structures and vice versa. We had discussed the details of a simple linear
layout of page structures and then a more complicated hierarchical layout that
divides the zones into sections.
It is necessary to take a second look at this concept now (refer to Figure 6.11).
To manage all the memory and that too efficiently, it is necessary to sometimes
divide it into sections and create a 2-level hierarchical structure. The first reason
is that we can efficiently manage the list of free frames within a section because
we use smaller data structures. Second, sometimes zones can be noncontiguous.
It is thus a good idea to break a noncontiguous zone into a set of sections,
where each section is a contiguous chunk of physical memory. Finally, sometimes
there may be intra-zone heterogeneity in the sense that the latencies of different
© Smruti R. Sarangi 218
Zone
one-to-one
PFN struct page
64 KB 64 KB 64 KB 64 KB
32 KB 32 KB 32 KB 32 KB
Full
Allocate 20 KB Allocated 20 KB
Now we can clearly see that 20 KB is between two powers of two: 16 KB and
32 KB. Hence, we take the leftmost 32 KB region and out of that we allocate
20 KB to the current request. We basically split a large free region into two
equal-sized smaller regions until the request lies between the region size and the
region size divided by two. We are basically overlaying a binary tree on top of
a linear array of pages.
If we traverse the leaves of this buddy tree from left to right, then they
essentially form a partition of the single large region. An allocation can only be
made at the leaves. If the request size is less than half the size of a leaf node
that is unallocated, then we split it into two equal-sized regions (contiguous in
memory), and continue to do so until we can just about fit the request. Note
that throughout this process, the size of each sub-region is still a power of 2.
Now assume that after some time, we get a request for a 64 KB block of
memory. Then as shown in the second part of Figure 6.12, we allocate the
remaining 64 KB region (right child of the parent) to the request.
128 KB
page buddy
64 KB 64 KB
Figure 6.13: Freeing the 20 KB region allocated earlier
Let us now free the 20 KB region that was allocated earlier (see Figure 6.13).
In this case, we will have two 32 KB regions that are free and next to each other
(they are siblings in the tree). There is no reason to have two free regions at
the same level. Instead, we can get rid of them and and just keep the parent,
whose size is 64 KB. We are essentially merging free regions (holes) and creating
a larger free region. In other words, we can say that if both the children of a
parent node are free (unallocated), they should be removed, and we should only
have the parent node that coalesces the full region. Let us now look at the
implementation. Let us refer to the region represented by each node in the
buddy tree as a block.
© Smruti R. Sarangi 224
Implementation
Let us look at the implementation of the buddy allocator by revisiting the
free area array in struct zone (refer to Section 6.2.5). Let us define the
order of a node in the buddy tree. The order of a leaf node that corresponds
to the smallest possible region – one page – is 0. Its parent has order 1. The
order keeps increasing by 1 till we reach the root. Let us now represent the tree
as an array of lists: one list per order. All the nodes of the tree (of the same
order) are stored one after the other (left to right) in an order-specific list. A
node represents an aggregate page, which stores a block of memory depending
upon the order. Thus we can say that each linked list is a list of pages, where
each page is actually an aggregate page that may point to a N contiguous 4 KB
pages, where N is a power of 2.
The buddy tree is thus represented by an array of linked lists – struct
free area free area[MAX ORDER]. Refer to Listing 6.7, where each struct
free area is a linked list of nodes (of the same order). The root’s order is
limited to MAX ORDER - 1. In each free area structure, the member nr free
refers to the number of free blocks (=number of pages in the associated linked
list).
Note that there is a little bit of a twist here. We actually have multiple
linked lists – one for each migration type. The Linux kernel classifies pages
based on their migration type: it is based on whether they can move, once they
have been allocated. One class of pages cannot move after allocation, then there
are pages that can freely move around physical memory, there are pages that
can be reclaimed and there are pages reserved for specific purposes. These are
different examples of migration types. We maintain separate lists for different
migration types. It is as if their memory is managed separately.
free_area
list of type 0
zone free_area list of type 1 List of buddy blocks
list of type 2
free_area
Figure 6.14: Buddies within a zone. The type refers to the migration type
Listing 6.9 shows the code for freeing an aggregate page (block in the buddy
system). In this case, we start from the block that we want to free and keep
proceeding towards the root. Given the page, we find the page frame number
of the buddy. If the buddy is not free then the find buddy page pfn returns
NULL. Then, we exit the for loop and go to label done merging. If this is not
the case, we delete the buddy and coalesce the page with the buddy.
Let us explain this mathematically. Assume that pages with frame numbers
A and B are buddies of each other. Let the order be ϕ. Without loss of
generality, let us assume that A < B. Then we can say that B = A + 2ϕ , where
ϕ = 0 for the lowest level (the unit here is pages). Now, if we want to combine
A and B and create one single block that is twice the block size of A and B,
then it needs to start at A and its size needs to be 2ϕ+1 pages.
© Smruti R. Sarangi 226
Let us now remove the restriction that A < B. Let us just assume that they
are buddies of each other. We then have A = B⊕2ϕ . Here ⊕ stands for the XOR
operator. Then if we coalesce them, the aggregate page corresponding to the
parent node needs to have its starting pfn (page frame number) at min(A, B).
This is the same as A&B, where & stands for the logical AND operation. This
is because they vary at a single bit: the (ϕ + 1)th bit (LSB is bit #1). If we
compute a logical AND, then this bit gets set to 0, and we get the minimum of
the two pfns. Let us now compute min(A, B) − A. It can either be 0 or −2ϕ ,
where the order is ϕ.
We implement exactly the same logic in Listing 6.9, where A and B are
buddy pfn and pfn, respectively. The combined pfn represents the minimum:
starting address of the new aggregate page. The expression combined pfn -
pfn is the same as min(A, B) − A. If A < B, it is equal to 0, which means
that the aggregate page (corresp. to the parent) starts at struct page* page.
However, if A > B, then it starts at page minus an offset. The offset should
be equal to A − B multiplied by the size of struct page. In this case A − B
is equal to pfn - combined pfn. The reason that this offset gets multiplied
with struct page is because when we do pointer arithmetic in C, any constant
that gets added or subtracted to a pointer automatically gets multiplied by the
size of the structure (or data type) that the pointer is pointing to. In this case,
the pointer is pointing to date of type struct page. Hence, the negative offset
combined pfn - pfn also gets multiplied with sizeof(struct page). This is
the starting address of the aggregate page (corresponding to the parent node).
done_merging :
/* set the order of the new
set_buddy_order ( page , order ) ;
add_to_free_list ( page , zone , order , migratetype ) ;
}
Once we combine a page and its buddy, we increment the order and try to
combine the parent with its buddy and so on. This process continues until we are
successful. Otherwise, we break from the loop and reach the label done merging.
Here we set the order of the merged (coalesced) page and add it to the free list
227 © Smruti R. Sarangi
at the corresponding order. This completes the process of freeing a node in the
buddy tree.
The buddy system overlays a possibly unbalanced binary tree over a lin-
ear array of pages. Each node of the tree corresponds to a set of contigu-
ous pages (the number is a power of 2). The range of pages represented
by a node is equally split between its children (left-half and right-half).
This process continues recursively. The allocations are always made at
the leaf nodes that are also constrained to have a capacity of N pages,
where N is a power of 2. It is never the case that two children of the
same node are free (unallocated). In this case, we need to delete them
and make the parent a leaf node. Whenever an allocation is made in
a leaf node that exceeds the minimum page size, the allocated memory
always exceeds 50% of the capacity of that node (otherwise we would
have split that node).
object
full slab
slab
object slab
par�al
free slab
object slab
slab
kmem_cache_node
memory_region
The slab cache has a per-CPU array of free objects (array cache). These
are recently freed objects, which can be quickly reused. This is a very fast way
of allocating an object without accessing other data structures to find which
object is free. Every object in this array is associated with a slab. Sadly, when
such an object is allocated or freed, the state in its encapsulating slab needs to
also be changed. We will see later that this particular overhead is not there in
the slub allocator.
Now, if there is a high demand for objects, then we may run out of free
objects in the per-CPU array cache. In such a case, we need to find a slab
that has a free object available.
It is very important to appreciate the relationship between a slab and the
slab cache at this point of time. The slab cache is a system-wide pool whose job
is to provide a free object and also take back an object after it has been used
(added back to the pool). A slab on the other hand is just a storage area for
storing a set of k objects: both active as well as inactive.
The slab cache maintains three kinds of slab lists – full, partial and free –
for each NUMA node. The full list contains only slabs that do not have any
free object. The partial list contains a set of partially full slabs and the free
list contains a set of slabs that do not even have a single allocated object. The
algorithm is to first query the list of partially full slabs and find a partially full
slab. Then in that slab, it is possible to find an object that has not been allo-
cated yet. The state of the object can then be initialized using an initialization
function whose pointer must be provided by the user of the slab cache. The
object is now ready for use.
However, if there are no partially full slabs, then one of the empty slabs
needs to be taken and converted to a partially full slab by allocating an object
within it.
229 © Smruti R. Sarangi
We follow the reverse process when returning an object to the slab cache.
We add it to the array cache. We also set the state of the slab that the object
is a part of. This can easily be found out by looking at the address of the object
and then doing a little bit of pointer math to find the nearest slab boundary. If
the slab was full, then now it is partially full. It needs to be removed from the
full list and added to the partially full list. If this was the only allocated object
in a partially full slab, then the slab is empty now.
We assume that a dedicated region in the kernel’s memory map is used to
store the slabs. Clearly all the slabs have to be in a contiguous region of the
memory such that we can do simple pointer arithmetic to find the encapsulating
slab. The memory region corresponding to the slabs and the slab cache can be
allocated in bulk using the high-level buddy allocator.
This is a nice, flexible and rather elaborate way of managing physical memory
for storing objects of only a particular type. A criticism of this approach is that
there are too many lists and we frequently need to move slabs from one list to
the other.
per cpu
kmem_cache kmem_cache_cpu
void ** freelist: pointer
slab kmem_cache_cpu
to free objects
*cpu_slab
slab_cache uint object_size struct slab *slab
inuse: #objects ctor: object constructor
func�on
freelist: list of free objects kmem_cache_node * slab
node [NUMA_NODES]
object
usage
counters 1. Return empty slab
slab
slabs to the par�al slab
memory system. kmem_cache_node
2. Forget about full
slabs.
We reuse the same slab that was used for designing the slab allocator. We
specifically make use of the inuse field to find the number of objects that are
currently being used and the freelist. Note that we have compressed the slab
part in Figure 6.16 and just summarized it. This is because it has been shown
in its full glory in Figure 6.15.
Here also every slab has a pointer to the slab cache (kmem cache). However,
the slab cache is architected differently. Every CPU in this case is given a private
slab that is stored in its per-CPU region. We do not have a separate set of free
objects for quick allocation. It is necessary to prioritize regularity for achieving
© Smruti R. Sarangi 230
231
© Smruti R. Sarangi 232
Chapter 8
233
© Smruti R. Sarangi 234
Appendix A
In this book, we have concerned ourselves only with the Linux kernel and that
too in the context of the x86-64 (64-bit) ISA. This section will thus provide a
brief introduction to this ISA. It is not meant to be a definitive reference. For a
deeper explanation, please refer to the book on basic computer architecture by
your author [Sarangi, 2021].
The x86-64 architecture is a logical successor of the x86 32-bit architec-
ture, which succeeded the 16 and 8-bit versions, respectively. It is the default
architecture of all Intel and AMD processors as of 2023. The CISC ISA got
complicated with the passage of time. From its early 8-bit origins, the develop-
ment of these processors passed through several milestones. The 16-bit version
arrived in 1978, and the 32-bit version arrived along with Intel 80386 that was
released in 1985. Intel and AMD introduced the x86-64 ISA starting from 2003.
The ISA has become increasingly complex over the years and hundreds of new
instructions have been added henceforth particularly vector extensions (a single
instruction can work on a full vector of data).
A.1 Registers
ax ah al
bx bh bl
cx ch cl
dx dh dl
235
© Smruti R. Sarangi 236
history. In the 8-bit version, they were named simply a, b, c and d. In the
16-bit avatar of the ISA, these registers were simply extended to 16 bits. Their
names changed though, for instance a became ax, b became bx, and so on. As
shown in Figure A.1, the original 8-bit registers continued to be accessible. Each
16-bit register was split into a high and low part. The lower 8 MSB bits are
addressable using the specifier al (low) and bits 9-16 are addressable using the
register ah (high).
A few more registers are present in the 16-bit ISA. There is a stack pointer
sp (top of the stack), a frame pointer bp (beginning of the activation block for
the current function), and two index registers for performing computations in a
loop via a single instruction (si and di). In the 32-bit variant, a prefix ‘e’ was
added. ax became eax, so and so forth. Furthermore, in the 64-bit variant the
prefix ‘e’ was replaced with the prefix ‘r’. Along with this 8 new registers were
added from r8 to r15. This is shown in Figure A.2. Note that even in the 64-bit
variant of the ISA known as x86-64 the 8, 16 and 32-bit registers are accessible.
It is just that these registers exist virtually (as a part of larger registers).
64 bits
32 bits
16 bits
rax eax ax
rbx ebx bx
rcx ecx cx
rdx edx dx
rsp esp sp
rbp ebp bp
rsi esi si
rdi edi di
r8
r9
r15
Figure A.2: The registers in the x86-64 ISA
Note that unlike newer RISC ISAs, the program counter is not directly acces-
sible. It is known as the instruction pointer in the x86 ISA, which is not visible
to the programmer. Along with the program counter, there is also a flags
register that becomes rflags in x86-64. It stores all the ALU flags. For exam-
ple, it stores the result of compare instructions. Subsequent branch instructions
237 © Smruti R. Sarangi
use the result of the last compare instruction for deciding the outcome of a
conditional branch instruction. Refer to Figure A.3.
64 bits
32 bits
16 bits
rip eip ip
There are a couple of flag fields in the rflags register that are commonly
used. These are bit positions. If the corresponding bit position is set to 1, then
it means that the corresponding flag is set otherwise it is unset (flag is false).
OF is the integer overflow flag, CF is the carry flag (generated in an addition),
the ZF flag is set when the last comparison resulted in an equality, and the
SF sign flag is set when the last operation that could set a flag resulted in a
negative result. Note that a comparison operation is basically implemented as
a subtraction operation. If the two operands are equal, then the comparison
results in an equality (zero flag is set) otherwise if the first operand is less than
the second operand, then the result is negative and the sign bit is set to 1 (result
is negative).
FP register
st0 st1 st0
st2 st0
st3 st4 st5 st6 st7
stack
The basic mov operation moves the first operand to the second operand.
The first operand is the source and the second operand is the destination in this
format. Each instruction admits a suffix, which specifies the number of bits that
we want it to operate on. The ‘q’ modifier means that we wish to operate on 64
bits, whereas the ‘l’ modifier indicates that we wish to operate on 32-bit values.
In the instruction movq $3, %rax, we move the number 3 (prefixed with a ‘$’)
to the register rax. Note that all registers are prefixed with a percentage (‘%’)
symbol. Similarly, the next instruction movq $4, %rbx moves the number 4 to
the register rbx. The third instruction addq %rbx, %rax adds the contents of
register rbx to the contents of register rax, and stores the result in rax. Note
that in this case, the second operand %rax is both a source and a destination.
The final instruction stores the contents of rax (that was just computed) to
memory. In this case, the memory address is computed by adding the base
address that is stored in the stack pointer (%rsp) to the offset 8. The movq
instruction moves data between registers as well as between a register and a
memory location. It thus works as both a load and a store. Note that we
cannot transfer data from one memory location to another memory location. It
is basically not possible to have two memory operands in an instruction.
Let us look at the code for computing the factorial in Listing A.1. In this
case, we use the 32-bit version of the ISA. Note that it is perfectly legal to do
so in a 64-bit processor for power and performance reasons. In the code shown
in Listing A.1, eax stores the number that we are currently multiplying and
edx stores the product. The imull instruction multiplies the partial product
239 © Smruti R. Sarangi
The frontend basically reads a sequence of bytes (the C file) and converts
it into a sequence of tokens. This process is known as lexical analysis. The
sequence of tokens are then used to create a parse tree. Often programs like
yacc and bison are used to specify the grammar associated with a programming
language and automatically create a parse tree for a source code file (like a C
file). The parse tree contains the details of all the code that is there in the
C file. This includes a list of all the global and statically defined variables,
their data types, all the functions, their arguments, return values and all the
code statements within the functions. In a certain sense, the parse tree is a
representation that passes certain syntactic and semantic checks, completely
represents the contents of the C file, and is very easy to handle. The parse
tree incorporates many syntactic details, which are not really required to create
machine code. Hence, the parse tree is used to construct a simpler representation
that is far easier to process and is devoid of unnecessary details – this simpler
tree is known as the Abstract Syntax Tree or AST. The AST is the primary data
structure that is processed by the backend of the compiler.
241
© Smruti R. Sarangi 242
the topmost priority. Almost all modern compilers are designed to handle such
concerns and generate code accordingly.
gcc –c x.c
x.c x.o
a C file needs to do is simply include the header file. Here the term include
means that a pre-compilation pass needs to copy the contents of the header file
into the C file that is including it. This is a very easy and convenient mechanism
for providing a bunch of signatures to a C file. For instance, there could be a set
of C files that provide cryptographic services. All of them could share a common
header file via which they export the signatures of the functions that they define
to other modules in a large software project. Other C files could include this
header file and call the relevant functions defined in it to obtain cryptographic
services. The header file thus facilitates a logical grouping of variable, function
and structure/class declarations. It is much easier for programmers to include
a single header file that provides a cohesive set of declarations as opposed to
manually adding declarations at the beginning of every C file.
Header files have other interesting uses as well. Sometimes, it is easier to
simply go through a header file to figure out the set of functions that a set of C
functions provide to the rest of the world.
Barring a few exceptions, header files never contain function definitions or
any other form of executable code. Their role is not to have regular C state-
ments. This is the role of regular source code files and header files should be
reserved only for signatures that aid in the process of compilation. For the
curious reader, it is important to mention that the only exception to this rule
is C++ templates. A template is basically a class definition that takes another
class or structure as an argument and generates code based on the type of the
class that is passed to it at compile time.
Now, let us look at a set of examples to understand how header files are
meant to be used.
Listing B.1: factorial.h
# ifndef FACTORIAL_H
# define FACTORIAL_H
extern int factorial ( int ) ;
# endif
Listing B.1 shows the code for the header file factorial.h. First, we check if
a preprocessor variable FACTORIAL H is already defined. If it is already defined,
it means that the header file has already been included. This can happen for
a variety of reasons. It is possible that some other header file has included
factorial.h, and that header file has been included in a C file. Given that the
contents of factorial.h are already present in the C file, there is no need to include
it again explicitly. This is ensuring using preprocessor variables. In this case, if
FACTORIAL H has not been defined, then we define the function’s signature: int
factorial(int);. This basically says that it takes a single integer variable as
input and the return value is an integer.
Listing B.2: factorial.c
# include " factorial . h "
Listing B.2 shows the code of the factorial.c file. Note the way in which
we are including the factorial.h file. It is being included by specifying its name
in between double quotes. This basically means that the header file should
be there in the same directory as the C file (factorial.c). We can also use the
traditional way of including a header file between the ’<’ and ’>’ characters.
In this case, the directory containing the header file should be there in the
include path. The include path is a set of directories in which the C compiler
searches for header files. The directories are searched in ascending order of
preference based on their order in the include path. There is always an option
of adding an additional directory to the include path by using the ‘-I’ flag in
gcc. Any directory that succeeds the ‘-I’ flag is made a part of the include path
and the compiler searches that directory as well for the presence of the header
file. Now, when the compiler compiles factorial.c, it can create factorial.o (the
corresponding object file). This object file can now be used by other C files
whenever they want to use the factorial function. They know that it is defined
there. The signature is always available in factorial.h.
Let us now try to write the file that will use the factorial function. Let us
name it prog.c. Its code is shown in listing B.3.
int main () {
printf ( " % d \ n " , factorial (3) ) ;
}
All that the programmer needs to do is include the factorial.h header file and
simply call the factorial function. The compiler knows how to generate the code
for prog.c and create the corresponding object file prog.o. Given that we have
two object files now – prog.o and factorial.o – we need to link them together and
create a single binary that can be executed. This is the job of the linker that we
shall see next. Before we look at the linker in detail, an important point that
needs to be understood here is that we are separating the signature from the
implementation. The signature was specified in factorial.h that allowed prog.c to
be compiled without knowing how exactly the factorial function is implemented.
The signature was enough information for the compiler to compile prog.c.
The other interesting part of having a header file is that the linkage between
the signature and the implementation is somewhat dehypenated. The program-
mer can happily change the implementation as long as the signature is the same.
The rest of the world will not be affected and they can continue to use the same
function as if nothing has changed. This allows multiple teams of programmers
to work independently as long as they agree on the signatures of functions that
their respective modules export.
The standard library is a set of object files that defines functions that many
programs typically use such as printf and scanf. The final executable needs
to link these library object files as well. In this case, we think of the standard
library as a library of functions that make it easy for accessing system services
such as reading and writing to files or the terminal.
There are two ways of linking: static and dynamic. Static linking is a simple
approach where we just combine all the .o files and create a single executable.
This is an inefficient method as we shall quickly see. This is why dynamic
linking is used where all the .o files are not necessarily combined into a single
executable.
The precise role of the linker is shown in Figure B.3. Each object file contains
two tables: the symbol table and the relocation table. The symbol table contains
a list of all the symbols – variables and functions – defined in the .o file. Each
entry contains the name of the symbol, sometimes its type and scope, and its
address. The relocation table contains a list of symbols whose address has not
been determined as yet. Let us now explain the linking process that uses these
tables extensively.
Each object file contains some text (program code), read-only constants and
global variables that may or may not be initialized. Along with that it references
variables and functions that are defined in other object files. All the symbols
that an object files exports to the world are defined in the symbol table and all
the symbols that an object file needs from other object files are listed in the
relocation table. The linker thus operates in two passes.
© Smruti R. Sarangi 248
factorial.h
Declara�on of the
factorial func�on
#include
factorial.c prog.c
Figure B.3: Compiling the code in the factorial program and linking the com-
ponents
Pass 1: It scans through all the object files and concatenates all the text sec-
tions (instructions), global variable, function definition and constant def-
inition sections. It also makes a list of all the symbols that have been
defined in the object files. This allows the linker to compute the final
sizes of all the sections: text, data (initialized global/static variables), bss
(uninitialized global/static variables) and constants. All the program code
and symbols can be concatenated and the final addresses of all variables
and functions are computed. The concatenated code is however incom-
plete. The addresses of all the relocated variables and functions (defined
in other object files) are set to zero.
Pass 2: In this stage, the addresses of all the relocated variables and functions
are set to their real values. We know the address of each variable at the
end of Pass 1. In the second pass, the linker replaces the zero-valued
addresses of relocated variables and functions with the actual addresses
computed as a part of the first pass.
thus not the first function to be invoked. Even when the main function returns,
the process does not immediately terminate. Again a sequence of functions are
invoked that release all the resources that the process owned such as open files
and network connections.
Surprisingly, all this overhead is not much. The dominant source of over-
heads here is the code of all the C libraries that is added to the executable.
This means that if we invoke the printf function, then the code of printf as
well as the set of all the library functions that printf invokes (and in turn they
invoke) are added to the executable. These overheads can be quite prohibitive.
Assume that a program has one hundred unique library calls, but in any practi-
cal execution only 25 unique library calls are made. The size overhead is 100/25
(4×). Sadly, at compile time, we don’t know which library calls will be made
and which ones should not be made. Hence, we conservatively assume that
every single library call that is made in any object file will actually be made
and it is not dead code. Consequently, the code for all those library functions
(and their backward slice) needs to be included in the executable. Here, the
backward slice of a library function such as printf comprises the set S of li-
brary functions called by printf, as well as all the library functions invoked by
functions in S, so on and so forth. Because of this, we need to include a lot of
code in executables and hence they become very large. This can be visualized
in Figure B.4.
test.c
#include <stdio.h>
Add the code to
a.out int main(){
� gcc –sta�c test.c int a = 4;
Check if all the func�ons � ldd a.out printf ("%d",a);
not a dynamic executable }
are bundled or not
� du –h a.out
892K a.out
The size of the binary is quite
large because the code of the
en�re library is included in a.out
first
Locate the
Stub �me
prin� func�on in a
func�on
library
subsequently
Copy the func�on
Call the func�on using to the address
its address space of the
process
In this case, where printf is dynamically linked, the address of the printf
symbol is not resolved at link time. Instead, the address of printf is set to a
dummy function known as a stub function. The first time that the stub function
is called, it locates the path of the library that contains the printf function,
then it copies the code of the printf function to a memory address that is
within the memory map of the process. Finally, it stores the address of the
first byte of the printf function in a dedicated table known as the jump table.
The next time the stub function is called, it directly accesses the address of the
printf function in the jump table. This basically means that the first access
to the printf function is slow. Henceforth, it is very fast.
The advantages of this scheme are obvious. We only load library functions
on-demand. This minimizes the size of the executable. Furthermore, we can
have one copy of the shared library code in physical memory and simply map
regions of the virtual address space of each process to the physical addresses
corresponding to the library code. This also minimizes the memory footprint
and allows as much of runtime code reuse as possible. Of course, there is a very
minor performance penalty. Whenever a library function is accessed for the first
time, there is a necessity to first search for the library first and then find the
address of the function within it. Searching for a library, proceeds in the same
manner as searching for header files.
During the process of compilation, a small note is made about which function
251 © Smruti R. Sarangi
table (list of variables and functions defined in the file) and the relocation table
(variables and functions that are defined elsewhere). There is some information
with regards to dynamic linking in the dynamic section.
B.3 Loader
The loader is the component of the operating system whose job is to execute a
program. When we execute a program in a terminal window, a new process is
spawned that runs the code of the loader. The loader reads the executable file
from the file system and lays it out in main memory. It needs to parse the ELF
executable to realize this.
It creates space for all the sections, loads the constants into memory and
allocates regions for the stack, heap and data/bss sections (static and global
variables). It also copies all the instructions into memory. If they are already
present in the memory system, then instead of creating a new copy, we can
simply map the instructions to the virtual memory of the new process. If there
is a need for dynamic linking, then all the information regarding dynamically
linked symbols is stored in the relocation table and dynamic section in the
process’s memory image. It also initializes the jump tables.
Next, it initializes the execution environment such as setting the state of all
the environment variables, copying the command line arguments to variables
accessible to the process and setting up exception handlers. Sometimes for
security reasons, we wish to randomize the starting addresses of the stack and
heap such that it is hard for an attacker to guess runtime addresses. This can
be done by the loader. It can generate random values within a pre-specified
range and initialize base addresses in the program such as the starting value of
the stack pointer and the heap memory region.
The very last step is to issue a system call to erase the memory state of
the loader and start the process from the first address in its text section. The
process is now alive and the program is considered to be loaded.
© Smruti R. Sarangi 254
Appendix C
Data Structures
255
© Smruti R. Sarangi 256
the next and previous entries. This is enough information to operate on the
linked list. For example, we can add new entries as well as remove entries. The
crucial question that we need to answer here is, “Where is the encapsulating
object that needs to be linked together?” In general, we define a linked list in
the context of an object. We interpret the linked list to be a list of objects.
Here, we do not see an object with its fields, instead we just see a generic linked
list node with pointers to the next and previous nodes. It is true that it satisfies
our demand for generality; however, it does not align with our intuitive notion
of a linked list as we have studied in a data structures course.
Listing C.1: The definition of a linked list
source : include/linux/types.h
struct list_head {
struct list_head * next , * prev ;
}
We will use two macros to answer this question as shown in Listing C.2.
Listing C.2: The list entry and container of macros
source : include/linux/list.h and
source : include/linux/container of.h (resp.)
# define list_entry ( ptr , type , member ) container_of (
ptr , type , member )
Let us start with describing the container of macro. It takes three inputs:
a pointer, a type and a member name. The first statement simply typecasts
the pointer to void*. This is needed because we want to create a generic
implementation, which is not dependent on any particular type of object. The
offsetof macro provides the offset of the starting address of the member from
the beginning of the structure. Consider the structures in Listing C.3. In the
case of struct abc, the value of offsetof(list,abc) is 4. This is because,
we are assuming the size of an integer is four bytes. This integer is stored in
the first four addresses of struct abc. Hence, the offset of the list member
is 4 here. On the same lines, we can argue that the offset of the member list
in struct def is 8. This is because the size of an integer as well as float is
4 bytes each. Hence, mptr − offsetof(type,member) provides the starting
address of the structure that can be thought of to be the linked list node. To
summarize, the container of macro returns the starting address of the linked
list node or in other words the encapsulating object given the offset of the list
member in the object.
float y ;
struct list_head list ;
}
Listing C.4: Example of code that uses the list entry macro
struct abc * current = ... ;
struct abc * next = list_entry ( current - > list . next , struct
abc , list ) ;
Listing C.4 shows a code snippet that uses the list entry macro where
struct abc is the linked list node. The current node that we are considering is
called current. To find the next node (the next one after current), which will
again be of type struct abc, all that we need to do is invoke the list entry
macro. In this case, the pointer (ptr) is current->list.next. This is a pointer
to the struct list head object in the next node. From this pointer, we need
to find the starting address of the encapsulating abc structure. The type is thus
struct abc and the member is list. The list entry macro will internally call
offsetof, which will return an integer. This integer will be subtracted from
the starting address of the struct list head member in the next node. This
will provide the pointer to the encapsulating object.
This is a very fast and generic mechanism to traverse linked lists in Linux
that is independent of the type of the encapsulating object. Note that we
can stretch this discussion to create a linked list that has different kinds of
encapsulating objects. Note that, in theory, this is possible as long as we know
the type of the encapsulating object for each struct list head on the list.
Listing C.5: Example of code that uses the list entry macro
struct hlist_head {
struct hlist_node * first ;
};
struct hlist_node {
© Smruti R. Sarangi 258
Let us now describe singly-linked lists that have a fair amount of value in
kernel code. Here the explicit aim is a one-way traversal of the linked list. An
example would be a hashtable where we resolve collisions by chaining entries
that hash to the same entry. Linux uses the struct hlist head structure that
is shown in Listing C.5. It points to a node that is represented using struct
hlist node.
It is a very simple data structure. It has a next pointer to another hlist node.
However, note that this information is not enough if we wish to delete the
hlist node from the linked list. We need a pointer to the previous entry. This
is where a small optimization is possible and a few instructions can be saved.
We actually store a pointer to the next member in the previous node of the
linked list in the field pprev. The advantage of this is that we can directly set
it to another value while deleting the current node. We cannot do anything else
easily, which is the intention here.
Such data structures that are primarily designed to be singly-linked lists are
often very performance efficient. Their encapsulating objects are accessed in
exactly the same way as the doubly-linked list struct list head.
The maximum depth of any leaf is at most twice the minimum depth of
a leaf.
259 © Smruti R. Sarangi
This is quite easy to prove. As we have mentioned, the black depth of all
the leaves is the same. Furthermore, we have also mentioned that a red node
can never have a red child. Assume that in any path from the root to a leaf,
there are r red nodes and b black nodes. We know that b is a constant for all
paths from the root. Furthermore, every red node will have a black child (note
that all leaves or sentinel nodes are black). Hence, r ≤ b. The total depth of
any leaf is r + b ≤ 2b. This basically means that the maximum depth is at most
twice the minimum depth b.
This vital property ensures that all search operations always complete in
O(log(n)) time. Note that a search operation in an RB tree operates in exactly
the same manner as a regular binary search tree. Insert and delete operations
also complete in O(log(n)) time. They are however not very simple because we
need to ensure that the black depth of all the leaves always stays the same, and
a red parent never has a red child.
They require a sequence of recolorings and rotations. However, we can prove
that at the end all the properties hold and the overall height of the tree is always
O(log(n)).
C.3 B-Tree
A B-tree is a generalization of a binary search tree, which is self-balancing. In
this case, a node can have more than two children; quite unlike a red-black tree.
This is a balanced tree and all of its operations are realizable in logarithmic time.
It is typically used in systems that store a lot of data and quickly accessing a
given datum or a contiguous subset of the data is essential. Hence, database
and file systems tend to use B-trees quite extensively.
Let us start with the main properties of a B-tree of order m.
3. It is important that the tree does not remain sparse. Hence, every internal
node needs to have at least ⌈m/2⌉ children (alternatively ⌈m/2⌉ − 1 keys).
4. If a node has k children, then it stores k−1 keys. These k−1 keys partition
the space of keys into k non-overlapping regions. Each child then stores
keys in the key space assigned to it. In this sense, an internal node’s key
acts as a key space separator.
2 4 8 10 16 25
1 3 5 7 9 11 14 17 26
Now, if we consider the root node, we find that it stores two keys: 6 and 12.
The leftmost child is an internal node, which stores two keys: 2 and 4. They
point to leaf nodes that store a single key each. Note that as per our definition
(order=3), this is allowed. Now, let us consider the second child of the root
node. It needs to store keys that are strictly greater than 6 and strictly less
than 12. We again see a similar structure with an internal node that stores two
keys – 8 and 10 – and has three leaf nodes. Finally, the last child of the root
only stores keys that are greater than 12.
It is quite obvious that traversing a B-tree is similar to traversing a regular
BST (binary search tree). It takes O(log(n)) time. For added precision, we can
account for the time that it takes to find the pointer to the right subtree within
an internal node. Given the fact that we would have at most m subtrees, this
will take O(log(m)) time (use binary search).
much the reverse of what we did while adding a new key. Here again, a situation
will arise when this cannot be done and we will be forced to reduce the height
of tree.
C.3.3 B+ Tree
The B+ tree is a variant of the classical B-tree. In the case of a B-tree, internal
nodes can store both keys and values, however in the case of a B+ tree, internal
nodes can only store keys. All the values (or pointers to them) are stored in the
leaf nodes. Furthermore, all the leaf nodes are connected to each other using
a linked list, which allows for very efficient range queries. It is also possible to
do a sequential search in the linked list and locate data with proximate keys
quickly.
strings
tr
travel tryst
truck tread
a u i y ead tram tractor
trust trim
trick try
vel ctor ck
m ck st
m
st
A Radix tree stores a set of keys very efficiently. Each key is represented
as a string (see Figure C.2). The task is to store all the keys in a single data
© Smruti R. Sarangi 262
structure and it is possible to query the data structure and find if it contains a
given string (key) or not. Here, values can be stored at both the leaf nodes as
well as the internal nodes.
The algorithm works on the basis of common prefixes. Consider two keys
“travel” and “truck”. In this case, we store the common prefix “tr” at the
root node and add two children to the root node: ‘a’ and ‘u’, respectively. We
proceed in a similar fashion and continue to create common prefix nodes across
keys. Consider two more keys “tram” and “tractor”. In this case, after we
traverse the path with the prefix “tra”, we create two leaf nodes “ctor” and
“m”. If we were to now add a new key “trams”, then we would need to create
a new child “s” with the parent as the erstwhile leaf node labeled “m”. In this
case, both “tram” and “trams” would be valid keys. Hence, there is a need to
annotate every internal node with an extra bit to indicate that the path leading
from the root to that node corresponds to a valid key. We can associate a value
with any node that has a valid key. In other words, this would mean that the
path from the root to the leaf node corresponds to a valid key.
The advantage of such a structure is that we can store a lot of keys very
efficiently and the time it takes to traverse it is proportional to the number
of letters within the key. Of course, this structure works well when the keys
share reasonably long prefixes. Otherwise, the tree structure will not form and
we will simply have a lot of separate paths. Hence, whenever there is a fair
amount of overlap in the prefixes, a Radix tree should be used. It is important
to understand that the lookup time complexity is independent of the number of
keys – it is theoretically only dependent on the number of letters (digits) within
a key.
Insertion and deletion are easy. We need to first perform a lookup operation
and find the point at which the non-matching part of the current key needs to
be added. There will be a need to add a new node that branches out of an
existing node. Deletion follows the reverse process. We locate the key first,
delete the node that stores the suffix of the string that is unique to the key and
then possibly merge nodes.
There is a popular data structure known as a trie, which is a prefix tree
like a Radix tree with one important difference: in a trie, we proceed letter
by letter. This means that each edge corresponds to a single letter. Consider
a system with two keys “tractor” and “tram”. In this case, we will have the
root node, an edge corresponding to ‘t’, then an edge corresponding to ‘r’, an
edge corresponding to ‘a’, so on and so forth. There is no point in having a
node with a single child. We can compress this information to create a more
efficient data structure, which is precisely a Radix tree. In a Radix tree, we can
have multi-letter edges. In this case, we can have an edge labeled “tra” (fuse
all single-child nodes) .
Trie will always have three nodes – a root and two children. The root node
will store the shared prefix, and the two children will contain the non-shared
suffix of the binary keys. Incidentally, Patricia stands for Practical Algorithm
To Retrieve Information Coded In Alphanumeric.
level of the tree. This process continues towards the root in a similar fashion.
We keep on grouping adjacent internal nodes and create a parent for them until
we reach the root. We thus end up with a balanced binary tree if n is a power
of 2. The greatness of the vEB tree lies in the contents of the internal nodes.
To explain this, let us start with the root.
If the root node stores a 1, it means that at least a single location in the bit
vector stores a 1. This is a very convenient trick because we instantly know if
the bit vector contains all 0s or it has at least one 1 value. Each of its children
is the root of a sub-tree (contiguous region in the bit vector). It stores exactly
the same information as the root. If the root of the subtree stores a 0, then it
means that all the bit vector locations corresponding to the subtree store a 0.
If it stores a 1, then it means that at least one location stores a 1.
Now let us consider the problem of locating the first 1 starting from the
lowest address (from the left in the figure). We first check the root. if it
contains a 1, then it means that there is at least a single 1 in the underlying
bit vector. We then proceed to look at the left child. If it contains a 1, then
it means that the first half of the bit vector contains a 1. Otherwise, we need
to look at the second child. This process continues recursively until we reach
the leaf nodes. We are ultimately guaranteed to find either a 1 or conclude that
there is no entry in the bit vector that is a 1.
This is a very fast process and runs in logarithmic time. Whenever we change
a value from 0 → 1 in the bit vector, we need to walk up the tree and convert all
0s to 1 on the path. However, when we change a value from 1 → 0, it is slightly
tricky. We need to traverse the tree towards the root, however we cannot blindly
convert 1s to 0s. Whenever, we reach a node on the path from a leaf to the root,
we need to take a look at the contents of the other child and decide accordingly.
If the other child contains a 1, then the process terminates right there. This is
because the parent node is the root of a subtree that contains a 1 (via the other
child). If the other child contains a 0, then the parent’s value needs to be set to
0 as well and the process will continue towards the root.
Bibliography
[Belady et al., 1969] Belady, L. A., Nelson, R. A., and Shedler, G. S. (1969).
An anomaly in space-time characteristics of certain programs running in a
paging machine. Communications of the ACM, 12(6):349–353.
[Corbet, 2014] Corbet, J. (2014). Locking and pinning. Online. Available at:
https://lwn.net/Articles/600502/.
[Corbet, 2021] Corbet, J. (2021). Clarifying memory management with page
folios. Online. Available at: https://lwn.net/Articles/849538/.
[Corbet, 2022] Corbet, J. (2022). A memory-folio update. Online. Available
at: https://lwn.net/Articles/893512/.
[Cormen et al., 2009] Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein,
C. (2009). Introduction to Algorithms. MIT Press, third edition.
[de Olivera, 2018] de Olivera, D. B. (2018). Avoid schedule() being called
twice, the second in vain. Online. Available at: https://www.mail-archive.
com/[email protected]/msg1740572.html.
[Fornai and Iványi, 2010a] Fornai, P. and Iványi, A. (2010a). Fifo anomaly is
unbounded. Acta Univ. Sapientiae, 2(1):80–89.
[Fornai and Iványi, 2010b] Fornai, P. and Iványi, A. (2010b). Fifo anomaly is
unbounded. arXiv preprint arXiv:1003.1336.
[Graham, 1969] Graham, R. L. (1969). Bounds on multiprocessing timing
anomalies. SIAM journal on Applied Mathematics, 17(2):416–429.
[Herlihy and Shavit, 2012] Herlihy, M. and Shavit, N. (2012). The Art of Mul-
tiprocessor Programming. Elsevier.
[Howlett, 2021] Howlett, L. (2021). The maple tree, a
modern data structure for a complex problem. On-
line. Available at: https://blogs.oracle.com/linux/post/
the-maple-tree-a-modern-data-structure-for-a-complex-problem.
[Karger et al., 1999] Karger, D. R., Stein, C., and Wein, J. (1999). Scheduling
algorithms. Algorithms and theory of computation handbook, 1:20–20.
[Lameter and Kumar, 2014] Lameter, C. and Kumar, P. (2014). this cpu oper-
ations. Online. Available at: https://docs.kernel.org/core-api/this_
cpu_ops.html.
265
© Smruti R. Sarangi 266
[Mall, 2009] Mall, R. (2009). Real-time systems: theory and practice. Pearson
Education India.
267
© Smruti R. Sarangi 268
LAPIC, 89 Page, 33
269 © Smruti R. Sarangi
Unlock, 121