Think OS: A Brief Introduction To Operating Systems
Think OS: A Brief Introduction To Operating Systems
Operating Systems
Version 0.5.5
Think OS
A Brief Introduction to Operating Systems
Version 0.5.5
Allen B. Downey
The original form of this book is LATEX source code. Compiling this code has the
effect of generating a device-independent representation of a textbook, which can
be converted to other formats and printed.
The cover for this book is based on a photo by Paul Friel (http://flickr.com/
people/frielp/), who made it available under the Creative Commons Attribution
license. The original photo is at http://flickr.com/photos/frielp/11999738/.
Preface
This book is intended for a different audience, and it has different goals. I
developed it for a class at Olin College called Software Systems.
This book does not assume that you have studied Computer Architecture.
As we go along, I will explain what we need.
Chapter 2 explains how the operating system uses processes to protect run-
ning programs from interfering with each other.
vi Chapter 0. Preface
Chapter 5 describes how numbers, letters, and other values are encoded,
and presents the bitwise operators.
• If you don’t want to use Git at all, you can download the files in a Zip
file using the button in the lower-right corner of the GitHub page.
Contributor List
If you have a suggestion or correction, please send email to
[email protected]. If I make a change based on your feed-
back, I will add you to the contributor list (unless you ask to be omitted).
If you include at least part of the sentence the error appears in, that makes it
easy for me to search. Page and section numbers are fine, too, but not quite
as easy to work with. Thanks!
Other people who found typos and errors include Jim Tyson, Donald Robertson,
Jeremy Vermast, Yuzhong Huang, Ian Hill.
viii Chapter 0. Preface
Contents
Preface v
1 Compilation 1
1.6 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Processes 9
2.2 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Virtual memory 15
6 Memory management 39
6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7 Caching 45
7.3 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8 Multitasking 57
8.1 Hardware state . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.2 Context switching . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.3 The process life cycle . . . . . . . . . . . . . . . . . . . . . . . 59
8.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.5 Real-time scheduling . . . . . . . . . . . . . . . . . . . . . . . 62
9 Threads 63
9.1 Creating threads . . . . . . . . . . . . . . . . . . . . . . . . . . 64
9.2 Creating threads . . . . . . . . . . . . . . . . . . . . . . . . . . 64
9.3 Joining threads . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.4 Synchronization errors . . . . . . . . . . . . . . . . . . . . . . 67
9.5 Mutex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
10 Condition variables 71
10.1 The work queue . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.2 Producers and consumers . . . . . . . . . . . . . . . . . . . . 74
10.3 Mutual exclusion . . . . . . . . . . . . . . . . . . . . . . . . . 75
10.4 Condition variables . . . . . . . . . . . . . . . . . . . . . . . . 77
10.5 Condition variable implementation . . . . . . . . . . . . . . . 80
11 Semaphores in C 83
11.1 POSIX Semaphores . . . . . . . . . . . . . . . . . . . . . . . . 83
11.2 Producers and consumers with semaphores . . . . . . . . . . 85
11.3 Make your own semaphores . . . . . . . . . . . . . . . . . . . 87
xii Contents
Chapter 1
Compilation
These checks happen before the program starts executing, so errors can be
found more quickly. More importantly, errors can be found in parts of the
program that have never run. Furthermore, these checks don’t have to hap-
pen at run time, which is one of the reasons compiled languages generally
run faster than interpreted languages.
This shows that the name of the variable is stored in memory while the pro-
gram is running (along with some other values that are part of the default
run-time environment).
2. Parsing: During parsing, the compiler reads the source code and
builds an internal representation of the program, called an “abstract
syntax tree.” Errors detected during this step are generally syntax er-
rors.
Normally when you run gcc, it runs all of these steps and generates an
executable file. For example, here is a minimal C program:
#include <stdio.h>
int main()
{
printf("Hello World\n");
}
If you save this code in a file called hello.c, you can compile and run it like
this:
$ gcc hello.c
$ ./a.out
By default, gcc stores the executable code in a file called a.out (which orig-
inally stood for “assembler output”). The second line runs the executable.
The prefix ./ tells the shell to look for it in the current directory.
It is usually a good idea to use the -o flag to provide a better name for the
executable:
$ gcc hello.c -o hello
$ ./hello
This output indicates that hello.o defines the name main and uses a func-
tion named puts, which stands for “put string.” In this example, gcc per-
forms an optimization by replacing printf, which is a large and compli-
cated function, with puts, which is relatively simple.
You can control how much optimization gcc does with the -O flag. By de-
fault, it does very little optimization, which can make debugging easier.
The option -O1 turns on the most common and safe optimizations. Higher
numbers turn on additional optimizations that require longer compilation
time.
1.6 Preprocessing
Taking another step backward through the compilation process, you can
use the -E flag to run the preprocessor only:
$ gcc hello.c -E
The result is the output from the preprocessor. In this example, it contains
the included code from stdio.h, and all the files included from stdio.h,
and all the files included from those files, and so on. On my machine, the
total is more than 800 lines of code. Since almost every C program includes
stdio.h, those 800 lines of code get compiled a lot. If, like many C pro-
grams, you also include stdlib.h, the result is more than 1800 lines of code.
Once the program starts, C does very little run-time checking, so there are
only a few run-time errors you are likely to see. If you divide by zero, or
perform another illegal floating-point operation, you will get a “Floating
point exception.” And if you try to read or write an incorrect location in
memory, you will get a “Segmentation fault.”
8 Chapter 1. Compilation
Chapter 2
Processes
The word “virtual” is often used in the context of a virtual machine, which is
software that creates the illusion of a dedicated computer running a particu-
lar operating system, when in reality the virtual machine might be running,
along with many other virtual machines, on a computer running a different
operating system.
In the context of virtualization, we sometimes call what is really happening
“physical”, and what is virtually happening either “logical” or “abstract.”
2.2 Isolation
One of the most important principles of engineering is isolation: when you
are designing a system with multiple components, it is usually a good idea
to isolate them from each other so that a change in one component doesn’t
have undesired effects on other components.
One of the most important goals of an operating system is to isolate each
running program from the others so that programmers don’t have to think
about every possible interaction. The software object that provides this iso-
lation is a process.
A process is a software object that represents a running program. I mean
“software object” in the sense of object-oriented programming; in general,
2.2. Isolation 11
an object contains data and provides methods that operate on the data. A
process is an object that contains the following data:
• The hardware state of the program, which includes data stored in reg-
isters, status information, and the program counter, which indicates
which instruction is currently executing.
Usually one process runs one program, but it is also possible for a process
to load and run a new program.
It is also possible, and common, to run the same program in more than
one process. In that case, the processes share the same program text, but
generally have different data and hardware states.
Most operating systems provide a fundamental set of capabilities to isolate
processes from each other:
• Virtual memory: Most operating systems create the illusion that each
process has its own chunk of memory, isolated from all other pro-
cesses. Again, programmers generally don’t have to think about how
virtual memory works; they can proceed as if every program has a
dedicated chunk of memory.
As a programmer, you don’t need to know much about how these capabili-
ties are implemented. But if you are curious, you will find a lot of interest-
ing things going on under the metaphorical hood. And if you know what’s
going on, it can make you a better programmer.
When I move the mouse, the window manager wakes up, sees that the
mouse is over the terminal window, and wakes up the terminal. The ter-
minal wakes up the shell. If I type make in the shell, it creates a new process
to run Make, which creates another process to run LaTeX and then another
process to display the results.
And those are just the processes I am aware of. At the same time there
are many other processes running “in the background.” Many of them are
performing operations related to the operating system.
The third column is the total processor time used by the process, in hours,
minutes, and seconds. The last column is the name of the running program.
In this example, bash is the name of the shell that interprets the commands
I type in the terminal, emacs is my text editor, and ps is the program gener-
ating this output.
By default, ps lists only the processes associated with the current terminal.
If you use the -e flag, you get every process (including processes belonging
to other users, which is a security flaw, in my opinion).
On my system there are currently 233 processes. Here are some of them:
PID TTY TIME CMD
1 ? 00:00:17 init
2 ? 00:00:00 kthreadd
3 ? 00:00:02 ksoftirqd/0
4 ? 00:00:00 kworker/0:0
8 ? 00:00:00 migration/0
9 ? 00:00:00 rcu_bh
10 ? 00:00:16 rcu_sched
47 ? 00:00:00 cpuset
48 ? 00:00:00 khelper
49 ? 00:00:00 kdevtmpfs
50 ? 00:00:00 netns
51 ? 00:00:00 bdi-default
52 ? 00:00:00 kintegrityd
53 ? 00:00:00 kblockd
54 ? 00:00:00 ata_sff
55 ? 00:00:00 khubd
56 ? 00:00:00 md
57 ? 00:00:00 devfreq_wq
init is the first process created when the operating system starts. It creates
many of the other processes, and then sits idle until the processes it created
are done.
Based on the name, you can infer that ksoftirqd is also a kernel daemon;
specifically, it handles software interrupt requests, or “soft IRQ”.
I won’t go into more details about the other processes, but if you are inter-
ested you can search for more information about them. You should run ps
on your system and compare your results to mine.
Chapter 3
Virtual memory
If the process reads and writes files, those files are usually stored on a hard
disk drive (HDD) or solid state drive (SSD). These storage devices are non-
volatile, so they are used for long-term storage. Currently a typical desktop
computer has a HDD with a capacity of 500 GB to 2 TB. GB stands for “gi-
gabyte,” which is 109 bytes. TB stands for “terabyte,” which is 1012 bytes.
You might have noticed that I used the binary unit GiB for the size of main
memory and the decimal units GB and TB for the size of the HDD. For
historical and technical reasons, memory is measured in binary units, and
disk drives are measured in decimal units. In this book I will be careful
to distinguish binary and decimal units, but you should be aware that the
word “gigabyte” and the abbreviation GB are often used ambiguously.
In casual use, the term “memory” is sometimes used for HDDs and SSDs
as well as RAM, but the properties of these devices are very different, so
we will need to distinguish them. I will use “storage” to refer to HDDs and
SSDs.
system with 1 GiB of physical memory, the highest valid address is 230 − 1,
which is 1,073,741,823 in decimal, or 0x03ff ffff in hexadecimal (the prefix 0x
indicates a hexadecimal number).
Instead, programs work with virtual addresses, which are numbered from
0 to M − 1, where M is the number of valid virtual addresses. The size of
the virtual address space is determined by the operating system and the
hardware it runs on.
You have probably heard people talk about 32-bit and 64-bit systems. These
terms indicate the size of the registers, which is usually also the size of a
virtual address. On a 32-bit system, virtual addresses are 32 bits, which
means that the virtual address space runs from 0 to 0xffff ffff. The size of
this address space is 232 bytes, or 4 GiB.
On a 64-bit system, the size of the virtual address space is 264 bytes, or 4 ·
10246 bytes. That’s 16 exbibytes, which is about a billion times bigger than
current physical memories. It might seem strange that a virtual address
space can be so much bigger than physical memory, but we will see soon
how that works.
Thus, virtual memory is one important way the operating system isolates
processes from each other. In general, a process cannot access data belong-
ing to another process, because there is no virtual address it can generate
that maps to physical memory allocated to another process.
• The text segment contains the program text; that is, the machine lan-
guage instructions that make up the program.
18 Chapter 3. Virtual memory
• The static segment contains variables that are allocated by the com-
piler, including global variables and local variables that are declared
static.
• The stack segment contains the run-time stack, which is made up of
stack frames. Each stack frame contains the parameters and local vari-
ables of a function.
• The heap segment contains chunks of memory allocated at run time,
usually by calling the C library function malloc.
The arrangement of these segments is determined partly by the compiler
and partly by the operating system. The details vary from one system to
another, but in the most common arrangement:
• The text segment is near the “bottom” of memory; that is at addresses
near 0.
• The static segment is often just above the text segment.
• The stack is near the top of memory; that is, near the highest addresses
in the virtual address space. As the stack expands, it grows down
toward smaller addresses.
• The heap is often above the static segment. As it expands, it grows up
toward larger addresses.
To determine the layout of these segments on your system, try running
this program, which is in aspace.c in the repository for this book (see Sec-
tion 0.1).
#include <stdio.h>
#include <stdlib.h>
int global;
int main ()
{
int local = 5;
void *p = malloc(128);
When I run this program, the output looks like this (I added spaces to make
it easier to read):
Address of main is 0x 40057c
Address of global is 0x 60104c
Address of local is 0x7fffd26139c4
Address of p is 0x 1c3b010
As expected, the address of main is the lowest, followed by global and p.
The address of local is much bigger. It has 12 hexadecimal digits. Each hex
digit corresponds to 4 bits, so it is a 48-bit address. That suggests that the
usable part of the virtual address space is 248 bytes.
As an exercise, run this program on your computer and compare your re-
sults to mine. Add a second call to malloc and check whether the heap on
your system grows up (toward larger addresses). Add a function that prints
the address of a local variable, and check whether the stack grows down.
For example, the following function keeps track of how many times it has
been called.
20 Chapter 3. Virtual memory
int times_called()
{
static int counter = 0;
counter++;
return counter;
}
The keyword static indicates that counter is a static local variable. The
initialization happens only once, when the program starts.
If you add this function to aspace.c you can confirm that counter is allo-
cated in the static segment along with global variables, not in the stack.
Most processors provide a memory management unit (MMU) that sits be-
tween the CPU and main memory. The MMU performs fast translation be-
tween VAs and PAs.
2. The MMU splits the VA into two parts, called the page number and
the offset. A “page” is a chunk of memory; the size of a page depends
on the operating system and the hardware, but common sizes are 1–4
KiB.
3. The MMU looks up the page number in the “page table” and gets the
corresponding physical page number. Then it combines the physical
page number with the offset to produce a PA.
• Since 1 GiB is 230 bytes and 1 KiB is 210 bytes, there are 220 physical
pages, sometimes called “frames.”
3.6. Address translation 21
• The size of the virtual address space is 232 B and the size of a page is
210 B, so there are 222 virtual pages.
• The size of the offset is determined by the page size. In this example
the page size is 210 B, so it takes 10 bits to specify a byte on a page.
• Since there are 220 physical pages, each physical page number is 20
bits. Adding in the 10 bit offset, the resulting PAs are 30 bits.
So far this all seems feasible. But let’s think about how big a page table
might have to be. The simplest implementation of a page table is an array
with one entry for each virtual page. Each entry would contain a physical
page number, which is 20 bits in this example, plus some additional infor-
mation about each frame. So we expect 3–4 bytes per entry. But with 222
virtual pages, the page table would require 224 bytes, or 16 MiB.
And since we need a page table for each process, a system running 256
processes would need 232 bytes, or 4 GiB, just for page tables! And that’s
just with 32-bit virtual addresses. With 48- or 64-bit VAs, the numbers are
ridiculous.
Fortunately, nothing like that much space is actually needed, because most
processes don’t use even a small fraction of their virtual address space. And
if a process doesn’t use a virtual page, we don’t need an entry in the page
table for it.
Another way to say the same thing is that page tables are “sparse”, which
implies that the simple implementation, an array of page table entries, is a
bad idea. Fortunately, there are several good implementations for sparse
arrays.
One option is a multilevel page table, which is what many operating sys-
tems, including Linux, use. Another option is an associative table, where
each entry includes both the virtual page number and the physical page
number. Searching an associative table can be slow in software, but in hard-
ware we can search the entire table in parallel, so associative arrays are often
used to represent pages tables in the MMU.
You can read more about these implementations at http://en.wikipedia.
org/wiki/Page_table; you might find the details interesting. But the fun-
damental idea is that page tables are sparse, so we have to choose a good
implementation for sparse arrays.
22 Chapter 3. Virtual memory
I mentioned earlier that the operating system can interrupt a running pro-
cess, save its state, and then run another process. This mechanism is called
a “context switch”. Since each process has its own page table, the operating
system has to work with the MMU to make sure each process gets the right
page table. In older machines, the page table information in the MMU had
to be replaced during every context switch, which was expensive. In newer
systems, each page table entry in the MMU includes the process ID, so page
tables from multiple processes can be in the MMU at the same time.
Chapter 4
When a process completes (or crashes), any data stored in main memory is
lost. But data stored on a hard disk drive (HDD) or solid state drive (SSD)
is “persistent;” that is, it survives after the process completes, even if the
computer shuts down.
Hard disk drives are complicated. Data is stored in blocks, which are laid
out in sectors, which make up tracks, which are arranged in concentric cir-
cles on platters.
Solid state drives are simpler in one sense, because blocks are numbered se-
quentially, but they raise a different complication: each block can be written
a limited number of times before it becomes unreliable.
Abstractly:
File names are usually strings, and they are usually “hierarchical”; that is,
the string specifies a path from a top-level directory (or folder), through a
series of subdirectories, to a specific file.
24 Chapter 4. Files and file systems
The primary difference between the abstraction and the underlying mecha-
nism is that files are byte-based and persistent storage is block-based. The
operating system translates byte-based file operations in the C library into
block-based operations on storage devices. Typical block sizes are 1–8 KiB.
For example, the following code opens a file and reads the first byte:
FILE *fp = fopen("/home/downey/file.txt", "r");
char c = fgetc(fp);
fclose(fp);
When this code runs:
1. fopen uses the filename to find the top-level directory, called /, the
subdirectory home, and the sub-subdirectory downey.
2. It finds the file named file.txt and “opens” it for reading, which
means it creates a data structure that represents the file being read.
Among other things, this data structure keeps track of how much of
the file has been read, called the “file position”.
In DOS, this data structure is called a File Control Block, but I want to
avoid that term because in UNIX it means something else. In UNIX,
there seems to be no good name for it. It is an entry in the open file
table, so I will call it an OpenFileTableEntry.
3. When we call fgetc, the operating system checks whether the next
character of the file is already in memory. If so, it reads the next char-
acter, advances the file position, and returns the result.
5. When the I/O operation is complete, the new block of data is stored
in memory, and the process resumes. It reads the first character and
stores it as a local variable.
6. When the process closes the file, the operating system completes or
cancels any pending operations, removes data stored in memory, and
frees the OpenFileTableEntry.
The process for writing a file is similar, but there are some additional steps.
Here is an example that opens a file for writing and changes the first char-
acter.
4.1. Disk performance 25
1. Again, fopen uses the filename to find the file. If it does not already
exist, it creates a new file and adds an entry in the parent directory,
/home/downey.
3. fputc attempts to write (or re-write) the first byte of the file. If the
file already exists, the operating system has to load the first block into
memory. Otherwise it allocates a new block in memory and requests
a new block on disk.
5. When the file is closed, any buffered data is written to disk and the
OpenFileTableEntry is freed.
To put these numbers in perspective, let’s compare them to the clock cycle
of the CPU. A processor with clock rate 2 GHz completes one clock cycle
every 0.5 ns. The time to get a byte from memory to the CPU is typically
around 100 ns. If the processor completes one instruction per clock cycle, it
would complete 200 instructions while waiting for a byte from memory.
26 Chapter 4. Files and file systems
• Block transfers: The time it takes to load a single byte from disk is
about 5 ms. By comparison, the additional time to load an 8 KiB block
is negligible. If the processor does 5 ms of work on each block, it might
be possible to keep the processor busy.
Similarly, if a program writes data to disk and then the computer loses
power, the data might be lost if it is in a cache and not yet on disk.
In many UNIX file systems, that data structure is called an “inode,” which
stands for “index node”. More generally, information about files, including
the location of their blocks, is called “metadata”. (The content of the file is
data, so information about the file is data about data, hence “meta”.)
Since inodes reside on disk along with the rest of the data, they are designed
to fit neatly into disk blocks. A UNIX inode contains information about the
file including: the user ID of the file owner, permission flags indicating who
is allowed to read, write, or execute it, and timestamps that indicate when it
was last modified and accessed. In addition, it contains block numbers for
the first 12 blocks that make up the file.
If the block size is 8 KiB, the first 12 blocks make up 96 KiB. On most sys-
tems, that’s big enough for a large majority of files, but it’s obviously not
big enough for all of them. That’s why the inode also contains a pointer to
an “indirection block”, which contains nothing but pointers to other blocks.
That’s why the inode also contains a pointer to a “double indirection block”,
which contains pointers to indirection blocks. With 1024 indirection blocks,
we can address 8 GiB.
And if that’s not big enough, there is (finally) a triple indirection block,
which contains pointers to double indirection blocks, yielding a maximum
file size of 8 TiB. When UNIX inodes were designed, that seemed big enough
to serve for a long time. But that was a long time ago.
28 Chapter 4. Files and file systems
It is hard to design a file system that achieves all of these goals, especially
since file system performance depends on “workload characteristics” in-
cluding file sizes, access patterns, etc. A file system that is well tuned for
one workload might not perform as well for another.
For this reason, most operating systems support several kinds of file sys-
tems, and file system design is an active area of research and development.
In the last decade, Linux systems have migrated from ext2, which was a
conventional UNIX file system, to ext3, a “journaling” file system intended
to improve speed and contiguity, and more recently to ext4, which can han-
dle larger files and file systems. Within the next few years, there might be
another migration to the B-tree file system, Btrfs.
4.4. Everything is a file? 29
Reusing the file abstraction makes life easier for programmers, since they
only have to learn one API (application program interface). It also makes
programs more versatile, since a program intended to work with files can
also work with data coming from pipes and other sources.
30 Chapter 4. Files and file systems
Chapter 5
For negative numbers, the most obvious representation uses a sign bit to
indicate whether a number is positive or negative. But there is another rep-
resentation, called “two’s complement” that is much more common because
it is easier to work with in hardware.
In two’s complement, the leftmost bit acts like a sign bit; it is 0 for positive
numbers and 1 for negative numbers.
To convert from an 8-bit number to 16-bits, we have to add more 0’s for a
positive number and add 1’s for a negative number. In effect, we have to
copy the sign bit into the new bits. This process is called “sign extension”.
In C all integer types are signed (able to represent positive and negative
numbers) unless you declare them unsigned. The difference, and the reason
this declaration is important, is that operations on unsigned integers don’t
use sign extension.
32 Chapter 5. More bits and bytes
In this context, the value 3 is called a “mask” because it selects some bits
and masks the rest.
xxxx
| 0011
----
xx11
Toggling bits: Finally, if you XOR a vector with 3, it flips the rightmost bits
and leaves the rest alone. As an exercise, see if you can compute the two’s
complement of 12 using ^. Hint: what’s the two’s complement representa-
tion of -1?
C also provides shift operators, and , which shift bits left and right. Each
left shift doubles a number, so 5 1 is 10, and 5 2 is 20. Each right shift
divides by two (rounding down), so 5 1 is 2 and 2 1 is 1.
Most computers use the IEEE standard for floating-point arithmetic. The C
type float usually corresponds to the 32-bit IEEE standard; double usually
corresponds to the 64-bit standard.
In the 32-bit standard, the leftmost bit is the sign bit, s. The next 8 bits are
the exponent, q, and the last 23 bits are the coefficient, c. The value of a
floating-point number is
(−1)s c · 2q
Well, that’s almost correct, but there is one more wrinkle. Floating-point
numbers are usually normalized so that there is one digit before the point.
For example, in base 10, we prefer 2.998 · 108 rather than 2998 · 105 or any
other equivalent expression. In base 2, a normalized number always has the
digit 1 before the binary point. Since the digit in this location is always 1,
we can save space by leaving it out of the representation.
34 Chapter 5. More bits and bytes
Well, that’s almost correct. There’s one more wrinkle. The exponent is
stored with a “bias”. In the 32-bit standard, the bias is 127, so the expo-
nent 3 would be stored as 130.
union {
float f;
unsigned int u;
} p;
p.f = -13.0;
unsigned int sign = (p.u >> 31) & 1;
unsigned int exp = (p.u >> 23) & 0xff;
printf("%d\n", sign);
printf("%d\n", exp);
printf("0x%x\n", coef);
This code is in float.c in the repository for this book (see Section 0.1).
The union allows us to store a floating-point value using p.f and then read
it as an unsigned integer using p.u.
To get the sign bit, we shift the bits to the right 31 places and then use a 1-bit
mask to select only the rightmost bit.
To get the exponent, we shift the bits 23 places, then select the rightmost 8
bits (the hexadecimal value 0xff has eight 1’s).
To get the coefficient, we need to extract the 23 rightmost bits and ignore the
rest. We do that by making a mask with 1s in the 23 rightmost places and 0s
on the left. The easiest way to do that is by shifting 1 to the left by 23 places
and then subtracting 1.
1
130
0x500000
As expected, the sign bit for a negative number is 1. The exponent is 130,
including the bias. And the coefficient, which I printed in hexadecimal, is
101 followed by 20 zeros.
Actually, the same thing can happen if you read a location in memory in-
correctly. One way that can happen is if you read past the end of an array.
To see what happens, I’ll start with a function that allocates an array on the
stack and fills it with the numbers from 0 to 99.
void f1() {
int i;
int array[100];
printf("%d\n", array[-2]);
printf("%d\n", array[-1]);
printf("%d\n", array[10]);
printf("%d\n", array[11]);
}
If I call f1 and then f2, I get these results:
17
123
98
99
The details here depend on the compiler, which arranges variables on the
stack. From these results, we can infer that the compiler put x and y next to
each other, “below” the array (at a lower address). And when we read past
the array, it looks like we are getting values that were left on the stack by
the previous function call.
In this example, all of the variables are integers, so it is relatively easy to fig-
ure out what is going on. But in general when you read beyond the bounds
of an array, the values you read might have any type. For example, if I
change f1 to make an array of floats, the results are:
17
123
1120141312
1120272384
The latter two values are what you get if you interpret a floating-point value
as an integer. If you encountered this output while debugging, you would
have a hard time figuring out what’s going on.
Also, remember that the letters and numbers in C strings are encoded in
ASCII. The ASCII codes for the digits “0” through “9” are 48 through 57,
not 0 through 9. The ASCII code 0 is the NUL character that marks the end
of a string. And the ASCII codes 1 through 9 are special characters used in
5.5. Representing strings 37
The ASCII code for the letter “A” is 65; the code for “a” is 97. Here are those
codes in binary:
65 = b0100 0001
97 = b0110 0001
A careful observer will notice that they differ by a single bit. And this pat-
tern holds for the rest of the letters; the sixth bit (counting from the right)
acts as a “case bit”, 0 for upper-case letters and 1 for lower case letters.
As an exercise, write a function that takes a string and converts from lower-
case to upper-case by flipping the sixth bit. As a challenge, you can make a
faster version by reading the string 32 or 64 bits at a time, rather than one
character at a time. This optimization is made easier if the length of the
string is a multiple of 4 or 8 bytes.
If you read past the end of a string, you are likely to see strange characters.
Conversely, if you write a string and then accidentally read it as an int or
float, the results will be hard to interpret.
Memory management
• calloc, which is the same as malloc except that it also clears the newly
allocated chunk; that is, sets all bytes in the chunk to 0.
• If you access any chunk that has not been allocated, that’s a paddling.
• If you free an allocated chunk and then access it, that’s a paddling.
• If you try to free a chunk that has not been allocated, that’s a paddling.
• If you free the same chunk more than once, that’s a paddling.
• If you call realloc with a chunk that was not allocated, or was allo-
cated and then freed, that’s a paddling.
It might not sound difficult to follow these rules, but in a large program a
chunk of memory might be allocated in one part of the program, used in
several other parts, and freed in yet another part. So changes in one part of
the program can require changes in many other parts.
To make matters worse, memory errors can be difficult to find because the
symptoms are unpredictable. For example:
• If you read a value from an unallocated chunk, the system might detect
the error, trigger a run-time error called a “segmentation fault”, and
stop the program. This outcome is desirable, because it indicates the
location in the program that caused the error. But, sadly, this outcome
is rare. More often, the program reads unallocated memory without
detecting the error, and the value it gets is whatever happened to be
stored at a particular location. If the value is not interpreted as the
right type, the results might be hard to interpret. For example, if you
read bytes from a string and interpret them as a floating-point value,
6.2. Memory leaks 41
One conclusion you should draw from this is that safe memory manage-
ment requires design and discipline. If you write a library or module that
allocates memory, you should also provide an interface to free it, and mem-
ory management should be part of the API design from the beginning.
If you use a library that allocates memory, you should be disciplined in your
use of the API. For example, if the library provides functions to allocate and
deallocate storage, you should use those functions and not, for example,
call free on a chunk you did not malloc. And you should avoid keeping
multiple references to the same chunk in different parts of your program.
For some programs, memory leaks are ok. For example, if your program
allocates memory, performs computations on it, and then exits, it is prob-
ably not necessary to free the allocated memory. When the program exits,
all of its memory is deallocated by the operating system. Freeing memory
immediately before exiting might feel more responsible, but it is mostly a
waste of time.
But if a program runs for a long time and leaks memory, its total memory
use will increase indefinitely. At that point, a few things might happen:
• On systems with virtual memory, the operating system can move an-
other process’s pages from memory to disk and then allocate more
space to the leaking process. I explain this mechanism in Section 7.8.
• Eventually, a process might fill its virtual address space (or the usable
part). After that, there are no more addresses to allocate, so malloc
returns NULL.
If malloc returns NULL, but you persist and access the chunk you think
you allocated, you get a segmentation fault. For this reason, it is considered
good style to check the result from malloc before using it. One option is to
add a condition like this after every malloc call:
void *p = malloc(size);
if (p == NULL) {
perror("malloc failed");
exit(-1);
}
perror is declared in stdio.h; it prints an error message and additional
information about the last error that occurred.
6.3 Implementation
When a process starts, the system allocates space for the text segment and
statically allocated data, space for the stack, and space for the heap, which
contains dynamically allocated data.
Not all programs allocate data dynamically, so the initial size of the heap
might be small or zero. Initially the heap contains only one free chunk.
When malloc is called, it checks whether it can find a free chunk that’s big
enough. If not, it has to request more memory from the system. The func-
tion that does that is sbrk, which sets the “program break”, which you can
think of as a pointer to the end of the heap.
When sbrk is called, it allocates new pages of physical memory, updates the
process’s page table, and sets the program break.
In theory, a program could call sbrk directly (without using malloc) and
manage the heap itself. But malloc is easier to use and, for most memory-
use patterns, it runs fast and uses memory efficiently.
To implement the memory management API, most Linux systems use
ptmalloc, which is based on dlmalloc, written by Doug Lea. A short
paper that describes key elements of the implementation is available at
http://gee.cs.oswego.edu/dl/html/malloc.html.
44 Chapter 6. Memory management
• The run time of malloc does not usually depend on the size of the
chunk, but might depend on how many free chunks there are. free is
usually fast, regardless of the number of free chunks. Because calloc
clears every byte in the chunk, the run time depends on chunk size (as
well as the number of free chunks).
realloc is sometimes fast, if the new size is smaller than the current
size, or if space is available to expand the existing chunk. If not, it has
to copy data from the old chunk to the new; in that case, the run time
depends on the size of the old chunk.
• Space overhead: Boundary tags and free list pointers take up space.
The minimum chunk size on most systems is 16 bytes. So for very
small chunks, malloc is not space efficient. If your program requires
large numbers of small structures, it might be more efficient to allocate
them in arrays.
• Fragmentation: If you allocate and free chunks with varied sizes, the
heap will tend to become fragmented. That is, the free space might be
broken into many small pieces. Fragmentation wastes space; it also
slows the program down by making memory caches less effective.
• Binning and caching: The free list is sorted by size into bins, so when
malloc searches for a chunk with a particular size, it knows what bin
to search in. So if you free a chunk and then immediately allocate a
chunk with the same size, malloc will usually be fast.
Chapter 7
Caching
When a program starts, the code (or text) is usually on a hard disk or solid
state drive. The operating system creates a new process to run the program,
then the “loader” copies the text from storage into main memory and starts
the program by calling main.
While the program is running, most of its data is stored in main memory,
but some of the data is in registers, which are small units of memory on the
CPU. These registers include:
• The program counter, or PC, which contains the address (in memory)
of the next instruction in the program.
• The instruction register, or IR, which contains the machine code in-
struction currently executing.
• The stack pointer, or SP, which contains the address of the stack frame
for the current function, which contains its parameters and local vari-
ables.
When a program is running, the CPU executes the following steps, called
the “instruction cycle”:
• Fetch: The next instruction is fetched from memory and stored in the
instruction register.
• Decode: Part of the CPU, called the “control unit”, decodes the in-
struction and send signals to the other parts of the CPU.
• Execute: Signals from the control unit cause the appropriate compu-
tation to occur.
Most computers can execute a few hundred different instructions, called the
“instruction set”. But most instructions fall into a few general categories:
During each instruction cycle, one instruction is read from the program text.
In addition, about half of the instructions in a typical program load or store
data. And therein lies one of the fundamental problems of computer archi-
tecture: the “memory bottleneck”.
When the CPU loads a value from memory, it stores a copy in the cache. If
the same value is loaded again, the CPU gets the cached copy and doesn’t
have to wait for memory.
Eventually the cache gets full. Then, in order to bring something new in,
we have to kick something out. So if the CPU loads a value and then loads
it again much later, it might not be in cache any more.
The cache “hit rate”, h, is the fraction of memory accesses that find data in
cache; the “miss rate”, m, is the fraction of memory accesses that have to go
to memory. If the time to process a cache hit is Th and the time for a cache
miss is Tm , the average time for each memory access is
hTh + mTm
Equivalently, we could define the “miss penalty” as the extra time to process
a cache miss, Tp = Tm − Th . Then the average access time is
Th + mTp
When the miss rate is low, the average access time can be close to Th . That
is, the program can perform as if memory ran at cache speeds.
7.3 Locality
When a program reads a byte for the first time, the cache usually loads
a “block” or “line” of data that includes the requested byte and some of
its neighbors. If the program goes on to read one of the neighbors, it will
already be in cache.
As an example, suppose the block size is 64 B; you read a string with length
64, and the first byte of the string happens to fall at the beginning of a block.
48 Chapter 7. Caching
When you load the first byte, you incur a miss penalty, but after that the
rest of the string would be in cache. After reading the whole string, the hit
rate would be 63/64. If the string spans two blocks, you would incur 2 miss
penalties. But even then the hit rate would be 62/64, or almost 97%.
The tendency of a program to use the same data more than once is called
“temporal locality”. The tendency to use data in nearby locations is called
“spatial locality”. Fortunately, many programs naturally display both kinds
of locality:
The next section explores the relationship between a program’s access pat-
tern and cache performance.
iters = iters + 1;
sec = sec + (get_seconds() - sec0);
sec keeps track of the total CPU time used by the inner loop. The outer
loop runs until sec exceeds 0.1 seconds, which is long enough that we can
compute the average time with sufficient precision.
10
512 B
256 B
8 128 B
64 B
32 B
16 B
012
2 214 216 218 220 222 224 226
size (B)
Figure 7.1: Average miss penalty as a function of array size and stride.
iters2 = iters2 + 1;
sec = sec - (get_seconds() - sec0);
• The program reads through the array many times, so it has plenty
of temporal locality. If the entire array fits in cache, we expect the
average miss penalty to be near 0.
7.5. Programming for cache performance 51
• When the stride is 4 bytes, we read every element of the array, so the
program has plenty of spatial locality. If the block size is big enough
to contain 64 elements, for example, the hit rate would be 63/64, even
if the array does not fit in cache.
• If the stride is equal to the block size (or greater), the spatial locality
is effectively zero, because each time we read a block, we only access
one element. In that case we expect to see the maximum miss penalty.
In Figure 7.1, cache performance is good, for all strides, as long as the array
is less than 222 B. We can infer that the cache size is near 4 MiB; in fact,
according to the specs, it is 3 MiB.
Many processors use “multi-level caches” that include a small, fast cache
and a bigger, slower cache. In this example, it looks like the miss penalty
increases a little when the array size is bigger than 214 B, so it’s possible that
this processor also has a 16 KB cache with an access time less than 1 ns.
For example, if you are working with a large array, it might be faster to
traverse the array once, performing several operations with each element,
rather than traversing the array several times.
If you are working with a 2-D array, it might be stored as an array of rows. If
you traverse through the elements, it would be faster to go row-wise, with
stride equal to the element size, rather than column-wise, with stride equal
to the row length.
Linked data structures don’t always exhibit spatial locality, because the
nodes aren’t necessarily contiguous in memory. But if you allocate many
52 Chapter 7. Caching
nodes at the same time, they are usually co-located in the heap. Or, even
better, if you allocate an array of nodes all at once, you know they will be
contiguous.
Recursive strategies like mergesort often have good cache behavior because
they break big arrays into smaller pieces and then work with the pieces.
Sometimes these algorithms can be tuned to take advantage of cache be-
havior.
Without going too far into computer architecture, there are two reasons:
electronics and economics. Caches are fast because they are small and close
to the CPU, which minimizes delays due to capacitance and signal propa-
gation. If you make a cache big, it will be slower.
Also, caches take up space on the processor chip, and bigger chips are
more expensive. Main memory is usually dynamic random-access memory
(DRAM), which uses only one transistor and one capacitor per bit, so it is
possible to pack more memory into the same amount of space. But this way
of implementing memory is slower than the way caches are implemented.
The tradeoff between speed, size, and cost is the fundamental reason for
caching. If there were one memory technology that was fast, big, and cheap,
we wouldn’t need anything else.
The following table shows typical access times, sizes, and costs for each of
these technologies.
• Who moves data up and down the hierarchy? At the top of the hierar-
chy, register allocation is usually done by the compiler. Hardware on
54 Chapter 7. Caching
the CPU handles the memory cache. Users implicitly move data from
storage to memory when they execute programs and open files. But
the operating system also moves data back and forth between mem-
ory and storage. At the bottom of the hierarchy, administrators move
data explicitly between disk and tape.
• What gets moved? In general, block sizes are small at the top of the
hierarchy and bigger at the bottom. In a memory cache, a typical block
size is 128 B. Pages in memory might be 4 KiB, but when the operating
system reads a file from disk, it might read 10 or 100 blocks at a time.
• When does data get moved? In the most basic cache, data gets moved
into cache when it is used for the first time. But many caches use some
kind of “prefetching”, meaning that data is loaded before it is explic-
itly requested. We have already seen a simple form of prefetching:
loading an entire block when only part of it is requested.
• Where in the cache does the data go? When the cache is full, we can’t
bring anything in without kicking something out. Ideally, we want to
keep data that will be used again soon and replace data that won’t.
The answers to these questions make up the “cache policy”. Near the top of
the hierarchy, cache policies tend to be simple because they have to be fast
and they are implemented in hardware. Near the bottom of the hierarchy,
there is more time to make decisions, and well-designed policies can make
a big difference.
Most cache policies are based on the principle that history repeats itself;
if we have information about the recent past, we can use it to predict the
immediate future. For example, if a block of data has been used recently,
we expect it to be used again soon. This principle suggests a replacement
policy called “least recently used,” or LRU, which removes from the cache
a block of data that has not been used recently. For more on this topic, see
http://en.wikipedia.org/wiki/Cache_algorithms.
7.8 Paging
In systems with virtual memory, the operating system can move pages back
and forth between memory and storage. As I mentioned in Section 6.2, this
mechanism is called “paging” or sometimes “swapping”.
3. If there are no free pages, the paging system chooses a “victim page”
belonging to Process B. It copies the contents of the page from memory
to disk, then it modifies the page table for Process B to indicate that
this page is “swapped out”.
4. Once the data from Process B is written, the page can be reallocated
to Process A. To prevent Process A from reading Process B’s data, the
page should be cleared.
5. At this point the call to sbrk can return, giving malloc additional space
in the heap. Then malloc allocates the request chunk and returns.
Process A can resume.
7. When the operating system handles the interrupt, it sees that the page
is swapped out, so it transfers the page back from disk to memory.
When paging works well, it can greatly improve the utilization of physical
memory, allowing more processes to run in less space. Here’s why:
• Most processes don’t use all of their allocated memory. Many parts of
the text segment are never executed, or execute once and never again.
Those pages can be swapped out without causing any problems.
• On most systems, there are processes like daemons that sit idle most of
the time and only occasionally “wake up” to respond to events. While
they are idle, these processes can be swapped out.
56 Chapter 7. Caching
• A user might have many windows open, but only a few are active at
a time. The inactive processes can be swapped out.
If you add up the total memory allocated to all processes, it can greatly
exceed the size of physical memory, and yet the system can still behave
well.
Up to a point.
When a process accesses a page that’s swapped out, it has to get the data
back from disk, which can take several milliseconds. The delay is often
noticeable. If you leave a window idle for a long time and then switch back
to it, it might start slowly, and you might hear the disk drive working while
pages are swapped in.
Occasional delays like that might be acceptable, but if you have too many
processes using too much space, they start to interfere with each other.
When Process A runs, it evicts the pages Process B needs. Then when B
runs, it evicts the pages A needs. When this happens, both processes slow
to a crawl and the system can become unresponsive. This regrettable sce-
nario is called “thrashing”.
Multitasking
In many current systems, the CPU contains multiple cores, which means it
can run several processes at the same time. In addition, each core is capable
of “multitasking”, which means it can switch from one process to another
quickly, creating the illusion that many processes are running at the same
time.
The part of the operating system that implements multitasking is the “ker-
nel”. In a nut or seed, the kernel is the innermost part, surrounded by a
shell. In an operating system, the kernel is the lowest level of software,
surrounded by several other layers, including an interface called a “shell.”
Computer scientists love extended metaphors.
1. When the interrupt occurs, the hardware saves the program counter
in a special register and jumps to the appropriate interrupt handler.
2. The interrupt handler stores the program counter and the flag register
in memory, along with the contents of any data registers it plans to
use.
3. The interrupt handler runs whatever code is needed to handle the in-
terrupt.
If this mechanism works correctly, there is generally no way for the inter-
rupted process to know there was an interruption, unless it detects small
changes in the time between instructions.
8.2. Context switching 59
• Ready, if the process could be running, but isn’t, usually because there
are more runnable processes than cores.
• Done, if the process has completed, but has exit status information
that has not been read yet.
Here are the events that cause a process to transition from one state to an-
other:
60 Chapter 8. Multitasking
• When a process calls exit, the interrupt handler stores the exit code
in the PCB and changes the process’s state to done.
8.4 Scheduling
As we saw in Section 2.3 there might be hundreds of processes on a com-
puter, but usually most of them are blocked. Most of the time, there are
only a few processes that are ready or running. When an interrupt occurs,
the scheduler decides which process to start or resume.
On a workstation or laptop, the primary goal of the scheduler is to minimize
response time; that is, the computer should respond quickly to user actions.
Response time is also important on a server, but in addition the scheduler
might try to maximize throughput, which is the number of requests that
complete per unit of time.
Usually the scheduler doesn’t have much information about what processes
are doing, so its decisions are based on a few heuristics:
not run faster with more CPU time. Finally, a process that interacts
with the user is probably blocked, most of the time, waiting for user
actions.
The operating system can sometimes classify processes based on their
past behavior, and schedule them accordingly. For example, when an
interactive process is unblocked, it should probably run immediately,
because a user is probably waiting for a reply. On the other hand, a
CPU-bound process that has been running for a long time might be
less time-sensitive.
• If a process is likely to run for a short time and then make a blocking
request, it should probably run immediately, for two reasons: (1) if
the request takes some time to complete, we should start it as soon as
possible, and (2) it is better for a long-running process to wait for a
short one, rather than the other way around.
As an analogy, suppose you are making an apple pie. The crust takes
5 minutes to prepare, but then it has to chill for half an hour. It takes
20 minutes to prepare the filling. If you prepare the crust first, you can
prepare the filling while the crust is chilling, and you can finish the
pie in 35 minutes. If you prepare the filling first, the process takes 55
minutes.
• If a process makes a request and blocks before its time slice is com-
plete, it is more likely to be interactive or I/O-bound, so its priority
should go up.
• If a task blocks for a long time and then becomes ready, it should get
a priority boost so it can respond to whatever it was waiting for.
• The system call nice allows a process to decrease (but not increase)
its own priority, allowing programmers to pass explicit information to
the scheduler.
Threads
When you create a process, the operating system creates a new address
space, which includes the text segment, static segment, and heap; it also
creates a new “thread of execution”, which includes the program counter
and other hardware state, and the run-time stack.
The processes we have seen so far are “single-threaded”, which means that
only one thread of execution runs in each address space. In this chapter,
you will learn about “multi-threaded” processes that have multiple threads
running in the same address space.
Within a single process, all threads share the same text segment, so they run
the same code. But different threads often run different parts of the code.
And they share the same static segment, so if one thread changes a global
variable, other threads see the change. They also share the heap, so threads
can share dynamically-allocated chunks.
But each thread has its own stack, so threads can call functions without
interfering with each other. And usually threads can’t access each other’s
local variables.
The example code for this chapter is in the repository for this book, in a
directory named counter. For information on downloading this code, see
Section 0.1.
64 Chapter 9. Threads
• When you compile the program, you link it with the Pthread library.
exit(-1);
}
return thread;
}
make_thread is a wrapper I wrote to make pthread_create easier to use,
and to provide error-checking.
The return type from pthread_create is pthread_t, which you can think of
as an id or “handle” for the new thread.
The parameter is the handle of the thread you want to wait for. All the
wrapper does is call pthread_join and check the result.
Any thread can join any other thread, but in the most common pattern the
parent thread creates and joins all child threads. Continuing the example
from the previous section, here’s the code that waits on the children:
for (i=0; i<NUM_CHILDREN; i++) {
join_thread(child[i]);
}
This loops waits for the children one at a time in the order they were created.
There is no guarantee that the child threads complete in that order, but this
loop works correctly even if they don’t. If one of the children is late, the loop
might have to wait, and other children might complete in the meantime. But
regardless, the loop exits only when all children are done.
If you have downloaded the repository for this book (see Section 0.1), you’ll
find this example in counter/counter.c. You can compile and run it like
this:
$ make counter
gcc -Wall counter.c -o counter -lpthread
$ ./counter
When I ran it with 5 children, I got the following output:
counter = 0
counter = 0
counter = 1
counter = 0
counter = 3
When you run it, you will probably get different results. And if you run it
again, you might get different results each time. What’s going on?
Here is a sequence of events that could explain the output in the previous
section:
68 Chapter 9. Threads
Child A reads 0
Child B reads 0
Child C reads 0
Child A prints 0
Child B prints 0
Child A sets counter=1
Child D reads 1
Child D prints 1
Child C prints 0
Child A sets counter=1
Child B sets counter=2
Child C sets counter=3
Child E reads 3
Child E prints 3
Child D sets counter=4
Child E sets counter=5
Each time you run the program, threads might be interrupted at different
points, or the scheduler might choose different threads to run, so the se-
quence of events, and the results, will be different.
I have written a small module called mutex.c that provides mutex objects.
I’ll show you how to use it first; then I’ll explain how it works.
9.5 Mutex
My definition of Mutex is a wrapper for a type called pthread_mutex_t,
which is defined in the POSIX threads API.
One of the problems with this API is that pthread_mutex_t behaves like a
structure, so if you pass it as an argument, it makes a copy, which makes the
70 Chapter 9. Threads
My code makes it easier to get that right. It defines a type, Mutex, which is
just a more readable name for pthread_mutex_t:
#include <pthread.h>
The functions to lock and unlock the mutex are simple wrappers for POSIX
functions:
void mutex_lock(Mutex *mutex)
{
int n = pthread_mutex_lock(mutex);
if (n != 0) perror_exit("lock failed");
}
Condition variables
I’ll start with a simple queue that is not thread safe, then we’ll see what goes
wrong and fix it. The code for this example is in the repository for this book,
in a folder called queue. The file queue.c contains a basic implementation of
72 Chapter 10. Condition variables
queue->array[queue->next_in] = item;
queue->next_in = queue_incr(queue, queue->next_in);
}
If the queue is full, queue_push prints an error message and exits. I will
explain queue_full soon.
If the queue is not full, queue_push inserts the new element and then incre-
ments next_in using queue_incr:
int queue_incr(Queue *queue, int i)
{
return (i+1) % queue->length;
}
When the index, i, gets to the end of the array, it wraps around to 0. And
that’s where we run into a tricky part. If we keep adding elements to the
queue, eventually next_in wraps around and catches up with next_out.
But if next_in == next_out, we would incorrectly conclude that the queue
was empty.
To avoid that, we define another special case to indicate that the queue is
full:
int queue_full(Queue *queue)
{
return (queue_incr(queue, queue->next_in) == queue->next_out);
}
If incrementing next_in lands on next_out, that means we can’t add an-
other element without making the queue seem empty. So we stop one el-
ement before the “end” (keeping in mind that the end of the queue can be
anywhere, not necessarily the end of the array).
Now we can write queue_pop, which removes and returns the next element
from the queue:
int queue_pop(Queue *queue) {
if (queue_empty(queue)) {
perror_exit("queue is empty");
}
If you try to pop from an empty queue, queue_pop prints an error message
and exits.
Shared *make_shared()
{
Shared *shared = check_malloc(sizeof(Shared));
shared->queue = make_queue(QUEUE_LENGTH);
return shared;
}
The code we have so far is a good starting place, but it has several problems:
• Access to the queue is not thread safe. Different threads could access
array, next_in, and next_out at the same time and leave the queue in
a broken, “inconsistent” state.
In the next section, we solve the first problem with a Mutex. In the following
section, we solve the second problem with condition variables.
queue->array[queue->next_in] = item;
queue->next_in = queue_incr(queue, queue->next_in);
mutex_unlock(queue->mutex); //-- new
}
Before checking whether the queue is full, we have to lock the mutex. If
the queue is full, we have to unlock the mutex before exiting; otherwise the
thread would leave the mutex locked and no other threads could proceed.
return item;
}
Note that the other Queue functions, queue_full, queue_empty, and
queue_incr do not try to lock the mutex. Any thread that calls these func-
tions is required to lock the mutex first; this requirement is part of the doc-
umented interface for these functions.
With this additional code, the queue is thread safe; if you run it, you should
not see any synchronization errors. But it is likely that the consumer will
exit at some point because the queue is empty, or the producer will exit
because the queue is full, or both.
Similarly, thread_pop might want to wait for a condition like “queue not
empty”.
Here’s how we can add these capabilities to the queue code. First we add
two condition variables to the Queue structure:
typedef struct {
int *array;
int length;
int next_in;
int next_out;
Mutex *mutex;
Cond *nonempty; //-- new
Cond *nonfull; //-- new
} Queue;
And initialize them in make_queue:
Queue *make_queue(int length)
{
Queue *queue = (Queue *) malloc(sizeof(Queue));
queue->length = length;
78 Chapter 10. Condition variables
When the thread that locked the mutex calls cond_wait, it unlocks the mu-
tex and then blocks. This is important. If cond_wait did not unlock the
mutex before blocking, no other thread would be able to access the queue,
no more items could be added, and the queue would always be empty.
So while the consumer is blocked on nonempty, the producer can run. Let’s
see what happens when the producer runs queue_push:
void queue_push(Queue *queue, int item) {
mutex_lock(queue->mutex);
while (queue_full(queue)) {
cond_wait(queue->nonfull, queue->mutex); //-- new
}
10.4. Condition variables 79
queue->array[queue->next_in] = item;
queue->next_in = queue_incr(queue, queue->next_in);
mutex_unlock(queue->mutex);
cond_signal(queue->nonempty); //-- new
}
Let’s assume that the queue is not full, so the producer doesn’t run
cond_wait and doesn’t block. It adds a new element to the queue and then
unlocks the mutex. But before exiting, it does one more thing: it “signals”
the condition variable nonempty.
If there are threads waiting on the condition variable, one of them gets un-
blocked and resumes execution of cond_wait. But before the awakened
thread can return from cond_wait, it has to wait for and lock the mutex,
again.
Now go back to queue_pop and see what happens when the thread returns
from cond_wait. It loops back to the top of the while loop and checks the
condition again. I’ll explain why in just a second, but for now let’s assume
that the condition is true; that is, the queue is not empty.
When the thread exits the while loop, we know two things: (1) the condition
is true, so there is at least one item in the queue, and (2) the mutex is locked,
so it is safe to access the queue.
After removing an item, queue_pop unlocks the mutex, signals that the
queue is not full, and returns.
In the next section I’ll show you how my Cond code works, but first I want
to answer two frequently-asked question:
Thread A is not blocked any more. Thread A could lock the mutex
and returns from cond_wait. If Thread A does not check the condi-
tion again, it would try to pop an element from an empty queue, and
probably cause an error.
• The other question that comes up when people learn about condition
variables is “How does the condition variable know what condition it
is associated with?”
This question is understandable because there is no explicit connec-
tion between a Cond structure and the condition it relates to. The
connection is implicit in the way it is used.
Here’s one way to think of it: the condition associated with a Cond
is the thing that is false when you call cond_wait and true when you
call cond_signal. Of course, there might be many conditions that are
true in the first case and false in the second. The right one is only in
the mind of the programmer, so it should be explained carefully in the
documentation.
return cond;
}
And here are the wrappers for cond_wait and cond_signal.
10.5. Condition variable implementation 81
Semaphores in C
Semaphores are a good way to learn about synchronization, but they are
not as widely used, in practice, as mutexes and condition variables.
Nevertheless, there are some syncrhonization problems that can be solved
simply with semaphores, yielding solutions that are more demonstrably
correct.
This chapter presents a C API for working with semaphores and my code
for making it easier to work with. And it presents a final challenge: can
you write an implementation of a semaphore using mutexes and condition
variables?
The code for this chapter is in directory semaphore in the repository for this
book (see Section 0.1).
I prefer to call this operation “signal” rather than “post”, although both
terms are common.
Here I am using the variable name mutex to indicate that the semaphore is
being used as a mutex. But remember that the behavior of a semaphore is
not the same as a Pthread mutex.
Here’s the new definition of Queue, replacing the mutex and condition vari-
ables with semaphores:
typedef struct {
int *array;
int length;
int next_in;
int next_out;
Semaphore *mutex; //-- new
Semaphore *items; //-- new
Semaphore *spaces; //-- new
} Queue;
And here’s the new version of make_queue:
Queue *make_queue(int length)
{
Queue *queue = (Queue *) malloc(sizeof(Queue));
queue->length = length;
queue->array = (int *) malloc(length * sizeof(int));
queue->next_in = 0;
86 Chapter 11. Semaphores in C
queue->next_out = 0;
queue->mutex = make_semaphore(1);
queue->items = make_semaphore(0);
queue->spaces = make_semaphore(length-1);
return queue;
}
mutex is used to guarantee exclusive access to the queue; the initial value is
1, so the mutex is initially unlocked.
items is the number of items in the queue, which is also the number of
consumer threads that can execute queue_pop without blocking. Initially
there are no items in the queue.
spaces is the number of empty spaces in the queue, which is the number
of producer threads that can execute queue_push without blocking. Initially
the number of spaces is the capacity of the queue, which is length-1, as
explained in Section 10.1.
queue->array[queue->next_in] = item;
queue->next_in = queue_incr(queue, queue->next_in);
semaphore_signal(queue->mutex);
semaphore_signal(queue->items);
}
Notice that queue_push doesn’t have to call queue_full any more; instead,
the semaphore keeps track of how many spaces are available and blocks
producers if the queue is full.
semaphore_signal(queue->mutex);
11.3. Make your own semaphores 87
semaphore_signal(queue->spaces);
return item;
}
This solution is explained, using pseudo-code, in Chapter 4 of The Little Book
of Semaphores.
Using the code in the repo for this book, you should be able to compile and
run this solution. You should be able to run:
$ make queue_sem
$ ./queue_sem
Before you go on, you might want to try this as an exercise: write func-
tions that implement the semaphore API in sem.h using using condition
variables and mutexes. You can put your solution in mysem.c and mysem.h
in the repo for this book, and you’ll find my solution in mysem_soln.c and
mysem_soln.h.
If you have trouble getting started, you can use the following structure def-
inition, from my solution, as a hint:
typedef struct {
int value, wakeups;
Mutex *mutex;
Cond *cond;
} Semaphore;
value is the value of the semaphore. wakeups counts the number of pend-
ing signals; that is, the number of threads that have been woken but
have not yet resumed execution. The reason for wakeups is to make sure
that our semaphores have Property 3, described in The Little Book of
Semaphores.
mutex provides exclusive access to value and wakeups; cond is the condition
variable threads wait on if they wait on the semaphore.
if (semaphore->value < 0) {
do {
cond_wait(semaphore->cond, semaphore->mutex);
} while (semaphore->wakeups < 1);
semaphore->wakeups--;
}
mutex_unlock(semaphore->mutex);
}
When a thread waits on the semaphore, it has to lock the mutex before it
decrements value. If the value of the semaphore becomes negative, the
thread blocks until a “wakeup” is available. While it is blocked, the mutex
is unlocked, so another thread can signal.
if (semaphore->value <= 0) {
semaphore->wakeups++;
11.3. Make your own semaphores 89
cond_signal(semaphore->cond);
}
mutex_unlock(semaphore->mutex);
}
Again, a thread has to lock the mutex before it increments value. If the
semaphore was negative, that means threads are waiting, so the signalling
thread increments wakeups and signals the condition variable.
At this point one of the waiting threads might wake up, but the mutex is
still locked until the signalling thread unlocks it.
At that point, one of the waiting threads returns from cond_wait and checks
whether a wakeup is still available. If not, it loops and waits on the condi-
tion variable again. If so, it decrements wakeups, unlocks the mutex, and
exits.
One thing about this solution that might not be obvious is the use of a
do...while loop. Can you figure out why it is not a more conventional
while loop? What would go wrong?
The problem is that with a while loop this implementation would not have
Property 3. It would be possible for a thread to signal and then run around
and catch its own signal.
With the do...while loop, it is guaranteed1 that when a thread signals, one
of the waiting threads will get the signal, even if the signalling thread runs
around and gets the mutex before one of the waiting threads resumes.