Think OS A Brief Introduction To Operating Systems
Think OS A Brief Introduction To Operating Systems
Operating Systems
Version 0.3
Think OS
A Brief Introduction to Operating Systems
Version 0.3
Allen B. Downey
The original form of this book is LATEX source code. Compiling this code has the
effect of generating a device-independent representation of a textbook, which can
be converted to other formats and printed.
The cover for this book is based on a photo by Paul Friel (http://flickr.com/
people/frielp/), who made it available under the Creative Commons Attribution
license. The original photo is at http://flickr.com/photos/frielp/11999738/.
Preface
This book is intended for a different audience, and it has different goals. I
developed it for a class at Olin College called Software Systems.
Few of my students will ever write an operating system, but many of them
will write low-level applications in C, and some of them will work on em-
bedded systems. My class includes material from operating systems, net-
works, databases, and embedded systems, but it emphasizes the topics pro-
grammers need to know.
This book does not assume that you have studied Computer Architecture.
As we go along, I will explain what we need.
Chapter 2 explains how the operating system uses processes to protect run-
ning programs from interfering with each other.
vi Chapter 0. Preface
Contributor List
If you have a suggestion or correction, please send email to
[email protected]. If I make a change based on your feed-
back, I will add you to the contributor list (unless you ask to be omitted).
If you include at least part of the sentence the error appears in, that makes it
easy for me to search. Page and section numbers are fine, too, but not quite
as easy to work with. Thanks!
vii
Preface v
1 Compilation 1
1.1 Compiled and interpreted languages . . . . . . . . . . . . . . 1
1.2 Static types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 The compilation process . . . . . . . . . . . . . . . . . . . . . 3
1.4 Object code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Assembly code . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Understanding errors . . . . . . . . . . . . . . . . . . . . . . . 7
2 Processes 9
2.1 Abstraction and virtualization . . . . . . . . . . . . . . . . . . 9
2.2 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Unix processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Virtual memory 15
3.1 A bit of information theory . . . . . . . . . . . . . . . . . . . . 15
3.2 Memory and storage . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Address spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Memory segments . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Address translation . . . . . . . . . . . . . . . . . . . . . . . . 19
x Contents
6 Memory management 39
6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 Caching 45
7.2 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8 Multitasking 55
8.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.4 Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
10 Synchronization in C 67
Compilation
the type of a variable until the program is running. In general, “static” refers
to things that happen at compile time, and “dynamic” refers to things that
happen at run time.
For example, in Python you can write a function like this:
Looking at this code, you can’t tell what type x and y will refer to. At run
time, this function could be called several times with different types. Any
types that support the addition operator will work; any other types will
cause an exception or “run time error.”
In C you would write the same function like this:
The first line of the function includes “type declarations” for the parameters
and the return value. x and y are declared to be integers, which means that
we can check at compile time whether the addition operator is legal for this
type (it is). The return value is also declared to be an integer.
Because of these declarations, when this function is called elsewhere in the
program, the compiler can check whether the arguments provided have the
right type, and whether the return value is used correctly.
These checks happen before the program starts executing, so errors can be
found more quickly. More importantly, errors can be found in parts of the
program that have never run. Furthermore, these checks don’t have to hap-
pen at run time, which is one of the reasons compiled languages generally
run faster than interpreted languages.
Declaring types at compile time also saves space. In dynamic languages,
variable names are stored in memory while the program runs, and they
are often accessible by the program. For example, in Python the built-in
function locals returns a dictionary that contains variable names and their
values. Here’s an example in a Python interpreter:
>>> x = 5
>>> print locals()
{'x': 5, '__builtins__': <module '__builtin__' (built-in)>,
'__name__': '__main__', '__doc__': None, '__package__': None}
1.3. The compilation process 3
This shows that the name of the variable is stored in memory while the pro-
gram is running (along with some other values that are part of the default
run-time environment).
2. Parsing: During parsing, the compiler reads the source code and
builds an internal representation of the program, called an “abstract
syntax tree.” Errors detected during this step are generally syntax er-
rors.
Normally when you run gcc, it runs all of these steps and generates an
executable file. For example, here is a minimal C program:
#include <stdio.h>
int main()
{
printf("Hello World\n");
return 0;
}
If you save this code in a file called hello.c, you can compile and run it like
this:
$ gcc hello.c
$ ./a.out
By default, gcc stores the executable code in a file called a.out (which orig-
inally stood for “assembler output”). The second line runs the executable.
The prefix ./ tells the shell to look for it in the current directory.
It is usually a good idea to use the -o flag to provide a better name for the
executable:
$ gcc hello.c -c
1.5. Assembly code 5
The result is a file named hello.o, where the o stands for “object code,”
which is the compiled program. Object code is not executable, but it can be
linked into an executable.
The Unix command nm reads an object file and generates information about
the names it defines and uses. For example:
$ nm hello.o
0000000000000000 T main
U puts
This output indicates that hello.o defines the name main and uses a func-
tion named puts, which stands for “put string.” In this example, gcc per-
forms an optimization by replacing printf, which is a large and compli-
cated function, with puts, which is relatively simple.
In general you can control how much optimization gcc does with the -O
flag. By default, it does very little optimization, which can help with de-
bugging. The option -O1 turns on the most common and safe optimiza-
tions. Higher numbers turn on additional optimizations that require longer
compilation time.
In theory, optimization should not change the behavior of the program,
other than to speed it up. But if your program has a subtle bug, you might
find that optimization makes the bug appear or disappear. It is usually
a good idea to turn off optimization while you are developing new code.
Once the program is working and passing appropriate tests, you can turn
on optimization and confirm that the tests still pass.
$ gcc hello.c -S
The result is a file named hello.s, which might look something like this:
.file "hello.c"
.section .rodata
.LC0:
6 Chapter 1. Compilation
1.6 Preprocessing
Taking another step backward through the compilation process, you can
use the -E flag to run the preprocessor only:
$ gcc hello.c -E
The result is the output from the preprocessor. In this example, it contains
the included code from stdio.h, and all the files included from stdio.h,
and all the files included from those files, and so on. On my machine, the
total is more than 800 lines of code. Since almost every C program includes
stdio.h, those 800 lines of code get compiled a lot. If, like many C pro-
grams, you also include stdlib.h, the result is more than 1800 lines of code.
1.7. Understanding errors 7
If you use a function that’s not defined in any of the standard libraries, you
get a message from the linker:
Once the program starts, C does very little run-time checking, so there are
only a few run-time errors you are likely to see. If you divide by zero, or
perform another illegal floating-point operation, you will get a “Floating
point exception.” And if you try to read or write an incorrect location in
memory, you will get a “Segmentation fault.”
8 Chapter 1. Compilation
Chapter 2
Processes
The word “virtual” is often used in the context of a virtual machine, which is
software that creates the illusion of a dedicated computer running a particu-
lar operating system, when in reality the virtual machine might be running,
along with many other virtual machines, on a computer running a different
operating system.
In the context of virtualization, we sometimes call what is really happening
“physical”, and what is virtually happening either “logical” or “abstract.”
2.2 Isolation
One of the most important principles of engineering is isolation: when you
are designing a system with multiple components, it is usually a good idea
to isolate them from each other so that a change in one component doesn’t
have undesired effects on other components.
One of the most important goals of an operating system is to isolate each
running program from the others so that programmers don’t have to think
about every possible interaction. The software object that provides this iso-
lation is a process.
A process is a software object that represents a running program. I mean
“software object” in the sense of object-oriented programming; in general,
2.2. Isolation 11
an object contains data and provides methods that operate on the data. A
process is an object that contains the following data:
• The hardware state of the program, which includes data stored in reg-
isters, status information, and the program counter, which indicates
which instruction is currently executing.
Usually one process runs one program, but it is also possible for a process
to load and run a new program.
It is also possible, and common, to run the same program in more than
one process. In that case, the processes share the same program text, but
generally have different data and hardware states.
Most operating systems provide a fundamental set of capabilities to isolate
processes from each other:
• Virtual memory: Most operating systems create the illusion that each
process has its own chunk of memory, isolated from all other pro-
cesses. Again, programmers generally don’t have to think about how
virtual memory works; they can proceed as if every program has a
dedicated chunk of memory.
As a programmer, you don’t need to know much about how these capabili-
ties are implemented. But if you are curious, you will find a lot of interest-
ing things going on under the metaphorical hood. And if you know what’s
going on, it can make you a better programmer.
The first column is the unique numerical process ID. The second column
is the terminal that created the process; “TTY” stands for teletypewriter,
which was the original mechanical terminal.
2.3. Unix processes 13
The third column is the total processor time used by the process, and the
last column is the name of the running program. In this example, bash is
the name of the shell that interprets the commands I type in the terminal,
emacs is my text editor, and ps is the process generating this output.
By default, ps lists only the processes associated with the current terminal.
If you use the -e flag, you get every process (including processes belonging
to other users, which is a security flaw, in my opinion).
On my system there are currently 233 processes. Here are some of them:
init is the first process created when the operating system starts. It creates
many of the other processes, and then sits idle until the processes it created
are done.
Based on the name, you can infer that ksoftirqd is also a kernel daemon;
specifically, it handles software interrupt requests, or “soft IRQ”.
I won’t go into more details about the other processes, but you might be
interested to search for more information about some of them. Also, you
should run ps on your system and compare your results to mine.
Chapter 3
Virtual memory
If the process reads and writes files, those files are usually stored on a hard
disk drive (HDD) or solid state drive (SSD). These storage devices are non-
volatile, so they are used for long-term storage. Currently a typical desktop
computer has a HDD with a capacity of 500 GB to 2 TB. GB stands for “gi-
gabyte,” which is 109 bytes. TB stands for “terabyte,” which is 1012 bytes.
You might have noticed that I used the binary unit GiB for the size of main
memory and the decimal units GB and TB for the size of the HDD. For
historical and technical reasons, memory is measured in binary units, and
disk drives are measured in decimal units. In this book I will be careful
to distinguish binary and decimal units, but you should be aware that the
word “gigabyte” and the abbreviation GB are often used ambiguously.
In casual use, the term “memory” is sometimes used for HDDs and SDDs
as well as RAM, but the properties of these devices are very different, so
we will need to distinguish them. I will use “storage” to refer to HDDs and
SDDs.
Instead, programs work with virtual addresses, which are numbered from
0 to M − 1, where M is the number of valid virtual address. The size of
the virtual address space is determined by the operating system and the
hardware it runs on.
You have probably heard people talk about 32-bit and 64-bit systems. These
terms indicate the size of the registers, which is usually also the size of a
virtual address. On a 32-bit system, virtual addresses are 32 bits, which
means that the virtual address space runs from 0 to 0xffff ffff. The size of
this address space is 232 bytes, or 4 GiB.
On a 64-bit system, the size of the virtual address space is 264 bytes, or 4 ·
10246 bytes. That’s 16 exbibytes, which is about a billion times bigger than
current physical memories. It might seem strange that a virtual address
space can be so much bigger than physical memory, but we will see soon
how that works.
Thus, virtual memory is one important way the operating system isolates
processes from each other. In general, a process cannot access data belong-
ing to another process, because there is no virtual address it can generate
that maps to physical memory allocated to another process.
• The text segment contains the program text; that is, the machine lan-
guage instructions that make up the program.
18 Chapter 3. Virtual memory
• The static segment contains variables that are allocated by the com-
piler, including global variables and local variables that are declared
static.
• The stack is near the top of memory; that is, near the highest addresses
in the virtual address space. As the stack expands, it grows down
toward smaller addresses.
To determine the layout of these segments on your system, try running this
program (you can download it from http://todo):
#include <stdio.h>
#include <stdlib.h>
int global;
int main ()
{
int local = 5;
void *p = malloc(128);
return 0;
}
main is the name of a function; when it is used as a variable, it refers to the
address of the first machine language instruction in main, which we expect
to be in the text segment.
global is a global variable, so we expect it to be in the static segment. local
is a local variable, so we expect it to be on the stack.
And p contains an address returned by malloc, which allocates space in the
heap. “malloc” stands for “memory allocate.”
When I run this program, the output looks like this (I added spaces to make
it easier to read):
Address of main is 0x 40057c
Address of global is 0x 60104c
Address of local is 0x7fffd26139c4
Address of p is 0x 1c3b010
As expected, the address of main is the lowest, followed by global and p.
The address of local is much bigger. It has 12 hexadecimal digits. Each hex
digit corresponds to 4 bits, so it is a 48-bit address. That suggests that the
usable part of the virtual address space is 248 bytes.
Exercise 3.1 Run this program on your computer and compare your results
to mine.
Add a second call to malloc and check whether the heap on your system
grows up (toward larger addresses). Add a function that prints the address
of a local variable, and check whether the stack grows down.
2. The MMU splits the VA into two parts, called the page number and
the offset. A “page” is a chunk of memory; the size of a page depends
on the operating system and the hardware, but common sizes are 1–4
KiB.
3. The MMU looks up the page number in the “page table” and gets the
corresponding physical page number. Then it combines the physical
page number with the offset to produce a PA.
• Since 1 GiB is 230 bytes and 1 KiB is 210 bytes, there are 220 physical
pages, sometimes called “frames.”
• The size of the virtual address space is 232 B and the size of a page is
210 B, so there are 222 virtual pages.
• The size of the offset is determined by the page size. In this example
the page size is 210 B, so it takes 10 bits to specify a byte on a page.
• Since there are 220 physical pages, each physical page number is 20
bits. Adding in the 10 bit offset, the resulting PAs are 30 bits.
So far this all seems feasible. But let’s think about how big a page table
might have to be. The simplest implementation of a page table is an array
with one entry for each virtual page. Each entry would contain a physical
page number, which is 20 bits in this example, plus some additional infor-
mation about each frame. So we expect 3–4 bytes per entry. But with 222
virtual pages, the page table would require 224 bytes, or 16 MiB.
And since we need a page table for each process, a system running 256
processes would need 232 bytes, or 4 GiB, just for page tables! And that’s
just with 32-bit virtual addresses. With 48- or 64-bit VAs, the numbers are
ridiculous.
3.5. Address translation 21
Fortunately, nothing like that much space is actually needed, because most
processes don’t use even a small fraction of their virtual address space. And
if a process doesn’t use a virtual page, we don’t need an entry in the page
table for it.
Another way to say the same thing is that page tables are “sparse,” which
implies that the simple implementation, an array of page table entries, is a
bad idea. Fortunately, there are several good implementations for sparse
arrays.
One option is a multilevel page table, which is what many operating sys-
tems, including Linux, use. Another option is an associative table, where
each entry includes both the virtual page number and the physical page
number. Searching an associative table can be slow in software, but in hard-
ware we can search the entire table in parallel, so associative arrays are often
used to represent pages tables in the MMU.
I mentioned earlier that the operating system can interrupt a running pro-
cess, save its state, and then run another process. This mechanism is called
a “context switch.” Since each process has its own page table, the operating
system has to work with the MMU to make sure that each process gets the
right page table. In older machines, the page table information in the MMU
had to be replaced during every context switch, which was expensive. In
newer systems, each page table entry in the MMU includes the process ID,
so page tables from multiple processes can be in the MMU at the same time.
22 Chapter 3. Virtual memory
Chapter 4
When a process completes (or crashes), any data stored in main memory is
lost. But data stored on a hard disk drive (HDD) or solid state drive (SSD)
is “persistent;” that is, it survives after the process completes, even if the
computer shuts down.
Hard disk drives are complicated. Data is stored in blocks, which are laid
out in sectors, which make up tracks, which are arranged in concentric cir-
cles on platters.
Solid state drives are simpler in one sense, because blocks are numbered se-
quentially, but they raise a different complication: each block can be written
a limited number of times before it becomes unreliable.
Abstractly:
File names are usually strings, and they are usually “hierarchical;” that is,
the string specifies a path from a top-level directory (or folder), through a
series of subdirectories, to a specific file.
24 Chapter 4. Files and file systems
The primary difference between the abstraction and the underlying mecha-
nism is that files are byte-based and persistent storage is block-based. The
operating system translates byte-based file operations in the C library into
block-based operations on storage devices. Typical block sizes are 1–8 KiB.
For example, the following code opens a file and reads the first byte:
1. fopen uses the filename to find the top-level directory, called /, the
subdirectory home, and the sub-subdirectory downey.
2. It finds the file named file.txt and “opens” it for reading, which
means it creates a data structure that represents the file being read.
Among other things, this data structure keeps track of how much of
the file has been read, called the “file position.”
In DOS, this data structure is called a File Control Block, but I want to
avoid that term because in UNIX it means something else. In UNIX,
there seems to be no good name for it. It is an entry in the open file
table, but “open file table entry” is hard to parse, so I will call it an
OpenFileTableEntry.
3. When we call fgetc, the operating system checks whether the next
character of the file is already in memory. If so, it reads the next char-
acter, advances the file position, and returns the result.
4. If the next character is not in memory, the operating system issues an
I/O request to get the next block. Disk drives are slow, so a process
waiting for a block from disk is usually interrupted so another process
can run until the data arrives.
5. When the I/O operation is complete, the new block of data is stored
in memory, and the process resumes.
6. When the process closes the file, the operating system completes any
pending operations, removes any data stored in memory, and frees
the OpenFileTableEntry.
The process for writing a file is similar, but there are some additional steps.
Here is an example that opens a file for writing and changes the first char-
acter.
4.1. Disk performance 25
1. Again, fopen uses the filename to find the file. If it does not already
exist, it creates a new file and adds an entry in the parent directory,
/home/downey.
3. fputc attempts to write (or re-write) the first byte of the file. If the
file already exists, the operating system has to load the first block into
memory. Otherwise it allocates a new block in memory and requests
a new block on disk.
5. When the file is closed, any buffered data is written to disk and the
OpenFileTableEntry is freed.
• Block transfers: The time it takes to load a single byte from disk is
about 5 ms. By comparison, the additional time to load an 8 KiB block
is negligible. If the processor does 5 ms of work on each block, it might
be possible to keep the processor busy.
• Prefetching: Sometimes the operating system can predict that a pro-
cess will read a block and start loading it before it is requested. For
example, if you open a file and read the first block, there is a good
chance you will go on to read the second block. The operating system
might start loading additional blocks before they are requested.
• Buffering: As I mentioned, when you write a file, the operating sys-
tem stores the data in memory and only writes it to disk later. If you
modify the block several times while it is in memory, the system only
has to write it to disk once.
• Caching: If a process has used a block recently, it is likely to use it
again soon. If the operating system keeps a copy of the block in mem-
ory, it can handle future requests at memory speed.
• Minimal space use: The data structures used by the allocator should
be small, leaving as much space as possible for data.
It is hard to design a file system that achieves all of these goals, especially
since file system performance depends on “workload characteristics,” that
include file sizes, access patterns, etc. A file system that is well tuned for
one workload might not perform as well for a different workload.
For this reason, most operating systems support several kinds of file sys-
tems, and file system design is an active area of research and development.
In the last decade, Linux systems have migrated from ext2, which was a
conventional UNIX file system, to ext3, a “journaling” file system intended
to improve speed and contiguity, and more recently to ext4, which can han-
dle larger files and file systems. Within the next few years, there might be
another migration to the B-tree file system, Btrfs.
and fprintf. For the second process, the pipe behaves like a file open for
reading, so it uses fgets and fscanf.
Reusing the file abstraction makes life easier for programmers, since they
only have to learn one API (application program interface). It also makes
programs more versatile, since a program intended to work with files can
also work with data coming from pipes and other sources.
30 Chapter 4. Files and file systems
Chapter 5
For negative numbers, the most obvious representation uses a sign bit to
indicate whether a number is positive or negative. But there is another rep-
resentation, called “two’s complement” that is much more common because
it is easier to work with in hardware.
In two’s complement, the leftmost bit acts like a sign bit; it is 0 for positive
numbers and 1 for negative numbers.
To convert from an 8-bit number to 16-bits, we have to add more 0’s for a
positive number and add 1’s for a negative number. In effect, we have to
copy the sign bit into the new bits. This process is called “sign extension.”
In C all integer types are signed (able to represent positive and negative
numbers) unless you declare them unsigned. Operations on unsigned inte-
gers don’t use sign extension.
32 Chapter 5. More bits and bytes
For example, & computes the AND operation, which yields 1 if both
operands are 1, and 0 otherwise. Here is an example of & applied to two
4-bit numbers:
1100
& 1010
----
1000
1100
| 1010
----
1110
1100
^ 1010
----
0110
Most commonly, & is used to clear a set of bits from a bit vector, | is used to
set bits, and ^ is used to flip, or “toggle” bits. Here are the details:
Clearing bits: For any value x, x&0 is 0, and x&1 is x. So if you AND a
vector with 3, it selects only the two rightmost bits, and sets the rest to 0.
5.3. Representing floating-point numbers 33
xxxx
& 0011
----
00xx
In this context, the value 3 is called a “mask” because it selects some bits
and masks the rest.
xxxx
| 0011
----
xx11
Toggling bits: Finally, if you XOR a vector with 3, it flips the rightmost bits
and leaves the rest alone. As an exercise, see if you can compute the two’s
complement of 12 using ^. Hint: what’s the two’s complement representa-
tion of -1?
C also provides shift operators, and , which shift bits left and right. Each
left shift doubles a number, so 5 1 is 10, and 5 2 is 20. Each right shift
divides by two (rounding down), so 5 1 is 2 and 5 1 is 1.
Most computers use the IEEE standard for floating-point arithmetic. The C
type float usually corresponds to the 32-bit IEEE standard; double usually
corresponds to the 64-bit standard.
In the 32-bit standard, the leftmost bit is the sign bit, s. The next 8 bits are
the exponent, q, and the last 23 bits are the coefficient, c. The value of a
floating-point number is
(−1)s c · 2q
34 Chapter 5. More bits and bytes
Well, that’s almost correct, but there is one more wrinkle. Floating-point
numbers are usually normalized so that there is one digit before the point.
For example, in base 10, we prefer 2.998 · 108 rather than 2998 · 105 or any
other equivalent expression. In base 2, a normalized number always has the
digit 1 before the binary point. Since the digit in this location is always 1,
we can save space by leaving it out of the representation.
Well, that’s almost correct. But there’s one more wrinkle. The exponent is
stored with a “bias”. In the 32-bit standard, the bias is 127, so the exponent
3 would be stored as 130.
union {
float f;
unsigned int u;
} p;
p.f = -13.0;
unsigned int sign = (p.u >> 31) & 1;
unsigned int exp = (p.u >> 23) & 0xff;
printf("%d\n", sign);
printf("%d\n", exp);
printf("0x%x\n", coef);
The union allows us to store a floating-point value using p.f and then read
it as an unsigned integer using p.u.
To get the sign bit, we shift the bits to the right 31 places and then use a 1-bit
mask to select only the rightmost bit.
To get the exponent, we shift the bits 23 places, then select the rightmost 8
bits (the hexadecimal value 0xff has eight 1’s).
5.4. Unions and memory errors 35
To get the coefficient, we need to extract the 23 rightmost bits and ignore the
rest. We do that by making a mask with 1s in the 23 rightmost places and 0s
on the left. The easiest way to do that is by shifting 1 to the left by 23 places
and then subtracting 1.
The output of this program is:
1
130
0x500000
As expected, the sign bit for a negative number is 1. The exponent is 130,
including the bias. And the coefficient, which I printed in hexadecimal, is
101 followed by 20 zeros.
As an exercise, try assembling and disassembling a double, which uses
the 64-bit standard. See http://en.wikipedia.org/wiki/IEEE_floating_
point.
Next I’ll define a function that creates a smaller array and deliberately ac-
cesses elements before the beginning and after the end:
void f2() {
int x = 17;
int array[10];
int y = 123;
printf("%d\n", array[-2]);
printf("%d\n", array[-1]);
printf("%d\n", array[10]);
printf("%d\n", array[11]);
}
17
123
98
99
The details here depend on the compiler, which arranges variables on the
stack. From these results, we can infer that the compile put x and y next to
each other, “below” the array (at a lower address). And when we read past
the array, it looks like we are getting values that were left on the stack by
the previous function call.
In this example, all of the variables are integers, so it is relatively easy to fig-
ure out what is going on. But in general when you read beyond the bounds
of an array, the values you read might have any type. For example, if I
change f1 to make an array of floats, the results are:
17
123
1120141312
1120272384
The latter two values are what you get if you interpret a floating-point value
as an integer. If you encountered this output while debugging, you would
have a hard time figuring out what’s going on.
5.5. Representing strings 37
Also, remember that the letters and numbers in C strings are encoded in
ASCII. The ASCII codes for the digits “0” through “9” are 48 through 57,
not 0 through 9. The ASCII code 0 is the NUL character that marks the end
of a string. And the ASCII codes 1 through 9 are special characters used in
some communication protocols. ASCII code 7 is a bell; on some terminals,
printing it makes a sound.
The ASCII code for the letter “A” is 65; the code for “a” is 97. here are those
codes in binary:
65 = b0100 0001
97 = b0110 0001
A careful observer will notice that they differ by a single bit. And this pat-
tern holds for the rest of the letters; the sixth bit (counting from the right)
acts as a “case bit,” 0 for upper-case letters and 1 for lower case letters.
As an exercise, write a function that takes a string and converts from lower-
case to upper-case by flipping the sixth bit. As a challenge, you can make a
faster version by reading the string 32 or 64 bits at a time, rather than one
character at a time. This optimization is made easier if the length of the
string is a multiple of 4 or 8 bytes.
If you read past the end of a string, you are likely to see strange characters.
Conversely, if you write a string and then accidentally read it as an int or
float, the results will be hard to interpret.
You will find that the ASCII representation of the first 8 characters of
my name, interpreted as a double-precision floating point number, is
69779713878800585457664.
38 Chapter 5. More bits and bytes
Chapter 6
Memory management
• calloc, which is the same as malloc except that it also clears the newly
allocated chunk; that is, sets all bytes in the chunk to 0.
• If you access any chunk that has not been allocated, that’s a paddling.
• If you free an allocated chunk and then access it, that’s a paddling.
• If you try to free a chunk that has not been allocated, that’s a paddling.
• If you free the same chunk more than once, that’s a paddling.
• If you call realloc with a chunk that was not allocated, or was allo-
cated and then freed, that’s a paddling.
It might not sound difficult to follow these rules, but in a large program a
chunk of memory might be allocated in one part of the program, used in
several other parts, and freed in yet another part. So changes in one part of
the program can require changes in many other parts.
To make matters worse, memory errors can be difficult to find because the
symptoms are unpredictable. For example:
• If you read a value from an unallocated chunk, the system might de-
tect the error, trigger a Segmentation Fault, and stop the program. This
outcome is desirable, because it indicates the location in the program
that caused the error. But, sadly, this outcome is rare. More often, the
program reads unallocated memory without detecting the error, and
the value is whatever happened to be stored at a particular location.
If the value is not interpreted as the right type, it will often cause pro-
gram behavior that is unexpected and hard to interpret. For example,
if you read bytes from a string and interpret them as a floating-point
6.2. Memory leaks 41
One conclusion you should draw from this is that safe memory manage-
ment requires design and discipline. If you write a library or module that
allocates memory, you should also provide an interface to free it, and mem-
ory management should be part of the API design from the beginning.
If you use a library that allocates memory, you should be disciplined in your
use of the API. For example, if the library provides functions to allocate and
deallocate storage, you should use those functions and not, for example,
call free on a chunk you did not malloc. And you should avoid keeping
multiple references to the same chunk in different parts of your program.
For some programs, memory leaks are ok. For example, if your program
allocates memory, performs computations on it, and then exits, it is prob-
ably not necessary to free the allocated memory. When the program exits,
all of its memory is deallocated by the operating system. Freeing memory
immediately before exiting might feel more responsible, but it is mostly a
waste of time.
But if a program runs for a long time and leaks memory, its total memory
use will increase indefinitely. Eventually the program will run out of mem-
ory and, probably, crash. But even before that, a memory hog might slow
down other processes (for reasons we’ll see soon) or cause them to run out
of memory and fail.
Many large complex programs, like web browsers, leak memory, causing
their performance to degrade over time. Users who have observed this pat-
tern are often in the habit of restarting these programs periodically.
To see which programs on your system are using the most memory, you can
use the UNIX utilities ps and top.
6.3 Implementation
When a process starts, the system allocates space for the text segment and
statically allocated data, space for the stack, and space for the heap, which
contains dynamically allocated data.
Not all programs allocate data dynamically, so the initial size of the heap
might be small or zero. Initially the heap contains only one free chunk.
When malloc is called, it checks whether it can find a free chunk that’s big
enough. If not, it has to request more memory from the system. The func-
tion that does that is sbrk, which sets the “program break,” which you can
think of as a pointer to the end of the heap.
When sbrk is called, it allocates new pages of physical memory, updates the
process’s page table, and updates the program break.
In theory, a program could call sbrk directly (without using malloc) and
manage the heap itself. But malloc is easier to use and, for most memory-
use patterns, it runs fast and uses memory efficiently.
• The run time of malloc does not usually depend on the size of the
chunk, but does depend on how many free chunks there are. free is
usually fast, regardless of the number of free chunks. Because calloc
clears every byte in the chunk, the run time depends on chunk size (as
well as the number of free chunks).
realloc is sometimes fast, if the new size is smaller or if space is avail-
able to expand the chunk. If not, it has to copy data from the old chunk
to the new; in that case, the run time depends on the size of the old
chunk.
• Space overhead: Boundary tags and free list pointers take up space.
The minimum chunk size on most systems is 16 bytes. So for very
small chunks, malloc is not space efficient. If your program requires
large numbers of small structures, it might be more efficient to allocate
them in arrays.
• Fragmentation: If you allocate and free chunks with varied sizes, the
heap will tend to become fragmented. That is, the free space might be
broken into many small pieces. Fragmentation wastes space; it also
slows the program down by making memory caches less effective.
• Binning and caching: The free list is sorted by size into bins, so when
malloc searches for a chunk with a particular size, it knows what bin
to search in. Also, if you free a chunk and then immediately allocate a
chunk with the same size, malloc will usually be very fast.
44 Chapter 6. Memory management
Chapter 7
Caching
• The program counter, or PC, which contains the address (in memory)
of the next instruction in the program.
• The instruction register, or IR, which contains the instruction currently
executing.
• The stack pointer, or SP, which contains the address of the stack frame
for the current function, which contains its parameters and local vari-
ables.
• General-purpose registers that hold the data the program is currently
working with.
• A status register, or flag register, that contains information about the
current computation. For example, the flag register usually contains a
bit that is set if the result of the previous operation was zero.
46 Chapter 7. Caching
When a program is running, the CPU executes the following steps, called
the instruction cycle:
• Fetch: The next instruction is fetched from memory and stored in the
instruction register.
• Decode: Part of the CPU, called the control unit, decodes the instruc-
tion and send signals to the other parts of the CPU.
• Execute: Signals from the control unit cause the appropriate compu-
tation to occur.
Most computers can execute a few hundred different instructions, called the
instruction set. But most instructions fall into a few general categories:
During each instruction cycle, one instruction is read from the program text.
In addition, about half of the instructions in a typical program load or store
data. And therein lies one of the fundamental problems of computer archi-
tecture: the memory bottleneck.
When the CPU loads a value from memory, it stores a copy in the cache. If
the same value is loaded again, the CPU gets the cached copy and doesn’t
have to wait for memory.
Eventually the cache gets full. Then, in order to bring something new in,
we have to kick something out. So if the CPU loads a value and then loads
it again much later, it might not be in cache any more.
The cache hit rate, h, is the fraction of memory accesses that find data in
cache; the miss rate, m, is the fraction of memory accesses that have to go
to memory. If the time to process a cache hit is Th and the time for a cache
miss is Tm , the average time for each memory access is
hTh + mTm
Equivalently, we could define the miss penalty as the extra time to process
a cache miss, Tp = Tm − Th . Then the average access time is
Th + mTp
When the miss rate is low, the average access time can be close to Th . That
is, the program can perform as if memory ran at cache speeds.
7.2 Locality
When a program reads a byte for the first time, the cache usually loads
a block or line of data that includes the requested byte and some of its
neighbors. If the program goes on to read one of the neighbors, it will find
it in cache.
As an example, suppose that the block size is 64 B. And suppose you read
a string with length 64, and the first byte of the string happens to fall at
48 Chapter 7. Caching
the beginning of a block. When you load the first byte, you would incur a
miss penalty, but after that the rest of the string would be in cache. After
reading the whole string, the hit rate would be 63/64. If the string spans
two blocks, you would incur 2 miss penalties. But even then the hit rate
would be 62/64, or almost 97%.
On the other hand, if the program jumps around unpredictably, reading
data from scattered locations in memory, and seldom accessing the same
location twice, cache performance would be poor.
The tendency of a program to use the same data more than once is called
temporal locality. The tendency to use data in nearby locations is called
spatial locality. Fortunately, many programs naturally display both kinds
of locality:
The next section explores the relationship between a program’s access pat-
tern and cache performance.
iters = 0;
do {
sec0 = get_seconds();
iters = iters + 1;
sec = sec + (get_seconds() - sec0);
The inner for loop traverses the array. limit determines how much of the
array it traverses; stride determines how many elements it skips over. For
example, if limit is 16 and stride is 4, the loop would increment elements
0, 4, 8, and 12.
sec keeps track of the total CPU time used by the inner loop. The outer
loop runs until sec exceeds 0.1 seconds, which is long enough that we can
compute the average time with sufficient precision.
double get_seconds(){
struct timespec ts;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
return ts.tv_sec + ts.tv_nsec / 1e9;
}
To isolate the time to access the elements of the array, the program runs a
second loop that is almost identical except that the inner loop doesn’t touch
the array; it always increments the same variable:
iters2 = 0;
do {
sec0 = get_seconds();
10
512 B
256 B
8 128 B
64 B
32 B
16 B
012
2 214 216 218 220 222 224 226
size (B)
Figure 7.1: Average miss penalty as a function of array size and stride.
iters2 = iters2 + 1;
sec = sec - (get_seconds() - sec0);
• The program reads through the array many times, so it has plenty
of temporal locality. If the entire array fits in cache, we expect the
average miss penalty to be near 0.
• When the stride is 4 bytes, we read every element of the array, so the
program has plenty of spatial locality. If the block size is big enough
to contain 64 elements, for example, the hit rate would be 63/64, even
if the array does not fit in cache.
• If the stride is equal to the block size (or greater), the spatial locality
is effectively zero, because each time we read a block, we only access
one element. In that case we expect to see the maximum miss penalty.
In Figure 7.1, cache performance is good, for all strides, as long as the array
is less than 222 B. We can infer that the cache size is near 4 MiB; in fact,
according to the specs, it is 3 MiB.
Many processors use multi-level caches that include a small, fast cache and
a bigger, slower cache. In this example, it looks like the miss penalty in-
creases a little when the array size is bigger than 214 B, so it’s possible that
this processor also has a 16 KB cache with an access time less than 1 ns.
For example, if you are working with a large array, it might be faster to
traverse the array once, performing several operations with each element,
rather than traversing the array several times.
If you are working with a 2-D array, it might be stored as an array of rows. If
you traverse through the elements, it would be faster to go row-wise, so the
52 Chapter 7. Caching
tions of caching:
• What gets moved? In general, block sizes are small at the top of the
hierarchy, and bigger at the bottom. In a memory cache, a typical block
size is 128 B. Pages in memory might be 4 KiB, but when the operating
system reads a file from disk, it might read 10 or 100 blocks at a time.
• When does data get moved? In the most basic cache, data gets moved
into cache when it is used for the first time. But many caches use some
kind of prefetching, meaning that data is loaded before it is explicitly
requested. We have already seen a simple form of preloading: loading
an entire block when only part of it is requested.
• Where in the cache does the data go? When the cache is full, we can’t
bring anything in without kicking something out. Ideally, we want to
keep data that will be used again soon and replace data that will not
be used again.
The answers to these questions make up the cache policy. Near the top of
the hierarchy, cache policies tend to be simple because they have to be fast
and they are implemented in hardware. Near the bottom of the hierarchy,
there is more time to make decisions, and well-designed policies can make
a big difference.
Most cache policies are based on the principle that history repeats itself;
if we have information about the recent past, we can use it to predict the
immediate future. For example, if a block of data has been used recently,
we expect it to be used again soon. This principle suggests a replacement
policy called “least recently used,” or LRU, which removes from the cache
a block of data that has not been used recently. For more on this topic, see
http://en.wikipedia.org/wiki/Cache_algorithms.
Chapter 8
Multitasking
In many current systems, the CPU contains multiple cores, which means it
can run several processes at the same time. In addition, each core is capable
of multitasking, which means it can switch from one process to another
quickly, creating the illusion that many processes are running at the same
time.
The part of the operating system that implements multitasking is the kernel.
In a nut or seed, the kernel is the innermost part, surrounded by a shell. In
an operating system, the kernel is the lowest level of software, surrounded
by several other layers, including an interface called a “shell.” Computer
scientists love extended metaphors.
It’s hard to specify which parts of the operating system should be in the
kernel. But at its most basic, the kernel’s job is to handle interrupts. An
interrupt is an event that stops the normal instruction cycle and causes the
flow of execution to jump to a special section of code called an interrupt
handler.
1. When the interrupt occurs, the hardware saves the program counter
in a special register and jumps to the appropriate interrupt handler.
2. The interrupt handler stores the program counter and the flag register
in memory, along with the contents of any data registers it plans to
use.
3. The interrupt handler runs whatever code is needed to handle the in-
terrupt.
4. Then it restores the contents of the saved registers. Finally, it restores
the program counter of the interrupted process, which has the effect
of jumping back to the interrupted instruction.
If this mechanism works correctly, there is generally no way for the inter-
rupted process to know there was an interruption, unless it detects small
changes in the time between instructions.
8.2. Context switching 57
• Ready, if the process could be running, but isn’t, usually because there
are more runnable processes than cores.
• Done, if the process has completed, but has exit status information
that has not been read yet.
Here are the events that cause a process to transition from one state to an-
other:
58 Chapter 8. Multitasking
• When a process calls exit, the interrupt handler stores the exit code
in the PCB and changes the process’s state to done.
8.4 Scheduling
As we saw in Section 2.3 there might be hundreds of processes on a com-
puter, but usually most of them are blocked. Most of the time, there are
only a few processes that are ready or running. When an interrupt occurs,
the scheduler decides which process to start or resume.
On a workstation or laptop, the primary goal of the scheduler is to minimize
response time; that is, the computer should respond quickly to user actions.
Response time is also important on a server, but in addition the scheduler
might try to maximize throughput, which is the number of requests that
complete per unit of time.
Usually the scheduler doesn’t have much information about what processes
are doing, so its decisions are based on a few heuristics:
not run faster with more CPU time. Finally, a process that interacts
with the user is probably blocked, most of the time, waiting for user
actions.
The operating system can sometimes classify processes based on their
past behavior, and schedule them accordingly. For example, when an
interactive process is unblocked, it should probably run immediately,
because a user is probably waiting for a reply. On the other hand, a
CPU-bound process that has been running for a long time might be
less time-sensitive.
• If a process is likely to run for a short time and then make a blocking
request, it should probably run immediately, for two reasons: (1) if
the request takes some time to complete, we should start it as soon as
possible, and (2) it is better for a long-running process to wait for a
short one, rather than the other way around.
As an analogy, suppose you are making an apple pie. The crust takes
5 minutes to prepare, but then it has to chill for half an hour. It takes
20 minutes to prepare the filling. If you prepare the crust first, you can
prepare the filling while the crust is chilling, and you can finish the
pie in 35 minutes. If you prepare the filling first, the process takes 55
minutes.
• If a process makes a request and blocks before its time slice is com-
plete, it is more likely to be interactive or I/O-bound, so its priority
should go up.
• If a task blocks for a long time and then becomes ready, it should get
a priority boost so it can respond to whatever it was waiting for.
• The system call nice allows a process to decrease (but not increase)
its own priority, allowing programmers to pass explicit information to
the scheduler.
• When you compile the program, you link it with the Pthread library.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <semaphore.h>
The first two are standard; the third is for Pthreads and the fourth is for
semaphores. To compile with the Pthread library in gcc, you can use the -l
option on the command line:
62 Chapter 9. Cleaning up POSIX threads
This compiles array.c with debugging info and optimization, links with
the Pthread library, and generates an executable named array.
If you are used to a language like Python that provides exception handling,
you will probably be annoyed with languages like C that require you to
check for error conditions explicitly. I often mitigate this hassle by wrapping
library function calls together with their error-checking code inside my own
functions. For example, here is a version of malloc that checks the return
value.
The return type from pthread_create is pthread_t, which you can think of
as a handle for the new thread. You shouldn’t have to worry about the im-
plementation of pthread_t, but you do have to know that it has the seman-
tics of a primitive type1 . You can think of a thread handle as an immutable
value, so you can copy it or pass it by value without causing problems. I
point this out now because it is not true for semaphores, which I will get to
in a minute.
typedef struct {
int counter;
} Shared;
Shared *make_shared ()
{
int i;
Shared *shared = check_malloc (sizeof (Shared));
shared->counter = 0;
return shared;
}
Now that we have a shared data structure, let’s get back to pthread_create.
The first parameter is a pointer to a function that takes a void pointer and
returns a void pointer. If the syntax for declaring this type makes your
eyes bleed, you are not alone. Anyway, the purpose of this parameter is to
specify the function where the execution of the new thread will begin. By
convention, this function is named entry:
The parameter is the handle of the thread you want to wait for. All my
function does is call pthread_join and check the result.
9.4 Semaphores
The POSIX standard specifies an interface for semaphores. This inter-
face is not part of Pthreads, but most UNIXes that implement Pthreads
also provide semaphores. If you find yourself with Pthreads and without
semaphores, you can make your own; see Section 10.2.
POSIX semaphores have type sem_t. You shouldn’t have to know about the
implementation of this type, but you do have to know that it has structure
9.4. Semaphores 65
semantics, which means that if you assign it to a variable you are making a
copy of the contents of a structure. Copying a semaphore is almost certainly
a bad idea. In POSIX, the behavior of the copy is undefined.
In my programs, I use capital letters to denote types with structure seman-
tics, and I always manipulate them with pointers. Fortunately, it is easy to
put a wrapper around sem_t to make it behave like a proper object. Here is
the typedef and the wrapper that creates and initializes semaphores:
Synchronization in C
typedef struct {
int counter;
int end;
int *array;
} Shared;
shared->counter = 0;
shared->end = end;
int main ()
{
int i;
pthread_t child[NUM_CHILDREN];
check_array (shared);
return 0;
}
The first loop creates the child threads; the second loop waits for them to
complete. When the last child has finished, the parent invokes check_array
to check for errors. make_thread and join_thread are defined in Ap-
pendix 9.
}
shared->array[shared->counter]++;
shared->counter++;
}
}
Each time through the loop, the child threads use counter as an index
into array and increment the corresponding element. Then they increment
counter and check to see if they’re done.
You can download this program (including the cleanup code) from
greenteapress.com/semaphores/counter.c
If you compile and run the program, you should see output like this:
Starting child at counter 0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Child done.
70 Chapter 10. Synchronization in C
But as end gets bigger, there are more context switches between the children.
On my system I start to see errors when end is 100,000,000.
typedef struct {
int counter;
int end;
int *array;
Semaphore *mutex; (*\label{declaremutex}*)
} Shared;
shared->counter = 0;
shared->end = end;
shared->array[shared->counter]++;
shared->counter++;
sem_signal(shared->mutex);
}
}
There is nothing too surprising here; the only tricky thing is to remember to
release the mutex before the return statement.
Puzzle: read about mutexes and condition variables, and then use them to
write an implementation of semaphores.
You might want to use the following utility code in your solutions. Here is
my wrapper for Pthreads mutexes:
Mutex *make_mutex ()
{
Mutex *mutex = check_malloc (sizeof(Mutex));
int n = pthread_mutex_init (mutex, NULL);
if (n != 0) perror_exit ("make_lock failed");
return mutex;
}
Cond *make_cond ()
{
Cond *cond = check_malloc (sizeof(Cond));
int n = pthread_cond_init (cond, NULL);
if (n != 0) perror_exit ("make_cond failed");
return cond;
}
typedef struct {
int value, wakeups;
Mutex *mutex;
Cond *cond;
} Semaphore;
value is the value of the semaphore. wakeups counts the number of pending
signals; that is, the number of threads that have been woken but have not
yet resumed execution. The reason for wakeups is to make sure that our
semaphores have Property 3, described in Section ??.
mutex provides exclusive access to value and wakeups; cond is the condition
variable threads wait on if they wait on the semaphore.
if (semaphore->value < 0) {
do {
cond_wait (semaphore->cond, semaphore->mutex);
} while (semaphore->wakeups < 1);
semaphore->wakeups--;
}
mutex_unlock (semaphore->mutex);
}
if (semaphore->value <= 0) {
semaphore->wakeups++;
cond_signal (semaphore->cond);
}
mutex_unlock (semaphore->mutex);
}
Most of this is straightforward; the only thing that might be tricky is the
do...while loop at Line ??. This is an unusual way to use a condition vari-
able, but in this case it is necessary.
Puzzle: why can’t we replace this do...while loop with a while loop?
80 Chapter 10. Synchronization in C
10.2. Make your own semaphores 81
With the do...while loop, it is guaranteed1 that when a thread signals, one
of the waiting threads will get the signal, even if another thread gets the
mutex at Line ?? before one of the waiting threads resumes.