Synthesis - An Efficient Implementation of Fundamental Operating System Services
Synthesis - An Efficient Implementation of Fundamental Operating System Services
Henry Massalin
Columbia University
1992
c Henry Massalin 1992
ALL RIGHTS RESERVED
Synthesis: An Ecient Implementation
of Fundamental Operating System Services
Henry Massalin
Abstract
This dissertation shows that operating systems can provide fundamental services
an order of magnitude more eciently than traditional implementations. It describes the
implementation of a new operating system kernel, Synthesis, that achieves this level of
performance.
The Synthesis kernel combines several new techniques to provide high performance
without sacri cing the expressive power or security of the system. The new ideas include:
Run-time code synthesis | a systematic way of creating executable machine code
at runtime to optimize frequently-used kernel routines | queues, bu ers, context
switchers, interrupt handlers, and system call dispatchers | for speci c situations,
greatly reducing their execution time.
Fine-grain scheduling | a new process-scheduling technique based on the idea of
feedback that performs frequent scheduling actions and policy adjustments (at sub-
millisecond intervals) resulting in an adaptive, self-tuning system that can support
real-time data streams.
Lock-free optimistic synchronization is shown to be a practical, ecient alternative to
lock-based synchronization methods for the implementation of multiprocessor operat-
ing system kernels.
An extensible kernel design that provides for simple expansion to support new kernel
services and hardware devices while allowing a tight coupling between the kernel and
the applications, blurring the distinction between user and kernel services.
The result is a signi cant performance improvement over traditional operating system im-
plementations in addition to providing new services.
Contents
Table of Contents i
List of Figures v
List of Tables vii
1 Introduction 1
1.1 Purpose : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1
1.2 History and Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3
1.3 Synthesis Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6
1.3.1 Kernel Structure : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6
1.3.2 Implementation Ideas : : : : : : : : : : : : : : : : : : : : : : : : : : 7
1.3.3 Implementation Language : : : : : : : : : : : : : : : : : : : : : : : : 8
1.3.4 Target Hardware : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
1.3.5 Unix Emulator : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9
2 Previous Work 11
2.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11
2.2 The Tradeo Between Throughput and Latency : : : : : : : : : : : : : : : : 12
2.3 Kernel Structure : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 17
2.3.1 The Trend from Monolithic to Di use : : : : : : : : : : : : : : : : : 17
2.3.2 Services and Interfaces : : : : : : : : : : : : : : : : : : : : : : : : : : 19
2.3.3 Managing Diverse Types of I/O : : : : : : : : : : : : : : : : : : : : : 20
2.3.4 Managing Processes : : : : : : : : : : : : : : : : : : : : : : : : : : : 21
3 Kernel Code Generator 23
3.1 Fundamentals : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23
3.2 Methods of Runtime Code Generation : : : : : : : : : : : : : : : : : : : : : 27
3.2.1 Factoring Invariants : : : : : : : : : : : : : : : : : : : : : : : : : : : 27
3.2.2 Collapsing Layers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 28
3.2.3 Executable Data Structures : : : : : : : : : : : : : : : : : : : : : : : 29
3.2.4 Performance Gains : : : : : : : : : : : : : : : : : : : : : : : : : : : : 29
3.3 Uses of Code Synthesis in the Kernel : : : : : : : : : : : : : : : : : : : : : : 30
3.3.1 Bu ers and Queues : : : : : : : : : : : : : : : : : : : : : : : : : : : : 30
i
3.3.2 Context Switches : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33
3.3.3 Interrupt Handling : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38
3.3.4 System Calls : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 42
3.4 Other Issues : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 45
3.4.1 Kernel Size : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 45
3.4.2 Protecting Synthesized Code : : : : : : : : : : : : : : : : : : : : : : 47
3.4.3 Non-coherent Instruction Cache : : : : : : : : : : : : : : : : : : : : : 48
3.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 49
4 Kernel Structure 51
4.1 Quajects : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51
4.1.1 Quaject Interfaces : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52
4.1.2 Creating and Destroying Quajects : : : : : : : : : : : : : : : : : : : 55
4.1.3 Resolving References : : : : : : : : : : : : : : : : : : : : : : : : : : : 57
4.1.4 Building Services : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 58
4.1.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 60
4.2 Procedure-Based Kernel : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61
4.2.1 Calling Kernel Procedures : : : : : : : : : : : : : : : : : : : : : : : : 61
4.2.2 Protection : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 63
4.2.3 Dynamic Linking : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 63
4.3 Threads of Execution : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 63
4.3.1 Execution Modes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 64
4.3.2 Thread Operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : 65
4.3.3 Scheduling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66
4.4 Input and Output : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67
4.4.1 Producer/Consumer : : : : : : : : : : : : : : : : : : : : : : : : : : : 67
4.4.2 Hardware Devices : : : : : : : : : : : : : : : : : : : : : : : : : : : : 68
4.5 Virtual Memory : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 68
4.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
5 Concurrency and Synchronization 71
5.1 Synchronization in OS Kernels : : : : : : : : : : : : : : : : : : : : : : : : : 71
5.1.1 Disabling Interrupts : : : : : : : : : : : : : : : : : : : : : : : : : : : 72
5.1.2 Locking Synchronization Methods : : : : : : : : : : : : : : : : : : : 72
5.1.3 Lock-Free Synchronization Methods : : : : : : : : : : : : : : : : : : 73
5.1.4 Synthesis Approach : : : : : : : : : : : : : : : : : : : : : : : : : : : 75
5.2 Lock-Free Quajects : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 78
5.2.1 Simple Linked Lists : : : : : : : : : : : : : : : : : : : : : : : : : : : 78
5.2.2 Stacks and Queues : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79
5.2.3 General Linked Lists : : : : : : : : : : : : : : : : : : : : : : : : : : : 80
5.2.4 Lock-Free Synchronization Overhead : : : : : : : : : : : : : : : : : : 82
5.3 Threads : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85
5.3.1 Scheduling and Dispatching : : : : : : : : : : : : : : : : : : : : : : : 85
5.3.2 Thread Operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : 86
5.3.3 Cost of Thread Operations : : : : : : : : : : : : : : : : : : : : : : : 88
ii
5.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90
6 Fine-Grain Scheduling 93
6.1 Scheduling Policies and Mechanisms : : : : : : : : : : : : : : : : : : : : : : 93
6.2 Principles of Feedback : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95
6.2.1 Hardware Phase Locked Loop : : : : : : : : : : : : : : : : : : : : : : 95
6.2.2 Software Feedback : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95
6.2.3 FLL Example : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 97
6.2.4 Application Domains : : : : : : : : : : : : : : : : : : : : : : : : : : : 99
6.3 Uses of Feedback in Synthesis : : : : : : : : : : : : : : : : : : : : : : : : : : 100
6.3.1 Real-Time Signal Processing : : : : : : : : : : : : : : : : : : : : : : 100
6.3.2 Rhythm Tracking and The Automatic Drummer : : : : : : : : : : : 102
6.3.3 Digital Oversampling Filter : : : : : : : : : : : : : : : : : : : : : : : 102
6.3.4 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 102
6.4 Other Applications : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 103
6.4.1 Clocks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 103
6.4.2 Real-Time Scheduling : : : : : : : : : : : : : : : : : : : : : : : : : : 104
6.4.3 Multiprocessor and Distributed Scheduling : : : : : : : : : : : : : : 105
6.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 107
7 Measurements and Evaluation 109
7.1 Measurement Environment : : : : : : : : : : : : : : : : : : : : : : : : : : : 109
7.1.1 Hardware : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 109
7.1.2 Software : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 110
7.2 User-Level Measurements : : : : : : : : : : : : : : : : : : : : : : : : : : : : 110
7.2.1 Comparing Synthesis with SUNOS 3.5 : : : : : : : : : : : : : : : : : 110
7.2.2 Comparing Window Systems : : : : : : : : : : : : : : : : : : : : : : 112
7.3 Detailed Measurements : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 114
7.3.1 File and Device I/O : : : : : : : : : : : : : : : : : : : : : : : : : : : 114
7.3.2 Virtual Memory : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 115
7.3.3 Window System : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 116
7.3.4 Other Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 117
7.4 Experience : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 117
7.4.1 Assembly Language : : : : : : : : : : : : : : : : : : : : : : : : : : : 117
7.4.2 Porting Synthesis to the Sony NEWS Workstation : : : : : : : : : : 120
7.4.3 Architecture Support : : : : : : : : : : : : : : : : : : : : : : : : : : : 123
7.5 Other Opinions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124
8 Conclusion 129
Bibliography 135
A Unix Emulator Test Programs 141
iii
List of Figures
3.1 Hand-crafted assembler implementation of a bu er : : : : : : : : : : : : : : 30
3.2 Better bu er implementation using code synthesis : : : : : : : : : : : : : : 31
3.3 Context Switch : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 34
3.4 Thread Context : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 35
3.5 Synthesized Code for Sound Interrupt Processing { CD Active : : : : : : : 40
3.6 Sound Interrupt Processing, Hand-Assembler : : : : : : : : : : : : : : : : : 40
3.7 Sound Interrupt Processing, C Code : : : : : : : : : : : : : : : : : : : : : : 41
3.8 User-to-Kernel Procedure Call : : : : : : : : : : : : : : : : : : : : : : : : : : 43
4.1 Queue Quaject : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 53
4.2 Blocking write : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 59
4.3 Non-blocking write : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 59
5.1 Atomic Update of Single-Word Data : : : : : : : : : : : : : : : : : : : : : : 74
5.2 De nition of Compare-and-Swap : : : : : : : : : : : : : : : : : : : : : : : : 74
5.3 De nition of Double-Word Compare-and-Swap : : : : : : : : : : : : : : : : 75
5.4 Insert and Delete at Head of Singly-Linked List : : : : : : : : : : : : : : : : 78
5.5 Stack Push and Pop : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79
5.6 Queue Put and Get : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80
5.7 Linked List Traversal : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81
5.8 Lock-Free Delete from Head of Singly-Linked List : : : : : : : : : : : : : : : 84
5.9 Locked Delete from Head of Singly-Linked List : : : : : : : : : : : : : : : : 84
5.10 Thread State Transition Diagram : : : : : : : : : : : : : : : : : : : : : : : : 87
6.1 PLL Picture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95
6.2 Relationship between ILL and FLL : : : : : : : : : : : : : : : : : : : : : : : 96
6.3 General FLL : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 98
6.4 Low-pass Filter : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 98
6.5 Integrator Filter : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 99
6.6 Derivative Filter : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 99
6.7 Program to Play a CD : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 101
6.8 Two Processors, Static Scheduling : : : : : : : : : : : : : : : : : : : : : : : 106
6.9 Two Processors, Fine-Grain Scheduling : : : : : : : : : : : : : : : : : : : : 107
v
A.1 Test 1: Compute : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 141
A.2 Test 2, 3, and 4: Read/Write to a Pipe : : : : : : : : : : : : : : : : : : : : : 142
A.3 Test 5 and 6: Opening and Closing : : : : : : : : : : : : : : : : : : : : : : : 142
A.4 Test 7: Read/Write to a File : : : : : : : : : : : : : : : : : : : : : : : : : : 142
vi
List of Tables
2.1 Overhead of Various System Calls, Unix Release 4.0C : : : : : : : : : : : : 13
2.2 Overhead of Various System Calls, Mach : : : : : : : : : : : : : : : : : : : : 13
3.1 CPU Cycles for Bu er-Put : : : : : : : : : : : ::
: : :: : : : : : : : : : : 31
3.2 Comparison of C-Language \stdio" Libraries : ::
: : :: : : : : : : : : : : 32
3.3 Cost of Thread Scheduling and Context Switch ::
: : :: : : : : : : : : : : 36
3.4 Processing Time for Sound-IO Interrupts : : : ::
: : :: : : : : : : : : : : 41
3.5 Cost of Null System Call : : : : : : : : : : : : : ::
: : :: : : : : : : : : : : 46
3.6 Kernel Memory Requirements : : : : : : : : : : ::
: : :: : : : : : : : : : : 47
4.1 List of Basic Quajects : : : : : : : : : : : : : : : :: : :: : : : : : : : : : : 52
4.2 Interface to I/O Quajects : : : : : : : : : : : : : :: : :: : : : : : : : : : : 54
4.3 Interface to other Kernel Quajects : : : : : : : : :: : :: : : : : : : : : : : 55
4.4 Interface to Device Quajects : : : : : : : : : : : : :: : :: : : : : : : : : : : 56
5.1 Comparison of Di erent Synchronization Methods : : :: : : : : : : : : : : 82
5.2 Thread operations : : : : : : : : : : : : : : : : : : :
: :: : : : : : : : : : : 88
5.3 Overhead of Thread Scheduling and Context Switch : :: : : : : : : : : : : 89
7.1 Measured Unix System Calls (in seconds) : : : : : : : :: : : : : : : : : : : 111
7.2 Time to \cat /etc/termcap" to a 80*24 TTY window : :: : : : : : : : : : : 113
7.3 File and Device I/O (in microseconds) : : : : : : : : : :: : : : : : : : : : : 114
7.4 Low-level Memory Management Overhead (Page Size = 4KB) : : : : : : : : 115
7.5 Selected Window System Operations : : : : : : : : : : : : : : : : : : : : : : 116
vii
Acknowledgements
Many people contributed to making this research e ort a success.
First and foremost, I want to thank my advisor, Calton Pu. He was instrumental in
bringing this thesis to fruition. He helped clarify the ideas buried in my \collection of fast
assembly-language routines," and his dedication through dicult times encouraged me to
keep pushing forward. Without him, this dissertation would not exist.
I am also greatly indebted to the other members of my committee: Dan Duchamp,
Bob Sproull, Sal Stolfo, and John Zahorjan. Their valuable insight and timely suggestions
helped speed this dissertation to completion.
My sincerest appreciation and deepest `Qua!'s go to Renate Valencia. Her unsel sh
love and a ection and incredible amount of emotional support helped me through some of
my darkest hours here at Columbia and gave me the courage to continue on. Thanks also
to Matthew, her son, for letting me borrow Goofymeyer, his stu ed dog.
Many other friends in many places have helped in many ways; I am grateful to Emilie
Dao for her generous help and support trying to help me understand myself and for the
fun times we had together; to John Underkoer and Clea Waite for their ear in times of
personal uncertainty; to Mike Hawley and Olin Shivers, for their interesting conversation,
rich ideas, and untiring willingness to \look at a few more sentences"; to Ken Phillips, for
the thoughts we shared over countless cups of co ee; to Mort Meyerson, whose generosity
in those nal days helped to dissipate some of the pressure; to Brewster Kahle, who always
has a ready ear and a warm hug to o er; to Domenic Frontiere and family, who are some of
the most hospitable people I know; and to all my friends at Cooper Union, who made my
undergrad and teaching years there so enjoyable.
I also wish to thank Ming-Chao Chiang, Tom Matthews, and Tim Jones, the project
students who worked so hard on parts of the Synthesis system. Thanks also go to all the
people in the administrative oces, particularly Germaine, who made sure all the paperwork
owed smoothly between the various oces and who helped schedule my thesis defense on
record short notice. I particularly want to thank my friends here at Columbia | Cli
Beshers, Shu-Wie Chen, Ashutosh Dutta, Edward Hee, John Ioannidis, Paul Kanevsky,
Fred Korz, David Kurlander, Jong Lim, James Tanis, and George Wolberg, to name just a
ix
few. The countless dinners, good times, and piggy-back rides we shared helped make my
stay here that much more enjoyable.
I also wish to extend special thanks to the people at the University of Washington,
especially Ed Lazowska, Hank Levy, Ed Felten, David Keppel (a.k.a. Pardo), Dylan Mc-
Namee, and Raj Vaswani, whose boundless energy and happiness always gave me something
to look forward to when visiting Seattle or traveling to conferences and workshops. Special
thanks to Raj, Dylan, Ed Felten and Jan and Denny Prichard, for creating that `carry' tee
shirt and making me feel special; and to Lauren Bricker, Denise Draper, and John Zahorjan
for piggy-back rides of unparalleled quality and length.
Thanks goes to Sony corporation for the use of their machine; to Motorola for sup-
plying most of the parts used to build my computer, the Quamachine; and to Burr Brown
for their generous donation of digital audio chips.
And nally, I want to thank my family, whose patience endured solidly to the end.
Thanks to my mother and father, who always welcomed me home even when I was too busy
to talk to them. Thanks, too, to my sister Lucy, sometimes the only person with whom I
could share my feelings, and to my brother, Peter, who is always challenging me to a bicycle
ride.
In appreciation, I o er to all a warm, heartfelt
| Qua! |
x
1
1
Introduction
I must Create a System, or be enslav'd by another Man's;
I will not Reason and Compare: my business is to Create.
| William Blake Jerusalem
1.1 Purpose
This dissertation shows that operating systems can provide fundamental services
an order of magnitude more eciently than traditional implementations. It describes the
implementation of a new operating system kernel, Synthesis, that achieves this level of
performance.
The Synthesis kernel combines several new techniques to provide high performance
without sacri cing the expressive power or security of the system. The new ideas include:
Run-time code synthesis | a systematic way of creating executable machine code
at runtime to optimize frequently-used kernel routines | queues, bu ers, context
switchers, interrupt handlers, and system call dispatchers | for speci c situations,
greatly reducing their execution time.
2
In 1983, the rst Unix-based workstations were being introduced. I was unhappy
with the performance of computers of that day, particularly that of workstations relative
to what DOS-based PCs could deliver. Among other things, I found it hard to believe that
the workstations could not drive even one serial line at a full 19,200 baud | approximately
2000 characters per second.1 I remember asking myself and others: \There is a full half-
millisecond time between characters. What could the operating system possibly be doing
for that long?" No one had a clear answer. Even at the relatively slow machine speed of
that day | approximately one million machine instructions per second | the processor
could execute 500 machine instructions in the time a character was transmitted. I could
not understand why 500 instructions were not sucient to read a character from a queue
and have it available to write to the device's control register by the time the previous one
had been transmitted.
That summer, I decided to try building a small computer system and writing some
operating systems software. I thought it would be fun, and I wanted to see how far I could
get. I teamed up with a fellow student, James Arleth, and together we built the precursor
of what was later to become an experimental machine known as the Quamachine. It was a
two-processor machine based on the 68000 CPU [4], but designed in such a way that it could
be split into two independently-operating halves, so we each would have a computer to take
with us after we graduated. Jim did most of the hardware design while I concentrated on
software.
The rst version of the software [19] consisted of drivers for the machine's serial
ports and 8-bit analog I/O ports, a simple multi-tasker, and an unusual debug monitor that
included a rudimentary C-language compiler/interpreter as its front end. It was quite small
1 This is still true today despite an order-of-magnitude speed increase in the processor hardware, and
attests to a comparable increase in operating system overhead. Speci cally, the Sony NEWS 1860 worksta-
tion, running release 4.0 of Sony's version of UNIX, places a software limit of 9600 baud on the machine's
serial lines. If I force the line to go faster through the use of kernel hackery, the operating system loses data
each time a burst longer than about 200 characters arrives at high speed.
4
| everything t into the machine's 16 kilobyte ROM, and ran comfortably in its 16 kilobyte
RAM. And it did drive the serial ports at 19,200 baud. Not just one, but all four of them,
concurrently. Even though it lacked many fundamental services, such as a lesystem, and
could not be considered a \real" operating system in the sense of Unix, it was the precursor
of the Synthesis kernel, though I did not know it at the time.
After entering the PhD program at Columbia in the fall of 1984, I continued to
develop the system in my spare time, improving both the hardware and the software, and
also experimenting with other interests | electronic music and signal processing. During
this time, the CPU was upgraded several times as Motorola released new processors in the
68000 family. Currently, the Quamachine uses a 68030 processor rated for 33 MHz, but
running at 50MHz, thanks to a homebrew clock circuit, special memory decoding tricks, a
higher-than-spec operating voltage, and an ice-cube to cool the processor.
But as the software was eshed out with more features and new services, it became
slower. Each new service required new code and data structures that often interacted with
other, unrelated, services, slowing them down. I saw my system slowly acquiring the ills of
Unix, going down the same road to ineciency. This gave me insight into the ineciency
of Unix. I noticed that, often, the mere presence of a feature or capability incurs some cost,
even when not being used. For example, as the number of services and options multiply,
extra code is required to select from among them, and to check for possible interference
between them. This code does no useful work processing the application's data, yet it adds
overhead to each and every call.
Suddenly, I had a glimmer of an idea of how to prevent this ineciency from creeping
into my system: runtime code generation! All along I had been using a monitor program
with a C-language front end as my \shell." I could install and remove services as needed, so
that no service would impose its overhead until it was used. I thought that perhaps there
might be a way to automate the process, so that the correct code would be created and
installed each time a service was used, and automatically removed when it was no longer
needed. This is how the concept of creating code at runtime came to be. I hoped that this
could provide relief from the ineciencies that plague other full-featured operating systems.
I was dabbling with these ideas, still in my spare time, when Calton Pu joined the
faculty at Columbia as I was entering my third year. I went to speak with him since I
was still unsure of my research plans and looking for a new advisor. Calton brought with
him some interesting research problems, among them the ecient implementation of object-
5
based systems. He had labored through his dissertation and knew where the problems were.
Looking at my system, he thought that my ideas might solve that problem one day, and
encouraged me to forge ahead.
The project took shape toward the end of that semester. Calton had gone home for
Christmas, and came back with the name Synthesis, chosen for the main idea: run-time
kernel code synthesis. He helped package the ideas into a coherent set of concepts, and we
wrote our rst paper in February of 1987.
I knew then what the topic of my dissertation would be. I started mapping out
the structure of the basic services and slowly restructured the kernel to use code synthesis
throughout. Every operation was subject to intense scrutiny. I recall the joy felt the day
I discovered how to perform a \putchar" (place character into bu er) operation in four
machine instructions rather than the ve I had been using (or eight, using the common C-
language macro). After all, \putchar" is a common operation, and I found it both satisfying
and amusing that eliminating one machine instruction resulted in a 4% overall gain in
performance for some of my music applications. I continued experimenting with electronic
music, which by then had become more than a hobby, and, as shown in section 6.3, o ered
a convincing demonstration that Synthesis did deliver the kind of performance claimed.
Over time, this type of semi-playful, semi-serious work toward a fully functional
kernel inspired the other features in Synthesis | ne-grained scheduling, lock-free synchro-
nization, and the kernel structure.
Fine-grained scheduling was inspired by work in music and signal-processing. The
early kernel's scheduler often needed tweaking in order to get a new music synthesis program
to run in real-time. While early Synthesis was fast enough to make real-time signal process-
ing possible by handling interrupts and context switches eciently, it lacked a guarantee
that real-time tasks got sucient CPU as the machine load increased. I had considered
the use of task priorities in scheduling, but decided against them, partly because of the
programming e ort involved, but mostly because I had observed other systems that used
priorities, and they did not seem to fully solve the problem. Instead, I got the idea that the
scheduler could deduce how much CPU time to give each stage of processing by measuring
the data accumulation at each stage. That is how ne-grained scheduling was born. It
seemed easy enough to do, and a few days later I had it working.
The overall structure of the kernel was another idea developed over time. Initially,
the kernel was an ad-hoc mass of procedures, some of which created code, some of which
6
didn't. Runtime code generation was not well understood, and I did not know the best
way to structure such a system. For each place in the kernel where code-synthesis would
be bene cial, I wrote special code to do the job. But even though the kernel was lacking
in overall structure, I did not see that as negative. This was a period where freedom to
experiment led to valuable insights, and, as I found myself repeating certain things, an
overall structure gradually became clear.
Optimistic synchronization was a result of these experiments. I had started writing
the kernel using disabled interrupts to implement critical sections, as is usually done in
other single-processor operating systems. But the limitations of this method were soon
brought out in my real-time signal processing work, which depends on the timely servicing
of frequent interrupts, and therefore cannot run in a system that disables interrupts for too
long. I therefore looked for alternatives to inter-process synchronization. I observed that in
many cases, such as in a single-producer/single-consumer queue, the producer and consumer
interact only when the queue is full or empty. During other times, they each work on di er-
ent parts of the queue, and can do so independently, without synchronization. My interest
in this area was further piqued when I read about the \Compare-and-Swap" instructions
on the 68030 processor, which allows concurrent data structures to be implemented without
using locks.
the following quajects: a raw serial device driver, two queues, an input editor, an output
format converter, and a system-call dispatcher. The wide choice of quajects and linkages
allows Synthesis to support a wide range of di erent system interfaces at the user level. For
example, Synthesis includes a (partial) Unix emulator that runs some SUN-3 binaries. At
the same time, a di erent application might use a di erent interface, for example, one that
supports asynchronous I/O.
devices: stereo 16-bit analog output, stereo 16-bit analog input, and a compact disc (CD)
player digital interface.
The Sony NEWS 1860 is a workstation with two 68030 processors. It is a commercial-
ly available machine, making Synthesis potentially accessible to other interested researchers.
It has two processors, which, while not a large number, nevertheless demonstrates Synthe-
sis multiprocessor support. While its architecture is not symmetric | one processor is the
main processor and the other is the I/O processor | Synthesis treats it as if it were a
symmetric multiprocessor, scheduling tasks on either processor without preference, except
those that require something that is accessible from one processor and not the other.
2
Previous Work
If I have seen farther than others, it is because
I was standing on the shoulders of giants.
| Isaac Newton
If I have not seen as far as others, it is because
giants were standing on my shoulders.
| Hal Abelson
In computer science, we stand on each other's feet.
| Brian K. Reid
2.1 Overview
This chapter sketches an overview of some of the classical goals of operating system
design and tells how existing designs have addressed them. This provides a background
against which the new techniques in Synthesis can be contrasted. I argue that some of the
classical goals need to be reconsidered in light of new requirements and point out the new
goals that have steered the design of Synthesis.
There are four areas in which Synthesis makes strong departures from classical de-
signs: overall kernel structure, the pervasive use of run-time code generation, the man-
agement of concurrency and synchronization, and novel use of feedback mechanisms in
12
scheduling. The rest of this chapter discusses each of these four topics in turn, but rst, it
is useful to consider some broad design issues.
System Function Time for 1 char (s) Time for 1024 chars (s)
Write to a pipe 260 450
Write to a le 340 420
Read from a pipe 190 610
Read from a le 380 460
Sony NEWS 1860 workstation, 68030 processor, 25MHz, 1 w/s, Unix Release 4.0C.
System Function Time for 1 char (s) Time for 1024 chars (s)
Write to a pipe 470 740
Write to a le 370 600
Read from a pipe 550 760
Read from a le 350 580
are exercised often, such as context switch and system call dispatch. We nd that operating
systems have historically exhibited large invocation overheads. Due to its popularity and
wide availability, Unix is one of the more-studied systems, and I use it here as a baseline
for performance comparisons.
In one study, Feder compares the evolution of Unix system performance over time
and over di erent machines [13]. He studies the AT&T releases of Unix| System 3 and
System 5 | spanning a time period from the mid-70's to late 1982 and shows that Unix
performance had improved roughly 25% during this time. Among the measurements shown
is the time taken to execute the getpid (get process id) system call. This system call fetches
a tiny amount of information (one integer) from the kernel, and its speed is a good indicator
of the cost of the system call mechanism. For the VAX-11/750 minicomputer, Feder reports
a time of 420 microseconds for getpid and 1000 microseconds for context switch.
I have duplicated some of these experiments on the Sony NEWS workstation, a
machine of roughly 10 times the performance of the VAX-11/750. Table 2.1 summarizes
the results.1 On this machine, getpid takes 40 microseconds, and a context switch takes
170 microseconds. These numbers suggest that, since 1982, the performance of Unix has
remained relatively constant compared to the speed of the hardware.
A study done by Ousterhout [22] shows that operating system speed has not kept
pace with hardware speed. The reasons he nds are that memory bandwidth and disk speed
have not kept up with the dramatic increases in processor speed. Since operating systems
tend to make heavier use of these resources than the typical application, this has a negative
e ect on operating system performance relative to how the processor's speed is measured.
But I believe there are further reasons for the large overhead in existing systems.
As new applications demand more functionality, the tendency has been simply to layer on
more functions. This can slow down the whole system because often the mere existence of a
feature forces extra processing steps, regardless of whether that feature is being used or not.
New features often require extra code or more levels of indirection to select from among
them. Kernels become larger and more complicated, leading designers to restructure their
operating systems to manage the complexity and improve understandability and maintain-
ability. This restructuring, if not carefully done, can reduce performance by introducing
1
Even though the Sony machine has two processors, one of them is dedicated exclusively to handling
device I/O and does not run any Unix code. This second processor does not a ect the outcome of the tests,
which are designed to measure Unix system overhead, not device I/O capacity. The le read and write
benchmarks were to an in-core le system. There was no disk activity.
15
The factor of 5 is signi cant because, ultimately, programs interact with the outside world
through kernel invocations. Increasing the overhead limits the rate at which a program can
invoke the kernel and therefore, interact with the outside world.
Taken to the limit, the things that remain fast are those local to an application,
those that can be done at user-level without invoking the kernel often. But in a world
of increasing interactions and communications between machines | all of which require
kernel intervention | I do not think this is a wise optimization strategy. Distributed
computing stresses the importance of low latency, both because throughput can actually
su er if machines spend time waiting for each others' responses rather than doing work,
and because there are so many interactions with other machines that even a small delay in
each is magni ed, leading to uneven response time to the user.
Improvement is clearly required to ensure consistent performance and controlled la-
tencies, particularly when processing richer media like interactive sound and video. For
example, in an application involving 8-bit audio sampled at 8KHz, using a 4096-byte bu er
leads to a 1=2-second delay per stage of processing. This is unacceptable for real-time,
interactive audio work. The basic system overhead must be reduced so that time-sensitive
applications can use smaller bu ers, reducing latency while maintaining throughput. But
there is little room for revolutionary increases in performance when the fundamental oper-
ating system mechanisms, such as system call dispatch and context switch, are slow, and
furthermore, show no trend in becoming faster. In general, existing designs have not focused
on lower-level, low-overhead mechanisms, preferring instead to solve performance problems
with more bu ering.
This dissertation shows that the unusual goal of providing high throughput with low
latency can be achieved. There are many factors in the design of Synthesis that accomplish
this result, which will be discussed at length in subsequent chapters. But let us now consider
four important aspects of the Synthesis design that depart from common precedents and
trends.
17
moved to user-level services. Some people argue that the size of user-level services does not
count as much, because they are pageable and are not constrained to occupy real memory.
But I argue: is it really a good idea to page out operating system services? This can only
result in increased latency and unpredictable response time.
In general, I agree that the di usion of the kernel structure is a good idea but nd
it unfortunate that current-generation kernelized systems tend to be slow, even in spite of
ongoing e orts to make them faster. Perhaps people commonly accept that some loss of
performance is the inevitable result of partitioning, and are willing to su er that loss in
return for greatly increased maintainability and extensibility.
My dissertation shows that this need not be the case: Synthesis addresses the issues
of structuring and performance. Its quaject-based kernel structure keeps the modularity,
protection, and extensibility demanded of modern-day operating systems. At the same
time Synthesis delivers performance an order of magnitude better than existing systems,
as evidenced by the experiments in Chapter 7. Its kernel services are subdivided into even
ner chunks than kernelized systems like Mach. Any service can be composed of pieces that
run at either user- or kernel-level: the distinction is blurred.
Synthesis breaks the batch-mode thinking that has led to systems that wait for all
the data to arrive before any subsequent processing is allowed to take place, when in fact
subsequent processing could proceed in parallel with the continuing arrival of data. Witness
a typical system's handling of network packets: the whole packet is received, bu ered, and
checksummed before being handed over for further processing, when instead the address
elds could be examined and lookups performed in parallel with the reception of the rest
of the packet, reducing packet handling latency. Some network gateways do this type of
cut-through routing for packet forwarding. But in a general-purpose operating system, the
high overhead of system calls and context switches in existing systems discourage this type
of thinking in preference to batching. By reconsidering the design, Synthesis compounds
the savings. Low-overhead system calls and context switches encourage frequent use to
better streamline processing and take advantage of the inherent parallelism achieved by a
pipeline, reducing overhead and latency even further.
19
Later versions of Unix extended the model, making up some of the loss, but these
extensions were not \clean" in the sense of the original Unix design. They were added
piecemeal as the need arose. For example, ioctl (for I/O controls) and the select system
call help support out-of-band stream controls and non-blocking (polled) I/O, but these
solutions are neither general nor uniform. Furthermore, the granularity with which Unix
considers an operation \`non-blocking" is measured in tens of milliseconds. While this was
acceptable for the person-typing-on-a-terminal mode of user interaction of the early 1980's,
it is clearly inappropriate for handling higher rate interactive data, such as sound and video.
Interactive games and real-time processing are two examples of areas where the classic
models are insucient. Unix and its variants have no asynchronous read, for example, that
would allow a program to monitor the keyboard while also updating the player's screen.
A conceptually simple application to record a user's typing along with its timing and later
play it back with the correct timing takes several pages of code to accomplish under Unix,
and then it cannot be done well enough if, say, instead of a keyboard we have a musical
instrument.
The newer systems, such as Mach, provide extensions and new capabilities but within
the framework of the same basic model, hence the problems persist. The result is that the
ner aspects of stream control, of real-time processing, or of the handling of time-sensitive
data in general have not been satisfactorily addressed in existing systems.
necessary. These translations can be expensive if the \distance" between the internal format
and a particular device is large. In addition, some functionality might be lost, because
common formats, however general, cannot capture everything. There could be some features
in a device that do not map well into the chosen format and those features become dicult
if not impossible to access. Since operating systems tend to be structured around older
formats, chosen at time when the prevalent I/O devices were terminals and disks, it is not
surprising that they have diculty handling the new rich media devices, such as music and
video.
Synthesis breaks this tradeo . The quaject structuring of the kernel allows new I/O
formats to be created to suit the performance characteristics of unusual devices. Indeed, it
is not inconceivable that every device has its own format, specially tailored to precisely t
its characteristics. Di erences between a device format and what the application expects
are spanned using translation, as in existing systems. But unlike existing systems, where
translation is used to map into a common format, Synthesis maps directly from the device
format to the needs of the application, eliminating the intermediate, internal format and
its associated bu ering and translation costs. This lets knowledgable applications use the
highly ecient device-level interfaces when very high performance and detailed control are
of utmost importance, but also preserves the ability of any application to work with any
device, as in the Unix common-I/O approach. Since the code is runtime-generated for each
speci c translation, performance is good. The ecient emulation of Unix under Synthesis
bears witness to this.
In this sense, Mach threads only add parallelism to an existing abstraction | the Unix
process | Mach does not develop the thread idea to its fullest potential. Both these systems
lack general functions to start, stop, query, and modify an arbitrary task's execution without
arrangements having been made beforehand, for example, by starting the task from within
a debugger.
In contrast, Synthesis provides detailed thread control, comparable to the level of
control found for other operating system services, such as I/O. Section 4.3.2 lists the oper-
ations supported, which work between any pair of threads, even between unrelated threads
in di erent address spaces and even on the kernel's threads, if there is sucient privilege.
Because of their exceptionally low overhead | only ten to twenty times the cost of a null
procedure call | they provide unprecedented data collection and measurement abilities and
unparalleled support for debuggers.
23
3
Kernel Code Generator
For, behold, I create new heavens and a new earth.
| The Bible, Isaiah
3.1 Fundamentals
Kernel code synthesis is the name given to the idea of creating executable machine
code at runtime as a means of improving operating system performance. This idea distin-
guishes Synthesis from all other operating systems research e orts, and is what helps make
Synthesis ecient.
Runtime code generation is the process of creating executable machine code during
program execution for use later during the same execution [16]. This is in contrast to the
usual way, where all the code that a program runs has been created at compile time, before
program execution starts. In the case of an operating system kernel like Synthesis, the
\program" is the operating system kernel, and the term \program execution" refers to the
kernel's execution, which lasts from the time the system is started to the time it is shut
down.
There are performance bene ts in doing runtime code generation because there is
more information available at runtime. Special code can be created based on the particular
24
data to be processed, rather than relying on general-purpose code that is slower. Runtime
code generation can extend the bene ts of detailed compile-time analysis by allowing certain
data-dependent optimizations to be postponed to runtime, where they can be done more
e ectively because there is more information about the data. We want to make the best
possible use of the information available at compile-time, and use runtime code generation
to optimize data-dependent execution.
The goal of runtime code generation can be stated simply:
Never evaluate something more than once.
For example, suppose that the expression \A A + A B + B B " is to be evaluated
for many di erent A while holding B = 1. It is more ecient to evaluate the reduced
expression obtained by replacing B with 1: \A A + A + 1". Finding opportunities for such
optimizations and performing them is the focus of this chapter.
The problem is one of knowing how soon we can know what value a variable has, and
how that information can be used to improve the program's code. In the previous example,
if it can be deduced at compile time that B = 1, then a good compiler can perform precisely
the reduction shown. But usually we can not know ahead of time what value a variable
will have. B might be the result of a long calculation whose value is hard if not impossible
to predict until the program is actually run. But when it is run, and we know B , runtime
code generation allows us to use the newly-acquired information to reduce the expression.
Speci cally, we create specialized code once the value of B becomes known, using
an idea called partial evaluation [15]. Partial evaluation is the building of simpler, easier-
to-evaluate expressions from complex ones by substituting variables that have a known,
constant value with that constant. When two or more of these constants are combined
in an arithmetic or logical operation, or when one of the constants is an identity for the
operation, the operation can be eliminated. In the previous example, we no longer have to
compute B B , since we know it is 1, and we do not need to compute A B , since we know
it is A.
There are strong parallels between runtime code generation and compiler code gen-
eration, and many of the ideas and terminology carry over from one to the other. Indeed,
anything that a compiler does to create executable code can also be performed at run-
time. But because compilation is an o -line process, there is usually less concern about
the cost of code generation and therefore one has a wider palette of techniques to choose
25
from. A compiler can a ord to use powerful, time-consuming analysis methods and perform
sophisticated optimizations | a luxury not always available at runtime.
Three optimizations are of special interest to us, not only because they are easy to
do, but because they are also e ective in improving code quality. They are: constant folding,
constant propagation, and procedure inlining. Constant folding replaces constant expressions
like 5 4 with the equivalent value, 20. Constant propagation replaces variables that have
known, constant value with that constant. For example, the fragment x = 5; y = 4 x
becomes x = 5; y = 4 5 through constant propagation; 4 5 then becomes 20 through
constant folding. Procedure inlining substitutes the body of a procedure, with its local
variables appropriately renamed to avoid con icts, in place of its call.
There are three costs associated with runtime code generation: creation cost, paid
each time a piece of code is created; execution cost, paid each time the code is used; and
management costs, to keep track of where the code is and how it is being used. The hope
is to use the information available at runtime to create better code than would otherwise
be possible. In order to win, the savings of using the runtime-created code must exceed the
cost of creating and managing that code. This means that for many applications, a fast
code generator that creates good code will be superior to a slow code generator that creates
excellent code. (The management problem is analogous to keeping track of ordinary, heap-
allocated data structures, and the costs are similar, so they will not be considered further.)
Synthesis focuses on techniques for implementing very fast runtime code generation.
The goal is to broaden its applicability and extend its bene ts, making it cheap enough so
that even expressions and procedures that are not re-used often still bene t from having
their code custom-created at runtime. To this end, the places where runtime code generation
is used are limited to those where it is clear at compile time what the possible reductions
will be. The following paragraphs describe the idea, while the next section describes the
speci c techniques.
A fast runtime code generator can be built by making full use of the information
available at compile time. In our example, we know at compile time that B will be held
constant, but we do not know what the constant will be. But we can predict at compile-time
what form the reduced expression will have: A A + C1 A + C2. Using this knowledge, we
can build a simple code generator for the expression that copies a code template representing
A A + C1 A + C2 into newly allocated memory and computes and lls the constants:
C1 = B and C2 = B B. A code template is a fragment of code which has been compiled
26
move a constant into a machine register in the most ecient way possible.
For example, we can apply this idea to improve the performance of the read system
function. When reading a particular le, constant parameters include the device that the
le resides on, the address of the kernel bu ers, and the process performing the read. We
can use le open as F create ; the Fsmall it generates becomes our read function. F create
consists of many small procedure templates, each of which knows how to generate code for
a basic operation such as \read disk block", \process TTY input", or \enqueue data." The
parameters passed to F create determine which of these code-generating procedures are called
and in what order. The nal Fsmall is created by lling these templates with addresses of
the process table, device registers, and the like.
The process to eliminate the procedure call can be embedded into two second-order
create returns code equivalent to Fpresent and suitable for in-line insertion.
functions. Fpresent
create incorporates that code to generate F flat .
Fapplica applica
create (p ; F create (p ; . . .); F flat (p ; p ; . . .)):
Fapplica 1 present 2 applica 1 2
This technique is analogous to in-line code substitution for procedure calls in compiler code
generation. In addition to the elimination of procedure calls, the resulting code typically
exhibit opportunities for further optimization, such as Factoring Invariants and elimination
of data copying.
29
create can eliminate the procedure call to the Session layer, and
By induction, Fpresent
create to establish a virtual circuit, the F flat
down through all layers. When we execute Fapplica applica
code used thereafter to send and receive messages may consist of only sequential code. The
performance gain analysis is similar to the one for factoring invariants.
Put(c)
{
*bufp++ = c;
if(bufp == endp)
flush();
}
the constant in the instruction stream, so there is an immediate savings that comes from
eliminating one or two levels of indirection and obviating the need to pass the structure
pointer. These can be attributed to \saving the cost of interpretation." But hardwiring also
opens up the possibility of further optimizations, such as constant folding, while fetching
from a data structure admits no such optimizations. Constant folding becomes possible be-
cause once it is known that a parameter will be, say, 2, all pure functions of that parameter
will likewise be constant and can be evaluated once and the constant result used thereafter.
A similar avor of optimization arises with IF-statements. In the code fragment \if(C)
S1; else S2;", where the conditional, C, depends only on constant parameters, the gen-
erated code will contain either S1 or S2, never both, and no test. It is with this cascade
of optimization possibilities that code synthesis obtains its most signi cant performance
gains. The following section illustrates some of the places in the kernel where runtime code
generation is used to advantage.
original bufp pointer. The character is stored in the bu er using the \move.b d0,(a0,D)"
instruction which is just as fast as a simple register-indirect store. The displacement, D, is
chosen so that when P + D points to the end of the bu er, P is 0 modulo 216, that is, the
least signi cant 16 bits of P are zero. The \addq.w #1,(P+2)" instruction then increments
the lower 16 bits of the bu er pointer and also implicitly tests for end-of-bu er, which is
indicated by a 0 result. For bu er sizes greater than 216 bytes, the flush routine can
propagate the carry-out to the upper bits, ushing the bu er when the true end is reached.
This performance gain can only be had using runtime code generation, because D
must be a constant, embedded in the bu er's machine code, to take advantage of the
fast memory-reference instruction. Were D a variable, the loss of fetching its value and
indexing would o set the gain from eliminating the compare instruction. The 40% savings
is signi cant because bu ers and queues are used often. Another advantage is improved
locality of reference: code synthesis puts both code and data in the same page of memory,
increasing the likelihood of cache hits in the memory management unit's address translation
cache.
Outside the kernel, the Synthesis implementation of the C-language I/O library,
\stdio," uses code-synthesized bu ers at the user level. In a simple experiment, I replaced
the Unix stdio library with the Synthesis version. I compiled and ran a simple test program
that invokes the putchar macro ten million times, using rst the native Unix stdio library
supplied with the Sony NEWS workstation, and then the Synthesis version. Table 3.2 shows
the Synthesis macro version is 1:6 times faster, and over 4 times smaller, than the Unix
version.
The drastic reduction in code size comes about because code synthesis can take
advantage of the extra knowledge available at runtime to eliminate execution paths that
33
cannot be taken. The putchar operation, as de ned in the C library, actually supports
three kinds of bu ering: block-bu ered, line-bu ered and unbu ered. Even though only
one of these can be in e ect at any one time, the C putchar macro must include code to
handle all of them, since it cannot know ahead of time which one will be used. In contrast,
code synthesis creates only the code handling the kind of bu ering actually desired for the
particular le being written to. Since putchar, being a macro, is expanded in-line every
time it appears in the source code, the savings accumulate rapidly.
Table 3.2 also shows that the Synthesis \putchar" function is slightly faster than
the Unix macro | a dramatic result, that even incurring a procedure call overhead, code
synthesis still shows a speed advantage over conventional code in-lined with a macro.
proc:
:
:
{Save necessary context}
bsr swtch
res:
{Restore necessary context}
:
:
swtch:
move.l (Current),a0 // (1) Get address of current thread's TTE
move.l sp,(a0) // (2) Save its stack pointer
bsr find_next_thread // (3) Find another thread to run
move.l a0,(Current) // (4) Make that one current
move.l (a0),sp // (5) Load its stack pointer
rts // (6) Go run it!
read procedure needs to block and wait for I/O to nish. Since it has already saved some
registers on the stack as part of the normal procedure-call mechanism, there is no need to
preserve them again as they will only be overwritten upon return.
Figure 3.3 illustrates the general idea. When a kernel thread executes code that
decides that it should block, it saves whatever context it wishes to preserve on the active
stack. It then calls the scheduler, swtch; doing so places the thread's program counter on
the stack. At this point, the top of stack contains the address where the thread is to resume
execution when it unblocks, with the machine registers and the rest of the context below
that. In other words, the thread's context has been reduced to a single register: its stack
pointer. The scheduler stores the stack pointer into the thread's control block, known as
the thread table entry (TTE), which holds the thread state when it is not executing. It then
selects another thread to run, shown as a call to the find next thread procedure in the
gure, but actually implemented as an executable data structure as discussed later. The
variable Current is updated to re ect the new thread and its stack pointer is loaded into the
CPU. A return-from-subroutine (rts) instruction starts the thread running. It continues
where it had left o (at label res), where it pops the previously-saved state o the stack
and proceeds with its work.
Figure 3.4 shows two TTEs. Each TTE contains code fragments that help with
context switching: sw in and sw in mmu, which loads the processor state from the TTE; and
sw out, which stores processor state back into the TTE. These code fragments are created
specially for each thread. To switch in a thread for execution, the processor executes the
35
tt0.reg: tt1.reg:
The integer registers The integer registers
tt0.usp: The user stack pointer tt1.usp: The user stack pointer
tt0.vbr: The vector table tt1.vbr: The vector table
tt0.ptab: The page map table tt1.ptab: The page map table
tt0.fpr: tt1.fpr:
The floating-point registers The floating-point registers
sw_in_mmu: sw_in_mmu:
pmove.q tt0.ptab,%crp pmove.q tt1.ptab,%crp
sw_in: move.l tt0.vbr,a0 sw_in: move.l tt1.vbr,a0
move.l a0,%vbr move.l a0,%vbr
move.l tt0.usp,a0 move.l tt1.usp,a0
move.l a0,%usp move.l a0,%usp
movem.l (tt0.reg),<d0-a7> movem.l (tt1.reg),<d0-a7>
rte rte
thread's sw in or sw in mmu procedure. To switch out a thread, the processor executes the
thread's sw out procedure.
Notice how the ready-to-run threads (waiting for CPU) are chained in an executable
circular queue. A jmp instruction at the end of the sw out procedure of the preceding thread
points to the sw in procedure of the following thread. Assume thread-0 is currently running.
When its time quantum expires, the timer interrupt is vectored to thread-0's sw out. This
procedure saves the CPU registers into thread-0's register save area (TT0.reg). The jmp
instruction then directs control ow to one of two entry points of the next thread's (thread-1)
context-switch-in procedure, sw in or sw in mmu. Control ows to sw in mmu when a change
of address space is required; otherwise control ows to sw in. The switch-in procedure then
36
loads the CPU's vector base register with the address of thread-1's vector table, restores the
processor's general registers, and resumes execution of thread-1. The entire switch takes 10.5
microseconds to switch integer-only contexts between threads in the same address space, or
56 microseconds including the oating point context and a change in address space.1
Table 3.3 summarizes the time taken by the various types of context switches in
Synthesis, saving and restoring all the integer registers. These times include the hardware
interrupt service overhead | they show the elapsed time from the execution of the last
instruction in the suspended thread to the rst instruction in the next thread. Previously
published papers report somewhat lower gures [25] [18]. This is because they did not
include the interrupt-service overhead, and because of some extra overhead incurred in
handling the 68882 oating point unit on the Sony NEWS workstation that does not occur
on the Quamachine, as discussed later. For comparison, a call to a null procedure in the C
language takes 1:4 microseconds, and the Sony Unix context switch takes 170 microseconds.
Previous papers incorrectly cite a oating-point context switch time of 15 s [25] [18]. This error is
1
believed to have been caused by a bug in the Synthesis assembler, which incorrectly lled the operand eld
of the oating-point move-multiple-registers instruction causing it to preserve just one register, instead of
all eight. Since very few Synthesis applications use oating point, this bug remained undetected for a long
time.
37
evaluation of oating-point context switches. Switching in a thread for execution loads its
integer state and disables the oating-point unit. When a thread executes its rst oating
point instruction since the switch, it takes an illegal instruction trap. The kernel then
loads the necessary state, rst saving any prior state that may have been left there, re-
enables the oating-point unit, and the thread resumes with the interrupted instruction.
The trap is taken only on the rst oating-point instruction following a switch, and adds
only 3 s to the overhead of restoring the state. This is more than compensated for by
the other savings: integer context-switch becomes 1.5 s faster because there is no need for
an fsave instruction to test for possible oating-point use; and even oating-point threads
bene t when they block without a oating point instruction being issued since they were
switched in, saving the cost of restoring and then saving that context. Indeed, if only a
single thread is using oating point, the oating point context is never switched, remaining
in the coprocessor.
Examples taken from the Synthesis Sound-IO device driver illustrate the ideas and
provide performance numbers. The Sound-IO device is a general-purpose, high-quality
audio input and output device with stereo, 16-bit analog-to-digital and digital-to-analog
converters, and a direct-digital input channel from a CD player. This device interrupts the
processor once for every sound sample | 44100 times per second | a very high number
by conventional measures. It is normally inconceivable to attach such high-rate interrupt
sources to the main processor. Sony Unix, for example, can service a maximum of 20,000
interrupts per second, and such a device could not be handled at all.2 Ecient interrupt
handing is mandatory, and the rest of this section shows how Synthesis can service high
interrupt rates eciently.
Several bene ts of runtime code generation combine to improve the eciency of
interrupt handing in Synthesis: the use of the high-speed bu ering code described in Sec-
tion 3.3.1, the ability to create interrupt routines that save and restore only the part of the
context being used, and the use of layer-collapsing to merge separate functions together.
Figure 3.5 shows the actual Synthesis code created to handle the Sound-IO interrupts
when only the CD-player is active. It begins by saving a single register, a0, since that is
the only one used. This is followed by the code for the speci c sound I/O channels, in this
case, the CD-player. The code is similar to the fast bu er described in 3.3.1, synthesized
to move data from the CD port directly into the user's bu er. If the other input sources
(such as the A-to-D input) also become active, the interrupt routine is re-written, placing
their bu er code immediately following the CD-player's. The code ends by restoring the a0
register and returning from interrupt.
Figure 3.6 shows the best I have been able to achieve using hand-written assembly
language, without the use of code synthesis. Like the Synthesis version, this uses only a
single register, so we save and restore only that one.3 But without code synthesis, we must
include code for all the Sound-IO sources | CD, AD, and DA | testing and branching
around the parts for the currently inactive channels. In addition, we can no longer use the
2
The Sony workstation has two processors, one of which is dedicated to I/O, including servicing I/O
interrupts using a somewhat lighter-weight mechanism. They solve the overhead problem with specialized
processors | a trend that appears to be increasingly common. But this solution compounds latency, and
does not negate my point, which is that existing operating systems have high overhead that discourage
frequent invocations.
3
Most existing systems neglect even this simple optimization. They save and restore all the registers, all
the time.
40
intr:
move.l a0,-(sp) // Save register a0
move.l (P),a0 // Get buffer pointer into reg. a0
move.l (cd_port),(a0,D)// Store CD data into address P+D
addq.w #4,(P+2) // Increment low 16 bits of P.
beq cd_done // ... flush buffer if full
move.l (sp)+,a0 // Restore register a0
rte // Return from interrupt
s_intr:
move.l a0,-(sp) // Save register a0
tst.b (cd_active) // Is the CD device active?
beq cd_no // ... no, jump
move.l (cd_buf),a0 // Get CD buffer pointer into reg. a0
move.l (cd_port),(a0)+ // Store CD data; increment pointer
move.l a0,(cd_buf) // Update CD buffer pointer
subq.l #1,(cd_cnt) // Decrement buffer count
beq cd_flush // ... jump if buffer full
cd_no:
tst.b (ad_active) // Is the AD device active?
beq ad_no // ... no, jump
:
: [handle AD device, similar to CD code]
:
ad_no:
tst.b (da_active) // Is the DA device active?
beq da_no // ... no, jump
:
: [handle DA device, similar to CD code]
:
da_no:
move.l (sp)+,a0 // Restore register a0
rte // Return from interrupt
s_intr:
movem.l <d0-d2,a0-a2>,-(sp)
bsr _sound_intr
movem.l (sp)+,<d0-d2,a0-a2>
rte
sound_intr()
{
if(cd_active) {
*cd_buf++ = *cd_port;
if(--cd_cnt < 0)
cd_flush();
}
if(ad_active) {
...
}
if(da_active) {
...
}
}
Time in S Speedup
Null Interrupt 2.0 |
CD-in, code-synth 3.7 |
CD-in, assembler 6.0 2.4
CD-in, C 9.7 4.5
CD-in, C & Unix 17.1 8.9
CD+DA, code-synth 5.1 |
CD+DA, assembler 7.7 1.8
CD+DA, C 11.3 3.0
CD+DA, C & Unix 18.9 5.5
Within each group of measurements, there are four rows. The rst three rows show
the time taken by the code synthesis, hand-assembler, and C implementations of the inter-
rupt handler, in that order. The code fragments measured are those of gures 3.5, 3.6,
and 3.7; the C code was compiled on the Sony NEWS workstation using \cc -O". The last
row shows the time taken by the C version of the handler, but dispatched the way that Sony
Unix does, preserving all the machines registers prior to the call. The left column tells the
elapsed execution time, in microseconds. The right column gives the ratio of times between
the code synthesis implementation and the others. The null-interrupt time is subtracted
before computing the ratio to give a better picture of what the implementation-speci c
performance increases are.
As can be seen from the table, the performance gains of using code synthesis are
impressive. With only one channel active, we get more than twice the performance of hand-
written assembly language, almost ve times more ecient than well-written C, and very
nearly an order of magnitude better than traditional Unix interrupt service. Furthermore,
the non-code-synthesis versions of the driver supports only the two-channel, 16-bit linear-
encoding sound format. Extending support, as Synthesis does, to other sound formats,
such as -Law, either requires more tests in the sound interrupt handler or an extra level of
format conversion code between the application and the sound driver. Either option adds
overhead that is not present in the code synthesis version, and would increase the time
shown.
With two channels active, the gain is still signi cant though somewhat less than that
for one channel. The reason is that the overhead-reducing optimizations of code synthesis
| collapsing layers and preserving only context that is used | become less important as
the amount of work increases. But other optimizations of code synthesis, such as the fast
bu er, continue to be e ective and scale with the work load. In the limit, as the number of
active channels becomes large, the C and assembly versions perform equally well, and the
code synthesis version is about 40% faster.
that happens to cross the protection boundary between user and kernel. This is important
because, as we will see in Chapter 4, each Synthesis service has a set of procedures associ-
ated with it that delivers that service. Since the set of services provided is extensible, we
need a more general way of invoking them. Combining procedure calls with runtime code
generation lets us do this eciently.
Figure 3.8 shows how. The generated code consists of two parts: a user part, shown
at the top of the gure, and a kernel part, shown at the bottom. The user part loads the
procedure index number into the d2 register and executes the trap instruction, switching the
processor into kernel mode where it executes the kernel part of the code, beginning at label
trap15. The kernel part begins with a limit check on the procedure index number, ensuring
that the index is inside the table area and preventing cheating by buggy or malicious user
code that may pass a bogus number. It then indexes the table and calls the kernel procedure.
The kernel procedure typically performs its own checks, such as verifying that all pointers
are valid, before proceeding with the work. It returns with the rte instruction, which
takes the thread back into user mode, where it returns control to the caller. Since the user
program can only specify an index into the procedure table, and not the procedure address
itself, only the allowed procedures may be called, and only at the appropriate entry points.
Even if the user part of the generated code is overwritten either accidentally or maliciously,
it can never cause the kernel to do something that could not have been done through some
other, untampered, sequence of calls.
44
Runtime code generation gives the following advantages: each thread has its own
table of vectors for exceptions and interrupts, including trap 15. This means that each
thread's kernel calls vector directly to the correct dispatch procedure, saving a level of
indirection that would otherwise have been required. This dispatch procedure, since it is
thread-speci c, can hard-wire certain constants, such as MAX and the base address of the
kernel procedure table, saving the time of fetching them from a data structure.
Furthermore, by thinking of kernel invocation not as a system call | which conjures
up thoughts of heavyweight processing and large overheads | but as a procedure call,
many other optimizations become easier to see. For example, ordinary procedures preserve
only those registers which they use; kernel procedures can do likewise. Procedure calling
conventions also do not require that all the registers be preserved across a call. Often,
a number of registers are allowed to be \trashed" by the call, so that simple procedures
can execute without preserving anything at all. Kernel procedures can follow this same
convention. The fact that kernel procedures are called from user level does not make them
special; one merely has to properly address the issues of protection, which is discussed
further in Section 3.4.2.
Besides dispatch, we also need to address the problem of how to move data between
user space and kernel as eciently as possible. There are two kinds of moves required:
passing procedure arguments and return values, and passing large bu ers of data. For
passing arguments, the user-level stub procedures are generated to pass as many arguments
as possible in the CPU registers, bypassing the expense of accessing the user stack from
kernel mode. Return values are likewise passed in registers, and moved elsewhere as needed
by the user-level stub procedure. This is similar in idea to using CPU registers for passing
short messages in the V system [9].
Passing large data bu ers is made ecient using virtual memory tricks. The main
idea is: when the kernel is invoked, it has the user address space mapped in. Synthesis
reserves part of each address space for the kernel. This part is normally inaccessible from
user programs. But when the processor executes the trap and switches into kernel mode,
the kernel part of the address space becomes accessible in addition to the user part, and
the kernel procedure can easily move data back and forth using the ordinary machine
instructions. Prior to beginning such a move, the kernel procedure checks that no pointer
refers to locations outside the user's address space | an easy check due to the way the
virtual addresses are chosen: a single limit-comparison (two instructions) suces.
45
Further optimizations are also possible. Since the user-level stub is a real procedure,
it can be in-line substituted into its caller. This can be done lazily | the stub is written so
that each time a call happens, it fetches the return address from the stack and modi es that
point in the caller. Since the stubs are small, space expansion is minimal. Besides being
e ective, this mechanism requires minimal support from the language system to identify
potential in-lineable procedure calls.
Another optimization bypasses the kernel procedure dispatcher. There are 16 possi-
ble traps on the 68030. Three of these are already used, leaving 13 free for other purposes,
such as to directly call heavily-used kernel procedures. If a particular kernel procedure is
expected to be used often, an application can invoke the cache procedure call, and Syn-
thesis will allocate an unused trap, set it to call the kernel procedure directly, and re-create
the user-level stub to issue this trap. Since this trap directly calls the kernel procedure,
there is no longer any need for a limit check or a dispatch table. Pre-assigned traps can
also be used to import execution environments. Indeed, the Synthesis equivalent of the
Unix concept of \stdin" and \stdout" is implemented with cached kernel procedure calls.
Speci cally, trap 1 writes to stdout, and trap 2 reads from stdin.
Combining both optimizations results in a kernel procedure call that costs just a
little more than a trap instruction. The various costs are summarized in Table 3.5. The
middle block of measurements show the cost of various Synthesis null kernel procedure
calls: the general-dispatched, non-inlined case; the general-dispatched, with the user-level
stub inlined into the application's code; cached-kernel-trap, non-inlined; and cached-kernel-
trap, inlined. For comparison, the cost of a null trap and a null procedure call in the C
language is shown on the top two lines, and the cost of the trivial getpid system call in
Unix and Mach is shown on the bottom two lines.
when to expand in-line is made by the programmer writing F create . Full, memory-hungry
in-line expansion is usually reserved for speci c uses where its bene ts are greatest: the
performance-critical, frequently-executed paths of a function, where the performance gains
justify increased memory use. Less frequently executed parts of a function are stored in a
common area, shared by all instances through subroutine calls.
In-line expansion does not always cost memory. If a function is small enough, ex-
panding it in-line can take the same or less space than calling it. Examples of functions
that are small enough include character-string comparisons and bu er-copy. For functions
with many runtime-invariant parameters, the size expansion of inlining is o set by a size
decrease that comes from not having to pass as many parameters.
In practice, the actual memory needs are modest. Table 3.6 shows the total memory
used by a full kernel | including I/O bu ers, virtual memory, network support, and a
window system with two memory-resident fonts.
47
control to the caller, the synthesized code reverts to the previous (user-level) mode.
There is still the question of invalidating the code when the operation it performs is
no longer valid | for example, invalidating the read procedure after a les has been closed.
Currently, this is done by setting the corresponding function pointers in the KPT to an
invalid address, preventing further calls to the function. The function's reference counter is
then decremented, and its memory freed when the count reaches zero.
Besides the performance hit from a cold cache, cache ushing itself may be slow. On
the 68030 processor, for example, the instruction to ush the cache is privileged. Although
this causes no special problems for the Synthesis kernel, it does force user-level programs
that modify code to make a system call to ush the cache. I do not see any protection-
related reason why that instruction must be privileged; perhaps making it so simpli ed
processor design.
3.5 Summary
This chapter showed: (1) that code synthesis allows important operating system
functions such as bu ering, context switching, interrupt handing, and system call dispatch
to be implemented 1.4 to 2.4 times more eciently than is possible using the best assembly-
language implementation without code synthesis and 1.5 to 5 times better than well-written
C code; (2) that code synthesis is also e ective at the user-level, achieving an 80% improve-
ment for basic operations such as putchar; and (3) that the anticipated size penalty does
not, in fact, happen.
Before leaving this section, I want to call a moment's more attention to the interrupt
handlers of Section 3.3.3. At rst glance | and even on the second and third | the C-
language code it looks to be as minimal as it can get. There does not seem to be any fat to
cut. Table 3.4 has shown otherwise. The point is that sometimes, sources of overhead are
hidden, not so easy to spot and optimize. They are a result of assumptions made and the
programming language used, whether it be in the form of a common calling convention for
procedures, or in conventions followed to simplify linking routines to interrupts. This section
has shown that code synthesis is an important technique that enables general procedure
interfacing while preserving | and often bettering | the eciency of custom-crafted code.
The next chapter now shows how Synthesis is structured and how synergy between
kernel code synthesis and good software engineering leads to a system that is general and
easily expandable, but at the same time ecient.
51
4
Kernel Structure
All things should be made as simple as possible, but no simpler.
| Albert Einstein
4.1 Quajects
Quajects are the building blocks out of which all Synthesis kernel services are com-
posed. The name is derived from the term \object" of Object-Oriented (O-O) systems,
which they strongly resemble [32]. The similarity is strong, but the di erence is signi cant.
Like objects, quajects encapsulate data and provide a well-de ned interface to access it.
Unlike objects, quajects use a code-synthesis implementation to achieve high performance,
but lack high-level language support and inheritance.
Kernel quajects can be broadly classi ed into four kinds: thread, memory, I/O, and
device. Thread quajects encapsulate the unit of execution, memory quajects the unit of
data storage, I/O quajects the unit of data movement, and device quajects the machine's
interfaces to the outside world. Each kind of quaject is de ned and implemented indepen-
dently.
Basic quajects implement fundamental services that cannot be had through any
combination of other quajects. Threads and queues are two examples of basic quajects;
52
Name Purpose
Thread Implements threads
Queue Implements FIFO queues
Bu er Data bu ering
Dcache Data caching (e.g., for disks)
FSmap File to at storage mapping
Clock The system clock
CookTTYin Keyboard input editor
CookTTYout Output editor and format conversion
VT-100 Emulates DEC's VT100 terminal
Twindow Text display window
Gwindow Graphics (bit-mapped) display window
Probe Measurements and statistics gathering
Sytab Symbol table (associative mapping)
Table 4.1 contains a list of the basic quajects in Synthesis. More complex kernel services
are built out of the basic quajects by composition. For example, the Synthesis kernel has no
pre-de ned notion of a \process." But a Unix-like process can be created by instantiating
a thread quaject, a memory quaject, some I/O quajects, and interconnecting them in a
particular way.
+----------+--------------------+----------+
| Qput | | Qget |
+----------+ ---+--+--+ +----------+
| Qfull | o o | | | | Qempty |
+----------+ ---+--+--+ +----------+
| Qfull-1 | | Qempty+1 |
+----------+--------------------+----------+
where external calls to other quaject's callentries happen. Tables 4.2, 4.3, and 4.4 list the
interfaces to the Synthesis basic kernel quajects.
Callentries are analogous to methods in O-O systems. The other two, callbacks and
callouts, have no direct analogue in O-O systems. Conceptually, a callout is a function
pointer that has been initialized to point to another quaject's callentry; callbacks point
back to the invoker. Callouts are an important part of the interface because they specify
what type of external call is needed, making it possible to dynamically link one of several
di erent quaject's callentries to a particular callout, so long as the type matches. For
example, the Synthesis buffer quaject has a flush callout which is invoked when the
bu er is full. This enables the same bu er implementation to be used throughout the
kernel simply be instantiating a buffer quaject and linking its flush callout to whatever
downstream processing is appropriate for the instance.
The quaject interface is better illustrated using a simple quaject as an example |
the FIFO queue, shown in Figure 4.1. The Synthesis kernel supports four di erent types
of queues, to optimize for the varying synchronization needs of di erent combinations of
single or multiple producers and consumers (synchronization is discussed in Chapter 5). All
four types support the same abstract type [6], de ned by two callentry references, Q put
and Q get, which put and get elements of the queue. Both these callentry references return
synchronously under the normal condition (successful insertion or deletion). Under other
conditions, the queue returns through the callbacks.
The queue has four callbacks which are used to return queue-full and queue-empty
conditions back to caller. Q empty is invoked when a Q get fails because the queue is empty.
Q full is invoked when a Q put fails because the queue is full. Q empty+1 is called after
a previous Q get had failed and then an element was inserted. And Q full-1 is called
after a previous Q put had failed and then an element was deleted. The idea is: instead of
returning a condition code for interpretation by the invoker, the queue quaject directly calls
54
the appropriate handling routines supplied by the invoker, speeding execution by eliminating
the interpretation of return status codes.
when linking one quaject to another. This is covered in the next section.
Kernel quajects are created whenever they are needed to build higher-level services.
For example, opening an I/O pipe creates a queue; opening a text window creates three
quajects: a window, a VT-100 terminal emulator, and a TTY-cooker. Which quajects
get created and how they are interconnected is determined by the implementation of each
service.
Quajects may also be created at the user level, simply by calling the class's create
callentry from a user-level thread. The e ect is identical to creating kernel quajects, except
that user memory is allocated and lled, and the resulting quajects execute in user-mode,
not kernel. The kernel does not concern itself with what happens to such user-level quajects.
It merely o ers creation and linkage services to applications that want to use them.
Quajects are destroyed when they are no longer needed. Invoking the destroy
57
callentry signals that a particular thread no longer needs a quaject. The quaject itself is
not actually destroyed until all references to it are severed. Reference counts are used.
There is the possibility that circular references prevent destruction of otherwise useless
quajects but this has not been a problem because quajects tend to be connected in cycle-
free graphs. Destroying quajects does not immediately deallocate their memory. They are
instead placed in the inactive list for their class. This speeds subsequent creation because
much of the code-generation and initialization work had been already done.1 As heap
memory runs out, memory belonging to quajects on the inactive list is recycled.
that is known at system generation time. Basically, the linking consists of blindly storing
addresses in various places, being assured that they will always \land" in the correct place in
the generated code. Similarly, no runtime type checking is required, as all such information
has been resolved at system generation time.
Not all references must be speci ed or lled. Each quaject provides default values
for its callout and callbacks that de ne what happens when a particular callout or callback
is needed but not connected. The action can be as simple as printing an error message
and aborting the operation or as complicated as dynamically creating the missing quaject,
linking the reference, and continuing.
In addition, the kernel can also resolve references in response to execution traps that
invoke the dynamic linker. Such references are represented by ASCII names. The name
Q get, for example, refers to the queue's callentry. A symbol-table quaject maps the string
names into the actual addresses and displacements. For example, the Q get callentry is
represented in the symbol table as a displacement from the start of the queue quaject.
Which quaject is being referenced is usually clear from context. For example, callentries
are usually invoked using a register-plus-o set addressing mode; the register contains the
address of the quaject in question. When not, an additional parameter disambiguates the
reference.
Figure 4.2 shows a producer thread using the Q put callentry to store bytes in the
queue. The ByteQueue's Q full callback is linked to the thread's suspend callentry; the
ByteQueue's Q full-1 callback is linked to the thread's resume callentry. As long as the
queue is not full, calls to Q put enqueue the data and return normally. When the queue
becomes full, the queue invokes the Q full callback, suspending the producer thread. When
The actual implementation of Synthesis V.1 uses an optimized version of ByteQueue that has a string-
2
callback return to
caller
(= Q full
the ByteQueue's reader removes a byte, the Q full-1 callback is invoked, awakening the
producer thread. This implements the familiar synchronous interface to an I/O stream.
Contrast this with Figure 4.3, which shows a non-blocking interface to the same
data channel implemented using the same queue quaject. Only the connections between
ByteQueue and the thread change. The thread's write callout still connects to the queue's
Q put callentry. But the queue's callbacks no longer invoke procedures that suspend or
resume the producer thread. Instead, they return control back to the producer thread,
functioning, in e ect, like interrupts that signal events | in this example, the lling and
emptying of the queue. When the queue lls, the Q full callback returns control back to
the producer thread, freeing it to do other things without waiting for output to drain and
without having written the bytes that did not t. The thread knows the write is incomplete
60
because control ow returns through the callback, not through Q put. After output drains,
Q full-1 is called, invoking an exception handler in the producer thread which checks
whether there are remaining bytes to write, and if so, it goes back to Q put to nish the
job.
Ritchie's Stream I/O system has a similar avor: it too provides a framework for
attaching stages of processing to an I/O stream [27]. But stream-I/O's queueing structure is
xed, the implementation is based on messages, and the I/O is synchronous. Unlike Stream-
I/O, quajects o er a ner level of control and expanded possibilities for connection. The
previous example illustrates this by showing how the same queue quaject can be connected
in di erent ways to provide either synchronous or asynchronous I/O. Furthermore, quajects
extend the idea to include non-I/O services as well, such as threads.
4.1.5 Summary
In the implementation of Synthesis kernel, quajects provide encapsulation and make
all inter-module dependencies explicit. Although quajects di er from objects in traditional
O-O systems because of a procedural interface and run-time code generation implemen-
tation, the bene ts of encapsulation and abstraction are preserved in a highly ecient
implementation.
I have shown, using the data channel as an example, how quajects are composed to
provide important services in the Synthesis kernel. That example also illustrates the main
points of a quaject interface:
Callentry references implement O-O{like methods and bypass interpretation in the
invoked quaject.
Callback references implement return codes and bypass interpretation in the invoker.
The operation semantics are determined dynamically by the quaject interconnections,
independent of the quaject's implementation.
This last point is fundamental in allowing a true orthogonal quaject implementation, for
example, enabling a queue to be implemented without needing any knowledge of how threads
work | not even how to suspend and resume them.
The next section shows how the quaject ideas t together to provide user-level ser-
vices.
61
invocation crosses a protection boundary. A direct procedure call would not work because
the kernel routine needs to run in supervisor mode.
In a conventional operating system, such as Unix, application programs invoke the
kernel by making system calls. But while system calls provide a controlled, protected way
for a user-level program to invoke procedures in the kernel, they are limited in that they
allow access to only a xed set of procedures in the kernel. For Synthesis to be extensible,
it needs an extensible kernel call mechanism; a mechanism that supports a protected, user-
level interface to arbitrary kernel quajects.
The user-level interface is supplied with stub quajects. Stub quajects reside in the
user address space and have the same callentries, with the same o sets, as the kernel quaject
which they represent. Invoking a stub's callentry from user-level results in the corresponding
kernel quaject's callentry being invoked and the results returned back.
This is implemented in the following way. The stub's callentries consist of tiny
procedures that load a number into a machine register and then executes a trap instruction.
The number identi es the desired kernel procedure. The trap switches the processor into
kernel mode, where it executes the kernel-procedure dispatcher. The dispatcher uses the
procedure number parameter to index a thread-speci c table of kernel procedure addresses.
Simple limit checks ensure the index is in range and that only the allowed procedures are
called. If the checks pass, the dispatcher invokes the kernel procedure on the behalf of the
user-level application.
There are many bene ts to this design. One is that it extends the kernel quaject
interface transparently to user-level, allowing kernel quajects to be composed with user-level
quajects. Its callentries are real procedures: their addresses can be passed to other functions
or stored in tables; they can be in-line substituted into other procedures and optimized using
the code-synthesis techniques of Section 3.2 applied at the user level. Another advantage,
which has already been discussed in Section 3.3.4, is that a very ecient implementation
exists. The result is that the protection boundary becomes uid; what is placed in the
kernel and what is done at user-level can be chosen at will, not dictated by the design of
the system. In short, all the advantages of kernel quajects have been extended out to user
level.
63
4.2.2 Protection
Kernel procedure calls are protected because the user program can only specify
indices into the kernel procedure table (KPT), so the kernel quajects are guaranteed to
execute only from legitimate entry points, and because the index is checked before being
used, only valid entries in the table can be accessed.
The vector table | the hardware-de ned array of starting addresses of exception
handlers. The hardware consults this table to dispatch the hardware-detected excep-
tions: hardware interrupts, error traps (like division by zero), memory faults, and
software-traps (system calls).
The context-switch-in and context-switch-out procedures comprising the executable
data structure of the ready queue.
Of these, the last two are unusual. The context-switch-in and -out procedures were
already discussed in Section 3.3.2, which explains how executable data structures are used
to implement fast context switching. Giving each thread its own vector table also di ers
from usual practice, which makes the vector table a global structure, shared by all threads
or processes. By having a separate vector table per thread, Synthesis saves the dispatching
cost of thread-speci c exceptions. Since most of the exceptions are thread speci c, the
savings is signi cant. Examples include all the error traps, such as division by zero, and
the VM-related traps, such as translation fault.
translation-fault exception, even from supervisor mode; the fault handler then reads in the
referenced page from backing store if it was missing or prints the diagnostic message if the
access is disallowed. (All this works because all quajects are reentrant, and since system
calls are built out of quajects, all system calls are reentrant.)
Synthesis threads also provide a mechanism where routines executing in supervisor
mode can make protected calls to user-mode procedures. It is mostly used to allow user-
mode handling of exceptions that arise during supervisor execution, for example, someone
typing \Control-C" while the thread is in the middle of a kernel call. It is also expected
to nd use in a future implementation of remote procedure call. The hard part in allowing
user-level procedure calls is not in making the call, but arranging for a protected return from
user-mode back to supervisor. This is done by pushing a special, exception-causing return
address on the user stack. When the user procedure nishes and returns, the exception is
raised, putting the thread back into supervisor mode.
the mode parameter. Suspended threads can be interrupted: they will execute the interrupt
procedure and then re-enter the suspended state.
Signal is like interrupt, but with a level of indirection for protection and isolation.
It takes an integer parameter, the signal number, and indexes the thread's signal-table
with it, obtaining the address and mode parameters that are then passed to interrupt.
Setsignal associates signal numbers with addresses of interrupt procedures and execution
modes. It takes three parameters: the signal number, an address, and a mode; and it lls
the table slot corresponding to the signal number with the address and mode.
Wait waits for events to happen. It takes one parameter, an integer representing an
event, and it suspends the thread until that event occurs. Notify informs the thread of
the occurrence of events. It too takes one parameter, an integer representing an event, and
it resumes the thread if it had been waiting for this event. The thread system does not
concern itself with what is an event nor how the assignment of events to integers is made.
4.3.3 Scheduling
The Synthesis scheduling policy is round-robin with an adaptively adjusted CPU
quantum per thread. Instead of priorities, Synthesis uses ne-grain scheduling, which assigns
larger or smaller quanta to threads based on a \need to execute" criterion. A detailed
explanation on ne-grain scheduling is postponed to Chapter 6. Here, I give only a brief
informal summary.
A thread's \need to execute" is determined by the rate at which I/O data ows
through its I/O channels compared to the rate at which which the running thread produces
or consumes this I/O. Since CPU time consumed by the thread is an increasing function
of the data ow, the faster the I/O rate the faster a thread needs to run. Therefore, the
scheduling algorithm assigns a larger CPU quantum to the thread. This kind of scheduling
must have a ne granularity since the CPU requirements for a given I/O rate and the I/O
rate itself may change quickly, requiring the scheduling policy to adapt to the changes.
E ective CPU time received by a thread is determined by the quantum assigned to
that thread divided by the sum of quanta assigned to all threads. Priorities can be simulated
and preferential treatment can be given to certain threads in two ways: raise a thread's CPU
quantum and reorder the ready queue as threads block and unblock. As an event unblocks
a thread, its TTE is placed at the front of the ready queue, giving it immediate access to
67
the CPU. This minimizes response time to events. Synthesis' low-overhead context switch
allows quanta to be considerably shorter than that of other operating systems without
incurring excessive overhead. Nevertheless, to minimize time spent context switching, CPU
quanta are adjusted to be as large as possible while maintaining the ne granularity. A
typical quantum is on the order of a few hundred microseconds.
4.4.1 Producer/Consumer
The Synthesis implementation of the channel model I/O follows the well-known pro-
ducer/consumer paradigm. Each data channel has a control ow that directs its data ow.
Depending on the origin and scheduling of the control ow, a producer or consumer can
be either active or passive. An active producer (or consumer) runs on a thread and calls
functions submitting (or requesting) its output (or input). A thread performing writes is
active. A passive producer (or consumer) does not run of its own; it sits passively, waiting
for one of its I/O functions to be called, then using the thread that called the function
to initiate the I/O. A TTY window is passive; characters appear on the window only in
response to other thread's I/O. There are three cases of producer/consumer relationships,
which we shall consider in turn.
The simplest is an active producer and a passive consumer, or vice-versa. This case,
called active-passive, has a simple implementation. When there is only one producer and
one consumer, a procedure call does the job. If there are multiple producers, we serialize
their access. If there are multiple consumers, each consumer is called in turn.
The most common producer/consumer relationship has both an active producer
and an active consumer. This case, called active-active, requires a queue to mediate the
two. For a single producer and a single consumer, an ordinary queue suces. For cas-
es with multiple participants on either the producer or consumer side, we use one of the
optimistically-synchronized concurrent-access queues described in section 5.2.2. Each queue
may be synchronous (blocking) or asynchronous (using signals) depending on the situation.
68
The last case is a passive producer and a passive consumer. Here, we use a pump
quaject that reads data from the producer and writes it to the consumer. This works for
multiple passive producers and consumers as well.
4.6 Summary
The positive experience in using quajects shows that a highly ecient implementation
of an object-based system can be achieved. The main ingredients of such an implementation
are:
a procedural interface using callout and callentry references,
explicit callback references for asynchronous return,
run-time code generation and linking.
Chapter 7 backs this up with measurements. But now, we will look at issues involving
multiprocessors.
71
5
Concurrency and Synchronization
The most exciting phrase to hear in science,
the one that heralds new discoveries, is not
"Eureka!" (I found it!) but \That's funny ..."
| Isaac Asimov
it.
Besides the overhead in acquiring and releasing locks, locking methods su er from
three major disadvantages: contention, deadlock, and priority inversion. Contention occurs
when many competing processes all want to access the same lock. Important global data
structures are often points of contention. In Mach, for example, a single lock serializes
access to the global run-queue [7]. This becomes a point of contention if several processors
want to access the queue at the same time, as would occur when the scheduler clocks
are synchronized. One way to reduce the lock contention in Mach relies on scheduling
\hints" from the programmer. For example, hand-o hints may give control directly to the
destination thread, bypassing the run queue. Although hints may decrease lock contention
for speci c cases, their use is dicult and their bene ts uncertain.
Deadlock results when two or more processes both need locks held by the other.
Typically, deadlocks are avoided by imposing a strict request order for the resources. This
is a dicult solution because it requires system-wide knowledge to perform a local function;
this goes against the modern programming philosophy of information-hiding.
Priority inversion occurs when when a low priority process in a critical section is
preempted and causes other, higher priority processes to wait for that critical section. This
can be particularly problematic for real-time systems where rapid response to urgent events
is essential. There are sophisticated solutions for the priority inversion problem [8], but
they contribute to make locks more costly and less appealing.
A nal problem with locks is that they are state. In an environment that allows
partial failure | such as parallel and distributed systems | a process can set a lock and
then crash. All other processes needing that lock then hang inde nitely.
int data_val;
AtomicUpdate(update_function)
{
retry:
old_val = data_val;
new_val = Update_Function(old_val);
if(CAS(&data_val, old_val, new_val) == FAIL)
goto retry;
return new_val;
}
He provides methods of partitioning large data structures so that not all of it needs to be
copied, but in general, his methods are expensive.
Herlihy de nes an implementation of a concurrent data structure to be wait-free if
it guarantees that each process modifying the data structure will complete the operation in
a nite number of steps. He de nes an implementation to be non-blocking if it guarantees
that some process will complete an operation in a nite number of steps. Both prevent
deadlock. Wait-free also prevents starvation. In this paper, we use the term lock-free as
synonymous with non-blocking. We have chosen to use lock-free synchronization instead of
wait-free because the cost of wait-free is much higher and the chances of starvation in an
OS kernel is low | I was unable to construct a test case where that would happen.
Even with the weaker goal of non-blocking, Herlihy's data structures are expensive,
even when there is no interference. For example, updating a limited-depth stack is im-
plemented by copying the entire stack to a newly allocated block of memory, making the
changes on the new version, and switching the pointer to the stack with a Compare-&-Swap.
This cost is much too high, and we want to nd ways to reduce it.
We did not want to disable interrupts because we wanted to support I/O devices that
interrupt at a very high rate, such as the Sound-IO devices. Also, disabling interrupts
by itself does not work for multiprocessors.
We wanted a synchronization method that does not have the problem of deadlock.
The reason is that we wanted as much exibility as possible to examine and modify
running kernel threads. We wanted to be able to suspend threads to examine their
state without a ecting the rest of the system.
Given these desires, lock-free synchronization is the method of choice. Lock-free
synchronization does not have the problems of priority inversion and deadlock. I also feel
it leads to more robust code because there can never be the problem of a process getting
stuck and hanging while holding a lock.
Unfortunately, Herlihy's general wait-free methods are too expensive. So instead
of trying to implement arbitrary data structures lock-free, we take a di erent tack: We
ask \what data structures can be implemented lock-free, eciently?" We then build the
kernel out of these structures. This di ers from the usual way: typically, implementors
select a synchronization method that works generally, such as semaphores, then use that
everywhere. We want to use the cheapest method for each job. We rely on the quaject
structuring of the kernel and on code synthesis to create special synchronization for each
need.
The job is made easier because the Motorola 68030 processor supports a two-word
Compare-&-Swap operation. It is similar in operation to the one-word Compare-&-Swap,
except that two words are compared, and if they both match, two updates are performed.
Two-word Compare-&-Swap lets us eciently implement many basic data structures such
as stacks, queues, and linked lists because we can atomically update both a pointer and the
location being pointed to in one step. In contrast, Herlihy's algorithms, using single-word
Compare-&-Swap, must resort to copying.
The rst step is to see if synchronization is necessary at all. Many times the need for
synchronization can be avoided through code isolation, where only specialized code that is
known to be single-threaded handles the manipulation of data. An example of code isolation
is in the run-queue. Typically a run-queue is protected by semaphores or spin-locks, such as
in the Unix and Mach implementations [7]. In Synthesis, only code residing in each element
can change it, so we separate the run-queue traversal, which is done lock-free, safely and
77
concurrently, from the queue element update, which is done locally by its associated thread.
Another example occurs in a single-producer, single-consumer queue. As long as the queue
is neither full nor empty, the producer and consumer work on di erent parts of it and need
not synchronize.
Once it has been determined that synchronization is unavoidable, the next step is to
try to encode the shared information into one or two machine words. If that succeeds, then
we can use Compare-&-Swap on the one or two words directly. Counters, accumulators,
and state- ags all fall in this category. If the shared data is larger than two words, then
we try to encapsulate it in one of the lock-free quajects we have designed, explained in the
next section: LIFO stacks, FIFO queues, and general linked lists. If that does not work,
we try to partition the work into two pieces, one part that can be done lock-free, such as
enqueueing the work and setting a \work-to-be-done" ag, and another part that can be
postponed and done at a time when it is known interference will not happen (e.g., code
isolation). Suspending of threads, which is discussed in Section 5.3.2, follows this idea |
a thread is marked suspended; the actual removal of the thread from the run-queue occurs
when the thread is next scheduled.
When all else fails, it is possible to create a separate thread that acts as a server to
serialize the operations. Communication with the server happens using lock-free queues to
assure consistency. This method is used to update complex data structures, such as those
in the VT-100 terminal emulator. Empirically, I have found that after all the other causes
of synchronization have been eliminated or simpli ed as discussed above, the complex data
structures that remain are rarely updated concurrently. In these cases, we can optimize,
dispensing with the server thread except when interference occurs. Invoking an operation
sets a \busy" ag and then proceeds with the operation, using the caller's thread to do the
work. If a second thread now attempts to invoke an operation on the same data, it sees the
busy- ag set, and instead enqueues the work. When the rst thread nishes the operation,
it sees a non-empty work queue, and spawns a server thread to process the remaining work.
This server thread persists as long as there is work in the queue. When the last request has
been processed, the server dies.
In addition to using only lock-free objects and optimistic critical sections, we also
try to minimize the length of each critical section to decrease the probability of retries. The
longer a process spends in the critical section, the greater the chance of outside interference
forcing a retry. Even a small decrease in length can have a profound e ect on retries.
78
Insert(elem)
{
retry:
old_first = list_head;
*elem = old_first
if(CAS(&list_head, old_head, elem) == FAIL)
goto retry;
}
Delete()
{
retry:
old_first = list_head;
if(old_first == NULL)
return NULL;
second = *old_first;
if(CAS2(&list_head, old_first, old_head, second, second, 0) == FAIL)
goto retry;
return old_first;
}
Sometimes a critical section can be divided into two shorter ones by nding a consistent
intermediate state. Shifting some code between readers and writers will sometimes produce
a consistent intermediate state.
Push(elem)
{
retry:
old_SP = SP;
new_SP = old_SP - 1;
old_val = *new_SP;
if(CAS2(&SP, new_SP, old_SP, old_val, new_SP, elem) == FAIL)
goto retry;
}
Pop()
{
retry:
old_SP = SP;
new_SP = old_SP + 1;
elem = *old_SP;
if(CAS2(&SP, old_SP, old_SP, elem, new_SP, elem) == FAIL)
goto retry;
return elem;
}
Put(elem)
{
retry:
old_head = Q_head;
new_head = old_head + 1;
if(new_head >= Q_end)
new_head = Q_begin;
if(new_head == Q_tail)
return FULL;
old_elem = *new_head;
if(CAS2(&Q_head, new_head, old_head, old_elem, new_head, elem) == FAIL)
goto retry;
}
Get()
{
retry:
old_tail = Q_tail;
if(old_tail == Q_head)
return EMPTY;
elem = *old_tail;
new_tail = old_tail + 1;
if(new_tail >= Q_end)
new_tail = Q_begin;
if(CAS2(&Q_tail, old_tail, old_tail, elem, new_tail, elem) == FAIL)
goto retry;
return elem;
}
private variable, then the new item stored there and the stack pointer updated using a
a two-word Compare-&-Swap. (The data must be read to give Compare-&-Swap-2 two
comparison values. Compare-&-Swap-2 always performs two tests; sometimes one of them
is undesirable.)
Figure 5.6 shows a lock-free implementation of a circular queue. It is very similar to
the stack implementation, and will not be discussed further.
VisitNextNode(current)
{
nextp = & current->next; // Point to current node's next-node field
retry:
next_node = *nextp; // Point to the next node
if(next_node != NULL) { // If node exists...
refp = & next_node->refcnt; // Point to next node's ref. count field
old_ref = *refp; // Get value of next node's ref. count
new_ref = old_ref + 1; // And increment
if(CAS2(nextp, refp, next_node, old_ref, next_node, new_ref) == FAIL)
goto retry;
}
return next_node;
}
ReleaseNode(current)
{
refp = & current->refcnt; // Point to current node's ref. count field
retry:
old_ref = *refp; // Get value of current node's ref. count
new_ref = old_ref - 1; // ... Decrement
if(CAS(old_ref, new_ref, refp) == FAIL)
goto retry;
if(new_ref == 0) {
Deallocate(current);
return NULL;
} else {
return current;
}
}
Times in microseconds
68030 CPU, 25MHz, 1-wait-state main memory, cold cache
and one that is not synchronized. The column labeled \Non Sync" shows the time taken
to execute the operation without synchronization. The column labeled \Locked" shows the
time taken by a locking-based implementation of the operation without interference. The
column labeled \Lock-freenoretry " shows the time taken by the lock-free implementation
when there is no interference. The column labeled \Lock-freeoneretry " shows the time taken
when interference causes the rst attempt to retry, with success on the second attempt.1 For
reference, the rst line of the table gives the cost of a null procedure call in the C language,
and the last line gives the cost of a get semaphore operation in Sony's RTX kernel. (The
RTX kernel runs in the I/O processor of Sony's dual-processor workstation, and is meant
to be a light-weight kernel.)
The numbers shown are for in-line assembly-code implementation and assume a
pointer to the relevant data structure already in a machine register. The lock-free code
measured is the same as that produced by the Synthesis kernel code generator. The non-
synchronized code is the best I've been able to do writing assembler by hand. The lock-
based code is the same as the non-synchronized, but preceded by some code that disables
interrupts and then obtains a spinlock, and followed by code to clear the spinlock and re-
enable interrupts. The reasoning behind disabling interrupts is to make sure that the thread
does not get preempted in its critical section, guaranteeing that the lock is cleared quickly.
This represents good use of spin-locks, since any contention quickly passes.
Besides avoiding the problems of locking, the table shows that the lock-free imple-
mentation is actually faster than the lock-based one, even in the case of no interference. In
fact, the performance of lock-free in the presence of interference is comparable to locking
without interference.
Let us study the reason for this surprising result. Figures 5.8 and 5.9 show the actual
code that was measured for the linked-list delete operation. Figure 5.8 shows the lock-free
code, while Figure 5.9 shows the locking-based code. The lock-free code closely follows the
principles of operation described earlier. The lock-based code begins by disabling processor
interrupts to guarantee that the process will not be preempted. It then obtains the spin-
lock; interference at this point is limited to that from other CPUs in a multiprocessor, and
the lock should clear quickly. The linked-list delete is then performed, followed by clearing
1
This case is produced by generating an interfering memory reference between the initial read and the
Compare-&-Swap. The Quamachine's memory controller, implemented using programmable gate arrays, lets
us do things like this. Otherwise the interference would be very dicult to produce and measure.
84
the lock and reenabling interrupts. (Don't be fooled by its longer length: part of the code
is executed only when the list is empty.)
Accounting for the costs, the actual process of deleting the element from the list
takes almost the same time for both versions, with the the lock-free code taking a few cycles
longer. (This is because the Compare-&-Swap instruction requires its compare-operands to
be in D registers while indirection is best done through the A registers, whereas the lock-
based code can use whatever registers are most convenient.) The cost advantage of lock-free
comes from the much higher cost of obtaining and clearing the lock compared to the cost
of Compare-&-Swap. The two-word Compare-&-Swap instruction takes 26 machine cycles
to execute on the 68030 processor. By comparison, obtaining and then clearing the lock
costs 46 cycles, with the following breakdown: 4 to save the CPU status register; 14 to
disable interrupts; 12 to obtain the lock; 6 to clear the lock following the operation; and
10 to reenable interrupts. (For reference, fetching from memory costs 6 cycles and a single
85
5.3 Threads
This section describes how thread operations can be implemented so they are lock-
free and gives timing gures showing their cost.
this: on every second context switch, a thread from level 0 is scheduled, in round-robin
fashion. On every fourth context switch, a thread from level 1 is scheduled, also in round-
robin fashion. On every eighth context switch, a thread from level 2 is scheduled. And so
on, for 8 levels. Each level gets half the attention of the previous level. If there are no
threads at a particular level, that level's quanta is distributed among the rest of the levels.
A global counter and a lookup table tells the dispatcher which level's queue is next.
The lookup table contains the scheduling policy described above | a 0 every other entry,
1 every fourth entry, 2 every eighth entry, like this: (0; 1; 0; 2; 0; 1; 0; 3; 0; 1; ). Using the
counter to follow the priority table, the kernel dispatches a thread from level 0 at every
second context-switch, from level 1 at every fourth context-switch, level 2 at every eighth,
and so on.
When multiple CPUs attempt to dispatch threads from the run-queues, each active
dispatcher (switch-out routine) acquires a new TTE by marking it using Compare-&-Swap.
If successful, the dispatcher branches to the switch-in routine in the marked TTE. Otherwise,
some other dispatcher has just acquired the attempted TTE, so this dispatcher moves on to
try to mark the next TTE. The marks prevent other dispatchers from accessing a particular
TTE, but not from accessing the rest of the run queues.
(0,0) (0,0)
Running Schedule Ready
Resume
Suspend Suspend
Resume Resume
When a scheduler encounters a thread with the STOPME ag set, it removes its
TTE from the run-queue and sets the STOPPED ag to indicate that the thread has been
stopped. This is done using the two-word compare-and-swap instruction to synchronize
with other CPU's schedulers that may be operating on the adjacent queue elements. The
mark on the TTE guarantees that only one CPU is visiting each TTE at any given time.
This also makes the delete operation safe.
Resume: First, the STOPME and STOPPED ags are read and the STOPME ag
is cleared to indicate that the thread is ready to run. If the previously-read STOPPED
ag indicates that the thread had not yet been removed from the run-queue, we are done.
Otherwise, we remove the TTE and insert the thread directly into the run queue. The main
problem we have to avoid is the case of a neighboring TTE being deleted due to the thread
being killed. To solve that problem, when a thread is killed, we mark its TTE as \killed,"
but do not remove it from the run-queue immediately. When a dispatcher realizes the next
TTE is marked \killed" during a context switch, it can safely remove it.
Signal: Thread-signal is synchronized in a way that is similar to thread-resume.
Each thread's TTE has a stack for pending signals which contains addresses of signal-handler
procedures. Thread-signal uses a two-word Compare-&-Swap to push a new procedure
address onto this stack. It then sets a signal-pending ag, which the scheduler tests. The
scheduler removes procedures from the pending-signal stack, one at a time, and constructs
88
procedure call frames on the thread's runtime stack to simulate the thread having called
that procedure.
Step: Thread-step is intended for instruction-at-a-time debugging; concurrent calls
defeats its purpose. So we do not give any particular meaning to concurrent calls of this
function except to preserve the consistency of the kernel. In the current implementation,
all calls after the rst fail. We implement this using an advisory lock.
Thread create has been made signi cantly faster with a copy-on-write optimization.
Recall from Section 4.3 that each thread has a separate vector table. The vector table
contains pointers to synthesized routines that handle the various system calls and hardware
interrupts. These include the 16 system-call trap vectors, 21 program exception vectors, 19
vectors for hardware failure detection, and, depending on the hardware con guration, from
8 to 192 interrupt vectors. This represents a large amount of state information that had to
be initialized | 1024 bytes.
Newly-created threads point their vector table to the vector table of their creator
and defer the creation of their own until they need to change the vector table. There are
only two operations that change a thread's vector table: opening and closing quajects. If a
quaject is not to be shared, open and close test if the TTE is being shared, and if so they
rst make a copy of the TTE and then modify the new copy. Alternatively, several threads
may share the changes in the common vector table. For example, threads can now perform
system calls such as open file and naturally share the resulting le access procedures with
the other threads using the same vector table.
Table 5.3 shows the cost of context switching and scheduling. Context-switch is
somewhat slower than shown earlier, in Table 3.3, because now we schedule from multiple
run queues, and because there is synchronization that was not necessary in the single-CPU
version discussed in Section 3.3.2. When changing address spaces, loading the memory
management unit's translation table pointer and ushing the translation cache increases
the context switch time. Extra time is then used up to ll the translation cache. This is the
\+1:6 TLB fill" time. Depending on the thread's locality of reference, this can be as low
90
as 4:5 microseconds for 3 pages (code, global data, and stack) to as high as 33 microseconds
to ll the entire TLB cache.
5.4 Summary
We have used only lock-free synchronization techniques in the implementation of Syn-
thesis multiprocessor kernel on a dual-68030 Sony NEWS workstation. This is in contrast
to other implementations of multiprocessor kernels that use locking. Lock-based synchro-
nization methods such as disabling interrupts, spin-locking, and waiting semaphores have
many problems. Semaphores carry high management overhead and spin-locks may waste
signi cant amount of CPU. (A typical argument for spin-locks is that the processor would
be idle otherwise. This may not apply for synchronization inside the kernel.) A complete-
ly lock-free implementation of a multiprocessor kernel demonstrates that synchronization
overhead can be reduced, concurrency increased, deadlock avoided, and priority inversion
eliminated.
This completely lock-free implementation is achieved with a careful kernel design
using the following ve-point plan as a guide:
Avoid synchronization whenever possible.
Encode shared data into one or two machine words.
Express the operation in terms of one or more fast lock-free data structure operations.
Partition the work into two parts: a part that can be done lock-free, and a part that
can be postponed to a time when there can be no interference.
Use a server thread to serialize the operation. Communications with the server hap-
pens using concurrent, lock-free queues.
First we reduced the kind of data structures used in the kernel to a few simple abstract data
types such as LIFO stacks, FIFO queues, and linked lists. Then, we restricted the uses of
these abstract data types to a small number of safe interactions. Finally we implemented
ecient special-purpose instances of these abstract data types using single-word and double-
word Compare-&-Swap. The kernel is fully functional, supporting threads, virtual memory,
and I/O devices such as window systems and le systems. The measured numbers show the
91
very high eciency of the implementation, competitive with user-level thread management
systems.
Two lessons were learned from this experience. The rst is that a lock-free im-
plementation is a viable and desirable alternative to the development of shared-memory
multiprocessor kernels. The usual strategy | to evolve a single-processor kernel into a mul-
tiprocessor kernel by surrounding critical sections with locks | carries some performance
penalty and potentially limits the system concurrency. The second is that single and double
word Compare-&-Swap are important for lock-free shared-memory multiprocessor kernels.
Architectures that do not support these instructions may su er performance penalties if
operating system implementors are forced to use locks. Other synchronization instructions,
such as the Load-Linked/Store-Conditional found on the MIPS processor, may also yield
ecient lock-free implementations.
93
6
Fine-Grain Scheduling
There's no sense in being precise when you
don't even know what you're talking about.
| John von Neumann
the job. A typical assumption in global scheduling is that all jobs are independent of each
other. But in a pipeline of processes, where successive stages are coupled through their
input and output, this assumption does not hold. In fact, a global adaptive scheduling
algorithm may lower the priority of a CPU-intensive stage, making it the bottleneck and
slowing down the whole pipeline.
To make better scheduling decisions for I/O-bound processes, we take into account
local information and coupling between jobs in addition to the global properties. We call
such scheduling policies ne-grain because they use local information. An example of in-
teresting local information is the amount of data in the job's input queue: if it is empty,
dispatching the job will merely block for lack of input. This chapter focuses on the coupling
between jobs in a pipeline using as the local information the amount of data in the queues
linking the jobs.
Fine-grain scheduling is implemented in the Synthesis operating system. The ap-
proach is similar to feedback mechanisms in control systems. We measure the progress of
each job and make scheduling decisions based on the measurements. For example, if the
job is \too slow," say because its input queue is getting full, we schedule it more often and
let it run longer. The measurements and adjustments occur frequently, accurately tracking
each job's needs.
The key idea in ne-grain scheduling policy is modeled after the hardware phase
locked loop (PLL). A PLL outputs a frequency synchronized with a reference input fre-
quency. Our software analogs of the PLL track a reference stream of interrupts to generate
a new stable source of interrupts locked in step. The reference stream can come from a vari-
ety of sources, for example an I/O device, such as disk index interrupts that occur once every
disk revolution, or the interval timer, such as the interrupt at the end of a CPU quantum.
For readers unfamiliar with control systems, the PLL is summarized in Section 6.2.
Fine-grain scheduling would be impractical without fast interrupt processing, fast
context switching, and low dispatching overhead. Interrupt handling should be fast, since it
is necessary for dispatching another process. Context switch should be cheap, since it occurs
often. The scheduling algorithm should be simple, since we want to avoid a lengthy search
or calculations for each decision. Chapter 3 already addressed the rst two requirements.
Section 6.2.3 shows that the scheduling algorithms are simple.
'$ 95
&%
output
input phase Volt Volt frequency
- comparator - (Filter
F (S ))
- (KVCO
0=S )
-
+ (Kd )
?6
N
(1=N )
Time Events
6R R
6
d=de de dt d=dt
? ?
Interval - Frequency
1=x
minimize its error compared to the reference, in the same way the VCO adjusts the output
frequency.
Let us consider a practical example from a disk driver: we would like to know which
sector is under the disk head to perform rotational optimization in addition to the usual
seek optimizations. This information is not normally available from the disk controller.
But by using feedback, we can derive it from the index-interrupt that occurs once per disk
revolution, supplied by some ESDI disk controllers. The index-interrupt supplies the input
reference. The rate divider, N , is set to the number of sectors per track. An interval timer
functions as the VCO and generates periodic interrupts corresponding to the passage of new
sectors under the drive head. The phase comparator and lter are algorithms described in
Section 6.2.3.
When we use software to implement the PLL idea, we nd more exibility in mea-
surement and control. Unlike hardware PLLs, which always measure phase di erences,
software can measure either the frequency of the input (events per second), or the time
interval between inputs (seconds per event). Analogously, we can adjust either the fre-
quency of generated interrupts or the intervals between them. Combining the two kinds
of measurements with the two kinds of adjustments, we get four kinds of software locked
loops. This dissertation looks only at software locked loops that measure and adjust the
same variable. We call a software locked loop that measures and adjusts frequency an FLL
(frequency locked loop) and a software locked loop that measures and adjusts time intervals
an ILL (interval locked loop).
97
In general, all stable locked loops minimize the error (feedback signal). Concretely,
an FLL measures frequency by counting events, so its natural behavior is to maintain the
number of events (and thus the frequency) equal to the input. An ILL measures intervals,
so its natural behavior is to maintain the interval between consecutive output interrupts
equal to the interval between inputs. At rst, this seems to be two ways of looking at the
same thing. And if the error were always zero, it would be. But when a change in the input
happens, there is a period of time when the loop oscillates before it converges to the new
output value. During this time, the di erences between ILL and FLL show up. An FLL
tends to maintain the correct number of events, although the interval between them may
vary from the ideal. An ILL tends to maintain the correct interval, even though it might
mean losing some events to do so.
This natural behavior can be modi ed with lters. The overall response of a software
locked loop is determined by the kind of lter it uses to transform measurements into
adjustments. A low-pass lter makes the FLL output frequency or the ILL output intervals
more uniform, less sensitive to transient changes in the input. But it also delays the response
to important changes in the input. An integrator lter allows the loop to track linearly
changing input without error. Without an integrator, only constant input can be tracked
error-free. Two integrators allows the loop to track quadratically changing input without
error. But too many integrators tend to make the loop less stable and lengthens the time
it takes to converge. A derivative lter improves response to sudden changes in the input,
but also makes the loop more prone to noise. Like their hardware analogs, these lters can
be combined to improve both the response time and stability of the SLL.
LoPass(x)
{
static int lopass;
lopass = (7*lopass + x) / 8;
return lopass;
}
If i2 and i1 were running at the perfect relative rate of 4 to 1, residue would tend to
zero and freq would not be changed. But if i2 is slower than 4 times i1, residue becomes
positive, increasing the frequency of i2 interrupts. Similarly, if i2 is faster than 4 times
i1, i2 will be slowed down. As the di erence in relative speeds increases, the correction
becomes correspondingly larger. As i1 and i2 approach the exact ratio of 1:4, the di erence
decreases and we reach the minimum correction with residue being decremented by one
and incremented by four, cycling between ?2 and +2. Since residue can never converge
to zero | only hover around it | the i2 execution frequency will always jitter slightly.
In practice, residue would be scaled down by an appropriate factor so that the jitter is
negligible.
Figures 6.4, 6.5, and 6.6 show some simple lters that can be used alone or in
combination to improve the responsiveness and stability of the FLL. In particular, the low-
pass lter shown in Figure 6.4 helps eliminate the jitter mentioned earlier at the expense
of a longer settling time. The variable lopass keeps a \history" of what the most recent
residues were. Each update adds 1=8 of the new residue to 7=8 of the old lopass. This
has the e ect of taking a weighted average of recent residues. When residue is positive
99
Integrate(x)
{
static int accum;
accum = accum + x;
return accum;
}
Deriv(x)
{
static int old_x;
int dx;
dx = x - old_x;
old_x = x;
return dx;
}
for many iterations, as is the case when i2 is too slow, lopass will eventually be equal
to residue. But if residue oscillates rapidly, as in the situation described in the previous
paragraph, lopass will go to zero. The derivative is never used alone, but can be used in
combination with other lters to improve response to rapidly-changing inputs.
A second scenario favors FLL. Suppose you have an interval timer with the resolution
of one-sixtieth of a second. The input event occurs 30 times a second. Since the FLL is
independent of timer resolution, its output will still stabilize to within 10% after seeing
about 50 events (in about 1.7 seconds). However, since the event interval is comparable
to the resolution of the timer, an ILL will su er loss of accuracy. In this example, the
measured interval will be either 1, 2 or 3 ticks, depending on the relative timing between
the clock and input. Thus the ILL's output can have an error of as much as 50%.
Generally, slow input rates and high resolution timers favor ILL, while high input
rates and low resolution timers favor FLL. Sometimes the problem at hand forces a particu-
lar choice. For example, in queue handling procedures, the number of get-queue operations
must equal the number of put-queue operations. This forces the use of an FLL, since the
actual number of events control the actions. In another example, subdivision of a time
interval (like in the disk sector nder), an ILL is best.
main()
{
char buf[100];
int n, fd1, fd2;
fd1 = open("/dev/cd", 0);
fd2 = open("/dev/speaker", 1);
for(;;) {
n = read(fd1, buf, 100);
write(fd2, buf, n);
}
}
this scheduling policy is to allocate enough CPU to each thread in the pipeline so it can
process its data. Threads connected to the high-speed Sound-IO devices nd their input
queues being lled | or their output queues being drained | at a high rate. Consequently,
their share of CPU increases until the rate at which they process data equals the rate that it
arrives. As these threads run and produce output, the downstream threads nd that their
queues start to ll, and they too receive more CPU. As long as the total CPU necessary for
the entire pipeline does not exceed 100%, the pipeline runs in real-time.
The simpli cation in applications programming that occurs using this scheduler can-
not be overstated. One no longer needs to worry about assigning priorities to jobs, or of
carefully crafting the inner loops so that everything is executed frequently enough. For ex-
ample, in Synthesis, reading from the CD player is no di erent than reading from any other
device or le. Simply open \/dev/cd" and read from it. To listen to the CD player, one
could use the program in Figure 6.7. The scheduler FLL keeps the data owing smoothly
at the 44.1 KHz sampling rate | 176 kilobytes per second for each channel | regardless
of how many CPU-intensive jobs might be executing in the background.
Several music-oriented signal-processing applications have been written for Synthe-
sis and run in real-time using the FLL-based thread scheduler. The Synthesis music and
signal-processing toolkit includes many simple programs that take sound input, process it
in some way, and produce sound output. These include delay elements, echo and reverber-
ation lters, adjustable low-pass, band-pass and high-pass lters, Fourier transform, and
a correlator and feature extraction unit. These programs can be connected together in a
pipeline to perform more complex sound processing functions, in a similar way that text
lters in Unix can be cascaded using the shell's \|" notation. The thread scheduler ensures
102
6.3.4 Discussion
A formal analysis of ne-grain scheduling is beyond the scope of this dissertation.
However, I would like to give readers an intuitive feeling about two situations: saturation
and cheating. As the CPU becomes saturated, the FLL-based scheduler degrades gracefully.
The processes closest to externally generated interrupts (device drivers) will still get the
necessary CPU time. The CPU-intensive processes away from I/O interrupts will slow down
rst, as they should at saturation.
1 This program runs on the Quamachine at 50 MHz clock rate.
103
idea will nd useful application in this area. In this section, I only outline the general idea,
without o ering proof or examples. For a good discussion of issues in real-time computing,
see [29].
We divide hard-deadline jobs into two categories: the short ones and the long ones.
A short job is one that must be completed in a time frame within an order of magnitude of
interrupt and context switch overhead. For example, a job taking up to 100 microseconds
would be a short job in Synthesis. Short jobs are scheduled as they arrive and run to
completion without preemption.
Long jobs take longer than 100 times the overhead of an interrupt and context switch.
In Synthesis this includes all the jobs that take more than 1 millisecond, which includes
most of the practical applications. The main problem with long jobs is the variance they
introduce into scheduling. If we always take the worst scenario, the resulting hardware
requirement is usually very expensive and unused most of the time.
To use ne-grain scheduling policies for long jobs, we break down the long job into
small strips. For simplicity of analysis we assume each strip to have the same execution
time ET. We de ne the estimated CPU power to nish job J as:
(strips in J) ET
Estimate(J ) = Deadline (J ) ? Now
For a long job, it is not necessary to know ET exactly since the locked loop \measures"
it and continually adjusts the schedule in lock step with the actual execution time. In
particular, if Estimate(J ) > 1 then we know from the current estimate that J will not
make the deadline. If we have two jobs, A and B, with Estimate(A) + Estimate(B ) > 1
then we may want to consider aborting the less important one and calling a short emergency
routine to recover.
Unlike traditional hard-deadline scheduling algorithms, which either guarantee com-
pletion or nothing, ne-grain scheduling provides the ability to predict the deadline miss
under dynamically changing system loads. I believe this is an important practical concern
to real-time application programmers, especially in recovery from faults.
rr rr rr
rrr
P1 execute execute execute
P2 execute execute
bu er size and execution time. Recognize that at a given a load, we can always nd the
optimal scheduling statically by calculating the best bu er size and CPU quantum. But I
emphasize the main advantage of feedback: the ability to dynamically adjust towards the
best bu er size and CPU quantum. This is important when we have a variable system load,
jobs with variable demands, or a recon gurable system with a variable number of CPUs.
Figure 6.8 shows the static scheduling for a two-processor shared-memory system
with a common disk (transfer rate of 2 MByte/second). We assume that both processes
access the disk drive at the full transfer rate, e.g. reading and writing entire tracks. Process
1 runs on processor 1 (P1) and process 2 runs on processor 2 (P2). Process 1 reads 100
KByte from the disk into a bu er, takes 100 milliseconds to process them, and writes 100
KByte through a pipe into process 2. Process 2 reads 100 KByte from the pipe, takes
another 100 milliseconds to process them, and writes 100 KByte out to disk. In the gure,
process 1 starts to read at time 0. All disk activities appear in the bottom row, P1 and P2
show the processor usage, and shaded quadrangles show idle time.
Figure 6.9 shows the ne-grain scheduling mechanism (using FLL) for the same
system. We assume that process 1 starts by lling its 100 KByte bu er, but soon after
it starts to write to the output pipe, process 2 starts. Both processes run to exhaust the
bu er, when process 1 will read from the disk again. After some settling time, depending on
the lter used in the locked loop, the stable situation is for the disk to remain continuously
active, alternatively reading into process 1 and writing from process 2. Both processes will
also run continuously, with the smallest bu er that maintains the nominal transfer rate.
The above example illustrates the bene ts of ne-grain scheduling policies in par-
107
rr rr rr rr rr rr
rrr rrr
P1 execute eeeeee
P2 execute eeeeee
6.5 Summary
We have generalized scheduling from job assignments as a function of time, to job
assignments as a function of any source of interrupts. The generalized scheduling is most
useful when we have ne-grain scheduling, that uses frequent state checks and dispatching
actions to adapt quickly to system changes. Relevant new applications of the generalized
ne-grain scheduling include I/O device management, such as a disk sector interrupt source,
and adaptive scheduling, such as real-time scheduling and distributed scheduling.
108
7
Measurements and Evaluation
15. Everything should be built top-down, except the rst time.
| Alan J. Perlis Epigrams on Programming
cessing: stereo 16-bit analog output, stereo 16-bit analog input, and a compact disc (CD)
player digital interface.
The Sony NEWS 1860 is a commercially-available workstation with two 68030 pro-
cessors. Its architecture is not symmetric. One processor is meant to be the main processor
and the other is meant to be the I/O processor. Synthesis tries to treat it as if it were
a symmetric multiprocessor, scheduling most tasks on either processor without preference,
except those that require something that is accessible from one processor and not the other.
While this is not a large number of processors, it nevertheless helps demonstrate Synthesis
multiprocessor support. But for measurement purposes of this chapter, only one processor
| the slower I/O processor | was used. (With the kernel's multiprocessor support kept
intact.)
7.1.2 Software
A partial emulator for Unix runs on top of the Synthesis kernel and emulates some
of the SUNOS (version 3.5) kernel calls. This provides a direct way of measuring and com-
paring two otherwise very di erent operating systems. Since the executables are the same,
the comparison is direct. The emulator further demonstrates the generality of Synthesis by
setting the lower bound | Synthesis is at least as general as Unix if it can emulate Unix.
It also helps with the problem of acquiring application software for a new operating system
by allowing the use of SUN-3 binaries instead. Although the emulator supports a subset of
the Unix system calls | time constraints have forced an \implement-as-the-need-arises"
strategy | the set supported is suciently rich to provide a good idea of what the relative
times for the basic operations are.
emulator.
Ideally, we would want to run both Synthesis and SUNOS on the same hardware.
Unfortunately, we could not obtain detailed information about the Sun-3 machine, so Syn-
thesis has not been ported to the Sun. Instead, we closely emulate the hardware charac-
teristics of a Sun-3 machine using the Quamachine. This involves three changes: replace
the 68030 CPU with a 68020, set the CPU speed to 16MHz, and introduce one wait-state
into the main-memory access. To validate faithfulness of the hardware emulation, the rst
benchmark program is a compute-bound test. This test program implements a function
producing a chaotic sequence.1 It touches a large array at non-contiguous points, which
ensures that we are not just measuring the \in-the-cache" performance. Since it does not
use any operating system resources, the measured times on the two machines should be the
same.
Table 7.1 summarizes the results of the measurements. The columns under \Raw
SUN data" were obtained using the Unix time command and veri ed with a stopwatch.
The SUN was unloaded during these measurements and time reported more than 99% CPU
available for them. The columns labeled \usr," \sys," and \total" give the time spent in
the user's program, in the SUNOS kernel, and the total elapsed time, as reported by the
1
Pages 137-138 in Godel, Escher, Bach: An Eternal Golden Braid, by Douglas Hofstadter.
112
time command. The column labeled \usr+sys" is the sum of the user and system times,
and is the number used for comparisons with Synthesis. The Synthesis emulator data were
obtained by using the microsecond-resolution real-time clock on the Quamachine, rounded
to hundredths of a second. These times were also veri ed with stopwatch, sometimes by
running each test 10 times to obtain a more easily measured time interval. The column
labeled \Ratio" gives the ratio of the preceding two columns. The last column, labeled \I/O
Rate", gives the overall Synthesis I/O rate in megabytes per second for those test programs
performing I/O.
The rst program is a compute-intensive calibration function to validate the hardware
emulation.
Programs 2, 3, and 4 write and then read back data from a Unix pipe in chunks of
1, 1024, and 4096 bytes. Program 2 shows a remarkable speed advantage | 56 times | for
the single-byte read/write operations. Here, the low overhead of the Synthesis kernel calls
really makes a di erence, since the amount of data moved is small and most of the time is
spent in overhead. But even as the I/O size grows to the page size, the di erence remains
signi cant | 4 to 6 times. Part of the reason is that the SUNOS overhead is still signi cant
even when amortized over more data. Another reason is the fast synthesized routines that
move data across address spaces. The generated code loads words from one address space
into registers and stores them back in the other address space. With unrolled loops this
achieves the data transfer rate of about 8MB per second.
Program 5 reads and writes a le (cached in main memory) in chunks of 1K bytes.
It too shows a remarkable speed improvement over SUNOS.
Programs 6 and 7 repeatedly open and close /dev/null and /dev/tty. They show
that Synthesis kernel code generation is very ecient. The open operations create exe-
cutable code for later read and write, yet they are 20 to 40 times faster than the Unix open
that does not do code generation. Table 7.3 contains more details of le system operations
that are discussed in the next section.
7.4 Experience
7.4.1 Assembly Language
The current version of Synthesis is written in 68030 macro assembly language. This
section reports on the experience.
Perhaps the rst question people ask is, \Why is Synthesis written in assembler?"
This is soon followed by \How much of Synthesis could be re-written in a high-level lan-
guage?" and \At what performance loss?".
There are several reasons why assembler language was chosen, some of them research-
related, and some of them historical. One reason is I felt that it would be an interesting
experiment to write a medium-size system in assembler, which allows unrestricted access to
the machine's architecture, and perhaps discover new coding idioms that have not yet been
captured in a higher-level language. Later paragraphs talk about these. Another reason
is that much of the early work involved discovering the most ecient way of working with
the machine and its devices. It was a fast prototyping language, one in which I could write
and test simple I/O drivers without the trouble of supporting a complex language runtime
environment.
But perhaps the biggest reason is that in 1984, at the time the seed ideas were being
developed, I could not nd a good, reliable (bug-free) C compiler for the 68000 processor.
I had tried the compilers on several 68000-based Unix machines and repeatedly found that
compilation was slow, that the compilers were buggy, that they produced terrible machine
code, and that their runtime libraries were not reentrant. These qualities interfered with
my creativity and desire to experiment. Slow compilation dampens the enthusiasm of trying
new ideas because the edit-compile-test cycle is lengthened. Buggy compilers makes it that
much harder to write correct code. Poor code-generation makes my optimization e orts
118
seem meaningless. And non-reentrant runtime libraries makes it harder to write a multi-
threaded kernel that can take advantage of multiprocessor architecture.
Having started coding in assembler, it was easier to continue that way than to change.
I had written an extensive library of utilities, including a fully reentrant C-language runtime
library and subroutines for music and signal processing. In particular, I found my signal
processing algorithms dicult to express in C. To achieve the high performance necessary
for real-time operation, I use xed-point arithmetic for the calculations, not oating-point.
The C language provides poor support for xed-point math, particularly multiply and
divide. The Synthesis \printf" output conversion and formatting function provides a
stunning example of the performance improvements that result with carefully-coded xed-
point math. This function converts a oating-point number into a fully-formatted ASCII
string, 1.5 times faster than the machine instruction on the 68882 oating-point coprocessor
converts binary oating-point to unformatted BCD (binary-coded decimal).
Overall, the experience has been a positive one. A powerful macro facility helped
minimize the diculty of writing complex programs. The Synthesis assembler macro pro-
cessor borrows heavily from the C-language macro processor, sharing much of the syntax
and semantics. It provides important extensions, including macros that can de ne macros
and quoting and \eval" mechanisms. Quaject de nition, for example, is a declarative macro
instruction in the assembler. It creates all the code and data structures needed by the kernel
code generator, so the programmer need not worry about these details and can concentrate
on the quaject's algorithms. Also, the Synthesis assembler (written in C, by the way) as-
sembles 5000 lines per second. Complete system generation takes only 15 seconds. The
elapsed time from making a change to the Synthesis source to having a new kernel booted
and running is less than a minute. Since the turn-around time is so fast, I am much more
likely to try di erent things.
To my surprise, I found that there are some things that were distinctly easier to do
using Synthesis assembler than using C. In many of these, the powerful macro processor
played an important role, and I believe that the C language could be usefully improved with
this macro processor. One example is the procedure that interprets receiver status code
bits in the driver for the LANCE Ethernet controller chip. Interpreting these bits is a little
tricky because some of the error conditions are valid only when present in conjunction with
certain other conditions. One could always use a deeply-nested if-then-else structure to
separate out the cases. It would work and also be quite readable and maintainable. But
119
The rst step went fast, taking two to three weeks. The reason is that most of
the quajects do not need to run in kernel mode in order to work. The di erence between
Synthesis under Unix and native Synthesis is that instead of connecting the nal-stage I/O
quajects to I/O device driver quajects (which are the only quajects that must be in the
kernel), we connect them to Unix read and write system calls on appropriately opened
le descriptors. This is ultimate proof that Synthesis services can run in user-level as well
as kernel.
Porting to the raw machine was much harder, primarily because we chose to do our
own device drivers. Some problems were caused by incomplete documentation on how to
program the I/O devices on the Sony NEWS workstation. It was further complicated by
the fact that each CPU has a di erent mapping of the I/O devices onto memory addresses
and not everything is accessible by both CPUs. A simple program was written to patch the
running Unix kernel and install a new system call | \execute function in kernel mode."
Using this utility (carefully!), we were able to examine the running kernel and discover
a few key addresses. After a bit more poking around, we discovered how to alter the
page mappings so that sections of kernel and I/O memory were directly mapped into all
user address spaces.2 (The mmap system call on /dev/mem did not work.) Then using the
Synthesis kernel monitor running on minimal Synthesis under a Unix process, we were able
to \hand access" the remaining I/O devices to verify their addresses and operation.
The Synthesis kernel monitor is basically a C-language parser front-end with direct
access to the kernel code generators. It was crucial to both development and porting of
Synthesis because it let us run and test sections of code without having the full kernel
present. A typical debug cycle goes something like this: using the kernel monitor, we
instantiate the quaject we want to test. We create a thread and point it at one of the
quaject's callentries. We then single-step the thread and verify that the control ows where
it is supposed to.
But the most dicult porting problems were caused by timing sensitivities in the
various I/O devices. Some devices would \freeze" when accessed twice in rapid succession.
These problems never showed up in the Unix code because Unix encapsulates device access
in procedures. Calling a procedure to read a status value or change a control register
allows enough time for the device to \recover" from the previous operation. But with
2
Talk about security holes!
122
code synthesis, device access frequently consists of a single machine instruction. Often the
same device is accessed twice in rapid succession by two consecutive instructions, causing
the timing problem. Once the cause of the problem was found, it was easy to correct: I
made the kernel code generator insert an appropriate number of \nop" instructions between
consecutive accesses.
Once we had the minimal kernel running, getting the rest of the kernel and its
associated libraries working was relatively easy. All of the code that did not involve the
I/O devices ran without change. This includes the user-level shared runtime libraries,
such as the C functions library and the signal-processing library. It also includes all the
\intermediate" quajects that do not directly access the machine and its I/O devices, such
as bu ers, symbol tables (for name service), and mappers and translators (for le system
mapping). Code involving I/O devices was harder, since that required writing new drivers.
Finally, there are some un nished drivers such as the SCSI disk driver.
The thread system needed some changes to support the two CPUs on the Sony
workstation; these were discussed in Chapter 5. Most of the changes were in the scheduling
and dispatching code, to synchronize between the processors. This involved developing
ecient, lock-free data structures which were then used to implement the algorithms. The
scheduling policy was also changed from a single round-robin queue to one that uses a
multiple-level queue structure. This helped guarantee good response time to urgent events
even when there are many threads running, making it feasible to run thousands of threads
on Synthesis.
The most time-consuming part was implementing the new services: virtual memory,
Ethernet driver, and window system. They were all implemented \from scratch," using all
the performance-improving ideas discussed in this dissertation, such as kernel code genera-
tion. The measurements in this chapter show high performance gains in these areas as well.
The Ethernet driver, for example, is fast enough to record all the packet trac of a busy
Ethernet (400 kilobytes/second, or about 4 megabits per second) into RAM using only 20%
of a 25MHz, 68030 CPU's time. This is a problem that has been worked on and dismissed
as impractical except when using special hardware.
Besides the Sony workstation, the new kernel runs on the Quamachine as well. Of
course, each machine must use the appropriate I/O drivers, but all the new services added
to the Sony version work on the Quamachine.
123
be moved: 8 registers, each 96 bits long, requires 24 memory cycles to save them, and
another 24 cycles to re-load them. Newer architectures, for example, one that supports
hardware matrix multiply, can have even more state. I claim that a lot of this state
does not change between switch-in and switch-out. I propose hardware support to
eciently save and restore only the part of the state that was used: a modi ed-bit on
each register, and selective disabling of hardware function units. Modi ed-bits on each
register lets the operating system save only those registers that have been changed
since switch-in. Selective disabling of function units lets the operating system defer
loading that unit's state until it is needed. If a functional unit goes unused between
switch-in and the subsequent switch-out, its state will not have been loaded nor saved.
Faster byte-operations. Many I/O-related functions tend to be byte-oriented,
whereas CPU and memory tends to be word-oriented. This means it costs no more
to fetch a full 32-bit word as it does to fetch a byte. We can take advantage to this
with two new instructions: \load-4-bytes" and \store-4-bytes". These would move a
word from memory into four registers, one byte to a register. The program can then
operate on the four bytes in registers without referencing memory again.
Another suggestion, probably less useful, is a \carry-suppress" option for addition, to
suppress carry-out at byte-boundaries, allowing four additions or subtractions to take
place simultaneously on four bytes packed into a 32-bit integer. I foresee the primary
use of this to be in low-level graphics routines that deal with 8-bit pixels.
Improved bit-wise operation support. The current complement of bitwise-logical
operations and shifts are already pretty good, what is lacking is a perfect shue of
bits in a register. This is very useful for bit-mapped graphics operations, particularly
things like bit-matrix transpose, which is heavily used when unpacking byte-wide
pixels into separate bit-planes, as is required by certain framebu er architectures.
to their own conclusions. In this section, I try to address some of the more frequently raised
objections regarding Synthesis, and rebut those that are, in my opinion, ill-founded.
Objection 1: \How much of the performance improvement is due to my ideas,
and how much is due to writing in assembler, and tuning the hell out of the
thing?"
This is often asked by people who believe it to be much more of the latter and much
less of the former.
Section 3.3 outlined several places in the kernel where code synthesis was used to
advantage. For data movement operations, it showed that code synthesis achieves 1.4 to
2.4 times better performance than the best assembly-language implementation not using
code synthesis. For more specialized operations, such as context switching, code synthesis
delivers as much as 10 times better performance. So, in a terse answer to the question, I
would say \40% to 140%".
But those gures do not tell the whole story. They are detailed measurements,
designed to compare two versions of the same thing, in the same execution environment.
Missing from those measurements is a sense of how the interaction between larger pieces of
a program changes when code synthesis is used. For example, in that same section, I show
that a procedural implementation of \putchar" using code synthesis is slightly faster than
the C-language \putchar" macro, which is in-line expanded into the user's code. The fact
that enough savings could be had through code synthesis to more than amortize the cost of
a procedure call | even in a simple, not-easily-optimized operation such as \putchar" |
changes the nature of how data is passed between modules in a program. Many modules
that process streams of data are currently written to take as input a bu er of data and
produce as output a new bu er of data. Chaining several such modules involves calling
each one in turn, passing it the previous module's output bu er as the input. With a fast
\putchar" procedure, it is no longer necessary to pass bu ers and pointers around; we can
now pass the address of the downstream module for \putchar," and the address of the
upstream module for \getchar." Each module makes direct calls to its neighbors to get the
data, eliminating the memory copy and all consequent pointer and counter manipulations.
Objection 2: \Self-modifying data structures are troublesome on pipelined
machines, and code generation has problems with machines that don't allow ne-
grained control of the instruction cache. In other words, Synthesis techniques
are dependent on hardware features that aren't present in all machines, and,
worse, are becoming increasingly scarce."
126
Pipelined machines pose no special diculties because Synthesis does not modify
instructions ahead of the program counter. Code modi cation, when it happens, is restricted
to patching just-executed code, or unrelated code. In both cases, even a long instruction
pipeline is not a problem.
The presence of a non-coherent and hard-to- ush instruction cache is the harder
problem. By \hard-to- ush," I mean a cache that must be ushed whole instead of line-at-
a-time, or one that cannot be ushed in user mode without taking a protection exception.
Self-modifying code is still e ective, but such a cache changes the breakeven point when it
becomes more economical to interpret data than to modify code. For example, conditions
that change frequently are best represented using a boolean ag, as is usually done. But for
conditions that are tested much more frequently than changed, code modi cation remains
the method of choice. The cost of ushing the cache determines at what ratio of testing to
modi cation the decision is made.
Relief may come from advances in the design of multiprocessors. Recent studies show
that, for a wide variety of workloads, software-controlled caches are nearly as e ective as
fully coherent hardware caches and much easier to build, as they require no hardware [23] [2].
Further extensions to this idea stem from the observation that full coherency is often not
necessary, and that it is bene cial to rely on the compiler to maintain coherency in software
only when required [2]. This line of thinking leads to cache designs that have the necessary
control to eciently support code-modifying programs.
But it is true that the assumption that code is read-only is increasingly common, and
that hardware designs are more and more using this assumption. Hardware manufacturers
design according to the needs of their market. Since nobody is doing runtime code gener-
ation, is it little wonder that it is not well supported. But then, isn't this what research is
for? To open people's eyes and to point out possibilities, both new and overlooked. This
dissertation points out certain techniques that increase performance. It happens that the
techniques are unusual, and make demands of the hardware that are not commonly made.
But just as virtual memory proved to be a useful idea and all new processors now support
memory management, one can expect that if Synthesis ideas prove to be useful, they too
will be better supported.
Objection 3: \Does this matter? Hardware is getting faster, and anything
that is slow today will probably be fast enough in two years."
Yes, it matters!
127
There is more to Synthesis than raw speed. Cutting the cost of services by a factor
of 10 is the kind of change that can fundamentally alter the structure of those services. One
example is the PLL-based process scheduling. You couldn't do that if context switch was
expensive | driving the time way below one millisecond is what made it possible to move
to a radically di erent scheduler, with nice properties, besides speed.
For another example, I want to pose a question: if threads were as cheap as proce-
dure calls, what would you do with them? One answer is found in the music synthesizer
applications that run on Synthesis. Most of them create a new thread for every note! Driv-
ing the cost of threads to within a few factors of the cost of procedure call changes the
way applications are structured. The programmer now only needs to be concerned that the
waveform is synthesized correctly. The Synthesis thread scheduler ensures that each thread
gets enough CPU time to perform its job. You could not do that if threads were expensive.
Finally, hardware may be getting faster, but it is not getting faster fast enough. Look
at the window-system gures given in Table 7.2. Synthesis running on 5-year-old hardware
technology outperforms conventional systems running on the latest hardware. Even with
faster hardware, it is not fast enough to overtake Synthesis.
Objection 4: \Why is Synthesis written in assembler? How much of the reason
is that you wanted no extraneous instructions? How much of the reason is that
code synthesis requires assembler? How much of Synthesis could be re-written
in a high-level language?"
8
Conclusion
A dissertation is never nished.
You just stop writing.
| Everyone with a Ph.D.
This dissertation has described Synthesis, a new operating system kernel that pro-
vides fundamental services an order of magnitude more eciently than traditional operating
systems.
Two options de ne the direction in which research of this nature may proceed. First-
ly, an existing system may be adopted as a platform upon which incremental development
may take place. Studying a piece of an existing system limits the scope of the work, ensures
that one is never far from a functioning system that can be measured to guide develop-
ment, and secures a preexisting base of users, upon completion. On the down side, such
an approach may necessarily limit the amount of innovation and creativity brought to the
process and possibly carry along any preexisting biases built in by the originators of the en-
vironment, reducing the impact the research might have on improving overall performance.
Alternatively one can start anew. Such an e ort removes the burden of preexisting
decisions and tradeo s, and allows use of knowledge and hindsight acquired from past
systems to avoid making the same mistakes. The danger, however, is of making new,
130
possibly fatal mistakes. In addition, so much valuable time can be spent building up the
base and re-inventing wheels, that little innovation takes place.
I have chosen the second direction. I felt that the potential bene ts of an important
breakthrough far outweighed the dangers of failure. Happily, I believe that a positive
outcome may be reported. I would like to summarize both the major contributions and
shortcomings of this e ort.
A basic assumption of this research e ort has been that low overhead and low latency
are important properties of an operating system. Supporting this notion is the prediction
that as distributed computing becomes ubiquitous, responsiveness and overall performance
will su er at the hands of the high overhead and latency of current systems. Advances
in networking technology, impressive as they are, will bear little fruit unless operating
systems software is ecient enough to to make full use of the higher bandwidths. Emerging
application areas such as interactive sound, video, and the future panoply of interface
technologies subsumed under the umbrella of \multi-media" place strict timing requirements
on operating system services | requirements that existing systems have diculty meeting,
in part, because of their high overhead and lack of real-time support.
The current leading suggestion to address the performance problems is to move
function out of the kernel, thus avoiding crossing the kernel boundary and allowing cus-
tomization of traditional kernel services to individual applications. Synthesis shows that
it is not necessary to accept that kernel services will be slow, and to nd work-arounds to
them, but rather that it is possible to provide very ecient kernel services. This is impor-
tant, because ultimately communications with the outside world still must go through the
kernel.
With real-time support and an overhead factor ten times less than that of other sys-
tems, Synthesis may be considered a resounding success. Four key performance-improving
dynamics di erentiate Synthesis:
Large scale use of run-time code generation.
Quaject-oriented kernel structure.
Lock-free synchronization.
Feedback-based process scheduling.
131
Synthesis constitutes the rst large-scale use of run time code generation to speci cal-
ly improve operating system performance. Chapter 3 demonstrates that common operating
system functions run ve times faster when implemented using runtime generated code than
a typical C language implementation, and nearly ten times faster when compared with the
standard Unix implementation. The use of run time code generation not only improves
the performance of existing services, but allows for the addition of new services without
incremental systems overhead.
Further di erentiating Synthesis is its novel kernel structure, based around \qua-
jects," forming the building blocks of all kernel services. In many respects, quajects resem-
ble the objects of traditional Object-Oriented programming, including data encapsulation
and abstraction. Quajects di er, however, in four important ways:
A procedural rather than message-based interface.
Explicit declaration of exceptions and external calls.
Runtime binding of the external calls, and
Implementation using runtime code generation.
By making explicit the quaject's exceptions and external calls, the kernel may dynamically
link quajects. Rather than providing services monolithically, Synthesis builds them through
the use of one or more quajects eventually comprising the user's thread. This binding takes
place dynamically, at runtime, allowing for both the expansion of existing services and for an
enhanced capability for creating new ones. The traditional distinction between kernel and
user services becomes blurred, allowing for applications' direct participation in the delivery
of services. This is possible because a quaject's interface is extensible across the protection
boundaries which divide applications from the kernel and from each other. Such an approach
enjoys a further advantage: preserving the partitioning and modularity found in traditional
systems centered around user-level servers, while bettering the higher performance levels of
the monolithic kernels which, while fast, are often dicult to understand and modify.
The code generation implementation and procedural interface of quajects enhances
performance by reducing argument passing and enabling in-line expansion of called qua-
jects into their caller to happen at runtime. Quaject callentries, for example, require no
\self" parameter, since it is implicit in their runtime-generated code. This shows, through
quajects, that a highly ecient object-based system is possible.
132
or structuring things in a di erent way will yield even greater bene ts.
Unfortunately, there is no high-level language available making programs that use
run time code generation easy to write and at the same time, portable. Aside from the
obvious bene t of making the technique more accessible to all members of the profession,
a better understanding of the bene ts of runtime code generation will sure accrue from
developing such a language.
An interesting direction to explore is to extend the ideas of runtime code generation
to runtime recon gurable hardware. Chips now exist whose function is \programmed"
by downloading strings of bits that con gure the internal logic function blocks and the
routing of signals between blocks. Although the chips are generally programmed once,
upon initialization, they could be reprogrammed at other times, optimizing the hardware
as the environment changes. Some PGAs could be set aside for computations purposes:
functions such as permuting bit vectors can be implemented much more eciently with
PGA hardware than in software. Memory operations, such as a fast memory-zero or fast
page copy could be implemented operating asynchronously with the main processor. As
yet unanticipated functions could be con gured as research identi es the need. A machine
architecture is envisaged having no I/O device controllers at all | just a large array of
programmable gate array (PGA) chips wired to the processor and to various forms of I/O
connectors. Clearly, the types of I/O devices which the machine supports is a function of
the bit patterns loaded into its PGAs, rather than the boards which alternatively would
be plugged into its backplane. This is highly advantageous, for as new devices need to be
supported, there is no need for new boards and the attendant expense and delay of acquiring
them.
Currently, under Synthesis, users cannot de ne their own services. Quaject com-
position is a powerful mechanism to de ne and implement kernel services, but this power
has not yet been made accessible to the end user. At present, all services that exist do so
because located somewhere in the kernel is a piece of code which knows which quajects to
create and how to link them in order to provide each service. It would be better if this were
not hard coded into the kernel, but made user-accessible via some sort of service description
language. To support such a language, the quaject type system would need to be extended
to provide runtime type checking, which is currently lacking.
Another open question concerns the generality of lock-free synchronization. Lock-free
synchronization has many desirable properties as discussed in this dissertation. Synthesis
134
Bibliography
[1] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young.
Mach: A New Kernel Foundation for Unix Development.
In Proceedings of the 1986 Usenix Conference, pages 93{112. Usenix Association, 1986.
[2] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, and Mary K. Vernon.
Comparison of Hardware and Software Cache Coherence Schemes.
In The 18th Annual International Symposium on Computer Architecture, volume 19,
pages 298{308, 1991.
[3] T.E. Anderson, B.N. Bershad, E.D. Lazowska, and H.M. Levy.
Scheduler Activations: E ective Kernel Support for the User-Level Management of
Parallelism.
In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages
95{109, Paci c Grove, CA, October 1991. ACM.
[4] James Arleth.
A 68010 multiuser development system.
Master's thesis, The Cooper Union for the Advancement of Science and Art, New York
City, 1984.
[5] Brian N. Bershad, Edward D. Lazowska, Henry M. Levy, and David B. Wagner.
An Open Environment for Building Parallel Programming Systems.
In Symposium on Parallel Programming: Experience with Applications, Languages and
Systems, pages 1{9, New Haven, Connecticut (USA), July 1988. ACM SIGPLAN.
[6] A. Black, N. Hutchinson, E. Jul, and H. Levy.
Object Structure in the Emerald System.
In Proceedings of the First Annual Conference on Object-Oriented Programming, Sys-
tems, Languages, and Applications, pages 78{86. ACM, September 1986.
[7] D.L. Black.
Scheduling Support for Concurrency and Parallelism in the Mach Operating System.
IEEE Computer, 23(5):35{43, May 1990.
[8] Min-Ih Chen and Kwei-Jay Lin.
A Priority Ceiling Protocol for Multiple-Instance Resources.
In IEEE Real-Time Systems Symposium, San Antonio, TX, December 1991.
[9] David Cheriton.
An Experiment Using Registers for Fast Message-Based Interprocess Communication.
138
[21] Motorola.
MC68030 User's Manual.
Prentice Hall, Englewood Cli s, NJ, 07632, 1989.
[22] J. Ousterhout.
Why Aren't Operating Systems Getting Faster as Fast as Hardware.
In USENIX Summer Conference, pages 247{256, Anaheim, CA, June 1990.
[23] Susan Owicki and Anant Agarwal.
Evaluating the Performance of Software Cache Coherence.
In Proceedings of the 3rd Symposium on Programming Languages and Operating Sys-
tems. ACM, 1989.
[24] R. Pike, D. Presotto, K. Thompson, and H. Trickey.
Plan 9 from Bell Labs.
Technical Report CSTR # 158, AT&T Bell Labs, 1991.
[25] C. Pu, H. Massalin, and J. Ioannidis.
The Synthesis Kernel.
Computng Systems, 1(1):11{32, Winter 1988.
[26] J.S. Quarterman, A. Silberschatz, and J.L. Peterson.
4.2BSD and 4.3BSD as Examples of the Unix System.
ACM Computing Surveys, 17(4):379{418, December 1985.
[27] D. Ritchie.
A Stream Input-Output System.
AT&T Bell Laboratories Technical Journal, 63(8):1897{1910, October 1984.
[28] D.M. Ritchie and K. Thompson.
The Unix Time-Sharing System.
Communications of ACM, 7(7):365{375, July 1974.
[29] J.A. Stankovic.
Misconceptions About Real-Time Computing: A Serious Problem for Next-Generation
Systems.
IEEE Computer, 21(10):10{19, October 1988.
[30] M. Stonebraker.
Operating System Support for Database Management.
Communications of ACM, 24(7):412{418, July 1981.
[31] Sun Microsystems Incorporated, 2550 Garcia Avenue, Mountain View, California
94043, 415-960-1300.
SunOS Reference Manual, May 1988.
[32] Peter Wegner.
Dimensions of Object-Based Language Design.
In Norman Meyrowitz, editor, Proceedings of the OOPSLA'87 conference, pages 168{
182, Orlando FL (USA), 1987. ACM.
[33] Mark Weiser, Alan Demers, and Carl Hauser.
The Portable Common Runtime Approach to Interoperability.
140
Appendix A
Unix Emulator Test Programs
#define N 500000
int x[N];
main() {
int i;
for(i=5; i--; )
g();
printf("%d\n%d\n", x[N-2], x[N-1]);
}
g() {
int i;
x[0] = x[1] = 1;
for(i=2; i<N; i++)
x[i] = x[i-x[i-1]] + x[i-x[i-2]];
}
#include <sys/file.h>
#define Test_dev "/dev/null" /* or /dev/tty */
main()
{
int f,i;
for(i=10000; i--; ) {
f = open(Test_dev, O_RDONLY);
close(f);
}
}
#include <sys/file.h>
#define N 1024
char x[N];
main()
{
int f,i,j;
f = open("file", O_RDWR | O_CREAT | O_TRUNC, 0666);
for(j=1000; j--; ) {
lseek(f, 0L, L_SET);
for(i=10; i--; )
write(f, x, N);
lseek(f, 0L, L_SET);
for(i=10; i--; )
read(f, x, N);
}
close(f);
}