Grace Safe Multithreading
Grace Safe Multithreading
Emery D. Berger
Ting Yang
Tongping Liu
Gene Novark
Abstract
The shift from single to multiple core architectures means
that programmers must write concurrent, multithreaded programs in order to increase application performance. Unfortunately, multithreaded applications are susceptible to numerous errors, including deadlocks, race conditions, atomicity
violations, and order violations. These errors are notoriously
difficult for programmers to debug.
This paper presents Grace, a software-only runtime system that eliminates concurrency errors for a class of multithreaded programs: those based on fork-join parallelism.
By turning threads into processes, leveraging virtual memory protection, and imposing a sequential commit protocol, Grace provides programmers with the appearance of
deterministic, sequential execution, while taking advantage
of available processing cores to run code concurrently and
efficiently. Experimental results demonstrate Graces effectiveness: with modest code changes across a suite of
computationally-intensive benchmarks (116 lines), Grace
can achieve high scalability and performance while preventing concurrency errors.
Categories and Subject Descriptors D.1.3 [Software]:
Concurrent ProgrammingParallel Programming; D.2.0
[Software Engineering]: Protection mechanisms
General Terms Performance, Reliability
Keywords Concurrency, determinism, deterministic concurrency, fork-join, sequential semantics
1.
Introduction
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
OOPSLA 2009, October 2529, 2009, Orlando, Florida, USA.
c 2009 ACM 978-1-60558-734-9/09/10. . . $10.00
Copyright
energy consumption now limit the ability of hardware manufacturers to speed up chips by increasing their clock rate.
This phenomenon has led to a major shift in computer architecture, where single-core CPUs have been replaced by
CPUs consisting of a number of processing cores.
The implication of this switch is that the performance of
sequential applications is no longer increasing with each new
generation of processors, because the individual processing
components are not getting faster. On the other hand, applications rewritten to use multiple threads can take advantage
of these available computing resources to increase their performance by executing their computations in parallel across
multiple CPUs.
Unfortunately, writing multithreaded programs is challenging. Concurrent multithreaded applications are susceptible to a wide range of errors that are notoriously difficult
to debug [29]. For example, multithreaded programs that fail
to employ a canonical locking order can deadlock [16]. Because the interleavings of threads are non-deterministic, programs that do not properly lock shared data structures can
suffer from race conditions [30]. A related problem is atomicity violations, where programs may lock and unlock individual objects but fail to ensure the atomicity of multiple
object updates [14]. Another class of concurrency errors is
order violations, where a program depends on a sequence of
threads that the scheduler may not provide [26].
This paper introduces Grace, a runtime system that eliminates concurrency errors for a particular class of multithreaded programs: those that employ fully-structured, or
fork-join based parallelism to increase performance.
While fork-join parallelism does not capture all possible parallel programs, it is a popular model of parallel program execution: systems based primarily on forkjoin parallelism include Cilk, Intels Threading Building
Blocks [35], OpenMP, and the fork-join framework proposed for Java [24]. Perhaps the most prominent use of forkjoin parallelism today is in Googles Map-Reduce framework, a library that is used to implement a number of Google
services [9, 34]. However, none of these prevent concurrency
2.
Sequential Semantics
Concurrency Error
Deadlock
Race condition
Atomicity violation
Order violation
Cause
cyclic lock acquisition
unguarded updates
unguarded, interleaved updates
threads scheduled in unexpected order
Prevention by Grace
locks converted to no-ops
all updates committed deterministically
threads run atomically
threads execute in program order
Table 1. The concurrency errors that Grace addresses, their causes, and how Grace eliminates them.
rency errors. Because the executions of f(x) and g(y) are
not interleaved and execute deterministically, atomicity violations or race conditions are impossible. Similarly, the ordering of execution of these functions is fixed, so there cannot be order violations. Finally, a sequential program does
not need locks, so eliding them prevents deadlock.
2.1
Programming Model
3.
Grace achieves concurrent speedup of multithreaded programs by executing threads speculatively, then committing
their updates in program order (see Section 4). A key challenge is how to enable low-overhead thread speculation in
C/C++.
One possible candidate would be some form of transactional memory [17, 36]. Unfortunately, no existing or proposed transactional memory system provides all of the features that Grace requires:
full compatibility with C and C++ and commodity hard-
ware,
full support for long-lived transactions,
management, and
extremely low runtime and space overhead.
Processes as Threads
reads
writes
{}
{}
{1}
{}
{1,4}
{}
{1,4}
{4}
protected
read-only
unprotected
(copy-on-write)
thread
end
Figure 3. An overview of execution in Grace. Processes emulate threads (Section 3.1) with private mappings to mmapped files
that hold committed pages and version numbers for globals and the heap (Sections 3.2 and 3.3). Threads run concurrently but
are committed in sequential order: each thread waits until its logical predecessor has terminated in order to preserve sequential
semantics (Section 4). Grace then compares the version numbers of the read pages to the committed versions. If they match,
Grace commits the writes and increments version numbers; otherwise, it discards the pages and rolls back.
stead of spawning new threads, Grace forks off new processes. Because each thread is in fact a separate process,
it is possible to use standard memory protection functions
and signal handlers to track reads and writes to memory.
Grace tracks accesses to memory at a page granularity, trading imprecision of object tracking for speed. Crucially, because only the first read or write to each page needs to be
tracked, all subsequent operations proceed at full speed.
To create the illusion that these processes are executing
in a shared address space, Grace uses memory mapped files
to share the heap and globals across processes. Each process has two mappings to the heap and globals: a shared
mapping that reflects the latest committed state, and a local (per-process), copy-on-write mapping that each process
uses directly. In addition, Grace establishes a shared and local map of an array of version numbers. Grace uses these
version numbersone for each page in the heap and global
areato decide when it is safe to commit updates.
3.2
Globals
Grace uses a fixed-size file to hold the globals, which it locates in the program image through linker-defined variables.
In ELF executables, the symbol end indicates the first address after uninitialized global data. Grace uses an ld-based
linker script to identify the area that indicates the start of
the global data. In addition, this linker script instructs the
linker to page align and separate read-only and global areas
of memory. This separation reduces the risk of false sharing
by ensuring that writes to a global object never conflict with
reads of read-only data.
3.3
Heap Organization
elegantly solves the problem of rolling back memory allocations. Grace rolls back memory allocations just as it rolls
back any other updates to heap data. Any conflict causes the
heap to revert to an earlier version.
However, a nave implementation of the allocator would
give rise to an unacceptably large number of conflicts: any
threads that perform memory allocations would conflict. For
example, consider a basic freelist-based allocator. Any allocation or deallocation updates a freelist pointer. Thus, any
time two threads both invoke malloc or free on the samesized object, one thread will be forced to roll back because
both threads are updating the page holding that pointer.
To avoid this problem of inadvertent rollbacks, Grace
uses a scalable per-thread heap organization that is loosely
based on Hoard [3] and built with Heap Layers [4]. Grace
divides the heap into a fixed number of sub-heaps (currently
16). Each thread uses a hash of its process id to obtain the
index of the heap it uses for all memory operations (malloc
and free).
This isolation of each threads memory operations from
the others allows threads to operate independently most
of the time. Each sub-heap is initially seeded with a pagealigned 64K chunk of memory. As long as a thread does
not exhaust its own sub-heaps pool of memory, it will operate independently from any other sub-heap. If it runs out
of memory, it obtains another 64K chunk from the global
allocator. This allocation only causes a conflict with another
thread if that thread also runs out of memory during the same
period of time.
This allocation strategy has two benefits. First, it minimizes the number of false conflicts created by allocations
from the main heap. Second, it avoids an important source
of false sharing. Because each thread uses different pages to
satisfy object allocation requests, objects allocated by one
thread are unlikely to be on the same pages as objects al-
Thread Execution
Completion
At the end of each atomically-executed regionthe end
of main() or an individual thread, right before a thread
spawn, and right before joining another threadGrace invokes atomicEnd (Figure 5), which attempts to commit
all updates by calling atomicCommit (Figure 6). It first
4.
Sequential Commit
bool
//
//
//
if
atomicCommit (void) {
If havent read or written anything,
we dont have to wait or commit;
update local view of memory & return.
(heap.nop() && globals.nop()) {
heap.updateAll();
globals.updateAll();
return true;
}
// Wait for immediate predecessor
// to complete.
waitExited(predecessor);
// Now try to commit state. Iff we succeed,
// return true.
// Lock to make check & commit atomic.
lock();
bool committed = false;
// Ensure heap and globals consistent.
if (heap.consistent() &&
globals.consistent()) {
// OK, all consistent: commit.
heap.commit();
globals.commit();
xio.commit(); // commits buffered I/O
committed = true;
}
unlock();
return committed;
}
We evaluate Graces performance on real computation kernels with a range of benchmarks, listed in Table 2. One
benchmark, matmula recursive matrix-matrix multiply routinecomes from the Cilk distribution. We handtranslated this program to use the pthreads API (essentially replacing Cilk calls like spawn with their counterparts). We performed the same translation for the remaining Cilk benchmarks, but because they use unusually fine-grained threads, none of them scaled when using
pthreads.
The remaining benchmarks are from the Phoenix benchmark suite [34]. These benchmarks represent kernel computations and were designed to be representative of computeintensive tasks from a range of domains, including enterprise
computing, artificial intelligence, and image processing. We
use the pthreads-based variants of these benchmarks with
the largest available inputs.
In addition to describing the benchmarks, Table 2 also
presents detailed benchmark characteristics measured from
their execution with Grace, including the total number of
commits and rollbacks, together with the average number
of pages read and written and average wall-clock time per
atomic region. With the exception of matmul and kmeans,
the benchmarks read and write from relatively few pages in
each atomic region. matmul has a coarse grain size and
large footprint, but has no interference between threads due
to the regular structure of its recursive decomposition. On
the other hand, kmeans has a benign race which forces
Grace to trigger numerous rollbacks (see Section 6.1).
Transactional I/O
Graces commit protocol not only enforces sequential semantics but also has an additional important benefit. Because
Grace imposes an order on thread commits, there is always
one thread running that is guaranteed to be able to commit its
state: the earliest thread in program order. This property ensures that Grace programs cannot suffer from livelock caused
by a failure of any thread to make progress, a problem with
some transactional memory systems.
This fact allows Grace to overcome an even more important limitation of most proposed transactional memory systems: it enables the execution of I/O operations in a system
with optimistic concurrency. Because some I/O operations
are irrevocable (e.g., network reads after writes), most I/O
operations appear to be fundamentally at odds with speculative execution. The usual approach is to ban I/O from speculative execution, or to arbitrarily pick a winner to obtain
a global lock prior to executing its I/O operations.
In Grace, each thread buffers its I/O operations and commits them at the same time it commits its updates to memory,
as shown in Figure 6. However, if a thread attempts to execute an irrevocable I/O operation, Grace forces it to wait for
5.
Methodology
5.1.1
CPU-Intensive Benchmarks
Modifications
Benchmark
histogram
kmeans
linear regression
matmul
pca
string match
Description
Analyzes images RGB components
Iterative clustering of 3-D points
Computes best fit line for set of points
Recursive matrix-multiply
Principal component analysis on matrix
Searches file for encrypted word
Commits
Rollbacks
9
6273
9
11
22
11
0
4887
0
0
0
0
7.3
404.5
5.6
4100
3.1
5.9
5.9
2.3
4.8
1865
2.2
4.3
1512.3
8.7
1024.0
2359.4
0.204
191.1
Table 2. CPU-intensive multithreaded benchmark suite and detailed characteristics (see Section 5.1).
CPUintensivebenchmarks
pthreads
Grace
8
7
12.97
6
10.80
Speedu
up
4
3
2
1
0
histogram
kmeans
linear_regression
matmul
pca
string_match
Benchmarks
Figure 9. Performance of multithreaded benchmarks running with pthreads and Grace on an 8 core system (higher
is better). Grace generally performs nearly as well as the
pthreads version while ensuring the absence of concurrency errors.
underlying application. The reordering or modification involved a small number of lines of code (116).
6.
Evaluation
Real Applications
Grace
pthread
Grace
pthread
1.4
Normalized Execution Time
1.2
1
0.8
0.6
0.4
0.2
0
0
1
16
32
64
Thread length (ms)
128
256
512
1024
8
16
32
64
128
Thread Execution Length (ms)
256
512
1024
Figure 10. Impact of thread running time on performance: (a) speedup over a sequential version (higher is better), (b)
normalized execution time with respect to pthreads (lower is better).
(b) Impact of footprint (normalized to pthread)
8
7
4
3
Grace (10ms)
pthread (10ms)
Grace (50ms)
pthread: (50ms)
Grace: (200ms)
pthread: (200ms)
2
1
16
64
256
Number of pages dirtied (in logscale)
3
2.5
2
1.5
1
0.5
0
0
1
3.5
1024
16
64
256
Number of pages dirtied (in logscale)
1024
Figure 11. Impact of thread running time on performance: (a) speedup over a sequential version (higher is better), (b)
normalized execution time with respect to pthreads (lower is better).
While the kmeans benchmark achieves a modest speedup
with pthreads (3.65X), it exhibits no speedup with Grace
(1.02X), which serializes execution. This benchmark iteratively clusters points in 3D space. Until it makes no further
modifications, kmeans spawns threads to find clusters (setting a cluster id for each point), and then spawns threads to
compute and store mean values in a shared array. It would be
straightforward to eliminate all rollbacks for the first threads
by simply rounding up the number of points assigned to
each thread, allowing each thread to work on independent
regions of memory. However, kmeans does not protect accesses or updates to the mean value array and instead uses
benign races as a performance optimization. Grace has no
way of knowing that these races are benign and serializes its
execution to prevent the races.
6.2
Application Characteristics
Speedup
threads (16), len is the thread running time, and nIter is the
number iterations.
Figure 10 shows the effect of thread running time on
performance. Because we expected the higher cost of thread
spawns to degrade Graces performance relative to pthreads,
we were surprised to view the opposite effect. We discovered
that the operating systems scheduling policy plays an important role in this set of experiments.
When the size of each thread is extremely small, neither
Grace nor pthreads make effective use of available CPUs.
In both cases, the processes/threads finish so quickly that the
load balancer is not triggered and so does not run them on
different CPUs. As the thread running time becomes larger,
Grace tends to make better of CPU resources, sometimes up
to 20% faster. We believe this is because the Linux CPU
scheduler attempts to put threads from the same process on
one CPU to exploit cache locality, which limits its ability to
use more CPUs, but is more liberal in its placement of processes across CPUs. However, once thread running time becomes large enough (over 50ms) for the load balancer to take
effect, both Grace and pthreads scale well. Figure 10(b)
shows that Grace has competitive performance compared to
pthreads, and the overhead of process creation is never
larger than 2%.
Footprint: In order to evaluate the impact of per-thread
footprint, we extend the previous benchmark so that each
thread also writes a value onto a number of private pages,
which only exercises Graces page protection mechanism
without triggering rollbacks. We conduct an extensive set of
tests, ranging thread footprint from 1 pages to 1024 pages
(4MB). This experiment is the worst case scenario for Grace,
since each write triggers two page faults.
Figure 11 summarizes the effect of thread footprint over
three representative thread running time settings: small
(10ms), medium (50ms) and large (200ms). When the thread
footprint is not too large ( 64 pages), Grace has comparable performance to pthreads, with no more than a 5%
slowdown. As the thread footprint continues to grow, the
performance of Grace starts to degrade due the overhead
of page protection faults. However, even when each thread
dirties one megabyte of memory (256 pages), Graces performance is within an acceptable range for the medium and
large thread runtime settings. The overhead of page protection faults only becomes prohibitively large when the thread
footprint is large relative to the running time, which is unlikely to be representative of compute-intensive threads.
Conflict rate: We next measure the impact of conflicting
updates on Graces performance by having each thread in the
microbenchmark update a global variable with a given probability, which the result that any other thread reading or writing that variable will need to rollback and re-execute. Grace
makes progress even with a 100% likelihood of conflicts because its sequential semantics provide a progress guarantee:
the first thread in commit order is guaranteed to succeed
0
0
20
40
60
Conflict Rate (%)
80
100
Figure 12. Impact of conflict rate (the likelihood of conflicting updates, which force rollbacks), versus a pthreads
baseline that never rolls back (higher is better).
Concurrency Errors
Bug type
deadlock
race condition
atomicity violation
order violations
Benchmark description
Cyclic lock acquisition
Race condition example, Lucia et al. [27]
Atomicity violation from MySQL [26]
Order violation from Mozilla 0.8 [25]
// Deadlock.
thread1 () {
lock (A);
// usleep();
lock (B);
// ...do something
unlock (B);
unlock (A);
}
thread2 () {
lock (B);
// usleep();
lock (A);
// ...do something
unlock (A);
unlock (B);
}
6.3.1
Deadlocks
Race conditions
// Race condition.
int counter = 0;
increment() {
print (counter);
int temp = counter;
temp++;
// usleep();
counter = temp;
print (counter);
}
thread1() { increment(); }
thread2() { increment(); }
}
Figure 14. Race condition example: the race is on the variable counter, where the first update can be lost. Under
Grace, both increments always succeed.
// Atomicity violation.
// thread1
S1: if (thd->proc_info) {
// usleep();
S2:
fputs (thd->proc_info,..)
}
// thread2
S3: thd->proc_info = NULL;
Atomicity Violations
Order violations
// Order violation.
char * proc_info;
// Order violation.
int foo;
thread1() {
// ...
// usleep();
proc_info = malloc(256);
}
thread1() {
foo = 0;
}
thread2() {
// ...
strcpy(proc_info,"abc");
}
main() {
spawn thread1();
spawn thread2();
}
main() {
S1: spawn thread1();
// usleep();
S2: foo = 1;
// ...
assert (foo == 0);
}
7.
Related Work
Transactional memory
when using lock-based synchronization. With Grace, program semantics are straightforward and unsurprising.
Welc et al. introduce support for irrevocable transactions
in the McRT-STM system for Java [40]. Like Grace, their
system supports one active irrevocable transaction at a time.
McRT-STM relies on a lock mechanism combined with
compiler-introduced read and write barriers, while Graces
support for I/O falls out for free from its commit protocol.
The McRT system for C++ also includes a malloc implementation called McRT-malloc, which resembles Hoard [3]
but is extended to support transactions [19]. Ni et al. present
the design and implementation of a transactional extension
to C++ that enable transactional use of the system memory
allocator by wrapping all memory management functions
and providing custom commit and undo actions [31]. These
approaches differ substantially from Graces memory allocator, which employs a far simpler design that leverages the
fact that in Grace, all code, including malloc and free,
execute transactionally. Grace also takes several additional
steps that reduce the risk of false sharing.
7.2
Automatic mutual exclusion, or AME, is a recentlyproposed programming model developed at Microsoft Research Cambridge. It is a language extension to C# that assumes that all shared state is private unless otherwise indicated [20]. These guarantees are weaker than Graces, in that
AME programmers can still generate code with concurrency
errors. AME has a richer concurrent programming model
than Grace that makes it more flexible, but its substantially
more complex semantics preclude a sequential interpretation [1]. By contrast, Graces semantics are straightforward
and thus likely easier for programmers to understand.
von Praun et al. present Implicit Parallelism with Ordered
Transactions (IPOT), that describes a programming model,
like Grace, that supports speculative concurrency and enforces determinism [38]. However, unlike Grace, IPOT requires a completely new programming language, with a wide
range of constructs including variable type annotations and
constructs to support speculative and explicit parallelism. In
addition, IPOT would require special hardware and compiler
support, while Grace operates on existing C/C++ programs
that use standard thread constructs.
Welc et al. present a future-based model for Java programming that, like Grace, is safe [39]. A future denotes
an expression that may be evaluated in parallel with the rest
of the program; when the program uses the expressions
value, it waits for the future to complete execution before
continuing. As with Graces threads, safe futures ensure that
the concurrent execution of futures provides the same effect
as evaluating the expressions sequentially. However, the safe
future system assumes that writes are rare in futures (by contrast with threads), and uses an object-based versioning system optimized for this case. It also requires compiler support
and currently requires integration with a garbage-collected
environment, making it generally unsuitable for use with
C/C++.
Graces use of virtual memory primitives to support speculation is a superset of the approach used by behaviororiented parallelism (BOP) [12]. BOP allows programmers
to specify possibly parallelizable regions of code in sequential programs, and uses a combination of compiler analysis
and the strong isolation properties of processes to ensure that
speculative execution never prevents a correct execution.
While BOP seeks to increase the performance of sequential code by enabling safe, speculative parallelism, Grace
provides sequential semantics for concurrently-executing,
fork-join based multithreaded programs.
7.3
8.
Future Work
9.
Conclusion
10.
Acknowledgements
The authors would like to thank Ben Zorn for his feedback
during the development of the ideas that led to Grace, to
Luis Ceze for graciously providing benchmarks, and to Cliff
Click, Dave Dice, Sam Guyer, and Doug Lea for their invaluable comments on earlier drafts of this paper. We also
thank Divya Krishnan for her assistance. This material is
References
[3] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded
applications. In Proceedings of the International Conference
on Architectural Support for Programming Languages and
Operating Systems (ASPLOS-IX), pages 117128, New York,
NY, USA, Nov. 2000. ACM.
[4] E. D. Berger, B. G. Zorn, and K. S. McKinley. Composing
high-performance memory allocators. In Proceedings of the
2001 ACM SIGPLAN Conference on Programming Language
Design and Implementation (PLDI 2001), pages 114124,
New York, NY, USA, June 2001. ACM.
[5] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson,
K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded
runtime system. J. Parallel Distrib. Comput., 37(1):5569,
1996.
[6] C. Blundell, E. C. Lewis, and M. M. K. Martin. Deconstructing transactions: The subtleties of atomicity. In WDDD 05:
4th Workshop on Duplicating, Deconstructing, and Debunking, June 2005.
[7] J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and performance of munin. In SOSP 91: Proceedings of
the Thirteenth ACM Symposium on Operating Systems Principles, pages 152164, New York, NY, USA, 1991. ACM.
[8] G.-I. Cheng, M. Feng, C. E. Leiserson, K. H. Randall, and
A. F. Stark. Detecting data races in cilk programs that use
locks. In SPAA 98: Proceedings of the tenth annual ACM
symposium on Parallel algorithms and architectures, pages
298309, New York, NY, USA, 1998. ACM.
[9] J. Dean and S. Ghemawat. MapReduce: simplified data
processing on large clusters. In OSDI04: Proceedings of the
6th conference on Symposium on Opearting Systems Design
& Implementation, pages 1010, Berkeley, CA, USA, 2004.
USENIX Association.
[10] J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP:
deterministic shared memory multiprocessing. In ASPLOS
09: Proceedings of the 14th International Conference on
Architectural Support for Programming Languages and
Operating Systems, pages 8596, New York, NY, USA, 2009.
ACM.