Practical Parallel and Concurrent Programming
Practical Parallel and Concurrent Programming
ABSTRACT ture is upon us, and many undergraduate courses are lagging
Multicore computers are now the norm. Taking advantage behind.
of these multiple cores entails parallel and concurrent pro- Scheduling decisions made by an operating system or a
gramming. There is therefore a pressing need for courses language runtime can dramatically affect the behaviour of
that teach effective programming on multicore architectures. parallel and concurrent programs. This makes such pro-
We believe that such courses should emphasize high-level ab- grams difficult to reason about, and introduces a whole new
stractions for performance and correctness and be supported set of code bugs students may encounter, such as data races
by tools. or deadlocks. Students programming for multicore systems
This paper presents a set of freely available course mate- also face a new set of potential performance bottlenecks,
rials for parallel and concurrent programming, along with a such as lock contention or false sharing. Concurrency bugs
testing tool for performance and correctness concerns called are both insidious and prevalent [8, 14, 16]; we need to en-
Alpaca (A Lovely Parallelism And Concurrency Analyzer). sure that all computer science students know how to identify
These course materials can be used for a comprehensive par- and avoid them.
allel and concurrent programming course, à la carte through- We need to start teaching courses that prepare students
out an existing curriculum, or as starting points for graduate for multicore architectures now, while concurrently updat-
special topics courses. We also discuss tradeoffs we made in ing the whole curriculum. Since parallel and concurrent
terms of what to include in course materials. programming was previously relevant only to the relatively
small collection of students interested in high-performance
computing, operating systems, or databases, instructors may
Categories and Subject Descriptors need help developing course materials for this new subject.
H.3.2 [Computers and Education]: Computer and In- In this paper, we put forward several ideas about how to
formation Science Education; D.2.8 [Programming Tech- structure parallel and concurrent programming courses, while
niques]: Concurrent Programming also providing a set of concrete course materials and tools.
1.1 Contributions
General Terms
Human Factors, Languages, Performance • We present course materials for a 16 week course in
parallel and concurrent programming [2]. We begin by
1. INTRODUCTION describing the course structure (Section 2), and then
discuss the individual units (Section 3).
Once upon a time programmers were taught to write se-
quential programs, with the expectation that new hardware • We present a testing framework, Alpaca [2], which
would make their programs perform faster. In the early supports deterministic unit tests for parallel and con-
2000s, we hit a power wall [1]. Today, all major chip manu- current programs, and allows students to effectively
facturers have switched to producing computers that contain test their concurrent programs. This framework is de-
more than one CPU [24]. Parallel and concurrent program- scribed in Section 4.
ming has rapidly moved from a special-purpose technique
to standard practice in writing scalable programs. The fu- • We discuss tradeoffs we made about what to include
(Section 2). We situate these tradeoffs within related
work, including comparisons with other courses (Sec-
tion 5).
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies 2. COURSE STRUCTURE
bear this notice and the full citation on the first page. To copy otherwise, to Practical Parallel and Concurrent Programming (PPCP)
republish, to post on servers or to redistribute to lists, requires prior specific is a semester-long course for teaching students how to pro-
permission and/or a fee.
SIGCSE’11, March 9–12, 2011, Dallas, Texas, USA. gram parallel/concurrent applications, using the C# and
Copyright 2011 ACM 978-1-4503-0500-6/11/03 ...$10.00. F# languages with other .NET libraries. Taken as a whole,
189
this 16 week (8 unit) course is aimed at upper-division un- static void ParQuickSort<T>(T[] arr, int lo, int hi)
dergraduate students. However, individual units can also be where T: IComparable<T> {
taught à la carte and could be integrated into various stages if (hi-lo <= Threshold) InsertionSort(arr, lo, hi);
of an existing curriculum. Alternatively, selections from the else {
course materials can form the core structure for an in-depth int pivot = Partition(arr, lo, hi);
Parallel.Invoke(
graduate special topics course. delegate {ParQuickSort(arr, lo, pivot-1);},
delegate {ParQuickSort(arr, pivot+1, hi);}
Parallel and Concurrent. );
In the course title, we make a distinction between paral- }
lel and concurrent programming. Parallel programming is }
about improving performance by making good use of the Figure 1: Parallel quicksort with the TPL; Paral-
underlying parallel resources of a particular machine (such lel.Invoke() runs a list of delegates in parallel.
as multiple cores), or the set of machines in a cluster. Con-
current programming is about improving responsiveness by
break those abstractions when necessary for performance.
reacting to simultaneously occurring events (coming from
Students learn the low-level building blocks (such as moni-
the network, user interface, other computers, or peripher-
tor synchronization and threads) that are combined to build
als) in a timely and proper fashion. Programs may exhibit
TPL-level abstractions, and the performance effects of these
both concurrent and parallel characteristics, thus blurring
building blocks interacting with a parallel architecture.
the distinction; we make such a distinction to emphasize
that PPCP includes both parallel and concurrent program-
ming. Other courses may only focus on one of these two Tool-based learning.
facets (Section 5). PPCP supports a tool-based approach to correctness and
performance. We have developed a tool called Alpaca (A
Lovely Parallelism and Concurrency Analyzer) which stu-
Performance and Correctness. dents can use to explore performance and correctness con-
In this course, students learn to write parallel algorithms,
cepts. Students will build an understanding of correctness
and explain (and fix) both performance and correctness bugs.
conditions and performance problems through experimen-
PPCP has a strong emphasis on both correctness and per-
tation. Throughout PPCP, students will regularly analyze
formance, with a slight tilt towards correctness. On the
their code, including tests of data race detection, deadlock
performance side, students will learn how to predict and
detection and linearizability checking. These tests leverage
test parallel speedups, and explain actual performance bot-
the CHESS stateless model checker [20] to run unit tests
tlenecks. On the correctness side, students will develop a
on different thread schedules, increasing the chances that
vocabulary for reasoning about parallelism and correctness,
a scheduling-dependent assertion will be tripped. Alpaca
analyze parallel or concurrent code for correctness, and un-
tests can also link into a GUI-based performance analysis
derstand expected invariants in their code.
tool (called TaskoMeter). We believe that this tool-based
approach will improve the learning experience by enabling
Breadth-first, Abstraction-first. quick and simple analysis to make debugging less frustrat-
PPCP has a breadth-first structure. Since there is not ing; preliminary observations of students in our Fall 2010
a clear consensus on whether any one parallel or concur- pilot course support this assertion. We have also found Al-
rent programming model will emerge as a clear winner [22], paca and TaskoMeter to be extremely useful in debugging
we feel that students should be exposed to a variety of our own code. The Alpaca framework is discussed in more
paradigms and topics. This course is tilted towards shared depth in Section 4, and is available for download [2].
memory systems with multiple threads, since we feel it is In addition to Alpaca tool support, the course materials
important to prepare students to succeed in this common also include slides and lecture notes, which are supported
case. However, a wide breadth of material is covered, in- by a selection of programming samples and a recommended
cluding message passing, data parallelism and even some textbook. The programming samples include a variety of
GPU programming. appealing examples, including small interactive games and
This course emphasizes productivity and starts with a visual examples such as ray tracing programs; additional
high level of abstraction. Abstraction layers, such as the samples are available online [18]. Further reading on course
.NET Task Parallel Library (TPL) [15], allow students to topics can be found in a recommended textbook [6], which is
quickly write parallel or concurrent programs while avoid- also available online; suggested sections from this textbook
ing explicit thread management. Figure 1 shows an ex- are listed at the end of the slides for each lecture and in the
ample generic ParQuickSort function which uses Paral- lecture notes.
lel.Invoke from the TPL to run quicksort recursively in
parallel; programmers do not have to explicitly create or
manage threads. 3. UNITS
Unfortunately, understanding performance bottlenecks re- Figure 2 shows the dependency structure of the eight units.
quires moving through abstraction layers. For example, dis- The course material starts at a high level of abstraction by
covering which portions of a program lie along the “crit- introducing patterns rather than primitives (units 1-4). For
ical path” of places where optimizations result in parallel example, in the DAG model of parallel computation tasks
speedup requires analyzing the (high level) task dependency are represented as nodes and dependencies between paral-
graph for a program, whereas preventing false sharing re- lel tasks are represented as directed edges. This model is
quires understanding the (low level) cache behaviour. After presented in Unit 1 and can be used to reason about both
abstractions are introduced, PPCP shows students how to performance and correctness in future units. In later units
190
!"#$%&! tions and determinism analysis when adding internal par-
"#$%&'()*%+,'('+ allelism to a component. This unit also discusses important
-'&'..%.)/#
thread-safety concepts (such as linearizability) for reason-
ing about how concurrent accesses to the same component
interact.
191
Object[] arr;
192
123.4./5/5,6*2/,2.7.89/.2/:.2/.*/2;./:,< we believe that concurrency should be better integrated with
123.4./5/5,6*2=",2/5;56>*2?<,6.2@50*/A the entire CS curriculum, doing so requires widespread changes.
Additionally, special-purpose parallel and concurrent pro-
B,,;2C)0 gramming courses can help fill the gaps for current students
while introduction of concurrency in introductory courses is
phased in. For these reasons, the focus of this paper is on the
content and structure for a dedicated parallel and concur-
rent programming course. However, individual course units
could be used à la carte at different points throughout a
curriculum. Additionally, the Alpaca framework could be
E6/.0F)G2H./=..62)2I/)0/2)6<2I/,42 used independently from the course materials.
D6.20,=24.02/)*+2;./.0 0.G)/5F.2/,2,/:.02;./.0*
Suzanne Rivore [22] presents a breadth-first course in mul-
ticore programming. This course is an upper-division elec-
Figure 6: Screenshot of the TaskoMeter tool tive which covers three different paradigms: shared memory
with OpenMP, pattern-based parallel programming with In-
• [PerformanceTestMethod] tel’s Thread Building Blocks (TBB), and a graphics pro-
This attribute will open up the TaskoMeter user in- cessors (GPU) programming model with nVidia’s CUDA
terface (Figure 6) upon running the test. The pro- library. These three paradigm-focused course subsections
grammer must demarcate code sections where perfor- were followed by reading technical papers on other advanced
mance measurements are desired. TaskoMeter dis- topics within parallel and concurrent programming. Perfor-
plays timing information across different threads within mance profiling was also an important part of the code de-
these programmer-defined sections. velopment process for students. Students found that learn-
ing the different syntax and style of the three paradigms
• [ScheduleTestMethod]
was challenging. Although PPCP covers a wide breadth of
When a test with this attribute is run, it will run under
paradigms, these paradigms are presented in an interopera-
the CHESS direct execution model checker, to test for
ble framework. We discuss shared memory with threads and
deadlocks and assertion violations on different sched-
synchronization in Unit 5, pattern-based parallel program-
ules.
ming with the TPL in Unit 1, and GPU programming with
• [DataRaceTestMethod] Accelerator in Unit 4: all of these programming models use
A test with this attribute behaves very similar to one C# syntax and the same development environment.
with the [ScheduleTestMethod] attribute. However,
the system explores less schedules but checks for data 5.3 Comparison to Other Courses
races on those observed schedules. The University of California at Berkeley and the Univer-
sity of Illinois at Urbana-Champagne both make up the well-
5. RELATED WORK funded Universal Parallel Computing Research Center (UP-
CRC). For this reason, we focus the following section on
5.1 Unit Testing in Education courses taught at these schools.
A number of researchers have advocated the inclusion of
test-driven development into the computer science (CS) cur- GPU-based courses.
riculum [12, 21, 23]. However, testing concurrent programs Several courses exist which target programming on graph-
is extremely difficult [14]. Because testing frameworks usu- ics processing units (GPUs). These include “Applied Par-
ally do not control scheduling of multiple threads, the results allel Programming” at UIUC, which is focused on CUDA
of testing are nondeterministic. In particular, tests do not programming tools and scientific computing workloads. We
necessarily expose concurrency-specific bugs such as data briefly touch on GPU programming with the presentation of
races or deadlocks (Figure 3). Accelerator in Unit 4. However, PPCP is focused on general
Previous pedagogical tool suites exist [3, 7]. However, programming and has more breadth.
Alpaca is unique in being tightly coupled with both a set
of course materials and a unit testing framework. Ricken Multicore Programming Courses.
et al. [21] present some examples of concurrent unit tests Several multicore-themed courses exist, including Berke-
in ConcJUnit; these examples are suitable for an educa- ley’s “Parallel Programming for Multicore” and “Applica-
tional context. ConcJUnit is a framework that ensures that tions of Parallel Computers”. These courses are mostly fo-
all threads spawned in unit tests terminate, but does not cused on shared memory programming using Posix threads.
explicitly check for concurrency-specific bugs (such as race “Parallel Programming” at UIUC starts with several lec-
conditions). Furthermore, since ConcJUnit tests do not tures on MPI, and then moves on to object-based paral-
control scheduling, they may be nondeterministic. In con- lel programming (Charm++), followed by OpenMP and a
trast, Alpaca tests check for concurrency-specific errors, few other language models (such as high performance FOR-
control scheduler-dependent non-determinism, and run the TRAN) and performance bottlenecks.
unit tests on different thread schedules. All the above courses are more focused on parallelism than
concurrency and performance over correctness. Also, they
5.2 Multicore Curricula do not include the tool support found with Alpaca. Unlike
Some researchers have advocated sprinkling parallelism these courses, PPCP leverages the TPL to start at a high
and concurrency throughout the CS curriculum and intro- level of abstraction, and provides a unified framework for
ducing these topics in introductory courses [9, 4]. Although introducing a variety of programming models.
193
6. FUTURE WORK [10] A. Fekete. Teaching students to develop thread-safe
A pilot version of PPCP is being taught at the University Java classes. In Conference on Innovation and
of Utah in Fall 2010, while this article goes to press. A sec- Technology in Computer Science Education (ITiCSE),
ond pilot will be taught at the University of Washington in 2008.
Spring 2011. We are collecting survey data from the students [11] W. Gropp, E. Lusk, and A. Skjellum. Using MPI:
enrolled in the initial pilot course; we also have developed portable parallel programming with the message
feedback surveys for teachers who use PPCP materials and passing interface. The MIT Press, 1999.
for anyone who reviews the online course materials. We plan [12] D. Janzen and H. Saiedian. Test-driven learning:
to incorporate feedback from these evaluations. intrinsic integration of testing into the CS/SE
We are working to develop additional materials; we are curriculum. In Technical Symposium on Computer
currently writing more example programs, as well as slides Science Education (SIGCSE), 2006.
for lab sessions.We are also incorporating new tests into Al- [13] D. Joiner, P. Gray, T. Murphy, and C. Peck. Teaching
paca, such as MPI analyses. parallel computing to science faculty: best practices
and common pitfalls. In Symposium on Principles and
Practice of Parallel Programming (PPoPP), 2006.
7. ACKNOWLEDGEMENTS
[14] E. Lee. The problem with threads. Computer,
Special thanks to Sherif Mahmoud and Chris Dern for 39(5):33–42, 2006.
their support.
[15] D. Leijen, W. Schulte, and S. Burckhardt. The design
of a task parallel library. SIGPLAN Notices,
8. ADDITIONAL AUTHORS 44(10):227–242, 2009. http://msdn.microsoft.com/
Additional Authors (Redmond, WA, USA): Madanlal Musu- en-us/library/dd460717.aspx.
vathi (Microsoft Research, email: [email protected]), [16] S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from
Shaz Qadeer (Microsoft Research, email: [email protected]), mistakes: a comprehensive study on real world
and Stephen Toub (Microsoft, email: [email protected]) concurrency bug characteristics. SIGPLAN Notices,
43(3):329–339, 2008.
[17] Microsoft. Parallel Language-Integrated Queries
9. REFERENCES (PLINQ). http://msdn.microsoft.com/en-us/
[1] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, library/dd460688.aspx.
K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, [18] Microsoft. Parallel programming samples for .NET 4.
K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A http://code.msdn.microsoft.com/ParExtSamples.
view of the parallel computing landscape. [19] M. Musuvathi and S. Qadeer. Iterative context
Communications of the ACM, 52(10):56–67, 2009. bounding for systematic testing of multithreaded
[2] T. Ball, S. Burckhardt, G. Gopalakrishnan, J. Mayo, programs. SIGPLAN Notices, 42(6):446–455, 2007.
M. Musuvathi, S. Qadeer, and C. Sadowski. Practical [20] M. Musuvathi, S. Qadeer, T. Ball, G. Basler,
parallel and concurrent programming course materials. P. Nainar, and I. Neamtiu. Finding and reproducing
http://ppcp.codeplex.com/. heisenbugs in concurrent programs. In Symposium on
[3] M. Ben-Ari. A suite of tools for teaching concurrency. Operating Systems Design and Implementation
SIGCSE Bulletin, 36(3):251–251, 2004. (OSDI), 2008.
[4] K. Bruce, A. Danyluk, and T. Murtagh. Introducing [21] M. Ricken and R. Cartwright. Test-first Java
concurrency in CS 1. In Technical Symposium on concurrency for the classroom. In Technical
Computer Science Education (SIGCSE), 2010. Symposium on Computer Science Education
[5] S. Burckhardt, A. Baldassion, and D. Leijen. (SIGCSE), 2010.
Concurrent programming with revisions and isolation [22] S. Rivoire. A breadth-first course in multicore and
types. In Symposium on Object-Oriented Programming manycore programming. In Technical Symposium on
Systems, Languages, and Applications (OOPSLA), Computer Science Education (SIGCSE), 2010.
2010. [23] J. Spacco and W. Pugh. Helping students appreciate
[6] C. Campbell, R. Johnson, A. Miller, and S. Toub. test-driven development (TDD). In Symposium on
Parallel Programming with Microsoft .NET: Design Object-Oriented Programming Systems, Languages,
Patterns for Decomposition and Coordination on and Applications (OOPSLA), 2006.
Multicore Architectures. Microsoft Press, 2010. [24] H. Sutter. The free lunch is over: A fundamental turn
http://parallelpatterns.codeplex.com/. toward concurrency in software. Dr. Dobbs Journal,
[7] S. Carr, J. Mayo, and C. Shene. ThreadMentor: a 30(3):16–20, 2005.
pedagogical tool for multithreaded programming. [25] D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using
Journal on Educational Resources in Computing data parallelism to program GPUs for general-purpose
(JERIC), 3(1):1, 2003. uses. SIGOPS Operating Systems Review,
[8] S. Choi and E. Lewis. A study of common pitfalls in 40(5):325–335, 2006. http://research.microsoft.
simple multi-threaded programs. SIGCSE Bulletin, com/en-us/projects/Accelerator/.
32(1):329, 2000.
[9] D. Ernst and D. Stevenson. Concurrent CS: preparing
students for a multicore world. In Conference on
Innovation and Technology in Computer Science
Education (ITiCSE), 2008.
194