Multi-Core Programming Digital Edition (06-29-06) PDF
Multi-Core Programming Digital Edition (06-29-06) PDF
Shameem Akhter
Jason Roberts
Intel
PRESS
Copyright 2006 Intel Corporation. All rights reserved.
ISBN 0-9764832-4-6
This publication is designed to provide accurate and authoritative information in regard to the
subject matter covered. It is sold with the understanding that the publisher is not engaged in
professional services. If professional advice or other expert assistance is required, the services
of a competent professional person should be sought.
Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or
other intellectual property rights that relate to the presented subject matter. The furnishing of
documents and other materials and information does not provide any license, express or
implied, by estoppel or otherwise, to any such patents, trademarks, copyrights, or other
intellectual property rights.
Intel may make changes to specifications, product descriptions, and plans at any time, without
notice.
Fictitious names of companies, products, people, characters, and/or data mentioned herein are
not intended to represent any real individual, company, product, or event.
Intel products are not intended for use in medical, life saving, life sustaining, critical control or
safety systems, or in nuclear facility applications.
Intel, the Intel logo, Celeron, Intel Centrino, Intel NetBurst, Intel Xeon, Itanium, Pentium, MMX,
and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in
the United States and other countries.
To my mother.
J.R.
Contents
Preface xi
Chapter 1 Introduction to Multi-Core Architecture 1
Motivation for Concurrency in Software 2
Parallel Computing Platforms 5
Parallel Computing in Microprocessors 7
Differentiating Multi-Core Architectures from Hyper-Threading
Technology 10
Multi-threading on Single-Core versus Multi-Core Platforms 11
Understanding Performance 13
Amdahls Law 14
Growing Returns: Gustafsons Law 18
Key Points 19
v
vi Multi-Core Programming
System Virtualization 33
Key Points 35
Deadlock 177
Heavily Contended Locks 181
Priority Inversion 181
Solutions for Heavily Contended Locks 183
Non-blocking Algorithms 186
ABA Problem 188
Cache Line Ping-ponging 190
Memory Reclamation Problem 190
Recommendations 191
Thread-safe Functions and Libraries 192
Memory Issues 193
Bandwidth 193
Working in the Cache 194
Memory Contention 197
Cache-related Issues 200
False Sharing 200
Memory Consistency 204
Current IA-32 Architecture 204
Itanium Architecture 207
High-level Languages 210
Avoiding Pipeline Stalls on IA-32 211
Data Organization for High Performance 212
Key Points 213
Glossary 303
References 317
Index 323
Preface
By now, most technology professionals have heard of the radical
transformation taking place in the way that modern computing platforms are
being designed. Intel, IBM, Sun, and AMD have all introduced microprocessors
that have multiple execution cores on a single chip. In 2005, consumers had
the opportunity to purchase desktop platforms, servers, and game consoles
that were powered by CPUs that had multiple execution cores. Future
product roadmaps show that this is only the beginning; rather than racing
towards being the first to 10 gigahertz, semiconductor manufacturers are now
working towards the goal of leading the industry in the number of execution
cores integrated onto a single die. In the future, computing platforms,
whether they are desktop, mobile, server, or specialized embedded platforms
are most likely to be multi-core in nature.
The fact that the hardware industry is moving in this direction
presents new opportunities for software developers. Previous hardware
platforms presented a sequential programming model to the
programmer. Operating systems and other system software simulated
multitasking environments by exploiting the speed, or lack thereof, of
human perception. As a result, multi-threading was an effective illusion.
With modern multi-core architectures, developers are now presented
with a truly parallel computing platform. This affords software
developers a great deal more power in terms of the ways that they design
and implement their software. In this book, well take a look at a variety
of topics that are relevant to writing software for multi-core platforms.
Intended Audience
Our primary objective is to provide the material software developers need
to implement software effectively and efficiently on parallel hardware
xi
xii Multi-Core Programming
Intel Software Development Products
As youll see throughout the text, and especially in Chapter 11, Intel
provides more than just multi-core processors. In addition to the
hardware platform, Intel has a number of resources for software
developers, including a comprehensive tool suite for threading that
includes:
Intel C++ and Fortran compilers, which support multi-threading
by providing OpenMP and automatic parallelization support
xiv Multi-Core Programming
Acknowledgements
This book is the culmination of the efforts of a number of talented
individuals. There are many people that need to be recognized. Wed like
to start off with the list of contributors that developed content for this
book. Chapter 6, OpenMP : A Portable Solution for Threading was
written by Xinmin Tian. Chapter 7, Solutions to Common Parallel
Programming Problems, was written by Arch Robison. Finally, James
Reinders, with contributions by Eric Moore and Gordon Saladino,
developed Chapter 11, Intel Software Development Products. Other
contributors who developed material for this book include: Sergey
Zheltov, Stanislav Bratanov, Eugene Gorbatov, and Cameron McNairy.
Preface xv
1
2 Multi-Core Programming
Figure 1.1 End User View of Streaming Multimedia Content via the Internet
1
independently from one another. This decomposition allows us to break
down each task into a single isolated problem, making the problem much
more manageable.
Concurrency in software is a way to manage the sharing of resources
used at the same time. Concurrency in software is important for several
reasons:
Concurrency allows for the most efficient use of system resources.
Efficient resource utilization is the key to maximizing perform-
ance of computing systems. Unnecessarily creating dependencies
on different components in the system drastically lowers overall
system performance. In the aforementioned streaming media example,
one might naively take this, serial, approach on the client side:
1. Wait for data to arrive on the network
2. Uncompress the data
3. Decode the data
4. Send the decoded data to the video/audio hardware
This approach is highly inefficient. The system is completely idle
while waiting for data to come in from the network. A better
approach would be to stage the work so that while the system is
waiting for the next video frame to come in from the network,
the previous frame is being decoded by the CPU, thereby improving
overall resource utilization.
Many software problems lend themselves to simple concurrent
implementations. Concurrency provides an abstraction for
implementing software algorithms or applications that are naturally
parallel. Consider the implementation of a simple FTP server.
Multiple clients may connect and request different files. A single-
threaded solution would require the application to keep track
of all the different state information for each connection. A
more intuitive implementation would create a separate thread for
each connection. The connection state would be managed by this
separate entity. This multi-threaded approach provides a solution
that is much simpler and easier to maintain.
Its worth noting here that the terms concurrent and parallel are not
interchangeable in the world of parallel programming. When multiple
1
The term independently is used loosely here. Later chapters discuss the managing of
interdependencies that is inherent in multi-threaded programming.
Chapter 1: Introduction to Multi-Core Architecture 5
2
A processor that is capable of executing multiple instructions in a single clock cycle is known as a
super-scalar processor.
8 Multi-Core Programming
Figure 1.6 Two Threads on a Dual-Core Processor with each Thread Running
Independently
In the case of memory caching, each processor core may have its
3
own cache. At any point in time, the cache on one processor core
may be out of sync with the cache on the other processor core. To
help illustrate the types of problems that may occur, consider the
following example. Assume two threads are running on a dual-core
processor. Thread 1 runs on core 1 and thread 2 runs on core 2. The
threads are reading and writing to neighboring memory locations.
Since cache memory works on the principle of locality, the data
values, while independent, may be stored in the same cache line. As a
result, the memory system may mark the cache line as invalid, even
though the data that the thread is interested in hasnt changed. This
problem is known as false sharing. On a single-core platform, there
is only one cache shared between threads; therefore, cache
synchronization is not an issue.
Thread priorities can also result in different behavior on single-core
versus multi-core platforms. For example, consider an application
that has two threads of differing priorities. In an attempt to improve
performance, the developer assumes that the higher priority thread
will always run without interference from the lower priority thread.
On a single-core platform, this may be valid, as the operating systems
scheduler will not yield the CPU to the lower priority thread.
However, on multi-core platforms, the scheduler may schedule both
threads on separate cores. Therefore, both threads may run
simultaneously. If the developer had optimized the code to assume
that the higher priority thread would always run without interference
from the lower priority thread, the code would be unstable on multi-
core and multi-processor systems.
One goal of this book is to help developers correctly utilize the number
of processor cores they have available.
Understanding Performance
At this point one may wonderhow do I measure the performance
benefit of parallel programming? Intuition tells us that if we can
subdivide disparate tasks and process them simultaneously, were likely
3
Multi-core CPU architectures can be designed in a variety of ways: some multi-core CPUs will share the
on-chip cache between execution units; some will provide a dedicated cache for each execution core;
and others will take a hybrid approach, where the cache is subdivided into layers that are dedicated to a
particular execution core and other layers that are shared by all execution cores. For the purposes of
this section, we assume a multi-core architecture with a dedicated cache for each core.
14 Multi-Core Programming
Amdahls Law
Given the previous definition of speedup, is there a way to determine the
theoretical limit on the performance benefit of increasing the number of
processor cores, and hence physical threads, in an application? When
examining this question, one generally starts with the work done by
Gene Amdahl in 1967. His rule, known as Amdahls Law, examines the
maximum theoretical performance benefit of a parallel solution relative
to the best case performance of a serial solution.
Amdahl started with the intuitively clear statement that program
speedup is a function of the fraction of a program that is accelerated and
by how much that fraction is accelerated.
1
Speedup =
(1 FractionEnhanced ) + (FractionEnhanced/SpeedupEnhanced )
So, if you could speed up half the program by 15 percent, youd get:
Speedup = 1 / ((1 .50) + (.50/1.15)) = 1 / (.50 + .43) = 1.08
1
Speedup =
S + (1 S )/n
In this equation, S is the time spent executing the serial portion of the
parallelized version and n is the number of processor cores. Note that the
numerator in the equation assumes that the program takes 1 unit of time
to execute the best sequential algorithm.
If you substitute 1 for the number of processor cores, you see that no
speedup is realized. If you have a dual-core platform doing half the work,
the result is:
1 / (0.5S + 0.5S/2) = 1/0.75S = 1.33
or a 33-percent speed-up, because the run time, as given by the
denominator, is 75 percent of the original run time. For an 8-core
processor, the speedup is:
1 / (0.5S + 0.5S/8) = 1/0.75S = 1.78
Setting n = in Equation 1.1, and assuming that the best sequential
algorithm takes 1 unit of time yields Equation 1.2.
1
Speedup =
S
Given this outcome, you can see the first corollary of Amdahls
law: decreasing the serialized portion by increasing the parallelized
portion is of greater importance than adding more processor cores. For
example, if you have a program that is 30-percent parallelized running on
a dual-core system, doubling the number of processor cores reduces run
time from 85 percent of the serial time to 77.5 percent, whereas
doubling the amount of parallelized code reduces run time from 85
percent to 70 percent. This is illustrated in Figure 1.7. Only when a
program is mostly parallelized does adding more processors help more
than parallelizing the remaining code. And, as you saw previously, you
have hard limits on how much code can be serialized and on how many
additional processor cores actually make a difference in performance.
where H(n) = overhead, and again, we assume that the best serial
algorithm runs in one time unit. Note that this overhead is not linear on a
good parallel machine.
Chapter 1: Introduction to Multi-Core Architecture 17
where N = is the number of processor cores and s is the ratio of the time
spent in the serial port of the program versus the total execution time.
Chapter 1: Introduction to Multi-Core Architecture 19
Key Points
This chapter demonstrated the inherent concurrent nature of many
software applications and introduced the basic need for parallelism in
hardware. An overview of the different techniques for achieving parallel
execution was discussed. Finally, the chapter examined techniques for
estimating the performance benefits of using proper multi-threading
techniques. The key points to keep in mind are:
Concurrency refers to the notion of multiple threads in progress
at the same time. This is often achieved on sequential processors
through interleaving.
Parallelism refers to the concept of multiple threads executing
simultaneously.
Modern software applications often consist of multiple processes
or threads that can be executed in parallel.
Most modern computing platforms are multiple instruction,
multiple data (MIMD) machines. These machines allow
programmers to process multiple instruction and data streams
simultaneously.
In practice, Amdahls Law does not accurately reflect the benefit
of increasing the number of processor cores on a given platform.
Linear speedup is achievable by expanding the problem size with
the number of processor cores.
Chapter 2
System Overview
of Threading
W hen implemented properly, threading can enhance performance by
making better use of hardware resources. However, the improper
use of threading can lead to degraded performance, unpredictable
behavior, and error conditions that are difficult to resolve. Fortunately, if
you are equipped with a proper understanding of how threads operate,
you can avoid most problems and derive the full performance benefits
that threads offer. This chapter presents the concepts of threading
starting from hardware and works its way up through the operating
system and to the application level.
To understand threading for your application you need to understand
the following items:
The design approach and structure of your application
The threading application programming interface (API)
The compiler or runtime environment for your application
The target platforms on which your application will run
From these elements, a threading strategy can be formulated for use in
specific parts of your application.
Since the introduction of instruction-level parallelism, continuous
advances in the development of microprocessors have resulted in
processors with multiple cores. To take advantage of these multi-core
processors you must understand the details of the software threading
model as well as the capabilities of the platform hardware.
21
22 Multi-Core Programming
You might be concerned that threading is difficult and that you might
have to learn specialized concepts. While its true in general, in reality
threading can be simple, once you grasp the basic principles.
Defining Threads
A thread is a discrete sequence of related instructions that is executed
independently of other instruction sequences. Every program has at least
one threadthe main threadthat initializes the program and begins
executing the initial instructions. That thread can then create other
threads that perform various tasks, or it can create no new threads and
simply do all the work itself. In either case, every program has at least
one thread. Each thread maintains its current machine state.
At the hardware level, a thread is an execution path that remains
independent of other hardware thread execution paths. The operating
system maps software threads to hardware execution resources as
described later in this chapter
The decision to thread your application should reflect the needs of
the program and the basic execution capabilities of the deployment
platform. Not everything should be threaded. Too much threading can
hurt performance. As with many aspects of programming, thoughtful
design and proper testing determine the right balance.
User-Level Threads
Operational Flow
Used by executable application and handled by user-level OS
Kernel-Level Threads
Used by operating system kernel and and
handled by kernal-level OS
Hardware Threads
Used by each Processor
#include <stdio.h>
// Have to include 'omp.h' to get OpenMP definitons
#include <omp.h>
void main()
{
int threadID, totalThreads;
/* OpenMP pragma specifies that following block is
going to be parallel and the threadID variable is
private in this openmp block. */
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_THREADS 5
As can be seen, the OpenMP code in Listing 2.1 has no function that
corresponds to thread creation. This is because OpenMP creates threads
automatically in the background. Explicit low-level coding of threads
is more evident in Pthreads, shown in Listing 2.2, where a call to
pthread_create() actually creates a single thread and points it at the
work to be done in PrintHello().
26 Multi-Core Programming
Application Layer
Applications and Required Service Components
User-Level
OS
System Libraries
Kernel-Level
Process, Kernel
OS
Other
Threads and IO Memory Internal
Operational
Resource Manager Manager Operational
Units
Scheduler Manager
HAL
(Hardware Abstraction Layer)
Architecture
(Proccessors and Chipset)
threads, which are called fibers on the Windows platform, require the
programmer to create the entire management infrastructure for the
threads and to manually schedule their execution. Their benefit is that
the developer can manipulate certain details that are obscured in kernel-
level threads. However, because of this manual overhead and some
additional limitations, fibers might not add much value for well designed
multi-threaded applications.
Kernel-level threads provide better performance, and multiple kernel
threads from the same process can execute on different processors or
cores. The overhead associated with kernel-level threading is higher than
user-level threading and so kernel-level threads are frequently reused
once they have finished their original work.
Processes are discrete program tasks that have their own address space.
They are the coarse-level execution unit maintained as an independent entity
inside an operating system. There is a direct correlation between processes
and threads. Multiple threads can reside in a process. All threads in a process
share the same address space and so they benefit from simple inter-thread
communication. Instead of maintaining an individual process-based thread
list, the kernel maintains a thread table to keep track of all threads. The
operating system assigns a process control block (PCB) to each process; it
contains data on the processs unique identity, current machine state, the
priority of the process, and the address of the virtual memory where the
process resides.
Figure 2.4 shows the relationship between processors, processes, and
threads in modern operating systems. A processor runs threads from one
or more processes, each of which contains one or more threads.
Processors Processor
Processes Map to
OP1 OP2 OPn MMU
Map to
Threads T 1 T2 Tm Processors
Use-level
threads
User Space
Processes
Operating
System Threads
Kernel Space
Operating
System
Scheduler
HAL
Hardware
Use-level
threads
User Space
Processes
Kernel Space
Operating
System
Scheduler
HAL
Hardware
P/C P/C P/C P/C
User Space
Processes
Operating
Kernel Space
System Threads
Operating
System
HAL
Scheduler
Hardware
P/C P/C P/C P/C
Operational Operational
Path Path
Operational Time
Concurrency Parallelism
Address N
Stack
Stack
Region for Thread 1
Stack
Region for Thread 2
Heap
Enter Exit
New Interrupt Terminate
Ready Running
Scheduler Dispatch
Event Compleation Event Wait
Stack
Waiting
Runtime Virtualization
The operation of runtime virtualization is being provided by runtime
virtual machine. These virtual machines (VMs) can be considered as a
container and executor application on top of an operating system. There
are two mainstream VMs in use today: the Java VM and Microsofts
Common Language Runtime (CLR) that were discussed previously. These
VMs, for example, create at least three threads: the executing thread, a
garbage-collection thread that frees memory blocks that are no longer in
use, and a thread for just-in-time (JIT) compilation of bytecodes into
executable binary code. The VMs generally create other threads for
internal tasks. The VM and the operating system work in tandem to map
these threads to the available execution resources in a way that will
benefit performance as much as possible.
System Virtualization
System virtualization creates a different type of virtual machine. These
VMs recreate a complete execution context for software: they use
virtualized network adapters and disks and run their own instance of
the operating system. Several such VMs can run on the same hardware
platform, each with its separate operating system. The virtualization
layer that sits between the host system and these VMs is called the
virtual machine monitor (VMM). The VMM is also known as the
hypervisor. Figure 2.9 compares systems running a VMM with one that
does not.
running on the actual hardware. The VMM executes the instructions but
pays little notice to what application threads are running.
The only time the VMM interrupts this process is when it needs to
swap out a VM or perform internal tasks. In such a case, several issues
can arise. For example, when a VMM is running multiple guest VMs, it
has to time-slice between them. Suppose a thread is locked and waiting
for a thread running on a different virtual processor when that other
processor is swapped out. The original VM will encounter a substantial
delay that would not occur if both VMs had been running on their own
dedicated hardware systems. This problem, known as lock-holder pre-
emption, is one of several that arise from the fact that guest VM
resources must be swapped out at times and the exact state of all threads
might not expect this situation. However, as virtualization becomes more
widely adopted, its likely that operating systems will offer features that
assist VMMs to coordinate this kind of activity.
Key Points
The concepts of threading depend on an understanding of the interaction
of various system components.
To properly comprehend the impact of threading, it is important
to understand the impact of threads on system components.
Software threads are different than hardware threads, but maintain
a direct relationship.
Application threading can be implemented using APIs or multi-
threading libraries.
Processes, threads, and fibers are different levels of the execution
mechanism within a system.
The thread life cycle has four stages: ready, running, waiting
(blocked), and terminated.
There are two types of virtualization on a system: runtime
virtualization and system virtualization.
A virtual machine monitor (VMM) typically makes no attempt to
match application threads to specific processor cores or to
second-guess the guest operating systems scheduler.
Chapter 3
Fundamental
Concepts of Parallel
Programming
A s discussed in previous chapters, parallel programming uses threads
to enable multiple operations to proceed simultaneously. The entire
concept of parallel programming centers on the design, development,
and deployment of threads within an application and the coordination
between threads and their respective operations. This chapter examines
how to break up programming tasks into chunks that are suitable for
threading. It then applies these techniques to the apparently serial
problem of error diffusion.
37
38 Multi-Core Programming
point in the process, one step generally flows into the next, leading up to
a predictable conclusion, based on predetermined parameters.
To move from this linear model to a parallel programming model,
designers must rethink the idea of process flow. Rather than being
constrained by a sequential execution sequence, programmers should
identify those activities that can be executed in parallel. To do so, they
must see their programs as a set of tasks with dependencies between
them. Breaking programs down into these individual tasks and identifying
dependencies is known as decomposition. A problem may be decomposed
in several ways: by task, by data, or by data flow. Table 3.1 summarizes
these forms of decomposition. As you shall see shortly, these different
forms of decomposition mirror different types of programming activities.
Table 3.1 Summary of the Major Forms of Decomposition
Decomposition Design Comments
Task Different activities assigned to Common in GUI apps
different threads
Data Multiple threads performing the Common in audio
same operation but on different processing, imaging, and
blocks of data in scientific programming
Data Flow One threads output is the input Special care is needed to
to a second thread eliminate startup and
shutdown latencies
Task Decomposition
Decomposing a program by the functions that it performs is called task
decomposition. It is one of the simplest ways to achieve parallel
execution. Using this approach, individual tasks are catalogued. If two of
them can run concurrently, they are scheduled to do so by the
developer. Running tasks in parallel this way usually requires slight
modifications to the individual functions to avoid conflicts and to
indicate that these tasks are no longer sequential.
If we were discussing gardening, task decomposition would suggest
that gardeners be assigned tasks based on the nature of the activity: if
two gardeners arrived at a clients home, one might mow the lawn while
the other weeded. Mowing and weeding are separate functions broken
out as such. To accomplish them, the gardeners would make sure to have
some coordination between them, so that the weeder is not sitting in the
middle of a lawn that needs to be mowed.
Chapter 3: Fundamental Concepts of Parallel Programming 39
generally fall into one of several well known patterns. A few of the more
common parallel programming patterns and their relationship to the
aforementioned decompositions are shown in Table 3.2.
Table 3.2 Common Parallel Programming Patterns
Pattern Decomposition
Task-level parallelism Task
Divide and Conquer Task/Data
Geometric Decomposition Data
Pipeline Data Flow
Wavefront Data Flow
In this section, well provide a brief overview of each pattern and the
types of problems that each pattern may be applied to.
Task-level Parallelism Pattern. In many cases, the best way to
achieve parallel execution is to focus directly on the tasks
themselves. In this case, the task-level parallelism pattern makes
the most sense. In this pattern, the problem is decomposed into a
set of tasks that operate independently. It is often necessary
remove dependencies between tasks or separate dependencies
using replication. Problems that fit into this pattern include the
so-called embarrassingly parallel problems, those where there
are no dependencies between threads, and replicated data
problems, those where the dependencies between threads may
be removed from the individual threads.
Divide and Conquer Pattern. In the divide and conquer pattern,
the problem is divided into a number of parallel sub-problems.
Each sub-problem is solved independently. Once each sub-
problem is solved, the results are aggregated into the final
solution. Since each sub-problem can be independently solved,
these sub-problems may be executed in a parallel fashion.
The divide and conquer approach is widely used on sequential
algorithms such as merge sort. These algorithms are very easy to
parallelize. This pattern typically does a good job of load
balancing and exhibits good locality; which is important for
effective cache usage.
Geometric Decomposition Pattern. The geometric decomposi-
tion pattern is based on the parallelization of the data structures
44 Multi-Core Programming
The numbers in Figure 3.1 illustrate the order in which the data
elements are processed. For example, elements in the diagonal
that contains the number 3 are dependent on data elements
1 and 2 being processed previously. The shaded data
elements in Figure 3.1 indicate data that has already been
processed. In this pattern, it is critical to minimize the idle time
spent by each thread. Load balancing is the key to success with
this pattern.
For a more extensive and thorough look at parallel programming design
patterns, refer to the book Patterns for Parallel Programming (Mattson
2004).
Chapter 3: Fundamental Concepts of Parallel Programming 45
Original 8-bit image on the left, resultant 2-bit image on the right. At the resolution
of this printing, they look similar.
The same images as above but zoomed to 400 percent and cropped to 25 percent
to show pixel detail. Now you can clearly see the 2-bit black-white rendering on the
right and 8-bit gray-scale on the left.
The basic error diffusion algorithm does its work in a simple three-
step process:
1. Determine the output value given the input value of the current
pixel. This step often uses quantization, or in the binary case,
thresholding. For an 8-bit grayscale image that is displayed on a 1-bit
output device, all input values in the range [0, 127] are to be
displayed as a 0 and all input values between [128, 255] are to
be displayed as a 1 on the output device.
2. Once the output value is determined, the code computes the
error between what should be displayed on the output device
and what is actually displayed. As an example, assume that the
current input pixel value is 168. Given that it is greater than our
threshold value (128), we determine that the output value will be
a 1. This value is stored in the output array. To compute the
error, the program must normalize output first, so it is in the
same scale as the input value. That is, for the purposes of
computing the display error, the output pixel must be 0 if the
output pixel is 0 or 255 if the output pixel is 1. In this case, the
display error is the difference between the actual value that
should have been displayed (168) and the output value (255),
which is 87.
3. Finally, the error value is distributed on a fractional basis to the
neighboring pixels in the region, as shown in Figure 3.3.
to the pixel to the right of the current pixel that is being processed.
5/16ths of the error is added to the pixel in the next row, directly below
the current pixel. The remaining errors propagate in a similar fashion.
While you can use other error weighting schemes, all error diffusion
algorithms follow this general method.
The three-step process is applied to all pixels in the image. Listing 3.1
shows a simple C implementation of the error diffusion algorithm, using
Floyd-Steinberg error weights.
/**************************************
* Initial implementation of the error diffusion algorithm.
***************************************/
void error_diffusion(unsigned int width,
unsigned int height,
unsigned short **InputImage,
unsigned short **OutputImage)
{
for (unsigned int i = 0; i < height; i++)
{
for (unsigned int j = 0; j < width; j++)
{
/* 1. Compute the value of the output pixel*/
if (InputImage[i][j] < 128)
OutputImage[i][j] = 0;
else
OutputImage[i][j] = 1;
Given that a pixel may not be processed until its spatial predecessors
have been processed, the problem appears to lend itself to an approach
where we have a produceror in this case, multiple producers
producing data (error values) which a consumer (the current pixel) will
use to compute the proper output pixel. The flow of error data to the
current pixel is critical. Therefore, the problem seems to break down
into a data-flow decomposition.
Chapter 3: Fundamental Concepts of Parallel Programming 49
Now that we identified the approach, the next step is to determine the
best pattern that can be applied to this particular problem. Each
independent thread of execution should process an equal amount of work
(load balancing). How should the work be partitioned? One way, based on
the algorithm presented in the previous section, would be to have a thread
that processed the even pixels in a given row, and another thread that
processed the odd pixels in the same row. This approach is ineffective
however; each thread will be blocked waiting for the other to complete,
and the performance could be worse than in the sequential case.
To effectively subdivide the work among threads, we need a way to
reduce (or ideally eliminate) the dependency between pixels. Figure 3.4
illustrates an important point that's not obvious in Figure 3.3that in
order for a pixel to be able to be processed, it must have three error
1
values (labeled eA, eB, and eC in Figure 3.3) from the previous row, and
one error value from the pixel immediately to the left on the current
row. Thus, once these pixels are processed, the current pixel may
complete its processing. This ordering suggests an implementation
where each thread processes a row of data. Once a row has completed
processing of the first few pixels, the thread responsible for the next row
may begin its processing. Figure 3.5 shows this sequence.
1
We assume eA = eD = 0 at the left edge of the page (for pixels in column 0); and that eC = 0 at the
right edge of the page (for pixels in column W-1, where W = the number of pixels in the image).
50 Multi-Core Programming
Notice that a small latency occurs at the start of each row. This
latency is due to the fact that the previous rows error data must be
calculated before the current row can be processed. These types of
latency are generally unavoidable in producer-consumer implementations;
however, you can minimize the impact of the latency as illustrated here.
The trick is to derive the proper workload partitioning so that each
thread of execution works as efficiently as possible. In this case, you
incur a two-pixel latency before processing of the next thread can begin.
An 8.5" X 11" page, assuming 1,200 dots per inch (dpi), would have
10,200 pixels per row. The two-pixel latency is insignificant here.
The sequence in Figure 3.5 illustrates the data flow common to the
wavefront pattern.
Other Alternatives
In the previous section, we proposed a method of error diffusion where
each thread processed a row of data at a time. However, one might
consider subdividing the work at a higher level of granularity.
Instinctively, when partitioning work between threads, one tends to look
for independent tasks. The simplest way of parallelizing this problem
would be to process each page separately. Generally speaking, each page
would be an independent data set, and thus, it would not have any
interdependencies. So why did we propose a row-based solution instead
of processing individual pages? The three key reasons are:
An image may span multiple pages. This implementation would
impose a restriction of one image per page, which might or might
not be suitable for the given application.
Increased memory usage. An 8.5 x 11-inch page at 1,200 dpi
consumes 131 megabytes of RAM. Intermediate results must be
saved; therefore, this approach would be less memory efficient.
An application might, in a common use-case, print only a
single page at a time. Subdividing the problem at the page level
would offer no performance improvement from the sequential
case.
A hybrid approach would be to subdivide the pages and process regions
of a page in a thread, as illustrated in Figure 3.6.
Chapter 3: Fundamental Concepts of Parallel Programming 51
Note that each thread must work on sections from different page.
This increases the startup latency involved before the threads can begin
work. In Figure 3.6, Thread 2 incurs a 1/3 page startup latency before it
can begin to process data, while Thread 3 incurs a 2/3 page startup
latency. While somewhat improved, the hybrid approach suffers from
similar limitations as the page-based partitioning scheme described
above. To avoid these limitations, you should focus on the row-based
error diffusion implementation illustrated in Figure 3.5.
Key Points
This chapter explored different types of computer architectures and how
they enable parallel software development. The key points to keep in
mind when developing solutions for parallel computing architectures are:
Decompositions fall into one of three categories: task, data, and
data flow.
Task-level parallelism partitions the work between threads based
on tasks.
Data decomposition breaks down tasks based on the data that the
threads work on.
52 Multi-Core Programming
Synchronization
Synchronization is an enforcing mechanism used to impose constraints
on the order of execution of threads. The synchronization controls the
relative order of thread execution and resolves any conflict among
threads that might produce unwanted behavior. In simple terms,
synchronization is used to coordinate thread execution and manage
shared data.
In an environment where messages are used for communicating
between a sender and a receiver, synchronization is implicit, as a
message must be sent before the message can be received. On the other
hand, for a shared-memory based environment, threads have no implicit
interdependency unless some constraints are imposed.
Two types of synchronization operations are widely used: mutual
exclusion and condition synchronization. In the case of mutual
53
54 Multi-Core Programming
Ti Tj
ti
tj
d T = f (t)
tk Shared Data d = f (t) = s(...,ti ,tj ,tk ,tl ,...)
tl
Perform synchronization
Parallel Code Block operations using parallel
or a section needs constructs Bi
multithread synchronization
T1 T2 Tn
Perform synchronization
operations using parallel
Parallel Code Block constructs Bj
T1...p
Critical Sections
A section of a code block called a critical section is where shared
dependency variables reside and those shared variables have dependency
among multiple threads. Different synchronization primitives are used to
keep critical sections safe. With the use of proper synchronization
techniques, only one thread is allowed access to a critical section at any
one instance. The major challenge of threaded programming is to
implement critical sections in such a way that multiple threads perform
mutually exclusive operations for critical sections and do not use critical
sections simultaneously.
Chapter 4: Threading and Parallel Programming Constructs 57
Deadlock
Deadlock occurs whenever a thread is blocked waiting on a resource of
another thread that will never become available. According to the
circumstances, different deadlocks can occur: self-deadlock, recursive
deadlock, and lock-ordering deadlock. In most instances, deadlock means
lock-ordering deadlock.
The self-deadlock is the instance or condition when a thread, Ti, wants
to acquire a lock that is already owned by thread Ti. In Figure 4.5 (a), at
time ta thread Ti owns lock li, where li is going to get released at tc.
However, there is a call at tb from Ti, which requires li. The release time of
li is td, where td can be either before or after tc. In this scenario, thread Ti is
in self-deadlock condition at tb. When the wakeup path of thread Ti, resides
in another thread, Tj, that condition is referred to as recursive deadlock, as
shown in Figure 4.5 (b). Figure 4.5 (c) illustrates a lock-ordering thread,
where thread Ti locks resource rj and waits for resource ri, which is being
locked by thread Tj. Also, thread Tj locks resource ri and waits for resource
rj, which is being locked by thread Ti. Here, both threads Ti and Tj are in
deadlock at td, and w is the wait-function for a lock.
58 Multi-Core Programming
Ti Tj
Ti
ta
tb Section
Required
li by thread
Ti
td
tc
Ti Tj
ti
tj
wi = f (td )
rj wj = f (td ) ri
td ti
Ti : ri ai fi sd
si sj
For any thread Ti, if the state transition of Ti becomes sd for all
possible scenarios and remains blocked at sd, thread Ti would not have
any way to transition from sd to any other state. That is why state sd is
called the deadlock state for thread Ti.
Avoiding deadlock is one of the challenges of multi-threaded
programming. There must not be any possibility of deadlock in an
application. A lock-holding prevention mechanism or the creation of lock
hierarchy can remove a deadlock scenario. One recommendation is to
use only the appropriate number of locks when implementing
synchronization. Chapter 7 has a more detailed description of deadlock
and how to avoid it.
Synchronization Primitives
Synchronization is typically performed by three types of primitives:
semaphores, locks, and condition variables. The use of these primitives
depends on the application requirements. These synchronization primitives
are implemented by atomic operations and use appropriate memory fences.
A memory fence, sometimes called a memory barrier, is a processor
60 Multi-Core Programming
dependent operation that guarantees that threads see other threads memory
operations by maintaining reasonable order. To hide the granularity of these
synchronization primitives, higher level synchronizations are used. That way
application developers have to concern themselves less about internal
details.
Semaphores
Semaphores, the first set of software-oriented primitives to accomplish
mutual exclusion of parallel process synchronization, were introduced by
the well known mathematician Edsger Dijkstra in his 1968 paper, The
Structure of the THE-Multiprogramming System (Dijkstra 1968). Dijkstra
illustrated that synchronization can be achieved by using only traditional
machine instructions or hierarchical structure. He proposed that a
semaphore can be represented by an integer, sem, and showed that a
semaphore can be bounded by two basic atomic operations, P (proberen,
which means test) and V (verhogen, which means increment). These
atomic operations are also referred as synchronizing primitives. Even
though the details of Dijkstras semaphore representation have evolved,
the fundamental principle remains same. Where, P represents the
potential delay or wait and V represents the barrier removal or
release of a thread. These two synchronizing primitives can be
represented for a semaphore s as follows:
Thread "T" performs operation "P":
P(s) atomic {sem = sem-1; temp = sem}
if (temp < 0)
{Thread T blocked and enlists on a
waiting list for s}
Thread "T" performs operation "V":
V(s) atomic {sem = sem +1; temp = sem}
if (temp <=0)
{Release one thread from the waiting
list for s}
where semaphore value sem is initialized with the value 0 or 1 before the
parallel processes get started. In Dijkstras representation, T referred to
processes. Threads are used here instead to be more precise and to
remain consistent about the differences between threads and processes.
The P operation blocks the calling thread if the value remains 0, whereas
the V operation, independent of P operation, signals a blocked thread to
allow it to resume operation. These P and V operations are indivisible
actions and perform simultaneously. The positive value of sem
Chapter 4: Threading and Parallel Programming Constructs 61
represents the number of threads that can proceed without blocking, and
the negative number refers to the number of blocked threads. When the
sem value becomes zero, no thread is waiting, and if a thread needs to
decrement, the thread gets blocked and keeps itself in a waiting list.
When the value of sem gets restricted to only 0 and 1, the semaphore is a
binary semaphore.
To use semaphore, you can consider semaphore as a counter, which
supports two atomic operations. Implementation of semaphores varies.
From a usability perspective, two kinds of semaphores exist: strong and
weak. These represent the success of individual calls on P. A strong
semaphore maintains First-Come-First-Serve (FCFS) model and provides
guarantee to threads to calls on P and avoid starvation. And a weak
semaphore is the one which does not provide any guarantee of service to
a particular thread and the thread might starve. For example, in POSIX,
the semaphores can get into starvation status and implemented
differently than what Dijkstra defined and considered as a weak
semaphore (Reek 2002).
According to Dijkstra, the mutual exclusion of parallel threads using
P and V atomic operations represented as follows:
semaphore s
s.sem = 1
begin
T: <non-critical section>
P(s)
<critical section>
V(s)
Goto T
end
semaphore s
void producer () {
while (1) {
<produce the next data>
s->release()
}
}
void consumer() {
while (1) {
s->wait()
<consume the next data>
}
}
void producer() {
while (1) {
sEmpty->wait()
<produce the next data>
sFull->release()
}
}
void consumer() {
while (1) {
sFull->release()
<consume the next data>
sEmpty->wait()
}
}
Locks
Locks are similar to semaphores except that a single thread handles a lock
at one instance. Two basic atomic operations get performed on a lock:
acquire(): Atomically waits for the lock state to be unlocked and
sets the lock state to lock.
release(): Atomically changes the lock state from locked to
unlocked.
At most one thread acquires the lock. A thread has to acquire a lock
before using a shared resource; otherwise it waits until the lock becomes
available. When one thread wants to access shared data, it first acquires
the lock, exclusively performs operations on the shared data and later
releases the lock for other threads to use. The level of granularity can be
either coarse or fine depending on the type of shared data that needs to
be protected from threads. The coarse granular locks have higher lock
contention than finer granular ones. To remove issues with lock
64 Multi-Core Programming
granularity, most of the processors support the Compare and Swap (CAS)
operation, which provides a way to implement lock-free synchronization.
The atomic CAS operations guarantee that the shared data remains
synchronized among threads. If you require the use of locks, it is
recommended that you use the lock inside a critical section with a single
entry and single exit, as shown in Figure 4.9.
Lock Types
An application can have different types of locks according to the
constructs required to accomplish the task. You must avoid mixing lock
types within a given task. For this reason, special attention is required
Chapter 4: Threading and Parallel Programming Constructs 65
when using any third party library. If your application has some third
party dependency for a resource R and the third party uses lock type L
for R, then if you need to use a lock mechanism for R, you must use lock
type L rather any other lock type. The following sections cover these
locks and define their purposes.
Mutexes. The mutex is the simplest lock an implementation can use.
Some texts use the mutex as the basis to describe locks in general. The
release of a mutex does not depend on the release() operation only. A
timer attribute can be added with a mutex. If the timer expires before a
release operation, the mutex releases the code block or shared memory
to other threads. A try-finally clause can be used to make sure that the
mutex gets released when an exception occurs. The use of a timer or try-
finally clause helps to prevent a deadlock scenario.
Recursive Locks. Recursive locks are locks that may be repeatedly
acquired by the thread that currently owns the lock without causing the
thread to deadlock. No other thread may acquire a recursive lock until
the owner releases it once for each time the owner acquired it. Thus
when using a recursive lock, be sure to balance acquire operations with
release operations. The best way to do this is to lexically balance the
operations around single-entry single-exit blocks, as was shown for
ordinary locks. The recursive lock is most useful inside a recursive
function. In general, the recursive locks are slower than nonrecursive
locks. An example of recursive locks use is shown in Figure 4.10.
Recursive_Lock L
void recursiveFunction (int count) {
L->acquire()
if (count > 0) {
count = count - 1;
recursiveFunction(count);
}
L->release();
}
efficiently for those instances where multiple threads need to read shared
data simultaneously but do not necessarily need to perform a write
operation. For lengthy shared data, it is sometimes better to break the
data into smaller segments and operate multiple read-write locks on
the dataset rather than having a data lock for a longer period of time.
Spin Locks. Spin locks are non-blocking locks owned by a thread.
Waiting threads must spin, that is, poll the state of a lock rather than
get blocked. Spin locks are used mostly on multiprocessor systems. This
is because while the thread spins in a single-core processor system, no
process resources are available to run the other thread that will release
the lock. The appropriate condition for using spin locks is whenever the
hold time of a lock is less than the time of blocking and waking up a
thread. The change of control for threads involves context switching of
threads and updating thread data structures, which could require more
instruction cycles than spin locks. The spin time of spin locks should be
limited to about 50 to 100 percent of a thread context switch (Kleiman
1996) and should not be held during calls to other subsystems. Improper
use of spin locks might cause thread starvation. Think carefully before
using this locking mechanism. The thread starvation problem of spin
locks can be alleviated by using a queuing technique, where every
waiting thread to spin on a separate local flag in memory using First-In,
First-Out (FIFO) or queue construct.
Condition Variables
Condition variables are also based on Dijkstras semaphore semantics,
with the exception that no stored value is associated with the operation.
This means condition variables do not contain the actual condition to
test; a shared data state is used instead to maintain the condition for
threads. A thread waits or wakes up other cooperative threads until a
condition is satisfied. The condition variables are preferable to locks
when pooling requires and needs some scheduling behavior among
threads. To operate on shared data, condition variable C, uses a lock, L.
Three basic atomic operations are performed on a condition variable C:
wait(L): Atomically releases the lock and waits, where wait
returns the lock been acquired again
signal(L): Enables one of the waiting threads to run, where signal
returns the lock is still acquired
broadcast(L): Enables all of the waiting threads to run, where
broadcast returns the lock is still acquired
Chapter 4: Threading and Parallel Programming Constructs 67
Condition C;
Lock L;
Bool LC = false;
void producer() {
while (1) {
L ->acquire();
// start critical section
while (LC == true) {
C ->wait(L);
}
// produce the next data
LC = true;
C ->signal(L);
// end critical section
L ->release();
}
}
void consumer() {
while (1) {
L ->acquire();
// start critical section
while (LC == false) {
C ->wait(L);
}
// consume the next data
LC = false;
// end critical section
L ->release();
}
}
Monitors
For structured synchronization, a higher level construct is introduced
for simplifying the use of condition variables and locks, known as a
monitor. The purpose of the monitor is to simplify the complexity of
primitive synchronization operations and remove the implementation
details from application developers. The compiler for the language that
supports monitors automatically inserts lock operations at the
beginning and the end of each synchronization-aware routine. Most
recent programming languages do not support monitor explicitly,
rather they expose lock and unlock operations to the developers. The
Java language supports explicit monitor objects along with
synchronized blocks inside a method. In Java, the monitor is maintained
by the synchronized constructs, such as
synchronized (object) {
<Critical Section>
}
Messages
The message is a special method of communication to transfer
information or a signal from one domain to another. The definition of
domain is different for different scenarios. For multi-threading
environments, the domain is referred to as the boundary of a thread.
The three Ms of message passing are multi-granularity,
multithreading, and multitasking (Ang 1996). In general, the
conceptual representations of messages get associated with processes
rather than threads. From a message-sharing perspective, messages get
shared using an intra-process, inter-process, or process-process
approach, as shown in Figure 4.12.
Chapter 4: Threading and Parallel Programming Constructs 69
Two threads that communicate with messages and reside in the same
process use intra-process messaging. Two threads that communicate and
reside in different processes use inter-process messaging. From the
developers perspective, the most common form of messaging is the
process-process approach, when two processes communicate with each
other rather than depending on the thread.
In general, the messaging could be devised according to the memory
model of the environment where the messaging operation takes place.
Messaging for the shared memory model must be synchronous, whereas
for the distributed model messaging can be asynchronous. These
operations can be viewed at a somewhat different angle. When there is
nothing to do after sending the message and the sender has to wait for
the reply to come, the operations need to be synchronous, whereas if the
sender does not need to wait for the reply to arrive and in order to
proceed then the operation can be asynchronous.
The generic form of message communication can be represented as
follows:
Sender:
<sender sends message to one or more recipients
through structure>
\\ Here, structure can be either queue or port
<if shared environment>
{wait for the acknowledgement>
<else>
{sender does the next possible operation>
70 Multi-Core Programming
Receiver:
<might wait to get message from sender from
appropriate structure>
<receive message from appropriate structure and
process>
The generic form of message passing gives the impression to developers
that there must be some interface used to perform message passing. The
most common interface is the Message Passing Interface (MPI). MPI is
used as the medium of communication, as illustrated in Figure 4.13.
MPI Interface
over base network protocol
Ni Nj
Microprocessor
Register
Use of hardware
system components
according to the size Caches (L0, L1 <L2=LLC>)
of messages
Fence
The fence mechanism is implemented using instructions and in fact, most
of the languages and systems refer to this mechanism as a fence
instruction. On a shared memory multiprocessor or multi-core
environment, a fence instruction ensures consistent memory operations.
At execution time, the fence instruction guarantees completeness of all
pre-fence memory operations and halts all post-fence memory operations
until the completion of fence instruction cycles. This fence mechanism
ensures proper memory mapping from software to hardware memory
models, as shown in Figure 4.15. The semantics of the fence instruction
depend on the architecture. The software memory model implicitly
supports fence instructions. Using fence instructions explicitly could be
error-prone and it is better to rely on compiler technologies. Due to the
performance penalty of fence instructions, the number of memory fences
needs to be optimized.
72 Multi-Core Programming
Barrier
The barrier mechanism is a synchronization method by which threads
in the same set keep collaborating with respect to a logical
computational point in the control flow of operations. Through this
method, a thread from an operational set has to wait for all other
threads in that set to complete in order to be able to proceed to the
next execution step. This method guarantees that no threads proceed
beyond an execution logical point until all threads have arrived at that
logical point. Barrier synchronization is one of the common operations
for shared memory multiprocessor and multi-core environments. Due to
the aspect of waiting for a barrier control point in the execution flow,
the barrier synchronization wait function for ith thread can be
represented as
(Wbarrier)i = f ((Tbarrier)i, (Rthread)i)
where Wbarrier is the wait time for a thread, Tbarrier is the number of threads
has arrived, and Rthread is the arrival rate of threads.
For performance consideration and to keep the wait time within a
reasonable timing window before hitting a performance penalty, special
Chapter 4: Threading and Parallel Programming Constructs 73
Key Points
This chapter presented parallel programming constructs and later
chapters provide details about the implementation of the constructs. To
become proficient in threading techniques and face fewer issues during
design and development of a threaded solution, an understanding of the
theory behind different threading techniques is helpful. Here are some of
the points you should remember:
For synchronization, an understanding of the atomic actions of
operations will help avoid deadlock and eliminate race
conditions.
Use a proper synchronization construct-based framework for
threaded applications.
Use higher-level synchronization constructs over primitive types.
An application cannot contain any possibility of a deadlock
scenario.
Threads can perform message passing using three different
approaches: intra-process, inter-process, and process-process.
Understand the way threading features of third-party libraries are
implemented. Different implementations may cause applications
to fail in unexpected ways.
Chapter 5
Threading APIs
revious chapters introduced the basic principles behind writing
P concurrent applications. While describing every thread package
available today is beyond the scope of this book, it is important to
illustrate the aforementioned principles with practical examples. This
chapter will provide an overview of several popular thread packages
used by developers today.
75
76 Multi-Core Programming
Creating Threads
All processes start with a single thread of execution: the main thread. In
order to write multi-threaded code, one must have the ability to create
new threads. The most basic thread creation mechanism provided by
Microsoft is CreateThread():
HANDLE CreateThread(
LPSECURITY_ATTRIBUTES lpThreadAttributes,
SIZE_T dwStackSize,
LPTHREAD_START_ROUTINE lpStartAddress,
LPVOID lpParameter,
DWORD dwCreationFlags,
LPDWORD lpThreadId );
The first parameter, lpThreadAttributes, is a data structure that
specifies several different security parameters. It also defines whether or
not processes created from the current process (child processes) inherit
this handle. In other words, this parameter gives advanced control over
how the thread handle may be used in the system. If the programmer
does not need control over these attributes, the programmer may specify
NULL for this field.
The second parameter, dwStackSize, specifies the stack size of the
thread. The size is specified in bytes, and the value is rounded up to the
nearest page size.
Chapter 5: Threading APIs 77
// do something
...
// ready to exit thread
return 0; // will implicitly call ExitThread(0);
}
Note that in C++, calling ExitThread() will exit the thread before
any constructors/automatic variables are cleaned up. Thus, Microsoft
recommends that the program simply return from the ThreadFunc()
rather than call ExitThread() explicitly.
The CreateThread() and ExitThread() functions provide a
flexible, easy to use mechanism for creating threads in Windows
applications. Theres just one problem. CreateThread() does not
perform per-thread initialization of C runtime datablocks and variables.
Hence, you cannot use CreateThread() and ExitThread(), in any
application that uses the C runtime library. Instead, Microsoft provides
two other methods, _beginthreadex() and _endthreadex() that
perform the necessary initialization prior to calling CreateThread().
CreateThread() and ExitThread() are adequate for writing
applications that just use the Win32 API; however, for most cases, it is
recommended that developers use _beginthreadex() and
_endthreadex() to create threads.
The definition of _beginthreadex() is similar to that of
CreateThread(); the only difference being one of semantics.
unsigned long _beginthreadex( // unsigned long
// instead of HANDLE,
// but technically the
// same
void *security, // same as CreateThread()
unsigned stack_size, // same as CreateThread()
unsigned (__stdcall func) (void), // ptr to func
// returning unsigned
// instead of void
void *arglist, // same as CreateThread()
unsigned initflag, // same as CreateThread()
unsigned* threadID); // same as CreateThread()
Similarly, the definition of _endthreadex() follows that of
ExitThread():
void _endthreadex( unsigned retval );
Chapter 5: Threading APIs 79
Managing Threads
Now that you know how to create a thread, lets examine the process of
controlling or manipulating the execution of threads. It was previously
demonstrated that Windows allows developers to create threads in one
of two initial states: suspended or running. For the remainder of the
1
There are two types of threads that can be created using AfxBeginThread(): worker threads and
user-interface threads. This text only considers worker threads.
80 Multi-Core Programming
31 ResetEvent(hTerminate);
32 break;
33 }
34 // we can do our work now...
35 // simulate the case that it takes 1 s
36 // to do the work the thread has to do
37 Sleep(1000);
38 }
39 _endthreadex(0);
40 return 0;
41 }
42
43
44 int main( int argc, char* argv[] )
45 {
46 unsigned int threadID[NUM_THREADS];
47 HANDLE hThread[NUM_THREADS];
48 ThreadArgs threadArgs[NUM_THREADS];
49
50 // Create 10 threads
51 for (int i = 0; i < NUM_THREADS; i++)
52 {
53 threadArgs[i].Id = i;
54 threadArgs[i].hTerminate = CreateEvent(NULL, TRUE,
55 FALSE, NULL);
56 hThread[i] = (HANDLE)_beginthreadex(NULL, 0,
57 &ThreadFunc, &threadArgs[i], 0, &threadID[i]);
58 }
59
60 printf("To kill a thread (gracefully), press 0-9, " \
61 "then <Enter>.\n");
62 printf("Press any other key to exit.\n");
63
64 while (1)
65 {
66 int c = getc(stdin);
67 if (c == '\n') continue;
68 if (c < '0' || c > '9') break;
69 SetEvent(threadArgs[c '0'].hTerminate);
70 }
71
72 return 0;
73 }
2
WaitForXXX() may wait on events, jobs, mutexes, processes, semaphores, threads, and timers,
among other objects.
3
The meaning of a signaled state varies based on the type of object being waited on. In the example
in Figure 5.1, we wait on an Event object, hence, WAIT_OBJECT_0 is returned once SetEvent()
sets the events state to signaled.
Chapter 5: Threading APIs 85
4
If bWaitAll is set to FALSE, and if the number of objects in the signaled state happens to be greater
than 1, the array index of the first signaled or abandoned value in the arraystarting at array index
0is returned.
86 Multi-Core Programming
Listing 5.2 Computing the Index of the Event that Has Been Signaled while Waiting
on Multiple Objects
Now that the thread has a mechanism for waiting for a particular
event to occur, we need a mechanism to signal the thread when it is time
to terminate. Microsoft provides the SetEvent() call for this purpose.
SetEvent() sets an event object to the signaled state. This allows a
thread to notify another thread that the event has occurred. SetEvent()
has the following signature:
BOOL SetEvent( HANDLE hEvent );
SetEvent() takes a single parameter which is the HANDLE value of
the specific event object, and returns TRUE if the event was signaled
successfully. The handle to the event object must be modifiable; in
other words, the access rights for the handle must have the
EVENT_MODIFY_STATE field set.
In the case of a manual reset event, the programmer must return the
event object the non-signaled state. To do this, a programmer uses the
ResetEvent() function. The ResetEvent() function has the following
prototype:
BOOL ResetEvent( HANDLE hEvent );
ResetEvent() accepts as a parameter the handle to reset and
returns TRUE upon success. Like SetEvent(), the handle to the event
Chapter 5: Threading APIs 87
object must have the appropriate access rights set, otherwise the call to
5
ResetEvent() will fail.
It is important to contrast the example program in Listing 5.1 to the
case where the TerminateThread() function is used to terminate a
thread. TerminateThread() fails to give the thread any chance of
graceful exit; the thread is terminated immediately and without any
chance to properly free any resources it may have acquired. It
recommended that you use a notification mechanism such as the one
defined above to give the thread a chance to do proper cleanup.
Thread Synchronization
Generally speaking, creating a thread is a relatively simple task, and one
that does not consume the bulk of the development time when writing a
multi-threaded application. The challenge in writing a multi-threaded
application lies in making sure that in a chaotic, unpredictable, real-world
runtime environment threads act in an orderly, well-known manner,
avoiding such nasty conditions as deadlock and data corruption caused
by race conditions. The example in Figure 5.1 showed one Windows
mechanism for coordinating the actions of multiple threadsevents. This
section will look at the different object types Microsoft provides for
sharing data among threads.
Microsoft defines several different types of synchronization objects as
part of the Win32 API. These include events, semaphores, mutexes, and
critical sections. In addition, the Wait methods allow the developer to
wait on thread and process handles, which may be used to wait for
thread and process termination. Finally, atomic access to variables and
linked lists can be achieved through the use of interlocked functions.
Before we discuss the different data structures provided by Windows,
lets review a few of the basic concepts that are used to synchronize
concurrent access requests to shared resources. The critical section is
the block of code that can only be accessed by a certain number of
threads at a single time. In most cases, only one thread may be executing
in a critical section at one time. A semaphore is a data structure that
limits access of a particular critical section to a certain number of
threads. A mutex is a special case of a semaphore that grants exclusive
access of the critical section to only one thread. With these basic
5
Microsoft defines an additional function for signaling events: PulseEvent(). PulseEvent()
combines the functionality of SetEvent() with ResetEvent(). It is not covered in this text, other
than in this footnote, as Microsofts documentation indicates that the function is unreliable and
should not be used.
88 Multi-Core Programming
1 HANDLE hSemaphore;
2 DWORD status;
3
4 // Create a binary semaphore that is unlocked
5 // We dont care about the name in this case
6 hSemaphore = CreateSemaphore(NULL, 1, 1, NULL);
7
8 // verify semaphore is valid
9 if (NULL == hSemaphore)
10 {
11 // Handle error
12 ;
13 }
14
15 ...
16
17 // We are now testing our critical section
18 status = WaitForSingleObject(hSemaphore, 0);
19
20 if (status != WAIT_OBJECT_0)
21 {
22 // cannot enter critical section handle appropriately
23 }
24 else
25 {
26 // enter critical section
27 // time to exit critical section
28 status = ReleaseSemaphore(hSemaphore, 1, NULL);
29 if (!status)
30 {
31 // release failed, recover from error here
32 }
33 }
6
If the increment value were to cause the semaphores count to exceed the maximum count, the
count will remain unchanged, and the function will return FALSE, indicating an error condition.
Always check return values for error conditions!
90 Multi-Core Programming
1 HANDLE hMutex;
2 DWORD status;
3
4 // Create a mutex
5 // Note that there arent count parameters
6 // A mutex only allows a single thread to be executing
7 // in the critical section
8 // The second parameter indicates whether or not
9 // the thread that creates the mutex will automatically
10 // acquire the mutex. In our case it wont
11 // We dont care about the name in this case
12 hMutex = CreateMutex(NULL, FALSE, NULL);
13 if (NULL == hMutex) // verify mutex is valid
14 {
15 // handle error here
16 }
17
18 ...
19
20 // We are now testing our critical section
21 status = WaitForSingleObject(hMutex, 0);
22
23 if (status != WAIT_OBJECT_0)
24 {
25 // cannot enter critical section handle appropriately
26 }
27 else
28 {
29 // enter critical section
30 // do some work
31
32 ...
33
34 // time to exit critical section
35 status = ReleaseMutex(hMutex);
36 if (!status)
Chapter 5: Threading APIs 91
37 {
38 // release failed, recover from error here
39 }
40 }
Theres one important point to note with regards to both the mutex
and semaphore objects. These objects are kernel objects, and can be
used to synchronize access between process boundaries. This ability
comes at a price; in order to acquire a semaphore, a call to the kernel
must be made. As a result, acquiring a semaphore or mutex incurs
overhead, which may hurt the performance of certain applications. In
the case that the programmer wants to synchronize access to a group
of threads in a single process, the programmer may use the
CRITICAL_SECTION data structure. This object will run in user space,
and does not incur the performance penalty of transferring control to the
kernel to acquire a lock.
The semantics of using CRITICAL_SECTION objects are different
from those of mutex and semaphore objects. The CRITICAL_SECTION
API defines a number of functions that operation on CRITICAL_SECTION
objects:
void InitializeCriticalSection( LPCRITICAL_SECTION lpCS );
void InitializeCriticalSectionAndSpinCount(
LPCRITICAL_SECTION lpCS,
DWORD dwSpinCount );
void EnterCriticalSection( LPCRITICAL_SECTION lpCS );
BOOL TryEnterCriticalSection( LPCRITICAL_SECTION lpCS );
void LeaveCriticalSection( LPCRITICAL_SECTION lpCS );
DWORD SetCriticalSectionSpinCount( LPCRITICAL_SECTION lpCS,
DWORD dwSpinCount );
void DeleteCriticalSection( LPCRITICAL_SECTION lpCS );
EnterCriticalSection() blocks on a critical section object when
it is not available. The non-blocking form of this operation is
TryEnterCriticalSection().
Atomic Operations
Acquiring mutexes and other locking primitives can be very
expensive. Many modern computer architectures support special
instructions that allow programmers to quickly perform common
atomic operations without the overhead of acquiring a lock. Microsoft
supports the operations through the use of the Interlocked API.
92 Multi-Core Programming
Thread Pools
In certain applications, the developer may need to dynamically allocate a
number of threads to perform some task. The number of threads may
vary greatly, depending on variables that are completely out of the
developers control. For example, in a Web server application, there may
be times where the server is sitting idle, with no work to be done.
During other times, the server may be handling thousands of requests at
any given time. One approach to handling this scenario in software
would be dynamic thread creation. As the system starts receiving
more and more work, the programmer would create new threads to
handle incoming requests. When the system slows down, the
programmer may decide to destroy a number of the threads created
during peak load as there isnt any work to be done and the threads are
occupying valuable system resources.
A couple of problems are associated with dynamic thread creation.
First, thread creation can be an expensive operation. During peak traffic,
a Web server will spend more time creating threads than it will spend
actually responding to user requests. To overcome that limitation, the
developer may decide to create a group of threads when the application
starts. These threads would be ready to handle requests as they come in.
This certainly helps solve the overhead problem, but other problems still
remain. What is the optimal number of threads that should be created?
How can these threads be scheduled optimally based on current system
load? At the application level, most developers dont have visibility into
these parameters, and as a result, it makes sense for the operating system
to provide some support for the notion of a thread pool.
Beginning with Windows 2000, Microsoft started providing a thread
pool API that greatly reduces the amount of code that the developer
needs to write to implement a thread pool. The principal function for
using the thread pool is QueueUserWorkItem():
BOOL QueueUserWorkItem ( LPTHREAD_START_ROUTINE Function,
PVOID Context,
ULONG Flags );
The first two parameters are of the kind youve seen before in creating
Windows threads. The routine Function() is a pointer to a function that
represents the work the thread in the pool must perform. This function
must have the form:
DWORD WINAPI Function( LPVOID parameter );
94 Multi-Core Programming
The return value is the threads exit code, which can be obtained by
calling GetExitCodeThread(). The parameter argument contains a
pointer to void. This construct is a generic way of allowing a program to
pass a single parameter or a structure containing multiple parameters.
Simply cast this parameter within the Function routine to point to the
desired data type. The Flags parameter will be examined shortly.
When QueueUserWorkItem() is called for the first time, Windows
creates a thread pool. One of these threads will be assigned to Function.
When it completes, the thread is returned to the pool, where it awaits a
new assignment. Because Windows relies on this process, Function()
must not make any calls that terminate the thread. If no threads are
available when QueueUserWorkItem() is called, Windows has the
option of expanding the number of threads in the pool by creating
additional threads. The size of the thread pool is dynamic and under the
control of Windows, whose internal algorithms determine the best way
to handle the current thread workload.
If you know the work youre assigning will take a long time to
complete, you can pass WT_EXECUTELONGFUNCTION as the third
parameter in the call to QueueUserWorkItem(). This option helps the
thread pool management functions determine how to allocate threads. If
all threads are busy when a call is made with this flag set, a new thread is
automatically created.
Threads in Windows thread pools come in two types: those that
handle asynchronous I/O and those that dont. The former rely on I/O
completion ports, a Windows kernel entity that enables threads to be
associated with I/O on specific system resources. How to handle I/O
with completion ports is a complex process that is primarily the
province of server applications. A thorough discussion of I/O completion
ports may be found in Programming Applications for Microsoft
Windows (Richter 1999).
When calling QueueUserWorkItem(), you should identify which
threads are performing I/O and which ones are not by setting the
WT_EXECUTIONDEFAULT field into the QueueUserWorkItem() Flags
parameter. This tells the thread pool that the thread does not perform
asynchronous I/O and it should be managed accordingly. Threads that do
perform asynchronous I/O should use the WT_EXECUTEINIOTHREAD flag.
When using many threads and functional decomposition, consider
using the thread pool API to save some programming effort and to allow
Windows the best possible opportunities to achieve maximum
performance
Chapter 5: Threading APIs 95
Thread Priority
All operating systems that support threads use a priority scheme to
determine how threads should be allocated time to run on a particular
core processor. This enables important work to proceed while lesser
tasks wait for processing resources to become available. Every operating
system has a different way of handling priorities. Much of the time,
priorities are of no great concern; however, every once in a while
priorities can be important to know how a particular thread will run in
the context of competing threads.
Windows uses a scheme in which threads have priorities that range
from 0 (lowest priority) to 31 (highest priority). The Windows scheduler
always schedules the highest priority threads first. This means that
higher-priority threads could hog the system causing lower-priority
threads to starveif it wasnt for priority boosts. Windows can
dynamically boost a threads priority to avoid thread starvation. Windows
automatically does this when a thread is brought to the foreground, a
window receives a message such as a mouse input, or a blocking
condition (event) is released. Priority boosts can somewhat be controlled
by the user via the following four functions:
SetProcessPriorityBoost( HANDLE hProc, BOOL disable )
SetThreadPriorityBoost( HANDLE hThread, BOOL disable )
GetProcessPriorityBoost( HANDLE hProc, PBOOL disable )
GetThreadPriorityBoost( HANDLE hThread, PBOOL disable )
All threads are created, by default, with their priority set to normal.
After creation, a threads priority is changed using this function:
BOOL SetThreadPriority( HANDLE threadHandle,
int newPriority );
The possible values for newPriority are specified in Table 5.1,
which lists the priorities in descending order. The values are self-
explanatory.
THREAD_PRIORITY_BELOW_NORMAL
THREAD_PRIORITY_LOWEST
THREAD_PRIORITY_IDLE
Processor Affinity
When threads are scheduled for execution, Windows chooses which
processor should run it. The policy it uses in selecting the processor is
called soft affinity. This policy suggests to the Windows scheduler that
Chapter 5: Threading APIs 97
#include <windows.h>
#include <stdio.h>
void main()
{
SYSTEM_INFO sysInfo;
method of scheduling multiple tasks that are known to not need parallel
execution.
The first step in using Windows fibers is to convert the current
thread into a fiber. Once this is done, additional fibers can be added. So,
the following function is the first one to call:
PVOID ConvertThreadToFiber( PVOID parameters );
This function returns the address of the fibers internal data area, which
contains housekeeping items. This address should be saved. Later on, when
you switch fibers, the address of this data area will be needed. The sole
parameter to this function is a pointer to arguments for this fiber. It seems a bit
strange for a thread to pass arguments to itself. However, this parameter can be
retrieved from the fibers internal data area using the function:
PVOID GetFiberData();
There is no point in converting a thread into a fiber unless you plan
to run multiple fibers on the thread. So, once youve converted the
thread into a fiber, you should next create the other fibers you plan to
run. The function to do this is:
PVOID CreateFiber ( DWORD fiberStackSize,
PFIBER_START_ROUTINE fiberProc,
PVOID fiberProcParameters );
The first parameter specifies how large the stack for the fiber should
be. Normally, a value of 0 is passed here. Passing a 0 will cause
Windows to allocate two pages of storage and to limit the stack size
to the default 1 MB. The next two parameters should look familiar
from thread-creation functions you have previously seen. The first is
a pointer to a fiber function; the second is a pointer to the
parameters of that function. Note that unlike a thread function this
fiber function does not return a value. This function has the form:
VOID WINAPI fiberProc( PVOID fiberProcParameters );
An important characteristic of the fiber function is that it must not exit.
Remember that when using threads and the thread function exits, the
thread is terminated. However, with fibers, the effect is more dramatic:
the thread and all the fibers associated with it are terminated.
Again, its important to save the address returned by CreateFiber()
because it is used in the following function to switch among the fibers:
VOID SwitchToFiber( PVOID addressOfFiberEnvironment );
The sole parameter to this function is the address returned by
CreateFiber() and ConvertThreadToFiber(). Switching to a fiber is
102 Multi-Core Programming
the only way to activate a fiber. You can switch anytime you desire to.
You basically receive total control over scheduling in exchange for the
fact that only one fiber at a time can run on a thread. Only a fiber can
switch to another fiber. This explains why you must convert the original
thread into a fiber at the start of this process.
The function to delete a fiber is:
VOID DeleteFiber( PVOID addressOfFiberEnvironment );
A fiber can kill itself this way. However, when it does so, it kills the
current thread and all fibers associated with it.
A final function that is useful is
PVOID GetCurrentFiber();
which returns the address of the fiber environment of the currently
executing fiber.
Listing 5.6 shows the code for a program that creates some fibers and
has them print their identification.
#include <stdio.h>
#include <windows.h>
#define FIBER_COUNT 10
void *fiber_context[FIBER_COUNT];
void main()
{
int i;
int fibers[FIBER_COUNT];
if ( fiber_context[i] != NULL )
printf ( "fiber %d created\n", i );
}
Listing 5.6 Program to Create Fibers that Print an Identifying Message to the Console
Notice the #defined manifest constant at the very start of the listing.
Fibers were introduced in Windows NT 4.0. The value of 0x400 in:
#define _WIN32_WINNT 0x400
tells the compiler to include features in windows.h that appeared in
Microsoft Windows NT 4.0 and later; hence, it includes support for
function calls used by the fiber APIs. Failing to include the constant will
result in compilation errors. The output from this program is:
fiber 1 created
fiber 2 created
fiber 3 created
fiber 4 created
fiber 5 created
fiber 6 created
fiber 7 created
fiber 8 created
fiber 9 created
Hello from fiber 1
Hello from fiber 2
Hello from fiber 3
Hello from fiber 4
Hello from fiber 5
Hello from fiber 6
104 Multi-Core Programming
7
One library is designed for single-threaded applications; it is not re-entrant. The other library is
designed for multi-threaded applications; it is re-entrant.
Chapter 5: Threading APIs 105
The libraries for the C++ runtime are different, as shown in Table 5.4
8
Note that this is the second time that using the standard C library routines has introduced an
additional level of complexity to using threads in Windows; the first was when using the
CreateThread() call. In general, Microsoft encourages the use of the Win32 API over the standard
C library, for instance, CreateFile() instead of fopen(). Using the Win32 API exclusively will
simplify writing Windows-based multi-threaded applications.
Chapter 5: Threading APIs 107
Creating Threads
On the whole, .NET APIs tend to be somewhat leaner than their Win32
counterparts. This is especially visible in the call for creating a new
thread:
using System.Threading;
. . .
Thread t = new Thread( new ThreadStart( ThreadFunc ));
Listing 5.7 illustrates a simple creation of a thread and the call to the
ThreadFunc.
1 using System;
2 using System.Threading;
3
4 public class ThreadDemo1
5 {
6 public static void ThreadFunc()
7 {
8 for ( int i = 0; i < 3; i++ )
9 Console.WriteLine(
10 "Hello #{0} from ThreadFunc", i );
11 Thread.Sleep( 10000 );
12 }
13
14 // The main entry point for the application.
15 public static void Main()
16 {
17 Thread t =
18 new Thread( new ThreadStart( ThreadFunc ));
19 t.Start();
20 Thread.Sleep( 40 );
21
22 for ( int j = 0; j < 4; j++ )
23 {
24 Console.WriteLine( "Hello from Main Thread" );
25 Thread.Sleep( 0 );
26 }
27 }
28 }
Managing Threads
The simplest and safest way to terminate a thread is to exit it. Doing so,
permits the CLR to perform the necessary clean up without any difficulty.
At times, however, its necessary to terminate some other thread. As part
of the .NET threading API, an Abort() method is supplied for this
purpose. A call to Abort() generates a ThreadAbortException, where
any code should go to handle an abort signal in the middle of an
unfinished operation. In addition, the call will execute any code in the
aborted threads finally statement. Listing 5.8 shows the calls.
1 using System;
2 using System.Threading;
3
4 public class ThreadAbortExample
5 {
6 public static void Thread2()
7 {
8 try
9 {
10 Console.WriteLine( "starting t2" );
11 Thread.Sleep( 500 );
12 Console.WriteLine( "finishing t2" );
13 }
14 catch( ThreadAbortException e )
Chapter 5: Threading APIs 111
15 {
16 Console.WriteLine( "in t2\'s catch block");
17 }
18 finally
19 {
20 Console.WriteLine( "in t2\'s finally" );
21 }
22 }
23
24 public static void Main()
25 {
26 Thread t = new Thread( new ThreadStart(Thread2) );
27 Console.WriteLine( "starting main thread" );
28 t.Start();
29 Thread.Sleep( 500 );
30 t.Abort();
31 t.Join();
32 Console.WriteLine( "main thread finished.\n" +
33 "Press <Enter> to exit" );
34 Console.ReadLine();
35 }
36 }
Waiting on a Thread
Threads often need to wait on each other. This concept is presented in
the Win32 APIs as waiting on an event. The .NET Framework borrows
the model used by Pthreadsthe API employed in Linux and several
versions of UNIX. There the concept is known as joining a thread, and
it simply means waiting for that thread to finish. Line 31 of Listing 5.8
shows how this method is called. In that program, the main thread
creates and aborts Thread2 then joins it. This is the preferred way of
knowing with certainty that a given thread has aborted.
It is important to note that the thread calling Join() blocks until the
joined thread exits. In some circumstances, this might not be desirable
because in such cases, Join() can be called with a 32-bit integer
parameter, which indicates the maximum number of milliseconds to wait
for the joined thread to complete. When called, this way, Join() returns
the Boolean value true if the thread terminated, and false if the thread
did not terminate and the return occurred because the maximum wait
expired.
Thread Pools
The creation of a new thread is an expensive operation. A lot of system-
level activity is generated to create the new thread of execution, create
thread-local storage, and set up the system structures to manage the
thread. As a result of this overhead, conservation of created threads is a
recommended practice. The effect on performance, especially on slower
machines, can be compelling.
Chapter 5: Threading APIs 113
1 using System;
2 using System.Threading;
3 public class ThreadPoolExample
4 {
5 public static void Main()
6 {
7 // Queue a piece of work
8 ThreadPool.QueueUserWorkItem(
9 new WaitCallback( WorkToDo ));
10
11 Console.WriteLine( "Greetings from Main()" );
12 Thread.Sleep( 1000 );
13
14 Console.WriteLine( "Main thread exiting...\n" +
15 "Press <enter> to close" );
16 Console.ReadLine();
17 }
18
19 // This thread procedure performs the task.
20 static void WorkToDo( Object dataItems )
114 Multi-Core Programming
21 {
22 Console.WriteLine("Greetings from thread pool");
23 }
24 }
The thread pool is created by the first call to the work queue, which
occurs in line 8. As in thread creation, it is passed a delegate; which, in this
case points to the method defined in lines 2023. As can be seen from the
signature on line 20, there is an overloaded version, which permits an
object to be passed to the work procedure. Frequently, this data object
contains state information about the status of the application when the call
work was queued, but it can, in fact, contain any data object.
Notice the call to Sleep() on line 12. It is necessary for successful
completion of this program. Without this statement, the program could
exit without the work queue ever having completed its work. Because
the work can be assigned to any available thread, the main thread has no
way to join any of the pools threads, so it has no mechanism for waiting
until they complete. Of course, the threads in the pool can modify a data
item to indicate activity, but that is a not a .NET-specific solution.
The output from this program is:
Greetings from Main()
Greetings from thread pool.
Main thread exiting...
Press <enter> to close
In addition to being work engines that consume queued work items,
thread pools are effective means of assigning threads to wait on specific
events, such as waiting on network traffic and other asynchronous
events. The .NET Framework provides several methods for waiting. They
require registering a call-back function that is invoked when the waited-
for event occurs. One of the basic methods for registering a call-back and
waiting is RegisterWaitForSingleObject(), which enables you to
also specify a maximum wait period. The call-back function is called if
the event occurs or the wait period expires. Listing 5.10, which is
adapted from a Microsoft example, shows the necessary code.
1 using System;
2 using System.Threading;
3 // TaskInfo contains data that is passed to the callback
4 // method.
5 public class TaskInfo
Chapter 5: Threading APIs 115
6 {
7 public RegisteredWaitHandle Handle = null;
8 public string OtherInfo = "default";
9 }
10
11 public class Example
12 {
13 public static void Main( string[] args )
14 {
15
16 AutoResetEvent ev = new AutoResetEvent( false );
17
18 TaskInfo ti = new TaskInfo();
19 ti.OtherInfo = "First task";
20 ti.Handle =
21 ThreadPool.RegisterWaitForSingleObject(
22 ev,
23 new WaitOrTimerCallback( WaitProc ),
24 ti,
25 1000,
26 false );
27
28 // The main thread waits three seconds,
29 // to demonstrate the time-outs on the queued
30 // thread, and then signals.
31
32 Thread.Sleep( 3100 );
33 Console.WriteLine( "Main thread signals." );
34 ev.Set();
35
36 Thread.Sleep( 1000 );
37 Console.WriteLine( "Press <enter> to close." );
38 Console.ReadLine();
39 }
40
41 // The callback method executes when the registered
42 // wait times-out, or when the WaitHandle (in this
43 // case, AutoResetEvent) is signaled.
44
45 public static void WaitProc( object passedData,
46 bool timedOut )
47 {
48 TaskInfo ti = (TaskInfo) passedData;
49
50 string cause = "TIMED OUT";
51 if ( !timedOut )
52 {
53 cause = "SIGNALED";
54 if ( ti.Handle != null )
116 Multi-Core Programming
55 ti.Handle.Unregister( null );
56 }
57
58 Console.WriteLine(
59 "WaitProc({0}) on thread {1}; cause={2}",
60 ti.OtherInfo,
61 Thread.CurrentThread.GetHashCode().ToString(),
62 cause
63 );
64 }
65 }
The number of the thread on which the task is executed will vary
from system to system. As can be seen, while the main thread is waiting,
the 1-second duration expires three times, as expected. Then, the
callback function is called one more time when the signal is sent.
The .NET Framework enables threads to start up based on more than
a single event. The WaitHandle.WaitAll() and WaitHandle.Wait-
Any()methods fire when all events in an array have been signaled, or if
any one event in an array is signaled, respectively. Events themselves do
not need to be automatic as in Listing 5.10; they can also be manual by
using ManualResetEvent(). The difference is that an automatic reset
will issue the signal and then reset itself so that it is not in the signaled
state, whereas a manual reset event persists in the signaled state until it is
manually reset. The choice between them depends entirely on the
applications needs.
As this section has illustrated, thread pools are a very useful
mechanism that enables sophisticated threading to be implemented
conveniently in many applications. The range of options regarding events
and the characteristics of signals give thread pools considerable
flexibility.
Thread Synchronization
The mechanisms for synchronizing thread actions in .NET are similar to
those found in all other threading APIs, such as Win32 and Pthreads.
They include capabilities for mutual exclusion and for atomic actions
on specific variables. By and large, .NET maintains the simplicity of
expression seen in the previous examples. No synchronization is simpler,
in fact, than use of the lock keyword in C#.
The usual way to use lock is to place it in front of a block of code
delimited by braces. Then, that block can be executed by only one
thread at a time. For example:
lock(this)
{
shared_var = other_shared_var + 1;
other_shared_var = 0;
}
The C# lock statement makes several calls to the .NET Framework.
The previous example is syntactically equivalent to the following snippet:
Monitor.Enter( this )
try
{
shared_var = other_shared_var + 1;
118 Multi-Core Programming
other_shared_var = 0;
}
finally
{
Monitor.Exit( this )
}
Monitor is a class that enforces mutual exclusion and locking in .NET.
When used as in the previous example, Monitor.Enter() locks a code
block. In this respect, it is similar to critical sections in the Win32 API.
Monitor can also be used to lock a data structure by passing that data
structure as a parameter to the Monitor.Enter() call. Monitor.Exit()
releases the lock. If Monitor.Enter() was called with an object,
Monitor.Exit() should be called with the same object to release the
lock. When Monitor.Enter() is called, the .NET Framework sets up two
queues: one containing references to threads waiting to obtain the lock
once its released, and another queue containing references to threads that
want to be signaled that the lock is available. When Monitor.Exit() is
called, the next thread in the first queue gets the lock.
Monitors have unusual aspects. For example, the Monitor.Wait()
method enables a thread to temporarily give up a lock to another thread and
then reclaim it. A system of signals called pulses are used to notify the
original thread that the lock has been released.
As you have learned, mutexes are a similar mechanism for providing
mutual exclusion to resources. Mutexes differ from monitors in that they
can be used with wait handles, as shown in the following example. They
also can be locked multiple times. In such a case, they must be unlocked
the same number of times before the lock is actually released.
To use a mutex, one must be created. Then a call to WaitOne is
issued to grab the lock as soon as it becomes available, if its not already
available. Once the lock is no longer needed, it is made available with the
ReleaseMutex method.
private static Mutex mutx = new Mutex();
. . .
Thread.Sleep( 100 );
Chapter 5: Threading APIs 119
Atomic Actions
Actions are atomic if they can only be performed as a single indivisible
act. The term is commonly used in database operations to refer to a series
of steps that must all be completed. If any of them cant be completed, all
steps so far completed are rolled back, so that at no time does the
database record a partial series. Its all or nothing. Threads present similar
problems. Consider what happens if a thread is suspended while it is
updating the values of an important variable. Suddenly, the application or
9
the system can be left in a degenerate or corrupted state. One solution is
the Interlocked class. Although not discussed in the Win32 portion of
this chapter, the Win32 API does have corresponding APIs.
The three most common methods of the Interlocked class are:
Decrement, Increment, and Exchange. These are all simple methods to
use and should be used anytime a variable shared between threads is
being modified.
int intCounter = 0;
. . .
// Drop value to 5
Interlocked.Decrement( ref intCounter );
//Raise it back to 6
Interlocked.Increment( ref intCounter );
Several aspects are worthy of note. Firstly, the Interlocked class
uses references to the variables to be modified, not the variables
themselves; so make sure to include the ref keyword, as in the
9
It might come as a surprise to some readers that incrementing or decrementing a variable is not
inherently an indivisible action. It takes three instructions: the variable is copied into a register in
the processor core by a process called loading, incremented, and then copied from the register
back to the variables location in memory.
120 Multi-Core Programming
POSIX Threads
POSIX threads, or Pthreads, is a portable threading library designed with
the intent of providing a consistent programming interface across multiple
operating system platforms. Pthreads is now the standard threading
interface for Linux and is also widely used on most UNIX platforms. An
open-source version for Windows, called pthreads-win32, is available as
well. For more information on pthreads-win32, refer to References. If you
want to work in C and need a portable threading API that provides more
direct control than OpenMP, pthreads is a good choice.
Most core Pthreads functions focus on thread creation and destruction,
synchronization, plus a few miscellaneous functions. Capabilities like
thread priorities are not a part of the core pthreads library, but instead are
a part of the optional capabilities that are vendor specific.
Creating Threads
The POSIX threads call to create a thread is pthread_create():
pthread_create (
&a_thread, // thread ID goes here
NULL, // thread attributes (NULL = none)
PrintThreads, // function name
(void *) msg ); // parameter
As in Windows, the third parameter represents a pointer to the function
called by the launched thread, while the fourth parameter is a pointer to a
void, which is used to pass arguments to the called function.
Listing 5.11 illustrates the usage of pthread_create() to create a
thread.
Chapter 5: Threading APIs 121
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <pthread.h>
4
5 void *PrintThreads ( void * );
6
7 #define NUM_THREADS 9
8
9 int main()
10 {
11 int i, ret;
12 pthread_t a_thread;
13
14 int thdNum [NUM_THREADS]; //thread numbers go here
15
16 for ( i = 0; i < NUM_THREADS; i++ )
17 thdNum[i] = i;
18
19 for ( i = 0; i < NUM_THREADS; i++ )
20 {
21 ret = pthread_create (
22 &a_thread,
23 NULL,
24 PrintThreads,
25 (void *) &thdNum[i] );
26
27 if ( ret == 0 )
28 printf ( "Thread launched successfully\n" );
29 }
30
31 printf ( "Press any key to exit..." );
32 i = getchar();
33 return ( 0 );
34 }
35
36 // Make the threads print out their thread number.
37
38 void *PrintThreads ( void *num )
39 {
40 int i;
41
42 for ( i = 0; i < 3; i++ )
43 printf ( "Thread number is %d\n",
44 *((int*)num));
45
46 return ( NULL );
47 }
Managing Threads
When a thread is created under Pthreads, developers have the option of
indicating the nature of that threads interaction with other threads. For
example,
pthread_detach( pthread_t thread_to_detach );
can be used to detach the thread from the other threads when it has no
need to interact with them. This option asserts that no other thread will
interact with this thread, and that the operating system is free to use
this information in managing the thread. The operating system uses this
information particularly at thread exit, when it knows that no return value
needs to be passed back to some other thread.
The complementary function,
pthread_join( pthread_t thread, void **ret_val );
tells the operating system to block the calling thread until the specified
thread exits. Attaching to a thread in this way is called joining, just as we
saw in the section on .NET threads. The function takes two parameters:
the pthread_t identifier of the thread being joined, and a pointer to a
pointer to void where the threads return value should be placed. If the
thread does not return a value, NULL can be passed as the second
parameter.
To wait on multiple threads, simply join all those threads. Listing 5.12
shows how this is done.
int main()
{
int i, ret;
{
ret = pthread_create (
&thdHandle[i],
NULL,
PrintThreads,
(void *) &thdNum[i] );
if ( ret == 0 )
printf ( "Thread launched successfully\n" );
}
One caveat should be noted: two threads cannot join the same
thread. Once a thread has been joined, no other threads can join it. To
have two or more threads wait on a threads execution, other devices
such as those presented in the section on signaling can be used.
Thread Synchronization
The Pthreads library has mutexes that function similarly to those in
Win32 and .NET. Terminology and coding syntax, predictably, are
different; as are some details of implementation.
Whereas Windows refers to mutexes as being signaled, that is,
available or unlocked, Pthreads refers to mutexes by the more intuitive
terms locked and unlocked. Obviously, when a mutex is locked, the code
its protecting is not accessible. The syntax of the Pthreads API calls
follows this nomenclature:
pthread_mutex_lock( &aMutex );
. . . code to be protected goes here . . .
pthread_mutex_unlock( &aMutex );
The sole parameter to both functions is the address of a previously
declared mutex object:
pthread_mutex_t aMutex = PTHREAD_MUTEX_INITIALIZER;
124 Multi-Core Programming
pthread_mutex_lock( &testMutex );
pthread_mutex_unlock( &testMutex );
return ( NULL );
}
Earlier in the program, at the global level, the following definition
appeared:
pthread_mutex_t testMutex = PTHREAD_MUTEX_INITIALIZER;
In the discussion of Win32 mutexes, we saw that calling WaitFor-
SingleObject(hMutex, 0) would test hMutex right away and return. By
examining the return value and comparing it to WAIT_TIMEOUT, we can tell
whether the mutex was locked. The Pthreads library has a similar function,
pthread_mutex_trylock(&mutex), which tests the mutex to see whether
its locked and then returns. If it returns EBUSY, the mutex is already locked.
Its important to note in both the Windows and Pthreads version of this
function, if the mutex is unlocked, this call will lock it. It behooves you
therefore to check the return value, so as to avoid inadvertently locking a
mutex simply because you were trying to see whether it was available. It is
expected that you will use this test-and-lock behavior in situations where
you would like to lock a mutex, but if the mutex is already locked, you
might want to perform other activities before testing the mutex again.
Signaling
Many multi-threading programmers find the event model of
communication error prone. As a result, certain APIs exclude them. The
Pthreads model has no direct counterpart to the Windows concept of
Chapter 5: Threading APIs 125
events. Rather, two separate constructs can be used to achieve the same
ends. They are condition variables and the semaphore.
Condition Variables
A condition variable is a mechanism that is tightly bound to a mutex and
a data item. It is used when one or more threads are waiting for the value
of the data item to change. Rather than spinning, the threads block on
the condition variable and wait for it to be signaled by some other thread.
This signal notifies the waiting threads that the data item has changed
and enables the threads to begin or resume processing.
This works in a very mechanical way. The data item is declared, for
instance, with a flag that tells a consumer thread that the producer thread
has data ready for it, and that the data is protected by a mutex. The data
item and the mutex together are associated with a condition variable.
When the producer thread changes the flag, after unlocking and
relocking the mutex, it signals the condition variable, which announces
that the flag has changed value. This announcement can be sent
optionally to a single thread or broadcast to all threads blocking on the
condition variable. In addition to the announcement, the signal unblocks
the waiting thread or threads.
Listing 5.13 illustrates how this works by showing two threads
waiting on a condition variable. The listing is somewhat longer than the
others presented in this book, but it shows how to address a very typical
problem in programming with threads.
1 #include <stdio.h>
2 #include <stdlib.h>
3
4 #include <pthread.h>
5
6 #define BLOCK_SIZE 100
7 #define BUF_SIZE 1000000
8
9 size_t bytesRead;
10
11 typedef struct {
12 pthread_mutex_t mutex; // mutex
13 pthread_cond_t cv; // condition variable
14 int data; // data item used as a flag
15 } flag;
16
17 flag ourFlag = { // default initialization
18 PTHREAD_MUTEX_INITIALIZER,
126 Multi-Core Programming
19 PTHREAD_COND_INITIALIZER,
20 0 }; // data item set to 0
21
22 pthread_t hThread1, hThread2; // the waiting threads
23 void* PrintCountRead( void* ); // the thread function
24
25 int main( int argc, char *argv[] )
26 {
27 FILE *infile;
28 char *inbuf;
29 int status;
30
31 if ( argc != 2 )
32 {
33 printf( "Usage GetSetEvents filename\n" );
34 return( -1 );
35 }
36
37 infile = fopen( argv[1], "r+b" );
38 if ( infile == NULL )
39 {
40 printf( "Error opening %s\n", argv[1] );
41 return( -1 );
42 }
43
44 inbuf = (char*) malloc ( BUF_SIZE );
45 if ( inbuf == NULL )
46 {
47 printf( "Could not allocate read buffer\n" );
48 return( -1 );
49 }
50
51 // now start up two threads
52 pthread_create( &hThread1, NULL,
53 PrintCountRead, (void *) NULL );
54 pthread_create( &hThread2, NULL,
55 PrintCountRead, (void *) NULL );
56
57 bytesRead = fread( inbuf, 1, BLOCK_SIZE, infile );
58 if ( bytesRead < BLOCK_SIZE )
59 {
60 printf( "Need a file longer than %d bytes\n",
61 BLOCK_SIZE );
62 return( -1 );
63 }
64 else // now we tell the waiting thread(s)
65 {
66 // first, lock the mutex
67 status = pthread_mutex_lock( &ourFlag.mutex );
68 if ( status != 0 )
Chapter 5: Threading APIs 127
69 {
70 printf( "error locking mutex in main func.\n" );
71 exit( -1 );
72 }
73
74 ourFlag.data = 1; // change the data item
75 // then broadcast the change
76 status = pthread_cond_broadcast( &ourFlag.cv ) ;
77 if ( status != 0 )
78 {
79 printf( "error broadcasting condition var\n" );
80 exit( -1 );
81 }
82
83 // unlock the mutex
84 status = pthread_mutex_unlock( &ourFlag.mutex );
85 if ( status != 0 )
86 {
87 printf( "error unlocking mutex in waiting \
88 function\n" );
89 exit( -1 );
90 }
91 }
92
93 while ( !feof( infile ) &&
94 bytesRead < BUF_SIZE - BLOCK_SIZE )
95 bytesRead += fread(inbuf, 1, BLOCK_SIZE, infile );
96
97 printf("Read a total of %d bytes\n", (int)bytesRead);
98 return( 0 );
99 }
100
101 // the thread function, which waits on the
102 // condition variable
103 void *PrintCountRead( void* pv )
104 {
105 int status;
106
107 // lock the mutex
108 status = pthread_mutex_lock( &ourFlag.mutex );
109 if ( status != 0 )
110 {
111 printf( "error locking mutex in waiting func.\n" );
112 exit( -1 );
113 }
114
115 // now wait on the condition variable
116 // (loop should spin once only)
117 while ( ourFlag.data == 0 )
118 {
128 Multi-Core Programming
now proves false, and execution flows to the next statement. Here, the
flags value is checked again (line 128130) and the dependent action
printing the number of bytes read by the principal threadis performed.
The mutex is then unlocked (lines 135141) and the worker thread exits.
After starting up the two worker threads, which are both blocked
waiting for their condition variables to be signaled, the main thread reads
one buffer of data (line 57). When this read is successful, it signals the
worker threads that they can proceed. It does this by locking the mutex
and broadcasting the signal to all waiting threads via
pthread_cond_broadcast() (line 76). It then unlocks the mutex and
finishes reading the file. This routine could have instead used
pthread_cond_signal() to emit the signal. However, that call would
have signaled only one waiting thread, rather than all of them. Such an
option would be useful if several waiting threads are all waiting to do the
same thing, but the desired activity cannot be parallelized.
The program in Listing 5.14 generates the following output when run
on a file consisting of 435,676 bytes.
Condition was signaled. Main thread has read 002700 bytes
Condition was signaled. Main thread has read 011200 bytes
Read a total of 435676 bytes
You might be tempted to use condition variables without the
required mutex. This will lead to problems. Pthreads is designed to use a
mutex with condition variables, as can be seen in the parameters in
pthread_cond_wait(), which takes a pointer to the condition variable
and one to the mutex. In fact, without the mutex, the code will not
compile properly. The mutex is needed by the Pthreads architecture to
correctly record the occurrence of the signal used by the condition
variable.
The code in Listing 5.14 is typical of producer/consumer situations.
In those, typically, the program starts up a bunch of threads. The
producer threadsin this case, the one reading the filemust generate
data or actions for the consumer or worker threads to process. Typically,
the consumer threads are all suspended pending a signal sent when there
is data to consume. In this situation, .NET implements handles via a
thread pool; however, Pthreads has no built-in thread pool mechanism.
Semaphores
The semaphore is comparable to those in the Win32 APIs, described
earlier. A semaphore is a counter that can have any nonnegative value.
Threads wait on a semaphore. When the semaphores value is 0, all
130 Multi-Core Programming
threads are forced to wait. When the value is nonzero, a waiting thread is
released to work. The thread that gets released is determined first by
thread priority, then by whoever attached to the semaphore first. When
a thread releases, that is, becomes unblocked, it decrements the value of
the semaphore. In typical constructs, the semaphore is set to 0
(blocking), which forces dependent threads to wait. Another thread
increments the semaphore; this process is known as posting. One
waiting thread is thereby released and in releasing, it decrements the
semaphore back to 0. This blocks all other threads still waiting on
the semaphore. This design makes the semaphore a convenient way to
tell a single waiting thread that it has work to be performed.
Technically speaking, Pthreads does not implement semaphores; they
are a part of a different POSIX specification. However, semaphores are
used in conjunction with Pthreads thread-management functionality, as
you shall see presently. Listing 5.14 illustrates the use of Pthreads with
semaphores. The program reads a file and signals another thread to print
the count of bytes read. Nonessential parts of the listing have been
removed.
1 #include <stdio.h>
2 #include <stdlib.h>
3
4 #include <pthread.h>
5 #include <semaphore.h>
6
7 #define BLOCK_SIZE 100
8 #define BUF_SIZE 1000000
9
10 size_t bytesRead;
11
12 sem_t sReadOccurred; // the semaphore we'll use
13 pthread_t hThread; // the waiting thread
14 void*
15 PrintCountRead( void* ); // the thread function
16
17 int main( int argc, char *argv[] )
18 {
19 . . . open the input file here. . .
20
21 // first initialize the semaphore
22 sem_init( &sReadOccurred, // address of the semaphore
23 0, // 0 = share only with threads in this program
24 0 ); // initial value. 0 = make threads wait
25
26 // now start up the thread
Chapter 5: Threading APIs 131
27 pthread_create(
28 &hThread,
29 NULL,
30 PrintCountRead,
31 (void *) NULL );
32
33 bytesRead = fread( inbuf, 1, BLOCK_SIZE, infile );
34 if ( bytesRead < BLOCK_SIZE )
35 {
36 printf( "Need a file longer than %d bytes\n",
37 BLOCK_SIZE );
38 return( -1 );
39 }
40 else
41 sem_post( &sReadOccurred ); // release the
42 // waiting threads
43
44 . . . finish reading file and print total bytes read. . .
45
46 return( 0 );
47 }
48
49 // the thread function, which waits for the event before
50 // proceeding
51 void *PrintCountRead( void* pv )
52 {
53 int i;
54
55 sem_wait( &sReadOccurred ); // wait on the semaphore
56 printf( "Have now read %06d bytes\n",
57 (int) bytesRead );
58 return( pv );
59 }
Key Points
This chapter provided an overview of two threading APIs: the Microsoft
Windows model, and the POSIX threads (Pthreads) model. When
developing applications based on these APIs, you should keep the
following points in mind:
Multi-threaded applications targeting Microsoft Windows can be
written in either native or managed code.
Since the CreateThread() function does not perform per-thread
initialization of C runtime datablocks and variables, you cannot
reliably use CreateThread() in any application that uses the C
runtime library. Use _beginthreadex() function instead.
Thread termination should be handled very carefully. Avoid using
functions such as TerminateThread().
Chapter 5: Threading APIs 133
135
136 Multi-Core Programming
Lets take a closer look at the loop. First, the example uses work-sharing,
which is the general term that OpenMP uses to describe distributing
work across threads. When work-sharing is used with the for construct,
as shown in this example, the iterations of the loop are distributed
among multiple threads. The OpenMP implementation determines how
many threads to create and how best to manage them. All the
programmer needs to do is to tell OpenMP which loop should be
threaded. No need for programmers to add a lot of codes for creating,
initializing, managing, and killing threads in order to exploit parallelism.
OpenMP compiler and runtime library take care of these and many other
details behind the scenes.
In the current OpenMP specification Version 2.5, OpenMP places the
following five restrictions on which loops can be threaded:
The loop variable must be of type signed integer. Unsigned
integers will not work. Note: this restriction is to be removed in
the future OpenMP specification Version 3.0.
The comparison operation must be in the form loop_variable
<, <=, >, or >= loop_invariant_integer.
The third expression or increment portion of the for loop must
be either integer addition or integer subtraction and by a loop-
invariant value.
Chapter 6: OpenMP: A Portable Solution for Threading 137
Loop-carried Dependence
Even if the loop meets all five loop criteria and the compiler threaded
the loop, it may still not work correctly, given the existence of data
dependencies that the compiler ignores due to the presence of OpenMP
pragmas. The theory of data dependence imposes two requirements
138 Multi-Core Programming
x[0] = 0;
y[0] = 1;
#pragma omp parallel for private(k)
x[0] = 0;
y[0] = 1;
x[49] = 74; //derived from the equation x(k)=x(k-2)+3
y[49] = 74; //derived from the equation y(k)=y(k-2)+3
Besides using the parallel for pragma, for the same example, you
can also use the parallel sections pragma to parallelize the original
loop that has loop-carried dependence for a dual-core processor system.
// Effective threading of a loop using parallel sections
With this simple example, you can learn several effective methods
from the process of parallelizing a loop with loop-carried dependences.
Sometimes, a simple code restructure or transformation is necessary to
get your code threaded for taking advantage of dual-core and multi-core
processors besides simply adding OpenMP pragmas.
Data-race Conditions
Data-race conditions that are mentioned in the previous chapters could
be due to output dependences, in which multiple threads attempt
to update the same memory location, or variable, after threading. In
general, the OpenMP C++ and Fortran compilers do honor OpenMP
pragmas or directives while encountering them during compilation
phase, however, the compiler does not perform or ignores the detection
of data-race conditions. Thus, a loop similar to the following example, in
which multiple threads are updating the variable x will lead to
undesirable results. In such a situation, the code needs to be modified
via privatization or synchronized using mechanisms like mutexes. For
example, you can simply add the private(x) clause to the parallel
for pragma to eliminate the data-race condition on variable x for
this loop.
Chapter 6: OpenMP: A Portable Solution for Threading 141
Each of the four clauses takes a list of variables, but their semantics are
all different. The private clause says that each variable in the list should
have a private copy made for each thread. This private copy is initialized
with its default value, using its default constructor where appropriate. For
example, the default value for variables of type int is 0. In OpenMP,
memory can be declared as private in the following three ways.
Use the private, firstprivate, lastprivate, or reduction
clause to specify variables that need to be private for each
thread.
Use the threadprivate pragma to specify the global variables
that need to be private for each thread.
Declare the variable inside the loopreally inside the OpenMP
parallel regionwithout the static keyword. Because static
variables are statically allocated in a designated memory area by
the compiler and linker, they are not truly private like other
variables declared within a function, which are allocated within
the stack frame for the function.
The following loop fails to function correctly because the variable x is
shared. It needs to be private. Given example below, it fails due to the
loop-carried output dependence on the variable x. The x is shared among
all threads based on OpenMP default shared rule, so there is a data-race
condition on the x while one thread is reading x, another thread might
be writing to it
#pragma omp parallel for
for ( k = 0; k < 100; k++ ) {
x = array[k];
array[k] = do_work(x);
}
This problem can be fixed in either of the following two ways, which
both declare the variable x as private memory.
// This works. The variable x is specified as private.
Every time you use OpenMP to parallelize a loop, you should carefully
examine all memory references, including the references made by called
functions.
For dynamic scheduling, the chunks are handled with the first-come,
first-serve scheme, and the default chunk size is 1. Each time, the number
of iterations grabbed is equal to the chunk size specified in the schedule
clause for each thread, except the last chunk. After a thread has finished
executing the iterations given to it, it requests another set of chunk-size
iterations. This continues until all of the iterations are completed. The
last set of iterations may be less than the chunk size. For example, if
the chunk size is specified as 16 with the schedule(dynamic,16)
clause and the total number of iterations is 100, the partition would be
16,16,16,16,16,16,4 with a total of seven chunks.
For guided scheduling, the partitioning of a loop is done based on the
following formula with a start value of 0 = number of loop iterations.
k = k
2N
where N is the number of threads, k denotes the size of the kth chunk,
starting from the 0th chunk, and k denotes the number of remaining
unscheduled loop iterations while computing the size of kth chunk.
146 Multi-Core Programming
When k gets too small, the value gets clipped to the chunk size S
that is specified in the schedule (guided, chunk-size) clause. The
default chunk size setting is 1, if it is not specified in the schedule
clause. Hence, for the guided scheduling, the way a loop is partitioned
depends on the number of threads (N), the number of iterations (0) and
the chunk size (S).
For example, given a loop with 0 = 800, N = 2, and S = 80, the loop
partition is {200, 150, 113, 85, 80, 80, 80, 12}. When 4 is smaller than
80, it gets clipped to 80. When the number of remaining unscheduled
iterations is smaller than S, the upper bound of the last chunk is trimmed
whenever it is necessary. The guided scheduling supported in the Intel
C++ and Fortran compilers are a compliant implementation specified in
the OpenMP Version 2.5 standard.
With dynamic and guided scheduling mechanisms, you can tune your
application to deal with those situations where each iteration has variable
amounts of work or where some cores (or processors) are faster than
others. Typically, guided scheduling performs better than dynamic
scheduling due to less overhead associated with scheduling.
The runtime scheduling scheme is actually not a scheduling scheme
per se. When runtime is specified in the schedule clause, the OpenMP
runtime uses the scheduling scheme specified in the OMP_SCHEDULE
environment variable for this particular for loop. The format for the
OMP_SCHEDULE environment variable is schedule-type[,chunk-size].
For example:
export OMP_SCHEDULE=dynamic,16
Assume you have a dual-core processor system and the cache line size
is 64 bytes. For the sample code shown above, two chunks (or array
sections) can be in the same cache line because the chunk size is set to 8
in the schedule clause. So each chunk of array x takes 32 bytes per cache
line, which leads to two chunks placed in the same cache line. Because
two chunks can be read and written by two threads at the same time, this
will result in many cache line invalidations, although two threads do not
read/write the same chunk. This is called false-sharing, as it is not
necessary to actually share the same cache line between two threads. A
simple tuning method is to use schedule(dynamic,16), so one chunk
takes the entire cache line to eliminate the false-sharing. Eliminating false-
sharing through the use of a chunk size setting that is aware of cache line
size will significantly improve your application performance.
Thread 0: Thread 1:
temp = 0; temp = 0;
for (k=0; k<50; k++) { for (k=50; k<100; k++) {
temp = temp + func(k); temp = temp + func(k)
} }
At the synchronization point, you can combine the partial sum results
from each thread to generate the final sum result. In order to perform
this form of recurrence calculation in parallel, the operation must be
mathematically associative and commutative. You may notice that the
variable sum in the original sequential loop must be shared to guarantee
the correctness of the multithreaded execution of the loop, but it also
must be private to permit access by multiple threads using a lock or a
critical section for the atomic update on the variable sum to avoid data-
race condition. To solve the problem of both sharing and protecting sum
without using a lock inside the threaded loop, OpenMP provides the
reduction clause that is used to efficiently combine certain associative
arithmetical reductions of one or more variables in a loop. The following
loop uses the reduction clause to generate the correct results.
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < 100; k++) {
sum = sum + func(k);
}
Given the reduction clause, the compiler creates private copies of the
variable sum for each thread, and when the loop completes, it adds the
values together and places the result in the original variable sum.
Other reduction operators besides + exist. Table 6.3 lists those C++
reduction operators specified in the OpenMP standard, along with the initial
valueswhich are also the mathematical identity valuefor the temporary
private variables. You can also find a list of Fortran reduction operators along
with their initial values in OpenMP specification Version 2.5.
reduced to 5.0 microseconds. Note that all measured costs are subject to
change if you measure these costs on a different processor or under a
different system configuration. The key point is that no matter how well the
compiler and runtime are developed and tuned to minimize the overhead of
OpenMP constructs and clauses, you can always find ways to reduce the
overhead by exploring the use of OpenMP in a more effective way.
Earlier, you saw how the parallel for pragma could be used to
split the iterations of a loop across multiple threads. When the compiler
generated thread is executed, the iterations of the loop are distributed
among threads. At the end of the parallel region, the threads are
suspended and they wait for the next parallel region, loop, or sections.
A suspend or resume operation, while significantly lighter weight than
create or terminate operations, still creates overhead and may be
unnecessary when two parallel regions, loops, or sections are adjacent as
shown in the following example.
#pragma omp parallel for for
( k = 0; k < m; k++ ) {
fn1(k); fn2(k);
}
Work-sharing Sections
The work-sharing sections construct directs the OpenMP compiler and
runtime to distribute the identified sections of your application among
threads in the team created for the parallel region. The following
example uses work-sharing for loops and work-sharing sections
together within a single parallel region. In this case, the overhead of
forking or resuming threads for parallel sections is eliminated.
#pragma omp parallel
{
#pragma omp for
for ( k = 0; k < m; k++ ) {
x = fn1(k) + fn2(k);
}
Here, OpenMP first creates several threads. Then, the iterations of the
loop are divided among the threads. Once the loop is finished, the
sections are divided among the threads so that each section is executed
exactly once, but in parallel with the other sections. If the program
contains more sections than threads, the remaining sections get
scheduled as threads finish their previous sections. Unlike loop
scheduling, the schedule clause is not defined for sections. Therefore,
OpenMP is in complete control of how, when, and in what order threads
are scheduled to execute the sections. You can still control which
variables are shared or private, using the private and reduction
clauses in the same fashion as the loop construct.
Performance-oriented Programming
OpenMP provides a set of important pragmas and runtime functions that
enable thread synchronization and related actions to facilitate correct
parallel programming. Using these pragmas and runtime functions
effectively with minimum overhead and thread waiting time is extremely
important for achieving optimal performance from your applications.
Considering the code example, lets see how it works. The following
code converts a color image to black and white.
for ( row = 0; row < height; row++ ) {
for ( col = 0; col < width; col++ ) {
pGray[col] = (BYTE)
( pRGB[row].red * 0.299 +
pRGB[row].green * 0.587 +
pRGB[row].blue * 0.114 );
}
pGray += GrayStride;
pRGB += RGBStride;
}
The issue is how to move the pointers pGray and pRGB to the correct
place within the bitmap while threading the outer row loop. The
address computation for each pixel can be done with the following code:
pDestLoc = pGray + col + row * GrayStride;
pSrcLoc = pRGB + col + row * RGBStride;
The above code, however, executes extra math on each pixel for the
address computation. Instead, the firstprivate clause can be used to
perform necessary initialization to get the initial address of pointer pGray
and pRGB for each thread. You may notice that the initial addresses of the
pointer pGray and pRGB have to be computed only once based on the
row number and their initial addresses in the master thread for each
thread; the pointer pGray and pRGB are induction pointers and updated
in the outer loop for each row iteration. This is the reason the bool-
type variable doInit is introduced with an initial value TRUE to make
sure the initialization is done only once for each to compute the initial
address of pointer pGray and pRGB. The parallelized code follows:
#pragma omp parallel for private (row, col) \
firstprivate(doInit, pGray, pRGB)
for ( row = 0; row < height; row++ ) {
// Need this init test to be able to start at an
// arbitrary point within the image after threading.
if (doInit == TRUE) {
doInit = FALSE;
pRGB += ( row * RGBStride );
pGray += ( row * GrayStride );
}
for ( col = 0; col < width; col++ ) {
pGray[col] = (BYTE) ( pRGB[row].red * 0.299 +
pRGB[row].green * 0.587 +
pRGB[row].blue * 0.114 );
Chapter 6: OpenMP: A Portable Solution for Threading 157
}
pGray += GrayStride;
pRGB += RGBStride;
}
If you take a close look at this code, you may find that the four variables
GrayStride, RGBStride, height, and width are read-only variables. In
other words, no write operation is performed to these variables in the
parallel loop. Thus, you can also specify them on the parallel for loop
by adding the code below:
firstprivate (GrayStride, RGBStride, height, width)
You may get better performance in some cases, as the privatization
helps the compiler to perform more aggressive registerization and code
motion as their loop invariants reduce memory traffic.
In the previous code, the dynamically nested critical sections are used.
When the function do_work is called inside a parallel loop, multiple
threads compete to enter the outer critical section. The thread that
succeeds in entering the outer critical section will call the dequeue
function; however, the dequeue function cannot make any further
progress, as the inner critical section attempts to enter the same critical
section in the do_work function. Thus, the do_work function could
never complete. This is a deadlock situation. The simple way to fix the
problem in the previous code is to do the inlining of the dequeue
function in the do_work function as follows:
void do_work(NODE *node)
{
#pragma omp critical (x)
{
node->next->data = fn1(node->data);
node = node->next;
}
}
When using multiple critical sections, be very careful to examine
critical sections that might be lurking in subroutines. In addition to using
critical sections, you can also use the atomic pragma for updating
shared variables. When executing code in parallel, it is impossible to
know when an operation will be interrupted by the thread scheduler.
Chapter 6: OpenMP: A Portable Solution for Threading 159
It is possible that the thread is swapped out between two of these machine
instructions. The atomic pragma directs the compiler to generate code to
ensure that the specific memory storage is updated atomically. The
following code example shows a usage of the atomic pragma.
int main()
{ float y[1000];
int k, idx[1000];
Thread pool
T1 T2 ... TK ... TN
Figure 6.2 uses these functions to perform data processing for each
element in array x. This example illustrates a few important concepts
when using the function calls instead of pragmas. First, your code must
be rewritten, and with any rewrite comes extra documentation,
debugging, testing, and maintenance work. Second, it becomes difficult
or impossible to compile without OpenMP support. Finally, because
thread values have been hard coded, you lose the ability to have loop-
scheduling adjusted for you, and this threaded code is not scalable
beyond four cores or processors, even if you have more than four cores
or processors in the system.
float x[8000];
omp_set_num_threads(4);
#pragma omp parallel private(k)
{ // This code has a shortcoming. Can you find it?
int num_thds = omp_get_num_threads();
int ElementsPerThread = 8000 / num_thds;
int Tid = omp_get_thread_num();
int LowBound = Tid*ElementsPerThread;
int UpperBound = LowBound + ElementsPerThread;
Figure 6.2 Loop that Uses OpenMP Functions and Illustrates the Drawbacks
Compilation
Using the OpenMP pragmas requires an OpenMP-compatible compiler
and thread-safe runtime libraries. The Intel C++ Compiler version 7.0
or later and the Intel Fortran compiler both support OpenMP on Linux
and Windows. This books discussion of compilation and debugging
will focus on these compilers. Several other choices are available as
well, for instance, Microsoft supports OpenMP in Visual C++ 2005 for
Windows and the Xbox 360 platform, and has also made OpenMP
work with managed C++ code. In addition, OpenMP compilers for
C/C++ and Fortran on Linux and Windows are available from the
Portland Group.
The /Qopenmp command-line option given to the Intel C++ Compiler
instructs it to pay attention to the OpenMP pragmas and to create
multithreaded code. If you omit this switch from the command line, the
compiler will ignore the OpenMP pragmas. This action provides a very
simple way to generate a single-threaded version without changing any
source code. Table 6.7 provides a summary of invocation options for
using OpenMP.
Debugging
Debugging multithreaded applications has always been a challenge due
to the nondeterministic execution of multiple instruction streams caused
by runtime thread-scheduling and context switching. Also, debuggers
may change the runtime performance and thread scheduling behaviors,
which can mask race conditions and other forms of thread interaction.
Even print statements can mask issues because they use synchronization
and operating system functions to guarantee thread-safety.
Debugging an OpenMP program adds some difficulty, as OpenMP
compilers must communicate all the necessary information of private
variables, shared variables, threadprivate variables, and all kinds of
constructs to debuggers after threaded code generation; additional code
that is impossible to examine and step through without a specialized
OpenMP-aware debugger. Therefore, the key is narrowing down the
problem to a small code section that causes the same problem. It would
be even better if you could come up with a very small test case that can
166 Multi-Core Programming
In the general form, the if clause can be any scalar expression, like the
one shown in the following example that causes serial execution when
the number of iterations is less than 16.
#pragma omp parallel for if(n>=16)
for ( k = 0; k < n; k++ ) fn2(k);
Another method is to pick the region of the code that contains the bug
and place it within a critical section, a single construct, or a master
construct. Try to find the section of code that suddenly works when it is
within a critical section and fails without the critical section, or executed
with a single thread.
The goal is to use the abilities of OpenMP to quickly shift code back
and forth between parallel and serial states so that you can identify the
locale of the bug. This approach only works if the program does in
fact function correctly when run completely in serial mode. Notice that
only OpenMP gives you the possibility of testing code this way without
rewriting it substantially. Standard programming techniques used in the
Windows API or Pthreads irretrievably commit the code to a threaded
model and so make this debugging approach more difficult.
Performance
OpenMP paves a simple and portable way for you to parallelize your
applications or to develop threaded applications. The threaded application
168 Multi-Core Programming
First look at the amount of time spent in the operating systems idle
loop. The Intel VTune Performance Analyzer is great tool to help with
the investigation. Idle time can indicate unbalanced loads, lots of blocked
synchronization, and serial regions. Fix those issues, then go back to the
VTune Performance Analyzer to look for excessive cache misses and
memory issues like false-sharing. Solve these basic problems, and you will
have a well-optimized parallel program that will run well on multi-core
systems as well as multiprocessor SMP systems.
Optimizations are really a combination of patience, trial and error, and
practice. Make little test programs that mimic the way your application
uses the computers resources to get a feel for what things are faster than
others. Be sure to try the different scheduling clauses for the parallel
sections. Chapter 7 provides additional advice on how to tune parallel
code for performance and Chapter 11 covers the tools youll need.
Key Points
Keep the following key points in mind while programming with OpenMP:
The OpenMP programming model provides an easy and portable
way to parallelize serial code with an OpenMP-compliant compiler.
OpenMP consists of a rich set of pragmas, environment variables,
and a runtime API for threading.
The environment variables and APIs should be used sparingly
because they can affect performance detrimentally. The pragmas
represent the real added value of OpenMP.
With the rich set of OpenMP pragmas, you can incrementally
parallelize loops and straight-line code blocks such as sections
without re-architecting the applications. The Intel Task queuing
extension makes OpenMP even more powerful in covering more
application domain for threading.
If your applications performance is saturating a core or
processor, threading it with OpenMP will almost certainly
increase the applications performance on a multi-core or
multiprocessor system.
You can easily use pragmas and clauses to create critical sections,
identify private and public variables, copy variable values, and
control the number of threads operating in one section.
170 Multi-Core Programming
171
172 Multi-Core Programming
scheduler gives each software thread a short turn, called a time slice, to
run on one of the hardware threads. When a software threads time slice
runs out, the scheduler preemptively suspends the thread in order to run
another software thread on the same hardware thread. The software
thread freezes in time until it gets another time slice.
Time slicing ensures that all software threads make some progress.
Otherwise, some software threads might hog all the hardware threads
and starve other software threads. However, this equitable distribution of
hardware threads incurs overhead. When there are too many software
threads, the overhead can severely degrade performance. There are
several kinds of overhead, and it helps to know the culprits so you can
spot them when they appear.
The most obvious overhead is the process of saving and restoring a
threads register state. Suspending a software thread requires saving the
register values of the hardware thread, so the values can be restored
later, when the software thread resumes on its next time slice. Typically,
thread schedulers allocate big enough time slices so that the save/restore
overheads for registers are insignificant, so this obvious overhead is in
fact not much of a concern.
A more subtle overhead of time slicing is saving and restoring a
threads cache state. Modern processors rely heavily on cache
memory, which can be about 10 to 100 times faster than main
memory. Accesses that hit in cache are not only much faster; they also
consume no bandwidth of the memory bus. Caches are fast, but finite.
When the cache is full, a processor must evict data from the cache to
make room for new data. Typically, the choice for eviction is the least
recently used data, which more often than not is data from an earlier
time slice. Thus threads tend to evict each others data. The net effect
is that too many threads hurt performance by fighting each other for
cache.
A similar overhead, at a different level, is thrashing virtual memory.
Most systems use virtual memory, where the processors have an address
space bigger than the actual available memory. Virtual memory resides
on disk, and the frequently used portions are kept in real memory.
Similar to caches, the least recently used data is evicted from memory
when necessary to make room. Each software thread requires virtual
memory for its stack and private data structures. As with caches, time
slicing causes threads to fight each other for real memory and thus hurts
performance. In extreme cases, there can be so many threads that the
program runs out of even virtual memory.
Chapter 7: Solutions to Common Parallel Programming Problems 173
The cache and virtual memory issues described arise from sharing
limited resources among too many software threads. A very different, and
often more severe, problem arises called convoying, in which software
threads pile up waiting to acquire a lock. Consider what happens when a
threads time slice expires while the thread is holding a lock. All threads
waiting for the lock must now wait for the holding thread to wake up
and release the lock. The problem is even worse if the lock
implementation is fair, in which the lock is acquired in first-come first-
served order. If a waiting thread is suspended, then all threads waiting
behind it are blocked from acquiring the lock.
The solution that usually works best is to limit the number of
runnable threads to the number of hardware threads, and possibly
limit it to the number of outer-level caches. For example, a dual-core
Intel Pentium Processor Extreme Edition has two physical cores, each
with Hyper-Threading Technology, and each with its own cache. This
configuration supports four hardware threads and two outer-level
caches. Using all four runnable threads will work best unless the
threads need so much cache that it causes fighting over cache, in which
case maybe only two threads is best. The only way to be sure is to
experiment. Never hard code the number of threads; leave it as a
tuning parameter.
Runnable threads, not blocked threads, cause time-slicing overhead.
When a thread is blocked waiting for an external event, such as a mouse
click or disk I/O request, the operating system takes it off the round-
robin schedule. Hence a blocked thread does not cause time-slicing
overhead. A program may have many more software threads than
hardware threads, and still run efficiently if most of the OS threads are
blocked.
A helpful organizing principle is to separate compute threads from
I/O threads. Compute threads should be the threads that are runnable
most of the time. Ideally, the compute threads never block on external
events, and instead feed from task queues that provide work. The
number of compute threads should match the processor resources. The
I/O threads are threads that wait on external events most of the time, and
thus do not contribute to having too many threads.
Because building efficient task queues takes some expertise, it is
usually best to use existing software to do this. Common useful practices
are as follows:
Let OpenMP do the work. OpenMP lets the programmer specify
loop iterations instead of threads. OpenMP deals with managing
174 Multi-Core Programming
such an access straddles a cache line, the processor performs the access
as two separate accesses to the two constituent cache lines.
Data races can arise not only from unsynchronized access to shared
memory, but also from synchronized access that was synchronized at too
low a level. Figure 7.3 shows such an example. The intent is to use a list
to represent a set of keys. Each key should be in the list at most once.
Even if the individual list operations have safeguards against races, the
combination suffers a higher level race. If two threads both attempt to
insert the same key at the same time, they may simultaneously determine
that the key is not in the list, and then both would insert the key. What is
needed is a lock that protects not just the list, but that also protects the
invariant no key occurs twice in list.
Deadlock
Race conditions are typically cured by adding a lock that protects the
invariant that might otherwise be violated by interleaved operations.
Unfortunately, locks have their own hazards, most notably deadlock.
Figure 7.4 shows a deadlock involving two threads. Thread 1 has
acquired lock A. Thread 2 has acquired lock B. Each thread is trying to
acquire the other lock. Neither thread can proceed.
178 Multi-Core Programming
has the further benefit of possibly improving scalability, because the lock
that was removed might have been a source of contention.
If replication cannot be done, that is, in such cases where there
really must be only a single copy of the resource, common wisdom is to
always acquire the resources (locks) in the same order. Consistently
ordering acquisition prevents deadlock cycles. For instance, the
deadlock in Figure 7.4 cannot occur if threads always acquire lock A
before they acquire lock B.
The ordering rules that are most convenient depend upon the
specific situation. If the locks all have associated names, even something
as simple as alphabetical order works. This order may sound silly, but it
has been successfully used on at least one large project.
For multiple locks in a data structure, the order is often based on the
topology of the structure. In a linked list, for instance, the agreed upon
order might be to lock items in the order they appear in the list. In a tree
structure, the order might be a pre-order traversal of the tree. Somewhat
similarly, components often have a nested structure, where bigger
components are built from smaller components. For components nested
that way, a common order is to acquire locks in order from the outside to
the inside.
If there is no obvious ordering of locks, a solution is to sort the
locks by address. This approach requires that a thread know all locks
that it needs to acquire before it acquires any of them. For instance,
perhaps a thread needs to swap two containers pointed to by pointers x
and y, and each container is protected by a lock. The thread could
compare x < y to determine which container comes first, and acquire
the lock on the first container before acquiring a lock on the second
container, as Figure 7.5 illustrates.
Figure 7.6 has some timing delays in it to prevent the hazard of live
lock. Live lock occurs when threads continually conflict with each other
and back off. Figure 7.6 applies exponential backoff to avoid live lock. If
a thread cannot acquire all the locks that it needs, it releases any that it
acquired and waits for a random amount of time. The random time is
chosen from an interval that doubles each time the thread backs off.
Eventually, the threads involved in the conflict will back off sufficiently
that at least one will make progress. The disadvantage of backoff schemes
is that they are not fair. There is no guarantee that a particular thread will
make progress. If fairness is an issue, then it is probably best to use lock
ordering to prevent deadlock.
Chapter 7: Solutions to Common Parallel Programming Problems 181
Priority Inversion
Some threading implementations allow threads to have priorities. When
there are not enough hardware threads to run all software threads, the
higher priority software threads get preference. For example,
foreground tasks might be running with higher priorities than
background tasks. Priorities can be useful, but paradoxically, can lead
to situations where a low-priority thread blocks a high-priority thread
from running.
Figure 7.7 illustrates priority inversion. Continuing our analogy with
software threads as cars and hardware threads as drivers, three cars are
shown, but there is only a single driver. A low-priority car has acquired a
lock so it can cross a single-lane critical section bridge. Behind it waits
a high-priority car. But because the high-priority car is blocked, the driver
is attending the highest-priority runnable car, which is the medium-
priority one. As contrived as this sounds, it actually happened on the
NASA Mars Pathfinder mission.
182 Multi-Core Programming
Figure 7.7 Priority Inversion Scenario, Where High Priority Gets Blocked and
Medium Priority Gets the Cycles
buckets mutex. The thread acquires a reader lock, not a writer lock, on
the reader-writer mutex even if it is planning to modify a bucket,
because the reader-writer mutex protects the array descriptor, not the
buckets. If a thread needs to resize the array, it requests a writer lock
on the reader-writer mutex. Once granted, the thread can safely modify
the array descriptor without introducing a race condition. The overall
advantage is that during times when the array is not being resized,
multiple threads accessing different buckets can proceed concurrently.
The principle disadvantage is that a thread must obtain two locks
instead of one. This increase in locking overhead can overwhelm the
advantages of increased concurrency if the table is typically not subject
to contention.
Non-blocking Algorithms
One way to solve the problems introduced by locks is to not use locks.
Algorithms designed to do this are called non-blocking. The defining
characteristic of a non-blocking algorithm is that stopping a thread does
not prevent the rest of the system from making progress. There are
different non-blocking guarantees:
Obstruction freedom. A thread makes progress as long as there is
no contention, but live lock is possible. Exponential backoff can
be used to work around live lock.
Lock freedom. The system as a whole makes progress.
Wait freedom. Every thread makes progress, even when faced
with contention. Very few non-blocking algorithms achieve this.
Non-blocking algorithms are immune from lock contention, priority
inversion, and convoying. Non-blocking algorithms have a lot of
advantages, but with these come a new set of problems that need to be
understood.
Non-blocking algorithms are based on atomic operations, such as the
methods of the Interlocked class discussed in Chapter 5. A few non-
blocking algorithms are simple. Most are complex, because the
algorithms must handle all possible interleaving of instruction streams
from contending processors.
A trivial non-blocking algorithm is counting via an interlocked
increment instead of a lock. The interlocked instruction avoids lock
overhead and pathologies. However, simply using atomic operations is
not enough to avoid race conditions, because as discussed before,
composing thread-safe operations does not necessarily yield a thread-safe
procedure. As an example, the C code in Figure 7.10 shows the wrong
way and right way to decrement and test a reference count
p->ref_count. In the wrong code, if the count was originally 2, two
threads executing the wrong code might both decrement the count, and
then both see it as zero at the same time. The correct code performs the
decrement and test as a single atomic operation.
Chapter 7: Solutions to Common Parallel Programming Problems 187
ABA Problem
In Figure 7.11, there is a time interval between when a thread executes
x_old = x and when the thread executes InterlockedCompareEx-
change. During this interval, other processors might perform other fetch-
and-op operations. For example, suppose the initial value read is A. An
intervening sequence of fetch-and-op operations by other processors
might change x to B and then back to A. When the original thread
executes InterlockedCompareExchange, it will be as if the other
processors actions never happened. As long as the order in which op is
executed does not matter, there is no problem. The net result is the same
as if the fetch-and-op operations were reordered such that the
intervening sequence happens before the first read.
But sometimes fetch-and-op has uses where changing x from A to B
to A does make a difference. The problem is indeed known as the ABA
problem. Consider the lockless implementation of a stack shown in
Figure 7.12. It is written in the fetch-and-op style, and thus has the
advantage of not requiring any locks. But the op is no longer a pure
function, because it deals with another shared memory location: the field
next. Figure 7.13 shows a sequence where the function
BrokenLockLessPop corrupts the linked stack. When thread 1 starts
out, it sees B as next on stack. But intervening pushes and pops make C
next on stack. But Thread 1s final InterlockedCompareExchange does
not catch this switch because it only examines Top.
Item* BrokenLocklessPop() {
Item *t_old, *t_was, *t_new;
do {
t_old = Top;
t_new = t_old->next;
// ABA problem may strike below!
t_was = InterlockedCompareExchange(&Top,t_new,t_old);
} while( t_was!=t_old );
return t_old;
}
Figure 7.12 Lockless Implementation of a Linked Stack that May Suffer from ABA
Problem
Figure 7.13 Sequence Illustrates ABA Problem for Code in Figure 7.12
The problem occurs for algorithms that remove nodes from linked
structures, and do so by performing compare-exchange operations on
fields in the nodes. For example, non-blocking algorithms for queues do
this. The reason is that when a thread removes a node from a data
structure, without using a lock to exclude other threads, it never knows
if another thread still looking at the node. The algorithms are usually
designed so that the other thread will perform a failing compare-
exchange on a field in the removed node, and thus know to retry.
Unfortunately, if in the meantime the node is handed to free, the field
might be coincidentally set to the value that the compare-exchange
expects to see.
The solution is to use a garbage collector or mini-collector like
hazard pointers. Alternatively you may associate a free list of nodes with
the data structure and not free any nodes until the data structure itself
is freed.
Recommendations
Non-blocking algorithms are currently a hot topic in research. Their big
advantage is avoiding lock pathologies. Their primary disadvantage is that
they are much more complicated than their locked counterparts. Indeed,
the discovery of a lockless algorithm is often worthy of a conference
paper. Non-blocking algorithms are difficult to verify. At least one
incorrect algorithm has made its way into a conference paper. Non-
experts should consider the following advice:
Atomic increment, decrement, and fetch-and-add are generally
safe to use in an intuitive fashion.
The fetch-and-op idiom is generally safe to use with operations
that are commutative and associative.
The creation of non-blocking algorithms for linked data
structures should be left to experts. Use algorithms from the
peer-reviewed literature. Be sure to understand any memory
reclamation issues.
Otherwise, for now, stick with locks. Avoid having more runnable
software threads than hardware threads, and design programs to avoid
lock contention. This way, the problems solved by non-blocking
algorithms will not come up in the first place.
192 Multi-Core Programming
Figure 7.14 Implementer Should Ensure Thread Safety of Hidden Shared State
not update hidden global state, because with multiple threads, it may not
be clear whose global state is being updated. The C library function
strtok is one such offender. Clients use it to tokenize a string. The first
call sets the state of a hidden parser, and each successive call advances
the parser. The hidden parser state makes the interface thread unsafe.
Thread safety can be obtained by having the implementation put the
parser in thread-local storage. But this introduces the complexity of a
threading package into something that really should not need it in the
first place. A thread-safe redesign of strtok would make the parser
object an explicit argument. Each thread would create its own local
parser object and pass it as an argument. That way, concurrent calls
could proceed blissfully without interference.
Some libraries come in thread-safe and thread-unsafe versions. Be sure
to use the thread-safe version for multi-threaded code. For example, on
Windows, the compiler option /MD is required to dynamically link with
the thread-safe version of the run-time library. For debugging, the
corresponding option is /MDd, which dynamically links with the debug
version of the thread-safe run-time. Read your compiler documentation
carefully about these kinds of options. Because the compilers date back to
the single-core era, the defaults are often for code that is not thread safe.
Memory Issues
When most people perform calculations by hand, they are limited by
how fast they can do the calculations, not how fast they can read and
write. Early microprocessors were similarly constrained. In recent
decades, microprocessors have grown much faster in speed than in
memory. A single microprocessor core can execute hundreds of
operations in the time it takes to read or write a value in main memory.
Programs now are often limited by the memory bottleneck, not
processor speed. Multi-core processors can exacerbate the problem
unless care is taken to conserve memory bandwidth and avoid memory
contention.
Bandwidth
To conserve bandwidth, pack data more tightly, or move it less
frequently between cores. Packing the data tighter is usually
194 Multi-Core Programming
speed, the extra bookkeeping pays off dramatically. Figure 7.18 shows
this performance difference. On this log plot, the cache friendly code has
a fairly straight performance plot, while the cache unfriendly versions
running time steps up from one straight line to another when n reaches
6
approximately 10 . The step is characteristic of algorithms that transition
from running in cache to running out of cache as the problem size
increases. The restructured version is five times faster than the original
version when n significantly exceeds the cache size, despite the extra
processor operations required by the restructuring.
Figure 7.18 Performance Difference between Figure 7.16 and Figure 7.17
Memory Contention
For multi-core programs, working within the cache becomes trickier,
because data is not only transferred between a core and memory, but
also between cores. As with transfers to and from memory, mainstream
programming languages do not make these transfers explicit. The
transfers arise implicitly from patterns of reads and writes by different
cores. The patterns correspond to two types of data dependencies:
Read-write dependency. A core writes a cache line, and then a
different core reads it.
Write-write dependency. A core writes a cache line, and then a
different core writes it.
An interaction that does not cause data movement is two cores
repeatedly reading a cache line that is not being written. Thus if multiple
cores only read a cache line and do not write it, then no memory
198 Multi-Core Programming
bandwidth is consumed. Each core simply keeps its own copy of the
cache line.
To minimize memory bus traffic, minimize core interactions by
minimizing shared locations. Hence, the same patterns that tend to
reduce lock contention also tend to reduce memory traffic, because it is
the shared state that requires locks and generates contention. Letting
each thread work on its own local copy of the data and merging the data
after all threads are done can be a very effective strategy.
Consider writing a multi-threaded version of the function
CacheFriendlySieve from Figure 7.17. A good decomposition for this
problem is to fill the array factor sequentially, and then operate on the
windows in parallel. The sequential portion takes time O( n ), and
hence has minor impact on speedup for large n. Operating on the
windows in parallel requires sharing some data. Looking at the nature of
the sharing will guide you on how to write the parallel version.
The array factor is read-only once it is filled. Thus each thread
can share the array.
The array composite is updated as primes are found. However,
the updates are made to separate windows, so they are unlikely
to interfere except at window boundaries that fall inside a cache
line. Better yet, observe that the values in the window are used
only while the window is being processed. The array composite
no longer needs to be shared, and instead each thread can have a
private portion that holds only the window of interest. This
change benefits the sequential version too, because now the
space requirements for the sieve have been reduced from O(n) to
11
O( n ). The reduction in space makes counting primes up to 10
possible on even a 32-bit machine.
The variable count is updated as primes are found. An atomic
increment could be used, but that would introduce memory
contention. A better solution, as shown in the example, is to give
each thread perform a private partial count, and sum the partial
counts at the end.
The array striker is updated as the window is processed. Each
thread will need its own private copy. The tricky part is that
striker induces a loop-carried dependence between windows.
For each window, the initial value of striker is the last value it
had for the previous window. To break this dependence, the
initial values in striker have to be computed from scratch. This
Chapter 7: Solutions to Common Parallel Programming Problems 199
Cache-related Issues
As remarked earlier in the discussion of time-slicing issues, good
performance depends on processors fetching most of their data from
cache instead of main memory. For sequential programs, modern caches
generally work well without too much thought, though a little tuning
helps. In parallel programming, caches open up some much more serious
pitfalls.
False Sharing
The smallest unit of memory that two processors interchange is a cache
line or cache sector. Two separate caches can share a cache line when
they both need to read it, but if the line is written in one cache, and read
in another, it must be shipped between caches, even if the locations of
interest are disjoint. Like two people writing in different parts of a log
book, the writes are independent, but unless the book can be ripped
apart, the writers must pass the book back and forth. In the same way,
two hardware threads writing to different locations contend for a cache
sector to the point where it becomes a ping-pong game.
Chapter 7: Solutions to Common Parallel Programming Problems 201
Figure 7.20 illustrates such a ping-pong game. There are two threads,
each running on a different core. Each thread increments a different
location belonging to the same cache line. But because the locations
belong to the same cache line, the cores must pass the sector back and
forth across the memory bus.
Figure 7.21 shows how bad the impact can be for a generalization of
Figure 7.20. Four single-core processors, each enabled with Hyper-
Threading Technology (HT Technology), are used to give the flavor of a
hypothetical future eight-core system. Each hardware thread increments
a separate memory location. The ith thread repeatedly increments
x[i*stride]. The performance is worse when the locations are
adjacent, and improves as they spread out, because the spreading puts
the locations into more distinct cache lines. Performance improves
sharply at a stride of 16. This is because the array elements are 4-byte
integers. The stride of 16 puts the locations 16 4 = 64 bytes apart. The
data is for a Pentium 4 based processor with a cache sector size of
64 bytes. Hence when the locations were 64 bytes part, each thread is
202 Multi-Core Programming
It may not be obvious that there is always enough room before the
aligned block to store the pointer. Sufficient room depends upon two
assumptions:
A cache line is at least as big as a pointer.
A malloc request for at least a cache lines worth of bytes returns
a pointer aligned on boundary that is a multiple of
sizeof(char*).
These two conditions hold for IA-32 and Itanium-based systems. Indeed,
they hold for most architecture because of alignment restrictions
specified for malloc by the C standard.
return result;
}
Figure 7.22 Memory Allocator that Allocates Blocks Aligned on Cache Line
Boundaries
Memory Consistency
At any given instant in time in a sequential program, memory has a well
defined state. This is called sequential consistency. In parallel programs,
it all depends upon the viewpoint. Two writes to memory by a hardware
thread may be seen in a different order by another thread. The reason is
that when a hardware thread writes to memory, the written data goes
through a path of buffers and caches before reaching main memory.
Along this path, a later write may reach main memory sooner than an
earlier write. Similar effects apply to reads. If one read requires a fetch
from main memory and a later read hits in cache, the processor may
allow the faster read to pass the slower read. Likewise, reads and writes
might pass each other. Of course, a processor has to see its own reads
and writes in the order it issues them, otherwise programs would break.
But the processor does not have to guarantee that other processors see
those reads and writes in the original order. Systems that allow this
reordering are said to exhibit relaxed consistency.
Because relaxed consistency relates to how hardware threads observe
each others actions, it is not an issue for programs running time sliced
on a single hardware thread. Inattention to consistency issues can result
in concurrent programs that run correctly on single-threaded hardware,
or even hardware running with HT Technology, but fail when run on
multi-threaded hardware with disjoint caches.
The hardware is not the only cause of relaxed consistency. Compilers
are often free to reorder instructions. The reordering is critical to most
major compiler optimizations. For instance, compilers typically hoist
loop-invariant reads out of a loop, so that the read is done once per loop
instead of once per loop iteration. Language rules typically grant the
compiler license to presume the code is single-threaded, even if it is not.
This is particularly true for older languages such as Fortran, C, and
C++ that evolved when parallel processors were esoteric. For recent
languages, such as Java and C#, compilers must be more circumspect,
Chapter 7: Solutions to Common Parallel Programming Problems 205
1
The first published software-only, two-process mutual exclusion algorithm.
206 Multi-Core Programming
Itanium Architecture
The Itanium architecture had no legacy software to preserve, and thus
could afford a cutting-edge relaxed memory model. The model
theoretically delivers higher performance than sequential consistency by
giving the memory system more freedom of choice. As long as locks are
properly used to avoid race conditions, there are no surprises. However,
programmers writing multiprocessor code with deliberate race
conditions must understand the rules. Though far more relaxed than
IA-32, the rules for memory consistency on Itanium processors are
simpler to remember because they apply uniformly. Furthermore,
compilers for Itanium-based systems interpret volatile in a way that
makes most idioms work.
Figure 7.24(a) shows a simple and practical example where the rules
come into play. It shows two threads trying to pass a message via
memory. Thread 1 writes a message into variable Message, and Thread 2
reads the message. Synchronization is accomplished via the flag
IsReady. The writer sets IsReady after it writes the message. The reader
208 Multi-Core Programming
busy waits for IsReady to be set, and then reads the message. If the
writes or reads are reordered, then Thread 2 may read the message
before Thread 1 is done writing it. Figure 7.24(b) shows how the Itanium
architecture may reorder the reads and writes. The solution is to declare
the flag IsReady as volatile, as shown in 7.24(c). Volatile writes
are compiled as store with release and volatile reads are compiled as
load with acquire. Memory operations are never allowed to move
downwards over a release or upwards over an acquire, thus
enforcing the necessary orderings.
writes some data, and then signals a receiver thread that it is ready by
modifying a flag location. The modification might be a write, or some
other atomic operation. As long as the sender performs a release
operation after writing the data, and the receiver performs an acquire
operation before reading the data, the desired ordering will be
maintained. Typically, these conditions are guaranteed by declaring
the flag volatile, or using an atomic operation with the desired
acquire/release characteristics.
Time
Message Passing Idiom
Cage Boundary
Memory Operations
Acquire Release
Time
Cage Idiom
Figure 7.25 Two Common Idioms for Using Shared Memory without a Lock
High-level Languages
When writing portable code in a high-level language, the easiest way to
deal with memory consistency is through the languages existing
synchronization primitives, which normally have the right kind of fences
built in. Memory consistency issues appear only when programmers roll
Chapter 7: Solutions to Common Parallel Programming Problems 211
while( !IsReady )
_asm pause;
R2 = Message;
Key Points
The key to successful parallel programming is choosing a good program
decomposition. Keep the following points in mind when choosing a
decomposition:
Match the number of runnable software threads to the available
hardware threads. Never hard-code the number of threads into
your program; leave it as a tuning parameter.
Parallel programming for performance is about finding the zone
between too little and too much synchronization. Too little
synchronization leads to incorrect answers. Too much
synchronization leads to slow answers.
Use tools like Intel Thread Checker to detect race conditions.
Keep locks private. Do not hold a lock while calling another
packages code.
Avoid deadlock by acquiring locks in a consistent order.
214 Multi-Core Programming
215
216 Multi-Core Programming
1
There are a number of different software development methodologies that are applicable to parallel
programming. For example, parallel programming can be done using traditional or rapid prototyping
(Extreme Programming) techniques.
Chapter 8: Multi-threaded Debugging Techniques 217
Code Reviews
Many software processes suggest frequent code reviews as a means
of improving software quality. The complexity of parallel
programming makes this task challenging. While not a replacement
for using well established parallel programming design patterns,
code reviews may, in many cases, help catch bugs in the early stages
of development.
One technique for these types of code reviews is to have individual
reviewers examine the code from the perspective of one of the threads
in the system. During the review, each reviewer steps through the
sequence of events as the actual thread would. Have objects that
represent the shared resources of the system available and have the
individual reviewers (threads) take and release these resources. This
technique will help you visualize the interaction between different
threads in your system and hopefully help you find bugs before they
manifest themselves in code.
As a developer, when you get the urge to immediately jump into
coding and disregard any preplanning or preparation, you should
consider the following scenarios and ask yourself which situation youd
rather be in. Would you rather spend a few weeks of work up front to
validate and verify the design and architecture of your application, or
would you rather deal with having to redesign your product when you
find it doesnt scale? Would you rather hold code reviews during
development or deal with the stress of trying to solve mysterious,
unpredictable showstopper bugs a week before your scheduled ship
date? Good software engineering practices are the key to writing reliable
Chapter 8: Multi-threaded Debugging Techniques 219
2
In the interest of making the code more readable, Listing 8.1 uses the time() system call to record
system time. Due to the coarse granularity of this timer, most applications should use a high
performance counter instead to keep track of the time in which events occurred.
220 Multi-Core Programming
Listing 8.1, creates a trace buffer that can store 1,024 events. It stores
these events in a circular buffer. As youll see shortly, once the circular
buffer is full, your atomic index will wrap around and replace the oldest
event. This simplifies your implementation as it doesnt require
dynamically resizing the trace buffer or storing the data to disk. In some
instances, these operations may be desirable, but in general, a circular
buffer should suffice.
Lines 113 define the data structures used in this implementation.
The event descriptor traceBufferElement is defined in lines 49. It
contains three fields: a field to store the thread ID, a timestamp value that
indicates when the event occurred, and a generic message string that is
associated with the event. This structure could include a number of
additional parameters, including the name of the thread.
The trace buffer in Listing 8.1 defines three operations. The first
method, InitializeTraceBuffer(), initializes the resources used by
the trace buffer. The initialization of the atomic counter occurs on line 16.
The atomic counter is initialized to 1. The initial value of this counter is
1 because adding a new entry in the trace buffer requires us to first
increment (line 29) the atomic counter. The first entry should be stored
in slot 0. Once the trace buffer is initialized, threads may call
AddEntryToTraceBuffer() to update the trace buffers with events as
they occur. PrintTraceBuffer() dumps a listing of all the events that
the trace buffer has logged to the screen. This function is very useful
when combined with a debugger that allows users to execute code at a
breakpoint. Both Microsoft Visual Studio and GDB support this
capability. With a single command, the developer can see a log of all the
222 Multi-Core Programming
LockTraceBuffer();
m_global = do_work();
AddEntryToTraceBuffer(msg);
UnlockTraceBuffer();
Thread_local_data = m_global;
AddEntryToTraceBuffer(msg);
UnlockTraceBuffer();
// ... finish thread
}
Threads Window
As part of the debugger, Visual Studio provides a Threads window that
lists all of the current threads in the system. From this window, you can:
Freeze (suspend) or thaw (resume) a thread. This is useful
when you want to observe the behavior of your application
without a certain thread running.
Switch the current active thread. This allows you to manually
perform a context switch and make another thread active in the
application.
Examine thread state. When you double-click an entry in the
Threads window, the source window jumps to the source line
that the thread is currently executing. This tells you the threads
current program counter. You will be able to examine the state of
local variables within the thread.
The Threads window acts as the command center for examining and
controlling the different threads in an application.
Tracepoints
As previously discussed, determining the sequence of events that lead to
a race condition or deadlock situation is critical in determining the root
cause of any multi-thread related bug. In order to facilitate the logging of
events, Microsoft has implemented tracepoints as part of the debugger
for Visual Studio 2005.
Most developers are familiar with the concept of a breakpoint. A
tracepoint is similar to a breakpoint except that instead of stopping
program execution when the applications program counter reaches that
point, the debugger takes some other action. This action can be printing
a message or running a Visual Studio macro.
Enabling tracepoints can be done in one of two ways. To create a
new tracepoint, set the cursor to the source line of code and select
Insert Tracepoint. If you want to convert an existing breakpoint to a
tracepoint, simply select the breakpoint and pick the When Hit option
from the Breakpoint submenu. At this point, the tracepoint dialog
appears.
When a tracepoint is hit, one of two actions is taken based on the
information specified by the user. The simplest action is to print a
message. The programmer may customize the message based on a set of
predefined keywords. These keywords, along with a synopsis of what
226 Multi-Core Programming
gets printed, are shown in Table 8.1. All values are taken at the time the
tracepoint is hit.
Breakpoint Filters
Breakpoint filters allow developers to trigger breakpoints only when
certain conditions are triggered. Breakpoints may be filtered by machine
name, process, and thread. The list of different breakpoint filters is
shown in Table 8.2.
Naming Threads
When debugging a multi-threaded application, it is often useful to assign
unique names to the threads that are used in the application. In
Chapter 5, you learned that assigning a name to a thread in a managed
application was as simple as setting a property on the thread object. In
this environment, it is highly recommended that you set the name field
when creating the thread, because managed code provides no way to
identify a thread by its ID.
In native Windows code, a thread ID can be directly matched to an
individual thread. Nonetheless, keeping track of different thread IDs
makes the job of debugging more difficult; it can be hard to keep track of
individual thread IDs. An astute reader might have noticed in Chapter 5
the conspicuous absence of any sort of name parameter in the methods
used to create threads. In addition, there was no function provided to get
or set a thread name. It turns out that the standard thread APIs in Win32
lack the ability to associate a name with a thread. As a result, this
association must be made by an external debugging tool.
Microsoft has enabled this capability through predefined exceptions
built into their debugging tools. Applications that want to see a thread
referred to by name need to implement a small function that raises an
exception. The exception is caught by the debugger, which then takes
the specified name and assigns it to the associated ID. Once the
exception handler completes, the debugger will use the user-supplied
name from then on.
The implementation of this function can be found on the Microsoft
Developer Network (MSDN) Web site at msdn.microsoft.com by
searching for: setting a thread name (unmanaged). The function,
named SetThreadName(), takes two arguments. The first argument is
the thread ID. The recommended way of specifying the thread ID is to
send the value -1, indicating that the ID of the calling thread should be
used. The second parameter is the name of the thread. The
SetThreadName() function calls RaiseException(), passing in a
special thread exception code and a structure that includes the thread
ID and name parameters specified by the programmer.
Once the application has the SetThreadName() function defined,
the developer may call the function to name a thread. This is shown in
228 Multi-Core Programming
3
Listing 8.5. The function Thread1 is given the name Producer,
indicating that it is producing data for a consumer. Note that the function
is called at the start of the thread, and that the thread ID is specified as -1.
This indicates to the debugger that it should associate the calling thread
with the associated ID.
3
Admittedly the function name Thread1 should be renamed to Producer as well, but is left
somewhat ambiguous for illustration purposes.
Chapter 8: Multi-threaded Debugging Techniques 229
50 SetThreadName(-1, "Consumer");
51 while (1)
52 {
53 process_data();
54
55 Sleep(1000);
56 count++;
57 if (count > 30)
58 break;
59 }
60 return 0;
61 }
void sample_data()
{
EnterCriticalSection(&hLock);
m_global = rand();
if ((m_global % 0xC5F) == 0)
{
// handle error
return;
}
LeaveCriticalSection(&hLock);
}
Figure 8.1 Examining Thread State Information Using Visual Studio 2005
When you examine the state of the application, you can see that the
consumer thread is blocked, waiting for the process_data() call to
return. To see what occurred prior to this failure, access the trace buffer.
With the application stopped, call the PrintTraceBuffer() method
directly from Visual Studios debugger. The output of this call in this
sample run is shown in Figure 8.2.
232 Multi-Core Programming
Figure 8.2 Output from trace buffer after Error Condition Occurs
Examination of the trace buffer log shows that the producer thread is
still making forward progress. However, no data values after the first two
make it to the consumer. This coupled with the fact that the thread state
for the consumer thread indicates that the thread is stuck, points to an
error where the critical section is not properly released. Upon closer
inspection, it appears that the data value in line 7 of the trace buffer log
is an error value. This leads up back to your new handling code, which
handles the error but forgets to release the mutex. This causes the
consumer thread to be blocked indefinitely, which leads to the consumer
thread being starved. Technically this isnt a deadlock situation, as the
producer thread is not waiting on a resource that the consumer thread
holds.
The complete data acquisition sample application is provided on this
books Web site, www.intel.com/intelpress/mcp.
Not all GDB implementations support all of the features outlined here.
Please refer to your systems manual pages for a complete list of
supported features.
Keep in mind that the systag is the operating systems identification for a
thread, not GDBs. GDB assigns each thread a unique number that
identifies it for debugging purposes.
The second point to keep in mind is that GDB does not single step all
threads in lockstep. Therefore, when single-stepping a line of code in one
thread, you may end up executing a lot of code in other threads prior to
returning to the thread that you are debugging. If you have breakpoints
Chapter 8: Multi-threaded Debugging Techniques 235
In this example, the thread command makes thread number 2 the active
thread.
The GDB backtrace (bt) command is applied to all threads in the system.
In this scenario, this command is functionally equivalent to: thread
apply 2 1 bt.
236 Multi-Core Programming
Key Points
This chapter described a number of general purpose debugging
techniques for multi-threaded applications. The important points to
remember from this chapter are:
Proper software engineering principles should be followed when
writing and developing robust multi-threaded applications.
When trying to isolate a bug in a multi-threaded application, it is
useful to have a log of the different sequence of events that led
up to failure. A trace buffer is a simple mechanism that allows
programmers to store this event information.
Bracket events that are logged in the trace buffer with before
and after messages to determine the order in which the events
occurred.
Running the application in the debugger may alter the timing
conditions of your runtime application, masking potential race
conditions in your application.
Tracepoints can be a useful way to log or record the sequence of
events as they occur.
For advanced debugging, consider using the Intel software tools,
specifically, the Intel Debugger, the Intel Thread Checker, and
the Intel Thread Profiler.
Chapter 9
Single-Core
Processor
Fundamentals
o gain a better understanding of threading in multi-core hardware, it
T is best to review the fundamentals of how single-core processors
operate. During the debugging, tracing, and performance analysis of
some types of programs, knowing a processors details is a necessity
rather than an option. This chapter and Chapter 10 provide the
architectural concepts of processors that are pertinent to an
understanding of multi-threaded programming. For internal instruction-
level details, you should consult Intel Software Developers Guides at
Intels Web site.
This chapter discusses single-core processors as a basis for
understanding processor architecture. If you are already familiar with the
basics of processors and chipsets, you might skip this chapter and move
directly to Chapter 10.
237
238 Multi-Core Programming
Motherboard
Processor
You might be familiar with two chips in the chipset. Previously these
chips were known as the Northbridge and Southbridge and they were
connected by a shared PCI bus. Intel changed the implementation and
started using dedicated point-to-point connections or direct media
interface (DMI) between these two chips and introduced Intel Hub
Architecture (IHA), as shown in Figure 9.2. IHA replaced the Northbridge
and Southbridge with the Memory Controller Hub (MCH) and the I/O
Controller Hub (ICH). When graphics and video features are built into
the MCH, it is called the Graphics Memory Controller Hub (GMCH). A
front side bus (FSB) attaches the chipset to the main processor.
To understand the impact of the hardware platform on an
application, the questions to pose are which processor is being used,
how much memory is present, what is the FSB of the system, what is the
cache size, and how the I/O operations take place? The answer to most
of these questions is dictated by the processor.
The smallest unit of work in a processor is handled by a single
transistor. A combination of transistors forms a logic block and a set of
logic blocks create a functional unitsome examples are the Arithmetic
Logic Unit (ALU), Control Units, and Prefetch Units. These functional
units receive instructions to carry out operations. Some functional units
Chapter 9: Single-Core Processor Fundamentals 239
are more influential than others and some remain auxiliary. The
functional units, or blocks, form a microprocessor or Central Processing
Unit (CPU). A high-level block diagram of a microprocessor is shown in
Figure 9.3(a). The manufacturing process of a microprocessor produces a
physical die and the packaged die is called the processor. Figure 9.3(b)
shows a photo of a die. Inside a computer system, the processor sits on a
socket. To show the physical entity of processor and socket, see
Figure 9.3(c). Sometimes the processor is referred to as the CPU. For
simplicitys sake, this book refers to processor and microprocessor
interchangeably. Different processors usually have a different number of
functional units.
Motherboard
Processor
Chipset
Add-On MCH Physical Memory
Graphics (Memory (Must have
Controller
(Optional) Hub) minimum)
DMI
(Direct Media
Interface)
Add-On ICH
Cards (IO Controler
(Optional) Hub)
Processor
On-Die Cache
L0 . . . Ln
Local APIC
Control Logic
(Advanced Programmable
Interrupt Controller)
Interace Unit
Execution
Resources
(ALU, Control Unit)
Register Array
Heat-Sink
Processor
(Single-Core or
Multi-Core)
Single
Socket
After operations
Fetch Functional
inside a processor
Block gets
results get back to
instructions from
main memory
main memory during
during post
fetch cycle
execution cycle
On-Die Cache
L0 . . . Ln
Local APIC
Control Logic Decode and
(Advanced Programmable
Interrupt Controller) execute
operations
Interace Unit
Execution
Resources
(ALU, Control Unit)
Register Array
Processor
Figure 9.4 Processor Attached with the System Bus Showing Basic
Operational Steps
not the I/O APIC. The I/O APIC is a part of the chipset that
supports interrupt handling of different I/O devices through
the Local APIC. The I/O APIC is an off-chip unit and usually a
part of a multi-processor-based chipset.
The interface unit is the functional block that helps to
interface a processor with the system bus or front side bus
(FSB).
The register array is the set of registers present in a processor.
The number of registers can vary significantly from one
generation of processor to another: 32-bit processors without
Intel Extended Memory 64 Technology (Intel EM64T) have
only eight integer registers, whereas 64-bit Itanium
processors have 128 integer registers.
The execution resources include the integer ALU, Floating-
Point execution, and branch units. The number of execution
units and the number of active execution units per cycle
referred to as the number of issue portsvary from processor
to processor. The execution speed of functional blocks varies
as well and these implementations get improved from
generation to generation. The number of execution units is
sometimes referred to as the machine width. For example, if
the processor has six execution units, the processor is said to
be a six-wide machine.
Other types of functional blocks are available in the processor and they
vary with respect to the type of processor as well. There are areas in a
processor referred to as queues that temporarily retain instructions prior
to going into the next phase of operation through the pipeline stages.
The scheduler is another functional block. It determines when micro-
operations are ready to execute based on the readiness of their
dependent input register operand sources and the availability of the
execution resources the micro-operations need to complete their
operation.
The execution flow of operations in a processor is shown in
Figures 9.5 and 9.6. These figures depict the basic four steps of the
pipeline: fetch, decode, execute, and write. In reality the process is
somewhat more complicated.
Chapter 9: Single-Core Processor Fundamentals 243
Fetch / Decode
Processor
Static Program
Instruction
reorder
and
Window of Execution commit
Instruction Dispatch
Instruction
Instruction Issue
fetch and
branch
prediction
Processor
L2 Cache L1 Cache
1
A processor with a single pipeline is called a scalar processor and a CPU with multiple pipelines is
called a superscalar processor.
Chapter 9: Single-Core Processor Fundamentals 245
Sequential
Machine Code
Multiple execution
efficiently but units used
Execution Units difficult more efficiently but Massive
to use efficiently compiler more complex Resources
Key Points
Understanding the basics of a single-core processor is essential to
comprehend how threading works on a multi-core processor. The
important concepts and terms to keep in mind are:
There are different functional blocks that form a microprocessor
such as, Arithmetic Logic Unit, Control Units, and Prefetch Units.
A chipset is used to interface the processor to physical memory
and other components.
A processor is the container of the dies, and the die is the
microprocessor or CPU. In loose terms, processor and
microprocessor get used interchangeably.
The high-level operations for multi-core processors remain the
same as for single-core processors.
Two fundamentally different generic architectures are available
from Intel: wide superscalar and EPIC.
Now that the basic building blocks of a processor have been covered, the
following chapter explores multi-core processor architecture from a
hardware perspective, focusing on the Pentium 4 processor and Itanium
architecture.
Chapter 10
Threading on Intel
Multi-Core
Processors
he concepts of threading from a software perspective were covered
T in previous chapters. Chapter 2 also touched briefly on threading
inside hardware and Chapter 9 covered the concepts of single-core
processors. This chapter describes in more detail what threading inside
hardware really means, specifically inside the processor. Understanding
hardware threading is important for those developers whose software
implementation closely interacts with hardware and who have control
over the execution flow of the underlying instructions. The degree to
which a developer must understand hardware details varies. This chapter
covers the details of the multi-core architecture on Intel processors for
software developers, providing the details of hardware internals from a
threading point of view.
Hardware-based Threading
Chapter 9 describes the basics of the single-core processor. In most
cases, threaded applications use this single-core multiple-issue
superscalar processor. The threading illusion materializes from the
processor and that is called instruction level parallelism (ILP). This is
done through a context-switch operation. The operational overhead of
context switching should be limited to a few processor cycles. To
perform a context switch operation, the processor must preserve the
current processor state of the current instruction before switching to
247
248 Multi-Core Programming
.. .. ..
. . .
Unused Cycle
Fine-grained Coarse-grained
Single-issue,
Multi-threading Multi-threading
Single-thread
or Interleaved or Blocked
or Single-threaded
Multi-threading Multi-threading
Superscaler
Superscalar Superscalar
.. ..
. . Unused Cycle
.. ..
. . Unused Cycle
The concept of CMP has been around for a while in the specialized
processor domain, in areas like communication and DSP. In CMP
technology, multiple processors reside on a single die. To extend the
usability of the CMP in the general processor domain, the industry
introduced the concept of multi-core processors, which are slightly
different than CMPs even though many publications started using CMP as
a generic description of multi-core processors. In CMPs, multiple
processors get packed on a single die, whereas for multi-core processors,
a single die contains multiple cores rather than multiple processors. The
concept of CMP can be confused with the existence of multiprocessor
systems-on-chip (MPSoC). CMP and MPSoC are two different types of
processors used for two different purposes. CMP is for general-purpose
ISA hardware solutions, whereas MPSoC is used in custom architectures.
In simple terms, the CMP is the one that has two or more conventional
processors on a single die, as shown in Figure 10.4.
Chapter 10: Threading on Intel Multi-Core Processors 251
1 MB L 2I 2 Way
Multi-threading
Processor
Core
L3 Tag
L2 Cache Dual
(256K) core
L3 Cache
Bus
Logic
2 x 12 MB L 3 Arbiter
I/O caches
launch of 64-bit Itanium processors. The next wave from Intel came
with the addition of dual-core processors in 2005, and further
developments are in the works. To understand Intel threading solutions
from the processor level and the availability of systems based on these
processors, review the features of brands like Intel Core Duo, Intel
Pentium Processor Extreme Edition, Intel Pentium D, Intel Xeon , Intel
Pentium 4, and Intel Itanium 2. As stated before, when you are going to
select different types of processor for your solution, you have to make
sure the processor is compatible with the chipset. To learn more details
on processors and compatibility, visit the Intel Web site.
Hyper-Threading Technology
Hyper-Threading Technology (HT Technology) is a hardware mechanism
where multiple independent hardware threads get to execute in a single
cycle on a single superscalar processor core, as shown in Figure 10.5.
The implementation of HT Technology was the first SMT solution for
general processors from Intel. In the current generation of Pentium 4
processors, only two threads run on a single-core by sharing, and
replicating processor resources.
T0 T1 T2 T3 T4 T5 T0 T1 T2 T3 T4 T5
Thread Pool Thread Pool
CPU
CPU
LP0
CPU T0 T1 T2 T3 T4 T5 T0 T2 T4
LP1 T1 T3 T5
Time CPU
2 Threads per
Processor Time
MP HT MP
Physical Package Physical Package Physical Package Physical Package
Logical Logical Logical Logical
Processor 0 Processor 1
Architectural Architectural
Processor 0 Processor 1 State State
Architectural Architectural Architectural Architectural
State State State State Execution Execution
Execution Engine Execution Engine Engine Engine
Local APIC Local APIC Local APIC Local APIC Local APIC Local APIC
Bus Interface Bus Interface Bus Interface Bus Interface
shared with entries that are tagged with logical processor IDs. The
decode logic preserves two copies of all the necessary states required to
perform an instruction decode, even though the decoding operations are
done through a coarse-grained scheme in a single cycle.
Decoder
Microcode
Rom
BTB
Execution Trace Cache
Micro-op
Queue
L2 Cache and Control
FSB Bus Interface
Allocator/Register Renamer
System Bus
ITLB: Instruction Translation Lookaside Buffer AGU: Address Generation Unit BTB:Branch Target Buffer
DTLB: Data Translation Lookaside Buffer FPx: Floating Point FP, MMX, SSE, SSE2, SSE3 FPm: Floating Point Move, FXCH
The decode logic passes decoded instructions to the trace cache, also
referred to as the advanced instruction cache. In reality, this is somewhat
different than the conventional instruction cache. The trace cache stores
already-decoded instructions in the form of micro-ops and maintains data
integrity by associating a logical processor ID. The inclusion of the trace
cache helps to remove the complicated decodes logic from the main
execution phase. The trace cache orders the decoded micro-ops into
program-ordered sequences or traces. If both hardware threads need
256 Multi-Core Programming
access to the trace cache, the trace cache provides access with a fine-
grained approach rather than coarse-grained. The trace cache can hold
up to 12K micro-ops, but not every required instruction can reside in the
trace cache. That is why, when the requested instruction micro-op is not
available in the trace cache, the instruction needs to bring it from L2
cachethis event is called a Trace Cache Miss. On the other hand, when
the instruction micro-ops remain available in trace cache and instruction
flow does not need to take extra steps to get required instructions from
L2, the event is referred to as a Trace Cache Hit.
In the front end of a processor with HT Technology, both
hardware threads make independent progress and keep data
association. The micro-op queues decouple the front end from the
execution core and have a hard partition to accommodate two
hardware threads. Once the front end is ready to prepare the
microcode, the operational phase gets transferred to the backend out-
of-order execution core, where appropriate execution parallelism
takes place among microcode streams. This is done with the help of
distributor micro-op queues and schedulers which keep the correct
execution semantics of the program. To maintain the two hardware
threads register resource allocation, two Register Allocation Tables
(RATs) support two threads. The register renaming operation is done
in parallel to allocator logic. The execution is done by the advanced
dynamic execution engine (DEE) and the rapid execution engine
(REE). Six micro-ops get dispatched in each cycle through DEE and
certain instructions are executed in each half cycle by REE. When two
hardware threads want to utilize back-end, each thread gets allocation
through a fine-grained scheme and a policy is established to limit the
number of active entries each hardware thread can have in each
scheduler queue. To provide ready micro-ops for different ports, the
collective dispatch bandwidth across all of the schedulers is twice the
number of micro-ops received by the out-of-order core.
Once the out-of-order execution core allows instructions from both
threads interleaved in an arbitrary fashion to complete execution, it
places issued micro-ops in the reorder buffer by alternating between two
hardware threads. If for some instruction, one hardware thread is not
ready to retire micro-ops, other threads can utilize the full retirement
bandwidth.
Chapter 10: Threading on Intel Multi-Core Processors 257
Multi-Core Processors
To understand multi-core processors, this section extends the concepts
of single core and differentiates the meaning of core from that of
processor. The following sections also cover the basics of the multi-core
architecture, what is available today, and what may be available beyond
multi-core architecture.
Architectural Details
Chapter 9 reviewed how a single-core processor contains functional
blocks, where most of the functional blocks perform some specific
operations to execute instructions. The core in this case is a combination
of all of the functional blocks that directly participate in executing
instructions. The unified on-die Last Level Cache (LLC) and Front Side
Bus (FSB) interface unit could be either part of the core or not,
depending on the configuration of the processor.
Some documents exclusively differentiate between core and
execution core. The only difference is that an execution core is the main
set of functional blocks that directly participate in an execution, whereas
core encompasses all available functional blocks. To remain consistent,
this book tries to distinguish the differences. In Figure 10.9, different
core configurations are shown. Using shared LLC in a multi-core
processor, the cache coherency complexity is reduced, but there needs
to be a mechanism by which the cache line keeps some identifying tag
for core association or dynamically splits the cache for all cores. Also,
when the FSB interface gets shared, this helps to minimize FSB traffic.
Proper utilization of a multi-core processor also comes from a compatible
chipset.
258 Multi-Core Programming
Processor
Execution Core
(EC)
FSB Bus
Interface Unit
Last Level Cache Last Level Cache Last Level Cache Last Level Cache
(LLC) (LLC) (LLC) (LLC)
FSB Bus Interface Unit FSB Bus Interface Unit FSB Bus Interface Unit FSB Bus Interface Unit
System Bus or FSB (Front Side Bus) System Bus or FSB (Front Side Bus)
System Bus or FSB (Front Side Bus) System Bus or FSB (Front Side Bus)
(d) Multi-Core Processor with Two (e) Multi-Core Processor with Two
Cores and Shared FSB Cores and Shared LLC and FSB
The number of cores can vary, but the cores remain symmetrical; that
is why you see product announcements for two, four, eight, or more
cores in processors. You will be seeing the representation of the number
n
of cores by 2 (where, in theory, 0 < n < ). Projected theoretical
representations always remain blocked by available technologies. With
the constraint in current technology, the proposed geometry of current
and upcoming multi-core processors is shown in Table 10.2.
Table 10.3 shows only two physical cores. The number of threads
supported by these processors is currently limited to two cores, but with
respect to the platform, the number of threads varies with respect to the
chipset where these processors are being used. If the chipset supports N
number of processors, then the number of hardware threads for that
platform can be as high as N 2 2. For the latest updates about
available processors, visit the Intel Web site.
260 Multi-Core Programming
B B B I I M M M M F F B B B I I M M M M F F
Branch Unit Integer Memory/ Floating Branch Unit Integer Memory/ Floating
Unit Integer Point Unit Unit Integer Point Unit
Queues/ L3 Queues/ L3
Control Cache (12MB) Control Cache (12MB)
Arbiter
Synchronizer Synchronizer
System Interface
The FSB interface is shared among the cores. Montecito supports two
cores in each socket and two hardware threads on each core. So, one
socket has four contexts. This can be seen as comparable to a dual-core
platform with HT Technology.
In Montecito, each core attaches to the FSB interface through the
arbiter, which provides a low-latency path for a core to initiate and
respond to system events while ensuring fairness and forward progress.
Chapter 10: Threading on Intel Multi-Core Processors 263
Processor
Processor
Execution Execution
Core Core
(EC) (EC)
Execution Core
Last Level Last Level
(EC)
Cache (LLC) Cache (LLC)
System Bus or FSB (Front Side Bus) System Bus or FSB (Front Side Bus)
(a) Single Core Itanium 2 (b) Montecito Core
Processor
Each core tracks the occupancy of the arbiters queues using a credit
system for flow control. As requests complete, the arbiter informs the
appropriate core of the type and number of de-allocated queue entries.
The cores use this information to determine which, if any, transaction to
issue to the arbiter. The arbiter manages the system interface protocols
while the cores track individual requests. The arbiter tracks all in-order
requests and maintains the system interface protocol. Deferred or out-of-
order transactions are tracked by the core with the arbiter simply passing
the appropriate system interface events on to the appropriate core. The
arbiter has the ability to support various legacy configurations by
adjusting where the agent identifiersocket, core, and/or threadis
driven on the system interface. The assignment of socket and core must
be made at power on and cannot be changed dynamically. The
assignment of a thread is fixed, but the driving of the thread identifier is
under Processor Abstraction Layer (PAL) control since it is for
information purposes only and is not needed for correctness or forward
progress.
In the core, one thread has exclusive access to the execution
resources (foreground thread) for a period of time while the other thread
is suspended (background thread). Thread control logic evaluates the
threads ability to make progress and may dynamically decrease the
foreground threads time quantum if it appears that it will make less
effective use of the core than the background thread. This ensures better
overall utilization of the core resources over strict temporal multi-
threading (TMT) and effectively hides the cost of long latency operations
such as memory accesses, especially the on-die LLC cache misses, which
has latency of 14 cycles. Other events, such as the time-out and forward
progress event, provide fairness, and switch hint events provide paths for
the software to influence thread switches. These events have an impact
on a threads urgency that indicates a threads ability to effectively use
core resources. Many switch events change a threads urgency, or the
prediction that a thread is likely to make good use of the core resources.
Each thread has an urgency value that is used as an indication of a
threads ability to make effective use of the core execution resources.
The urgency of the foreground thread is compared against the
background thread at every LLC event. If the urgency of the foreground
thread is lower than the background thread then the LLC event may
initiate a thread switch. Thread switches may be delayed from when the
control logic requests a switch to when the actual switch occurs. The
Chapter 10: Threading on Intel Multi-Core Processors 265
Local APIC Local APIC Local APIC Local APIC Local APIC Local APIC Local APIC Local APIC
PCI Bus
External Interrupts
I/O APIC
Chipset
IPIs
Processor Processor
Figure 10.13 Relationship of Each Processors Local APIC, I/O APIC, and System Bus in a
Multiprocessor System
Executes
worker
thread N
Again, by employing the IPI scheme, the system ensures that all
waiting threads are executed immediately after the synchronization
object has been triggered, and executed with a predictable delay time,
much less than the normal rescheduling period.
But in some situations a threads wait time does not exceed the time
quantum granted to the thread by the operating system. In this case it
would be inefficient to reschedule the threads execution by returning
control to the OS or by making other threads issue an IPI to wake up
your waiting thread, since the interrupt delivery delay may be much
greater than the actual wait interval. The only solution would be to keep
control and wait for other threads in a loop. This is where the hardware
monitor/wait approach yields the best benefit to a programmer, because
one can boost the performance by providing a hint to the processor that
the current thread does not need any computational resources since all it
does is wait for a variable to change.
Power Consumption
You might be surprised to find this section in a software book. You
might know that you can control your system power by using available
system level APIs such as GetSystemPowerStatus and
GetDevicePowerState. Mobile application developers understand the
issue with power more than others. Traditionally, systems and
applications have been designed for high performance. In fact, our entire
discussion up to this point has been concerned with architectural and
programming innovations to increase performance of applications and
systems. However, recently the power consumption of the platform has
become a critical characteristic of a computing system.
Power Metrics
Increases in power consumption have occurred despite dramatic and
revolutionary improvements in process technology and circuit design. The
primary reason behind the increase in power has been the continued
emphasis on higher performance. As complexity and frequency of processors
has increased over the years to provide unprecedented levels of performance
the power required to supply these processors has increased steadily too. A
simplified equation that demonstrates power-performance relationship for
the CMOS circuits on which all modern processors are based is:
P ACV 2 f
Chapter 10: Threading on Intel Multi-Core Processors 269
10 10000
Nominal feature size
0.7X every
2 years
1 1000
90 nm
2003 65 nm Technology Generation
Manufa
cturing 2005 45 nm
20 32 nm
Develo 07 2009
pment
2011+
Resea
rch
SiGe S/D
Strained SiGe S/D
Silicon Strained
Silicon
Key Points
When developing a software application, the focus usually remains on
the implementation layer and the layer below. Several layers separate the
abstracted application and the hardware. With the recent development of
more than one core in a single package, developers have to consider
every component in the solution domain to optimize the capabilities of
these new processors.
Chapter 10: Threading on Intel Multi-Core Processors 273
Overview
Intel has been working with multiprocessor designs and the tools to
support them for well over a decade. In order to assist programmers,
Intel has made available a number of tools for creating, debugging, and
tuning parallel programs.
275
276 Multi-Core Programming
Investigate
Most programming work begins with an existing application. It often
begins with a prototype of a framework or critical elements of the
application, for those working to program something entirely new.
Whether a prototype or a preexisting application, some initial
investigation plays a critical role in guiding future work. Tools such
as the Intel VTune Performance Analyzer and the Intel Thread
Profiler are extremely useful. The Intel compilers can play a strong
role in what if experiments by simply throwing some switches, or
inserting a few directives in the code, and doing a recompile to see
what happens.
Create/Express
Applications are written in a programming language, so a compiler is a
natural place to help exploit parallelism. No programming languages
in wide usage were designed specifically with parallelism in mind.
This creates challenges for the compiler writer to automatically find
and exploit parallelism. The Intel compilers do a great deal to find
parallelism automatically. Despite this great technology, there are too
many limitations in widely used programming languages, and
limitations in the way code has been written for decades, for this
to create a lot of success. Automatic parallelization by the compiler
is nevertheless a cheap and easy way to get some help
all automatically. Auto-parallelization is limited by all popular
programming languages because the languages were designed without
regard to expressing parallelism. This is why extensions like OpenMP
are needed, but they are still limited by the programming languages
they extend. There is no cheap and easy way to achieve parallelism
using these languages.
To overcome limitations imposed by conventional programming
languages, the Intel compilers support OpenMP, which allows a
developer to add directives to the code base that specify how different
code segments may be parallelized. This allows programs to get
significant performance gains in a simple, easy-to-maintain fashion. The
OpenMP extensions have been covered in some detail in Chapter 6. Intel
libraries also help make the production of threaded applications easier. In
this case, Intel engineers have done the work for you and buried it in the
implementation of the libraries. These may be the very same libraries you
were using before threading.
Chapter 11: Intel Software Development Products 277
Debugging
Having multiple threads combine to get the work of an application done
gives rise to new types of programming errors usually not possible with
single threaded applications. Up until recently, these threading errors
were simply bugs that needed to be debugged the old fashion wayseek
and find. With the Intel Thread Checker, developers can directly locate
threading errors. It can detect the potential for these errors even if the
error does not occur during an analysis session. This is because a well-
behaved threaded application needs to coordinate the sharing of memory
between threads in order to avoid race conditions and deadlock. The
Intel Thread Checker is able to locate examples of poor behavior that
should be removed by the programmer to create a stable threaded
application.
Tuning
Performance tuning of any application is best done with non-intrusive
tools that supply an accurate picture of what is actually happening on a
system. Threaded applications are no exception to this. A programmer,
armed with an accurate picture of what is happening, is able to locate
suboptimal behavior and opportunities for improvement. The Intel
Thread Profiler and the Intel VTune Performance Analyzer help tune a
threaded application by making it easy to see and probe the activities of
all threads on a system.
Intel Thread Checker
The Intel Thread Checker is all about checking to see that a threaded
program is not plagued by coding errors in how threads interoperate that
can cause the program to fail. It is an outstanding debugging tool, even
for programs that seem to be functioning properly. Just knowing that
such a tool exists is a big step since this is such a new area for most
programmers. Finding this class of programming error is especially
difficult and frustrating because the errors manifest themselves as
nondeterministic failures that often change from run to run of a program
and most often change behavior when being examined using a debugger.
Developers use the Intel Thread Checker to locate a special class of
threading coding errors in multi-threaded programs that may or may not
be causing the program to fail. The Intel Thread Checker creates
diagnostic messages for places in a program where its behavior in a
278 Multi-Core Programming
How It Works
The Intel Thread Checker can do its analysis using built-in binary
instrumentation and therefore can be used regardless of which compiler is
used. This is particularly important with modern applications that rely on
dynamically linked libraries (DLLs) for which the source code is often
unavailable. The Intel Thread Checker is able to instrument an application
and the shared libraries, such as DLLs, that the application utilizes.
When combined with the Intel compiler and its compiler-inserted
instrumentation functionality, Intel Thread Checker gives an even better
understanding by making it possible to drill down to specific variables on
each line. Figure 11.1 shows the diagnostic view and Figure 11.2 shows
the source view of the Intel Thread Checker.
Chapter 11: Intel Software Development Products 279
Usage Tips
Because the Intel Thread Checker relies on instrumentation, a
program under test will run slower than it does without
instrumentation due to the amount of data being collected.
Therefore, the most important usage tip is to find the smallest data
set that will thoroughly exercise the program under analysis.
Selecting an appropriate data set, one that is representative of your
code without extra information, is critical so as not to slow execution
unnecessarily. It is generally not practical to analyze a long program
or run an extensive test suite using this tool.
In practice, three iterations of a loopfirst, middle, and lastare
usually sufficient to uncover all the problems that the Intel Thread
Checker is able to find within each loop. The exception is when if
conditions within the loop do different things for specific iterations.
Because of the overhead involved in the Intel Thread Checker operation,
you should choose a data set for testing purposes that operates all the
loops that you are trying to make run in parallel, but has the smallest
amount of data possible so that the parallel loops are only executed a
small number of iterations. Extra iterations only serve to increase the
execution time. If you have a particular section of code you would like to
focus on, you can either craft your data and test case to exercise just that
part, or you can use the Pause/Resume capabilities of the Intel Thread
Checker.
The Intel Thread Checker prioritizes each issue it sees as an error,
warning, caution, information, or remark, as shown in Figure 11.3.
Sorting errors by severity and then focusing on the most important issues
first is the best way to use the tool.
Before you prepare your code for use with the Intel Thread
Checker, you should ensure that your code is safe for parallel
execution by verifying that it is sequentially correct. That is, debug it
sequentially before trying to run in parallel. Also, if your language or
compiler needs special switches to produce thread-safe code, use
them. This comes up in the context of languages like Fortran, where
use of stack (automatic) variables is usually necessary, and not always
the default for a compiler. The appropriate switch on the Intel Fortran
Compiler is Qauto. Use of this option on older code may cause
issues, and the use of a SAVE statement in select places may be
required for subroutines that expect variables to be persistent from
invocation to invocation.
Chapter 11: Intel Software Development Products 281
Severity
Diagnostic Groups
2
Unclassified
Remark
Information
Caution
Warning
Error
Filtered
0 1 2 3 4 5
Number of Occurences
Figure 11.3 Intel Thread Checker Bar Chart with Error Categories
Intel Compilers
Just as with previous hardware technologies, the compiler can
play a central or supporting role in taking advantages the
multi-processor/multi-core/multi-threading capabilities of your shared
282 Multi-Core Programming
OpenMP
In Chapter 6, you learned how OpenMP can be used as a portable
parallel solution and that Intel compilers have support for OpenMP
within the Windows and Linux environments. Intel compilers support
all the implementation methodologies discussed in Chapter 6. At the
time of this writing, Version 9.1 of the Intel compilers support the
OpenMP API 2.5 specification as well as the workqueuing extension, a
feature proposed by Intel for OpenMP 3.0. To get the Intel compiler to
recognize your OpenMP constructs, compile with the following
switch:
Windows: /Qopenmp
Linux: -openmp
Atomic
The OpenMP Atomic directive is probably the most obvious example of a
feature where the compiler is able to provide a fast implementation. The
atomic directive is similar to a critical sectionin that only one thread
may enter the atomic section of code at a timebut it places the
limitation on the developer that only very simplistic and specific
statements can follow it.
When you use the atomic directive as follows:
#pragma omp atomic
workunitdone++;
Chapter 11: Intel Software Development Products 283
The compiler can issue the following instructions that allow the
hardware to atomically add one to the variable
mov eax, 0x1h
lock xadd DWORD PTR [rcx], eax
This is much more efficient than locking the code using a critical section
or a mutex, then updating the variable, and finally releasing the lock,
which can take hundreds or thousands of cycles, depending on the
implementation. This could be created using inline assembly or compiler
intrinsics, except that then the code would not be portable to other
architectures or OS environments.
The Intel compilers will perform other optimization algorithms when
compiling OpenMP code. The atomic example was chosen due to its
simplicity. As optimization techniques are developed by Intels compiler
developers, those techniques usually get added in the compiler so that
everyone who uses OpenMP with the Intel compiler benefits, whether
they are aware of it or not.
Auto-Parallel
The Intel compilers have another feature to help facilitate threading. The
auto-parallelization feature automatically translates serial source code into
equivalent multi-threaded code. The resulting binary behaves as if the
user inserted OpenMP pragmas around various loops within their code.
The switch to do this follows:
Windows: /Qparallel
Linux: -parallel
For some programs this can yield a free performance gain on SMP
systems. For many programs the resulting performance is less than
expected, but dont give up on the technology immediately. There are
several things that can be done to increase the probability of
performance for this auto-parallel switch.
Increasing or decreasing the threshold for which loops will be made
parallel might guide the compiler in creating a more successful binary.
The following switch guides the compiler heuristics for loops:
Windows: /Qpar_threshold[:n]
Linux: -par_threshold[n]
where the condition 0 <= n <= 100 holds and represents the
threshold for the auto-parallelization of loops. If n=0, then loops get
284 Multi-Core Programming
where 0 <= n <= 3. If n=3, then the report gives diagnostic information
about the loops it analyzed. The following demonstrates the use of this
report on a simplistic example. Given the following source:
1 #define NUM 1024
2 #define NUMIJK 1024
3 void multiply_d( double a[][NUM], double b[][NUM],
4 double c[][NUM] )
5 {
6 int i,j,k;
7 double temp;
8 for(i=0; i<NUMIJK; i++) {
9 for(j=0; j<NUMIJK; j++) {
10 for(k=0; k<NUMIJK; k++) {
11 c[i][j] = c[i][j] + a[i][k] * b[k][j];
12 }
13 }
14 }
15 }
The compiler produces the following report:
$ icc multiply_d.c -c -parallel -par_report3
procedure: multiply_d
serial loop: line 10: not a parallel candidate due to insufficent
work
serial loop: line 8
anti data dependence assumed from line 11 to line 11, due to "b"
anti data dependence assumed from line 11 to line 11, due to "a"
Chapter 11: Intel Software Development Products 285
flow data dependence assumed from line 11 to line 11, due to "c"
flow data dependence assumed from line 11 to line 11, due to "c"
serial loop: line 9
anti data dependence assumed from line 11 to line 11, due to "b"
anti data dependence assumed from line 11 to line 11, due to "a"
flow data dependence assumed from line 11 to line 11, due to "c"
flow data dependence assumed from line 11 to line 11, due to "c"
Based on this report, you can see the compiler thinks a dependency
exists between iterations of the loop on the a, b, and c arrays. This
dependency is due to an aliasing possibilitybasically, it is possible that
the a or b array points to a memory location within the c array. It is easy
1
to notify the compiler that this is not possible . To handle such instances,
any of the following techniques can be used:
Inter-Procedural Optimization (IPO)
Windows: /Qipo
Linux: -ipo
Restrict keyword
Aliasing switches: /Oa, /Ow, /Qansi_alias
#pragma ivdep
After modifying the code as follows:
1
In this case, the programmer assumes the responsibility of ensuring that this aliasing doesnt occur.
If the programmer is wrong, unpredictable results will occur.
286 Multi-Core Programming
cache before the main thread needs the data. Since the hardware
threading resources would have been idle otherwise, this technique
effectively eliminates performance penalties associated with memory
latencies. This technique will work for any system that can execute
threads simultaneously and includes a shared cache that multiple threads
can access directly.
In order for this technique to yield a performance gain the compiler
needs detailed data about cache misses within your application. The
compiler needs to gather an execution profile of your application and
data on cache misses from the Performance Monitoring Unit (PMU) in
order to identify where cache misses are occurring in your application.
Intel Debugger
Chapter 8 covered a number of general purpose debugging techniques
for multi-threaded applications. In order to provide additional help to
developers Intel has developed a debugging tool appropriately named the
Intel Debugger (IDB). The Intel Debugger is shipped as part of the Intel
compilers. It is a full-featured symbolic source-code application debugger
that helps programmers to locate software defects that result in run-time
errors in their code. It provides extensive debugging support for C, C++
and Fortran, including Fortran 90. It also provides a choice of control
from the command line, including both dbx and gdb modes, or from a
graphical user interface, including a built-in GUI, ddd, Eclipse CDT, and
Allinea DDT.
The Intel compilers enable effective debugging on the platforms they
support. Intel compilers are debugger-agnostic and work well with
native debuggers, the Intel Debugger, and selected third-party debuggers.
By the same token, the Intel Debugger is compiler-agnostic and works
well with native compilers, the Intel compilers, and selected third-party
compilers. This results in a great deal of flexibility when it comes to
mixing and matching development tools to suite a specific environment.
In addition, the Intel Debugger provides excellent support for the
latest Intel processors, robust performance, superior language-feature
support, including C++ templates, user-defined operators, and modern
Fortran dialects (with Fortran module support); and support for Intel
Compiler features not yet thoroughly supported by other debuggers.
The Intel Debugger is a comprehensive tool in general and also
supports extensively for threaded applications as well. Some of the
advanced capabilities of the Intel Debugger for threaded applications are:
Includes native threads and OpenMP threads
Provides an all threads stop / all threads go execution model
Acquires thread control on attach and at thread creation
Ability to list all threads and show indication of thread currently
in focus
Set focus to a specific thread
Sets breakpoints and watchpoints for all threads or for a subset of
all threads (including a specific thread)
Most commands apply to thread currently in focus or to any/all
threads as appropriate
Chapter 11: Intel Software Development Products 289
Intel Libraries
Libraries are an ideal way to utilize parallelism. The library writer can hide
all the parallelism and the programmer can call the routines without
needing to write parallel code. Intel has two libraries that implement
functions that have been popular for many years, and which Intel has
gradually made more and more parallel leading up to today when they are
parallelized to a great extent. Both of Intels libraries are programmed using
OpenMP for their threading, and are pre-built with the Intel compilers. This
is a great testimonial to the power of OpenMP, since these libraries produce
exceptional performance using this important programming method.
The Future
Libraries will expand as a popular method for achieving parallelism. The
need for more standardizationfor compilers, users, and libraries to
cooperate with regards to the creation and activation of threadswill
grow. Right now, a careful programmer can pour through the
documentation for libraries and compilers and sort out how to resolve
potential conflicts. As time passes, we hope to see some consensus on
how to solve this problem and make programming a little easier. We will
292 Multi-Core Programming
Intel VTune Performance Analyzer
The Intel VTune Performance Analyzer is a system-wide analysis tool that
offers event sampling and call graphs that include all information
available broken down not only by processes/tasks, but also by the
threads running within the processes. Intel Press offers a whole book on
the Analyzer, which dives into its numerous capabilities. This section
gives you just a flavor for the features, and highlights some of the ways
the Analyzer feature can be used in the tuning of threaded applications.
Users have summed up the tool by saying that the VTune analyzer
finds things in unexpected places. Users of the VTune analyzer are
enthusiastic about this tool largely because of this remarkable capability.
Threading adds a dimension to already complex modern computer
systems. It is no surprise when things happen on a system that cannot be
easily anticipated. When you seek to refine a computer system, the best
place to start is with a tool that can find these hidden problems by giving
a comprehensive performance exam.
Measurements are the key to refinement. The Intel VTune
Performance Analyzer is a tool to make measurements. It also has
wonderful features to help you understand those measurements, and
even advises you on what exceptional values may mean and what you
can do about them.
Taking a close look at the execution characteristics of an application
can guide decisions in terms of how to thread an application. Starting
with the hotspotsthe main performance bottlenecksin the
application, one can see if threading can be applied to that section of
code. Hotspots are found using the event sampling features in the VTune
analyzer. If the hotspot is in a location with little opportunity for
parallelism, a hunt up the calling sequence will likely find better
Chapter 11: Intel Software Development Products 293
opportunities. The calling sequence can be traced back using the call-
graph capability of the analyzer. Implementations of threads can be
refined by looking at balance using the Samples Over Time feature and
the Intel Thread Profiler in the analyzer.
Figure 11.4 Sampling Results Using the Intel VTune Performance Analyzer
294 Multi-Core Programming
If you can distribute the work currently done on one processor onto
two processors, you can theoretically double the performance of an
application. Amdahls law reminds us that we cannot make a program
run faster than the sequentialnot written to run in parallelportion of
the application, so dont expect to leap to doubled performance every
time.
Intel Thread Profiler
The Intel Thread Profiler is implemented as a view within the VTune
Performance Analyzer, but it is so significant that it should be discussed
as if it were an entirely separate product. Unlike the rest of the VTune
analyzer, the Intel Thread Profiler is aware of synchronization objects
used to coordinate threads. Coordination can require that a thread wait,
so knowing about the synchronization objects allows Intel Thread
Profiler to display information about wait time, or wasted time. The Intel
Thread Profiler helps a developer tune for optimal performance by
providing insights into synchronization objects and thread workload
imbalances that cause delays along the longest flows of execution.
The Intel Thread Profiler shows an applications critical path as it
moves from thread to thread, helping a developer decide how to use
threads more efficiently, shown in Figure 11.7. It is able to identify
synchronization issues and excessive blocking time that cause delays for
Win32, POSIX threaded and OpenMP code. It can show thread workload
imbalances so a developer can work to maximizes threaded application
performance by maximizing application time spent in parallel regions
296 Multi-Core Programming
doing real work. Intel Thread Profiler has special knowledge of OpenMP,
and can graphically display the performance results of a parallel
application that has been instrumented with calls to the OpenMP
statistics-gathering run-time engine.
The Timeline view shows the contribution of each thread to the total
program, whether on the critical path or not. The Thread Profilers also
has the ability to zero in on the critical path: the Critical Paths view
shows how time was spent on your program's critical path, the Profile
view displays a high-level summary of the time spent on the critical path.
Using the VTune Performance Analyzer and the Intel Thread Profiler
together, provide insights for a developer about threading in their
applications and on their systems. Together, these analysis tools help the
developer avoid searching for opportunities through trial and error by
providing direct feedback.
MPI Programming
Threading is a convenient model where each thread has access to the
memory of the other thread. This is portable only between shared
memory machines. In general, parallel machines may not share memory
between processors. While this is not the case with multi-core
processors, it is important to point out that parallel programs need not be
written assuming shared memory.
When shared memory is not assumed, the parts of a program
communicate by passing messages back and forth. It is not important
Chapter 11: Intel Software Development Products 297
how the messages are passed; the details of the interconnect are hidden
in a library. On a shared memory machine, such as a multi-core
processor, this is done through shared memory. On a supercomputer
with thousands of processors, it may be done through an expensive and
very high speed special network. On other machines, it may be done via
the local area network or even a wide area network.
In order for a message-passing program to be portable, a standard for
the message passing library was needed. This formed the motivation
behind the Message Passing Interface (MPI), which is the widely used
standard for message passing. Many implementations exist including
vendor-specific versions for their machines or interconnects. The two
most widely used versions of MPI are MPICH, with roots from the earliest
days of UNIX and now hosted by Argonne National Lab, and LAM/MPI, an
open-source implementation hosted by Indiana University.
MPI makes possible source-code portability of message-passing
programs written in C, C++, or Fortran. This has many benefits, including
protecting investments in a program, and allowing development of the
code on one machine such as a desktop computer, before running it on
the target machine, which might be an expensive supercomputer with
limited availability.
MPI enables developers to create portable and efficient programs
using tightly coupled algorithms that require nodes to communicate
during the course of a computation. MPI consists of a standard set of API
calls that manage all aspects of communication and data transfer between
processors/nodes. MPI allows the coordination of a program running as
multiple processes in a distributed (not shared) memory environment,
yet is flexible enough to also be used in a shared memory system such as
a multi-core system.
MPI
Application
Intel MPI
c
Library
Scalability is a key concern with any parallel program, and the analyzer
provides views that are particularly useful for a developer seeking to
enhance scalability. A user can navigate through trace data levels of
abstraction: cluster, node, process, thread, and function. The Detailed and
300 Multi-Core Programming
Key Points
Parallel programming is more natural than forcing our thinking into
sequential code streams. Yet, the change from this type of thinking that
developers have all been trained on means we all need to think
differently than we have for decades.
Chapter 11: Intel Software Development Products 301
These are complex and powerful tools. Describing all the features and
capabilities is beyond the scope of the book. For a more complete
discussion of all the different features and capabilities please refer to the
documentation included with the programs and stay up to date with the
latest information, which can be found at the Intel Software Network
Web site at www.intel.com/software.
Glossary
64-bit mode The mode in which 64-bit applications run on platforms
with Intel Extended Memory 64 Technology (Intel EM64T). See
compatibility mode.
advanced programmable interrupt controller (APIC) The hardware
unit responsible for managing hardware interrupts on a computing
platform.
aliasing A situation in which two or more distinct references map to the
same address in memory or in cache.
alignment The need for data items to be located on specific boundaries
in memory. Misaligned data can cause the system to hang in certain
cases, but mostly it detrimentally affects performance. Padding helps
keep data items aligned within aggregate data types.
architecture state The physical resources required by a logical
processor to provide software with the ability to share a single set of
physical execution resources. The architecture state consists of the
general purpose CPU registers, the control registers, and the
advanced programmable interrupt controller (APIC). Each copy
of the architecture state appears to software as a separate physical
processor.
associativity The means by which a memory cache maps the main RAM
to the smaller cache. It defines the way cache entries are looked up
and found in a processor.
303
304 Multi-Core Programming
317
318 Multi-Core Programming
Grama, Ananth, Anshul Gupta, George Karypis, and Vipin Kumar. 2003.
Introduction to Parallel Computing. Boston, MA: Addison-Wesley.
Hennessy, John L. and David A. Patterson. 2003 Computer Architecture
A Quantitative Approach. San Francisco, CA: Morgan Kaufmann.
Hill, Mark D. 1998. Multiprocessors Should Support Simple Memory
Consistency Models. IEEE Computer (August), 31(8):2834.
Hoare, C.A.R. 1974. Monitors: An Operating System Structuring Concept.
Communications of the ACM, 17(10):549557.
Holt, Bill. 2005. Moores Law, 40 years and Counting Future Directions
of Silicon and Packaging. InterPACK 05, Heat Transfer Conference.
Holub, Allen. 2000. Taming Java Threads. Berkeley, CA: Apress.
Hughes, Cameron and Tracey Hughes. 2004. Dividing C++ Programs
into Multiple Threads. Boston, MA: Addison Wesley.
Intel Corporation. 2003. Intel Hyper-Threading Technology, Technical
Users Guide. Santa Clara, CA: Intel Corporation.
______. 2005. IA-32 Intel Architecture Optimization Reference Manual.
Available at: http://www.intel.com/
______. 2006a. Intel Itanium Architecture Software Developers
Manual, Volume 1: Application Architecture, Volume 2: System
Architecture, Volume 3: Instruction Set Reference. Available at:
http://www.intel.com/
______. 2006b. IA-32 Intel Architecture Software Developers Manual,
Volume 1: Basic Architecture, Volume 2A-2B: Instruction Set
Reference, Volume 3: System Programming Guide. Available at:
http://www.intel.com/
______. 2006c. Intel Processor Identification and the CPUID Instruction,
Application Note 485. Available at: http://www.intel.com/
Kleiman, Steve, Devang Shah, and Bart Smaalders. 1996. Programming
with Threads. Upper Saddle River, NJ: Prentice Hall.
Kongetira, Poonacha, Kathirgamar Aingaran, and Kunle Olukotun. 2005.
Niagara: A 32-way Multithreaded SPARC Processor. IEEE Micro
(March/April), 25(2):2129.
Kubiatowicz, John David. 1998. Integrated Shared-Memory and Message-
Passing Communication in the Alewife Multiprocessor. PhD Thesis,
Massachusetts Institute of Technology, February.
320 Multi-Core Programming
Sutter, Herb and James Larus. 2005. Software and the Concurrency
Revolution. Microprocessor, Vol. 3, No. 7, September.
Ungerer, Theo, Borut Robi, and Jurij ilc. 2003. A Survey of Processors
with Explicit Multithreading. ACM Computing Surveys, Vol. 35, No. 1,
March.
Vahalia, Uresh. 1995. UNIX Internals: The New Frontiers. Upper Saddle
River, NJ: Prentice Hall.
Wadleigh, Kevin R. and Isom L. Crawford. 2000. Software Optimization
for High Performance Computing. Indianapolis, IN: Prentice Hall
PTR.
Wisniewski, Robert W., Leonidas I. Kortothanassis, and Michael L. Scott.
1995. High Performance Synchronization Algorithms for
Multiprogrammed Multiprocessors. Department of Computer
Science, University of Rochester, Rochester, NY.
Major Contributors
James Reinders is a senior engineer who joined Intel in 1989 and has contributed
to projects including the worlds first TeraFLOP supercomputer, as well as
compilers and architecture work for Intel processors. He is currently director of
business development and marketing for Intels Software Development Products
group and serves as the chief evangelist and spokesperson. James is also the author
of the book VTune Performance Analyzer Essentials.
Arch D. Robison has been a Principle Engineer at Intel since 2000. Arch received
his Ph.D. in computer science from the University of Illinois. Prior to his work at
Intel, Arch worked at Shell on massively parallel programs for seismic imaging. He
was lead developer for the KAI C++ compiler and holds five patents on compiler
optimization.
Xinmin Tian holds a Ph.D. in computer science and leads an Intel development
group working on exploiting thread-level parallelism in high-performance Intel C++
and Fortran compilers for Intel Itanium, IA-32, Intel EM64T, and multi-core
nd
architectures. Xinmin is a co-author of The Software Optimization Cookbook, 2
Edition.
Sergey Zheltov is a project manager and senior software engineer at the Advanced
Computer Center of Intels Software Solutions Group. He holds an M.S. degree in
theoretical and mathematical physics. Sergeys research includes parallel software
and platform architecture, operating systems, media compression and processing,
signal processing, and high-order spectra.
Index
threading for .NET Framework, 107, 109,
A 110, 111, 112, 113, 114, 115, 116, 117,
118, 119, 120
ABA problem, 188, 189, 190 threading for POSIX, 120, 121, 122, 123,
Abort( ) method, 110, 111, 112 124, 125, 126, 127, 128, 129, 130,
ACPI (Advanced Configuration and Power 131, 132
Interface), 270 threading for Windows, 75, 76, 77, 78, 79,
acquire( ) operation, 63, 64 80, 81, 82, 83, 84, 85, 86, 87, 88, 90, 91,
92, 93, 94, 95, 96, 97, 98, 99, 100, 101,
AddEntryToTraceBuffer( ) method, 222
102, 103, 104, 105, 106
$ADDRESS tracepoint keyword, 226
threads and, 23, 24
affinity. See processor affinity
application programming models
AfxBeginThread( ) function, 79 concurrency in software, 2, 3, 4, 5, 19
AfxEndThread( ) function, 79 debugging in Windows, 224, 225, 227, 228,
algorithms, non-blocking, 186, 187, 229, 230, 231, 232
188, 189, 190, 191 designing with debugging in mind, 216,
ALU (Arithmetic Logic Unit), 239 217, 218, 219
Amdahl, Gene, 14 Intel software development and, 276, 277
Amdahls law legacy applications, 76
Gustafson's law and, 18, 19 parallel programming problems, 210, 211
overview, 14, 15, 16, 17, 18 threads and, 32, 33
work distribution and, 294 arbiters, 263, 264
anti-dependence, 138, 139 asynchronous I/O, 94, 177
APIC (Advanced Programmable Interrupt Atomic directive (OpenMP), 282
Controller), 242, 266 atomic pragma (OpenMP), 157, 159, 160
APIs auto-parallelization feature (Intel), 283, 284,
MPI programming and, 297 285, 286, 287
323
324 Multi-Core Programming
H instruction streams, 5
Integrated Performance Primitives (Intel), 290
HAL (hardware abstraction layer), 26
Intel Core Duo processor, 6, 252
hardware
Intel Debugger, 288, 289
message processing, 70
Intel MPI Library, 298
threads and, 22, 29, 30, 248, 249, 250,
251, 252 Intel NetBurst microarchitecture, 251, 255
hash tables, 184, 185 Intel software development products
compilers, 282, 283, 284, 285, 286, 287
hazard pointers, 190
Intel Debugger, 288, 289
hint@pause instruction, 265
Intel Thread Building Blocks, 292
HLO (High Level Optimizations), 284
Intel Thread Checker, 141, 176, 277, 278,
HT (Hyper-Threading) Technology 279, 280, 281
Amdahl's law, 17, 18 Intel Thread Profiler, 276, 277, 295, 296
concurrent processing and, 29 Intel VTune Performance Analyzer, 141,
differentiating multi-core architectures, 169, 276, 277, 292, 293, 294, 295
10, 11 libraries, 289, 290, 291
false sharing, 202 MPI programming, 296, 297, 298,299, 300
overview, 8 overview, 276, 277
runnable threads, 173 Intel Thread Building Blocks, 292
thread priorities, 96
Intel Thread Checker
threading APIs, 99, 100
data-race conditions, 141, 176
threading on, 252, 253, 254, 255, 256, 257
overview, 277, 278, 279, 280, 281
Intel Thread Profiler, 276, 277, 295, 296
I
Intel Trace Analyzer and Collector, 299, 300
I/O APIC, 242 Intel VTune Performance Analyzer
I/O threads, 173 data-race conditions, 141
IA-32 architecture, 205, 206, 207, 211, 212 investigating applications, 276
ICH (I/O Controller Hub), 239 overview, 292, 293, 294, 295, 296
IHA (Intel Hub Architecture), 239 performance and, 169
tuning and, 277
ILP (instruction-level parallelism)
defined, 248 interface unit, 242
goal of, 7 interleaving, 154, 155
Itanium architecture and, 262 Interlocked API, 91, 92
multi-core processors and, 249 Interlocked class (.NET), 119, 120
implementation-dependent threading, 73, 74 InterlockedCompareExchange( ) function, 92,
Increment( ) method, 119 187, 188
INFINITE constant, 84 InterlockedCompareExchange64( )
function, 92
info command (GDB), 233
InterlockedCompareExchangePointer( )
initial threads, 32
function, 92
InitializeSListHead( ) function, 92
InterlockedDecrement( ) function, 92
InitializeTraceBuffer( ) method, 222
InterlockedExchange( ) function, 92
instruction pointers (IP), 255
328 Multi-Core Programming
InterlockedExchange64( ) function, 92 L
InterlockedExchangeAdd( ) function, 92
lastprivate clause (OpenMP), 141, 142,
InterlockedExchangePointer( ) function, 92
155, 167
InterlockedFlushSList( ) function, 92
latency hiding, 10, 11
InterlockedIncrement( ) function, 92
LeaveCriticalSection( ) function, 91
InterlockedPopEntrySList( ) function, 92
legacy applications, 76
InterlockedPushEntrySList( ) function, 92
libraries, 289, 290, 291, 298
Inter-Procedural Optimizations (IPO), 284,
Linux environment
285, 286
Intel MPI Library, 298
inter-process messaging, 69
OpenMP and, 164
interprocessor interrupt (IPI), 266, 268
power consumption and, 270
IntializeCriticalSection( ) function, 91 Pthreads and, 120
IntializeCriticalSectionAndSpinCount( ) live locks, 180
function, 91
LLC (Last Level Cache), 257, 263,
intra-process messaging, 69 264, 265
IP (instruction pointers), 255 load balancing
IPI (interprocessor interrupt), 266 defined, 42
IPO (Inter-Procedural Optimizations), 284, 285 loop scheduling and partitioning,
ISA (Instruction Set Architecture), 34, 244, 250 143, 144, 145, 146, 147
Itanium architecture processor affinity and, 97
multi-core for, 252, 262 VTune analyzer, 295
parallel programming problems, 207, 208, Local APIC unit, 242
209, 210 LOCK prefix, 205, 211
lock statement (C#), 117
J locked mutexes, 123
JIT (just-in-time) compilation, 33, 107 lock-holder pre-emption, 35
JMX (Java Management Extension), 68 lock-ordering deadlock, 57
Join( ) method, 112 locks
joining threads, 112, 122, 123 heavily contended, 181, 182, 183, 184,
185, 186
JVM (Java Virtual Machine), 24, 33
Intel Thread Checker, 278
live, 180
K
ordering of, 179
kernel-level threads parallel programming problems, 177
overview, 26, 27, 28, 29 synchronization primitives, 63, 64, 65,
Windows and, 75, 76, 77, 78, 79, 80, 81, 82, 66, 70
83, 84, 85, 86, 87, 88, 90, 91, 92, 93, 94, logic block, 239
95, 96, 97, 98, 99, 100 logic operators, 227
kernels loop variables, 136
defined, 27 loop-carried dependence, 138, 139,
mutexes and, 91 140, 147
semaphores and, 91 loop-independent dependence, 138
Index 329
debugging using GDB, 232, 233, 234, 235 omp_get_thread_num( ) function, 162
general debug techniques, 215, 216, 217, OMP_NUM_THREADS environment variable,
218, 219, 220, 221, 222, 223, 224 163, 166, 291
hardware and, 29, 30, 248, 249, 250, OMP_SCHEDULE environment variable, 145,
251, 252 146, 163
parallel computing platforms and, 12, 13, 19 omp_set_num_threads( ) function, 162
preemptive, 1, 28 one-to-one mapping model, 28, 29
semaphores and, 61
OOP (object-oriented programming), 37
time-slice, 8, 9
OpenMP standard
mutexes
cache optimization and, 287
data-race conditions and, 140
challenges in threading loops, 137, 138,
defined, 65, 88 139, 140, 141, 142, 143, 144, 145, 146,
priority inheritance, 183 147, 148, 149
Pthreads and, 123, 124 compilation, 164, 165
wait handles and, 118 debugging and, 165, 166, 167
Windows and, 73, 74, 85, 87, 90 environment variables, 163, 164
mutual exclusion synchronization, 54 Intel compilers, 282
Intel libraries and, 289
N Intel Thread Checker, 278, 281
Intel Thread Profiler, 296
name property (thread), 110
library functions, 162, 163
named critical sections, 157, 158
managing loops, 174
nested critical sections, 158 minimizing threading overhead, 149, 150,
.NET Framework 151, 152
creating threads, 107, 108, 109, 110 overview, 135, 136, 137
managing threads, 110, 111, 112 parallelism and, 276
parallel programming problems, 211 performance and, 144, 152, 153, 154, 155,
thread pools, 112, 113, 114, 115, 156, 157, 158, 159, 160, 161, 168, 169
116, 117 threads and, 24, 25, 27
thread synchronization, 117, 118, 119, 120 OpenMutex( ) function, 90
threading APIs, 107, 109, 110, 111, 112, OpenSemaphore( ) function, 88
113, 114, 115, 116, 117, 118, 119, 120
out-of-order execution. See ILP
non-blocking algorithms, 186, 187, 188, 189,
output dependence, 138
190, 191
Northbridge chips, 239
P
notify( ) operation, 68
notifyAll( ) operation, 68 PAL (Processor Abstraction Layer), 264
nowait clause (OpenMP), 152, 153, 155 parallel computing platforms
NPTL (Native POSIX Thread Library), 73 HT Technology, 10, 11
microprocessors, 7, 8, 9
O multi-threading on, 12, 13
overview, 5, 6, 19
omp_get_num_procs( ) function, 162 parallel for pragma (OpenMP), 139, 140, 141,
omp_get_num_threads( ) function, 162 144, 150, 157
Index 331
processors QR
defined, 257
depicted, 239 QueueUserWorkItem( ) function, 93, 94, 174
execution flow, 242, 243, 244
race conditions. See data-race conditions
functionality, 240
RATs (Register Allocation Tables), 256
internals, 241, 242
power consumption, 268, 269, 270, 271 read-write locks
contention and, 186
process-process messaging, 69
defined, 66
producer/consumer model
memory contention, 197, 198, 199
semaphores and, 132
recursive deadlock, 57
producer/consumer model
recursive locks, 65
synchronization and, 62, 64, 68
threads and, 40, 41 reduction clause (OpenMP), 141, 142, 147,
148, 149, 152
pthread_cond_broadcast( ) function, 129
REE (rapid execution engine), 256
pthread_cond_signal( ) function, 129
ref keyword, 120
pthread_cond_wait( ) function, 129
RegisterWaitForSingleObject( ) method,
pthread_create( ) function, 120, 121, 122
115, 116
pthread_detach( ) function, 122
relaxed consistency, 204
pthread_join( ) function, 122
release( ) operation, 63, 64
PTHREAD_MUTEX_INITIALIZER
ReleaseMutex( ) function, 90
macro, 124
ReleaseSemaphore( ) function, 88
pthread_mutex_trylock( ) function, 124
ResetEvent( ) function, 83, 84, 86, 87
pthread_mutexattr_getprotocol( )
function, 183 Resume( ) method, 112
pthread_mutexattr_setprotocol( ) ResumeThread( ) function, 80
function, 183 runnable threads, 173
Pthreads (POSIX threads) runtime scheduling scheme (OpenMP), 143,
coding, 24, 25 145, 146
creating threads, 120, 121, 122 runtime virtualization, 33
debugging, 232, 233, 234, 235
implementation-dependent S
threading, 73
Samples Over Time feature (VTune
Intel Thread Checker, 278
analyzer), 295
kernel-level threads, 27
scalability, 42, 300
managing threads, 122, 123
signaling, 125, 126, 127, 128, 129, 130 scalar processors, 244
thread pools and, 174 schedule clause (OpenMP), 143, 144, 145, 146,
thread synchronization, 123, 124 147, 150
Pthreads (POSIX threads) scheduling, loop, 143, 144, 145, 146, 147
compilation and linking, 132 sections construct (OpenMP), 151, 152, 153
signaling, 131, 132 self-deadlock, 57
pulses, 118 semaphores
defined, 88
Index 333
trace buffers and, 219, 223 93, 94, 95, 96, 97, 98, 99, 100, 101, 102,
Windows threading, 87, 88, 90, 91 103, 104, 105, 106
synchronization blocks, 57 application programming models and, 32
systag (GDB), 233 applying commands to, 235
system virtualization, 33, 34, 35 boundaries of, 68
creating, 31, 32
T creating for .NET Framework, 108,
109, 110
task decomposition, 38, 39, 41, 42 critical sections and, 56, 57
task-level parallelism, 43 deadlocks, 57, 59
taskq pragma (OpenMP), 161 defined, 8, 22
Taskqueuing extension (OpenMP), 160, 161 designing for, 37, 38, 39, 40, 41, 42
flow control-based concepts, 71, 72
TDP (Thermal Design Power), 270
hardware, 29, 30, 248, 249, 250, 251, 252
TerminateThread( ) function, 80, 87
HT Technology and, 10, 11, 252, 253, 254,
testing processor affinity, 97, 98 255, 256, 257
thermal envelope, 251 implementation-dependent, 73, 74
thread affinity, 97 Intel Thread Checker, 141, 176, 277, 278,
thread command (GDB), 235 279, 280, 281
thread IDs, 227, 228 joining, 112, 122, 123
thread pools kernel-level, 26, 27, 28, 29, 75, 76, 77, 78,
.NET Framework, 112, 113, 114, 115, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88,
116, 117 90, 91, 92, 93, 94, 95, 96, 97, 98,
parallel programming problems, 174 99, 100
Windows environment, 93, 94 memory and, 296
messages, 68, 69, 70
thread priority
minimizing overhead, 149, 150, 151
.NET Framework, 109, 110
multi-core processors, 257, 258, 259, 260,
multi-threading and, 12, 13
262, 263, 264, 265, 266, 268
priority inversion, 181, 182, 183
naming, 227, 228
Pthreads and, 120
non-blocking algorithms, 186, 187, 188,
Windows environment, 95, 96
189, 190, 191
ThreadAbortException, 110 OpenMP standard, 137, 138, 139, 140,
ThreadPool class (.NET), 113, 174 141, 142, 143, 144, 145, 146, 147,
threadprivate pragma (OpenMP), 142 148, 149
threads parallel programming problems, 171,
allocating, 94 172, 173
APIs for .NET Framework, 107, 109, 110, resuming, 112
111, 112, 113, 114, 115, 116, 117, 118, suspending, 112, 172
119, 120 synchronization and, 53, 54, 55, 60, 61, 62,
APIs for POSIX, 120, 121, 122, 123, 124, 63, 64, 65, 66, 67, 68
125, 126, 127, 128, 129, 130, system view of, 22, 23, 24, 25, 26, 27, 28,
131, 132 29, 30
APIs for Windows, 75, 76, 77, 78, 79, 80, user-level, 22, 23, 24, 25, 27, 29, 100, 101,
81, 82, 83, 84, 85, 86, 87, 88, 90, 91, 92, 102, 103, 104
Index 335
virtual environments and, 33, 34, 35 OpenMP clauses and, 142, 149
waiting on, 112 shared, 157, 158, 159, 160, 166
Threads window (Visual Studio), 225 uninitialized, 167
ThreadStart( ) function, 107, 109 virtual memory, 172, 173
$TID tracepoint keyword, 226 virtual processors, 34
time slices, 172, 173 Visual Basic language, 107
time-sharing operating systems, 1 Visual Studio (Microsoft), 222, 224, 225, 226,
time-slice multi-threading, 8, 28 227, 228, 230, 231, 232
TLP (thread-level parallelism) VMM (virtual machine monitor), 33, 34, 35
defined, 8 VMs (virtual machines), 33, 34, 35
Itanium architecture and, 262 voltage, 269, 270
multi-core processing and, 249 von Neumann, John, 1, 2
TMT (temporal multi-threading), 248, 264
$TNAME tracepoint keyword, 226 W
trace buffers, 218, 219, 220, 221, 222,
Wait( ) method, 118
223, 224
wait( ) operation, 66, 68
Trace Cache Miss event, 256
WaitAll( ) method, 117
tracepoints, 225, 226
WaitAny( ) method, 117
transistors
WaitForMultipleObjects( ) function, 84, 85
enhancing density, 272
logic blocks and, 239 WaitForSingleObject( ) function, 84, 85, 88,
90, 124
microprocessors and, 7
power consumption and, 269, 270 wavefront pattern, 44
thermal envelopes and, 251 Win32 APIs, 75, 76, 77, 78, 79, 80, 81, 82,
83, 84, 85, 86, 87, 88, 90, 91, 92, 93, 94,
TryEnterCrticalSection( ) function, 91
95, 96, 97, 98, 99, 100, 101, 102, 103, 104,
105, 106
U
Windows environment
UI (user interface), 12 atomic operations, 91, 92
UNIX platform, 120 creating threads, 76, 77, 78, 79
unlocked mutexes, 123 debugging applications, 224, 225, 226, 227,
228, 229, 230, 231, 232
user-level threads
Intel MPI Library, 298
fibers and, 100, 101, 102, 103, 104
Intel Thread Checker, 278
overview, 22, 23, 24, 25
managing threads, 80, 81
Windows support, 27, 29
multi-threaded applications, 105, 106
mutexes, 73, 74, 85, 87, 90
V
OpenMP and, 164
variables power consumption and, 270
condition, 66, 67, 68, 70, 125, 126, 127, priority inheritance, 183
128, 129 processor affinity, 97, 98, 100
environment, 163, 164 pthreads-win32 support, 120
loop, 136
336 Multi-Core Programming
thread communication, 81, 82, 83, 84, 85, thread-safe libraries, 193
86, 87 user-level threading with fibers, 100, 101,
thread pools, 93, 94 102, 103, 104
thread priority, 95, 96 work stealing method, 174
thread synchronization, 87, 88, 90, 91 work-sharing, 136, 151, 152
threading APIs, 76, 77, 78, 79, 80, 81, 82, WT_EXECUTEINIOTHREAD flag, 94
83, 84, 85, 86, 87, 88, 90, 91, 92, 93, 94,
WT_EXECUTELONGFUNCTION flag, 94
95, 96, 97, 98, 99, 100, 101, 102, 103,
104, 105, 106 WT_EXECUTIONDEFAULT flag, 94
threads and, 24, 27, 29
Continuing Education is Essential
Its a challenge we all face keeping pace with constant
change in information technology. Whether our formal
training was recent or long ago, we must all find time to
keep ourselves educated and up to date in spite of the
daily time pressures of our profession.
Intel produces technical books to help the industry
learn about the latest technologies. The focus of these publications spans the basic motivation
and origin for a technology through its practical application.
Sincerely,
Justin Rattner
Senior Fellow and Chief Technology Officer
Intel Corporation
Oleksiy Danikhno,
Director, Application
Development and Architecture
A4Vision, Inc.
VTune Performance Analyzer Essentials
Measurement and Tuning Techniques
for Software Developers
By James Reinders
ISBN 0-9743649-5-9
www.intel.com/intelpress/bookbundles.htm
Our products are planned with the help of many people in the developer
community and we encourage you to consider becoming a customer advisor.
If you would like to help us and gain additional advance insight to the latest
technologies, we encourage you to consider the Intel Press Customer
Advisor Program. You can register here:
www.intel.com/intelpress/register.htm
For information about bulk orders or corporate sales, please send e-mail to
[email protected]
IMPORTANT
You can access the companion Web site for this book
on the Internet at:
www.intel.com/intelpress/mcp
Use the serial number located in the upper-right hand
corner of this page to register your book and access
additional material, including all code examples and
pointers to development resources.
Programming $69.95 USA
Multi-Core Programming
Increasing Performance through Software Multi-threading
ABOUT THE AUTHORS Akhter
Roberts
Multi-Core Programming
SHAMEEM AKHTER is a platform
Discover programming techniques for Intel multi-core architect at Intel, focusing on single
socket multi-core architecture and
Increasing Performance through Software
architecture and Hyper-Threading Technology performance analysis. He has also
worked as a senior software engineer Multi-threading
Multi-Core Programming
Software developers can no longer rely on increasing clock speeds alone with the Intel Software and Solutions
to speed up single-threaded applications; instead, to gain a competitive Group, designing application
advantage, developers must learn how to properly design their optimizations for desktop and server Shameem Akhter and Jason Roberts
applications to run in a threaded environment. Multi-core architectures platforms. Shameem holds a patent on a
have a single processor package that contains two or more processor threading interface for constraint
"execution cores," or computational engines, and deliverwith programming, developed as a part of his
appropriate softwarefully parallel execution of multiple software master's thesis in computer science.
threads. Hyper-Threading Technology enables additional threads to JASON ROBERTS is a senior software
operate on each core. engineer at Intel Corporation. Over the
past 10 years, Jason has worked on a
This book helps software developers write high-performance
number of different multi-threaded
multi-threaded code for Intel's multi-core architecture while avoiding the software products that span a wide
common parallel programming issues associated with multi-threaded range of applications targeting desktop,
programs. handheld, and embedded DSP platforms.
Highlights include:
Elements of parallel programming and multi-threading
Programming with threading APIs
OpenMP: The portable solution
Solutions to common parallel programming problems
Debugging and testing multi-threaded applications
Software development tools for multi-threading
This book is a practical, hands-on volume with immediately usable code
examples that enable readers to quickly master the necessary
programming techniques. The companion Web site contains pointers to
threading and optimization tools, code samples from the book, and
extensive technical documentation on Intel multi-core architecture