P Threads
P Threads
with Pthreads
# Chapter Subtitle
Outline
• Examples:
Concurrent: Thread A Thread B Thread C
A & B, A&C
B&C
Time
Execution Flow on one-core or multi-core
systems
Concurrent execution on a single core system
• Responsiveness
• Resource Sharing
Shared memory
• Economy
• Scalability
Explore multi-core CPUs
Thread Programming with Shared Memory
• Program is a collection of threads of control.
Can be created dynamically
• Each thread has a set of private variables, e.g., local stack
variables
• Also a set of shared variables, e.g., static variables, shared
common blocks, or global heap.
Threads communicate implicitly by writing and reading
shared variables.
Threads coordinate by synchronizing on shared
variables
Shared memory
s
s = ...
i: 2 i: 5 Private i: 8
memory 9
P0 P1 Pn
Shared Memory
Programming
Several Thread Libraries/systems
• Pthreads is the POSIX Standard
Relatively low level
Portable but possibly slow; relatively heavyweight
• OpenMP standard for application level programming
Support for scientific programming on shared memory
http://www.openMP.org
• Java Threads
• TBB: Thread Building Blocks
Intel
• CILK: Language of the C “ilk”
Lightweight threads embedded into C
10
Creation of Unix processes vs. Pthreads
C function for starting a thread
pthread.h
One object
for each
pthread_t thread.
int pthread_create (
pthread_t* thread_p /* out */ ,
const pthread_attr_t* attr_p /* in */ ,
void* (*start_routine ) ( void ) /* in */ ,
void* arg_p /* in */ ) ;
A closer look (1)
int pthread_create (
pthread_t* thread_p /* out */ ,
const pthread_attr_t* attr_p /* in */ ,
void* (*start_routine ) ( void ) /* in */ ,
void* arg_p /* in */ ) ;
int pthread_create (
pthread_t* thread_p /* out */ ,
const pthread_attr_t* attr_p /* in */ ,
void* (*start_routine ) ( void ) /* in */ ,
void* arg_p /* in */ ) ;
• pthread_yield();
Informs the scheduler that the thread is willing to yield
• pthread_exit(void *value);
Exit thread and pass value to joining thread (if exists)
Others:
• pthread_t me; me = pthread_self();
Allows a pthread to obtain its own identifier pthread_t
thread;
• Synchronizing access to shared variables
pthread_mutex_init, pthread_mutex_[un]lock
pthread_cond_init, pthread_cond_[timed]wait
Compiling a Pthread program
. / pth_hello
. / pth_hello
Hello from thread 0
Hello from thread 1
Difference between Single and Multithreaded
Processes
Shared memory access for code/data
Separate control flow -> separate stack/registers
CRITICAL SECTIONS
Data Race Example
static int s = 0;
Thread 0 Thread 1
1. Busy waiting
2. Mutex (lock)
3. Semaphore
4. Conditional Variables
Example of Busy Waiting
static int s = 0;
static int flag=0
Thread 0 Thread 1
int temp, my_rank int temp, my_rank
for i = 0, n/2-1 for i = n/2, n-1
temp0=f(A[i]) temp=f(A[i])
while flag!=my_rank; while flag!=my_rank;
s = s + temp0 s = s + temp
flag= (flag+1) %2 flag= (flag+1) %2
Thread 1 Thread 2
Unlock/Release mutex
Critical section
Unlock/Release mutex
Mutexes in Pthreads
• To release
T0 T1 T2
Thread 0
Thread 1 Thread 2
Write a msg to #1 Write a msg to #2
Write a msg to #0
Set msg[1]
Set msg[2] Set msg[0]
If msg[0] is ready If msg[1] is ready If msg[2] is ready
Print msg[0] Print msg[1]
Print msg[2]
Consume a message
Semaphore Synchronization with 3 threads
Thread 0
Thread 1 Thread 2
Write a msg to #1 Write a msg to #2
Write a msg to #0
Set msg[1]
Set msg[2] Set msg[0]
Post(semp[1])
Post(semp[2]) Post(semp[0])
Wait(semp[0])
Print msg[0] Wait(semp[1])
Wait(semp[2])
Print msg[1]
Print msg[2]
Message sending with semaphores
sem_post(&semaphores[dest]);
/* signal the dest thread*/
sem_wait(&semaphores[my_rank]);
/* Wait until the source message is created */
• Shared Data
Data set
Lock mutex (to protect readcount)
Semaphore wrt initialized to 1 (to
synchronize between
readers/writers)
Integer readcount initialized to 0
Readers-Writers Problem
• A writer
do {
sem_wait(wrt) ; //semaphore wrt
// writing is performed
sem_post(wrt) ; //
} while (TRUE);
Readers-Writers Problem (Cont.)
• Reader
do {
mutex_lock(mutex);
readcount ++ ;
if (readcount == 1)
sem_wait(wrt); //check if anybody is
writing
mutex_unlock(mutex)
// reading is performed
mutex_lock(mutex);
readcount - - ;
if (readcount == 0)
sem_post(wrt) ; //writing is allowed now
nlock(mutex) ;
} while (TRUE);
Barriers
• Why?
• More programming primitives to simplify code for
synchronization of threads
Synchronization Functionality
Busy waiting Spinning for a condition. Waste resource.
Not safe
Mutex lock Support code with simple mutual
exclusion
Semaphore Signal-based synchronization. Allow
sharing (not wait unless semaphore=0)
Producer thread:
Producer thread:
mutex_lock(&m);
Produce next item; availl = avail+1;
Cond_signal(&cond); //notify an item is
available
mutex_unlock(&m);
When to use condition broadcast?
Time
Issues with Threads: False Sharing,
Deadlocks, Thread-safety
Problem: False Sharing
• Occurs when two or more processors/cores access
different data in same cache line, and at least one
of them writes.
Leads to ping-pong effect.
• Let’s assume we parallelize code with p=2:
for( i=0; i<n; i++ )
a[i] = b[i];
Each array element takes 8 bytes
Cache line has 64 bytes (8 numbers)
False Sharing: Example (2 of 3)
Written by CPU 0
Written by CPU 1
False Sharing: Example Two CPUs execute:
for( i=0; i<n; i++ )
a[i] = b[i];
a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]
cache line
Written by CPU 0
Written by CPU 1
a[0] a[2] a[4] CPU0
• Task partitioning
For (i=0; i<m; i=i+1)
Task Si for Row i
y[i]=0;
For (j=0; j<n; j=j+1)
y[i]=y[i] +a[i][j]*x[j]
Task graph
S0 S1 Sm
...
Mapping to
threads S0 S1 S2 S3
...
Thread 0 Thread 1
Using 3 Pthreads for 6 Rows: 2 row per
thread
S0, S1
S2, S3
S4,S5
Code for S0
Code for Si
Pthread code for thread with ID rank
i-th thread calls Pth_mat_vect( &i)
m is # of rows in this matrix A.
n is # of columns in this matrix A.
local_m is # of rows handled by
this thread.
Task Si
Impact of false sharing on performance of
matrix-vector multiplication
Why is performance of
8x8,000,000 matrix bad?
How to fix that?
Deadlock and Starvation