0% found this document useful (0 votes)
17 views

P Threads

Uploaded by

cvidal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

P Threads

Uploaded by

cvidal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 72

Shared Memory Programming

with Pthreads
# Chapter Subtitle
Outline

• Shared memory programming: Overview


• POSIX pthreads
• Critical section & thread synchronization.
 Mutexes.
 Producer-consumer synchronization and
semaphores.
 Barriers and condition variables.
 Read-write locks.
• Thread safety.
Shared Memory Architecture
Processes and Threads

• A process is an instance of a running (or


suspended) program.
• Threads are analogous to a “light-weight” process.
• In a shared memory program a single process may
have multiple threads of control.
Logical View of Threads

• Threads are created within a process

A process Process hierarchy


T2
T4
T1
shared code, data P1
and kernel context
sh sh sh
T5 T3
foo
Concurrent Thread Execution

• Two threads run concurrently if their logical flows


overlap in time

• Examples:
 Concurrent: Thread A Thread B Thread C
A & B, A&C
B&C

Time
Execution Flow on one-core or multi-core
systems
Concurrent execution on a single core system

Parallel execution on a multi-core system


Benefits of multi-threading

• Responsiveness

• Resource Sharing
 Shared memory

• Economy

• Scalability
 Explore multi-core CPUs
Thread Programming with Shared Memory
• Program is a collection of threads of control.
 Can be created dynamically
• Each thread has a set of private variables, e.g., local stack
variables
• Also a set of shared variables, e.g., static variables, shared
common blocks, or global heap.
 Threads communicate implicitly by writing and reading
shared variables.
 Threads coordinate by synchronizing on shared
variables
Shared memory
s
s = ...

i: 2 i: 5 Private i: 8
memory 9
P0 P1 Pn
Shared Memory
Programming
Several Thread Libraries/systems
• Pthreads is the POSIX Standard
 Relatively low level
 Portable but possibly slow; relatively heavyweight
• OpenMP standard for application level programming
 Support for scientific programming on shared memory
 http://www.openMP.org
• Java Threads
• TBB: Thread Building Blocks
 Intel
• CILK: Language of the C “ilk”
 Lightweight threads embedded into C
10
Creation of Unix processes vs. Pthreads
C function for starting a thread
pthread.h
One object
for each
pthread_t thread.

int pthread_create (
pthread_t* thread_p /* out */ ,
const pthread_attr_t* attr_p /* in */ ,
void* (*start_routine ) ( void ) /* in */ ,
void* arg_p /* in */ ) ;
A closer look (1)

int pthread_create (
pthread_t* thread_p /* out */ ,
const pthread_attr_t* attr_p /* in */ ,
void* (*start_routine ) ( void ) /* in */ ,
void* arg_p /* in */ ) ;

We won’t be using, so we just pass NULL.

Allocate before calling.


A closer look (2)

int pthread_create (
pthread_t* thread_p /* out */ ,
const pthread_attr_t* attr_p /* in */ ,
void* (*start_routine ) ( void ) /* in */ ,
void* arg_p /* in */ ) ;

Pointer to the argument that should


be passed to the function start_routine.

The function that the thread is to run.


Function started by pthread_create
• Prototype:
void* thread_function ( void* args_p ) ;

• Void* can be cast to any pointer type in C.

• So args_p can point to a list containing one or


more values needed by thread_function.
• Similarly, the return value of thread_function can
point to a list of one or more values.
Wait for Completion of Threads

pthread_join(pthread_t *thread, void


**result);
 Wait for specified thread to finish. Place exit value
into *result.
• We call the function pthread_join once for each
thread.
• A single call to pthread_join will wait for the thread
associated with the pthread_t object to complete.
Example of Pthreads
#include <pthread.h>
#include <stdio.h>
void *PrintHello(void * id){
printf(“Thread%d: Hello World!\n", id);
}

void main (){


pthread_t thread0, thread1;
pthread_create(&thread0, NULL, PrintHello, (void *) 0);
pthread_create(&thread1, NULL, PrintHello, (void *) 1);
}
Example of Pthreads with join
#include <pthread.h>
#include <stdio.h>
void *PrintHello(void * id){
printf(“Hello from thread %d\n", id);
}

void main (){


pthread_t thread0, thread1;
pthread_create(&thread0, NULL, PrintHello, (void *) 0);
pthread_create(&thread1, NULL, PrintHello, (void *) 1);
pthread_join(thread0, NULL);
pthread_join(thread1, NULL);
}
Some More Pthread Functions

• pthread_yield();
 Informs the scheduler that the thread is willing to yield
• pthread_exit(void *value);
 Exit thread and pass value to joining thread (if exists)
Others:
• pthread_t me; me = pthread_self();
 Allows a pthread to obtain its own identifier pthread_t
thread;
• Synchronizing access to shared variables
 pthread_mutex_init, pthread_mutex_[un]lock
 pthread_cond_init, pthread_cond_[timed]wait
Compiling a Pthread program

gcc −g −Wall −o pth_hello pth_hello . c −lpthread

link in the Pthreads library


Running a Pthreads program

. / pth_hello

Hello from thread 1


Hello from thread 0

. / pth_hello
Hello from thread 0
Hello from thread 1
Difference between Single and Multithreaded
Processes
Shared memory access for code/data
Separate control flow -> separate stack/registers
CRITICAL SECTIONS
Data Race Example

static int s = 0;
Thread 0 Thread 1

for i = 0, n/2-1 for i = n/2, n-1


s = s + f(A[i]) s = s + f(A[i])

• Also called critical section problem.


• A race condition or data race occurs when:
- two processors (or two threads) access the same variable,
and at least one does a write.
- The accesses are concurrent (not synchronized) so they
could happen simultaneously
Synchronization Solutions

1. Busy waiting
2. Mutex (lock)
3. Semaphore
4. Conditional Variables
Example of Busy Waiting

static int s = 0;
static int flag=0
Thread 0 Thread 1
int temp, my_rank int temp, my_rank
for i = 0, n/2-1 for i = n/2, n-1
temp0=f(A[i]) temp=f(A[i])
while flag!=my_rank; while flag!=my_rank;
s = s + temp0 s = s + temp
flag= (flag+1) %2 flag= (flag+1) %2

• A thread repeatedly tests a condition, but, effectively, does no


useful work until the condition has the appropriate value.
• Weakness: Waste CPU resource. Sometime not safe with
compiler optimization.
Mutexes (Locks)
Acquire mutex lock
Critical section
• Code structure
Unlock/Release mutex
• Mutex (mutual exclusion) is a special type of variable
used to restrict access to a critical section to a single
thread at a time.
• guarantee that one thread “excludes” all other threads
while it executes the critical section.
• When A thread waits on a mutex/lock,
CPU resource can be used by others.
• Only thread that has acquired the lock
can release this lock
Execution example with 2 threads

Thread 1 Thread 2

Acquire mutex lock Acquire mutex lock


Critical section

Unlock/Release mutex
Critical section
Unlock/Release mutex
Mutexes in Pthreads

• A special type for mutexes: pthread_mutex_t.

• To gain access to a critical section, call

• To release

• When finishing use of a mutex, call


Global sum function that uses a mutex (1)
Global sum function that uses a mutex (2)
Semaphore: Generalization from mutex
locks
• Semaphore S – integer variable
• Can only be accessed /modified via two
(atomic) operations with the following
semantics:
 wait (S) { //also called P()
while S <= 0 wait in a queue;
S--;
}
 post(S) { //also called V()
S++;
Wake up a thread that waits in the queue.
}
Why Semaphores?
Synchronization Functionality/weakness
Busy waiting Spinning for a condition. Waste
resource. Not safe
Mutex lock Support code with simple mutual
exclusion
Semaphore Handle more complex signal-based
synchronization

• Examples of complex synchronization


 Allow a resource to be shared among multiple
threads.
– Mutex: no more than 1 thread for one protected region.
 Allow a thread waiting for a condition after a signal
– E.g. Control the access order of threads entering the
critical section.
– For mutexes, the order is left to chance and the system.
Syntax of Pthread semaphore functions
Semaphores are not part of Pthreads;
you need to add this.
Producer-consumer
Synchronization and
Semaphores
Producer-Consumer Example

T0 T1 T2

• Thread x produces a message for Thread x+1.


 Last thread produces a message for thread 0.
• Each thread prints a message sent from its source.
• Will there be null messages printed?
 A consumer thread prints its source message before
this message is produced.
 How to avoid that?
Flag-based Synchronization with 3 threads

Thread 0
Thread 1 Thread 2
Write a msg to #1 Write a msg to #2
Write a msg to #0
Set msg[1]
Set msg[2] Set msg[0]
If msg[0] is ready If msg[1] is ready If msg[2] is ready
Print msg[0] Print msg[1]
Print msg[2]

To make sure a message is received/printed, use busy waiting.


First attempt at sending messages using pthreads

Produce a message for a destination


thread

Consume a message
Semaphore Synchronization with 3 threads

Thread 0
Thread 1 Thread 2
Write a msg to #1 Write a msg to #2
Write a msg to #0
Set msg[1]
Set msg[2] Set msg[0]
Post(semp[1])
Post(semp[2]) Post(semp[0])
Wait(semp[0])
Print msg[0] Wait(semp[1])
Wait(semp[2])
Print msg[1]
Print msg[2]
Message sending with semaphores

sprintf(my_msg, "Hello to %ld from %ld", dest, my_rank);


messages[dest] = my_msg;

sem_post(&semaphores[dest]);
/* signal the dest thread*/
sem_wait(&semaphores[my_rank]);
/* Wait until the source message is created */

printf("Thread %ld > %s\n", my_rank,


messages[my_rank]);
READERS-WRITERS PROBLEM
Synchronization Example for Readers-Writers Problem

• A data set is shared among a number of concurrent


threads.
 Readers – only read the data set; they do not perform any
updates
 Writers – can both read and write
• Requirement:
 allow multiple readers to read at the same time.
 Only one writer can access the shared data at the same
time.
• Reader/writer access permission table:
Reader Writer
Reader OK No
Writer NO No
Readers-Writers (First try with 1 mutex lock)
• writer
do {
mutex_lock(w);
// writing is performed
mutex_unlock(w);
Reader Writer
} while (TRUE);
• Reader Reader ? ?
do { Writer ? ?
mutex_lock(w);
// reading is performed
mutex_unlock(w);
} while (TRUE);
Readers-Writers (First try with 1 mutex lock)
• writer
do {
mutex_lock(w);
// writing is performed
mutex_unlock(w);
Reader Writer
} while (TRUE);
• Reader Reader no no
do { Writer no no
mutex_lock(w);
// reading is performed
mutex_unlock(w);
} while (TRUE);
2nd try using a lock + readcount
• writer
do {
mutex_lock(w);// Use writer mutex lock
// writing is performed
mutex_unlock(w);
} while (TRUE);
• Reader
do {
readcount++; // add a reader counter.
if(readcount==1) mutex_lock(w);
// reading is performed
readcount--;
if(readcount==0) mutex_unlock(w);
} while (TRUE);
Readers-Writers Problem with semaphone

• Shared Data
 Data set
 Lock mutex (to protect readcount)
 Semaphore wrt initialized to 1 (to
synchronize between
readers/writers)
 Integer readcount initialized to 0
Readers-Writers Problem

• A writer
do {
sem_wait(wrt) ; //semaphore wrt

// writing is performed

sem_post(wrt) ; //
} while (TRUE);
Readers-Writers Problem (Cont.)

• Reader
do {
mutex_lock(mutex);
readcount ++ ;
if (readcount == 1)
sem_wait(wrt); //check if anybody is
writing
mutex_unlock(mutex)

// reading is performed

mutex_lock(mutex);
readcount - - ;
if (readcount == 0)
sem_post(wrt) ; //writing is allowed now
nlock(mutex) ;
} while (TRUE);
Barriers

• Synchronizing the threads to make sure that they all


are at the same point in a program is called a barrier.
• No thread can cross the barrier until all the threads
have reached it.
• Availability:
 No barrier provided by
Pthreads library and needs
a custom implementation
 Barrier is implicit in
OpenMP
and available in MPI.
Condition Variables

• Why?
• More programming primitives to simplify code for
synchronization of threads
Synchronization Functionality
Busy waiting Spinning for a condition. Waste resource.
Not safe
Mutex lock Support code with simple mutual
exclusion
Semaphore Signal-based synchronization. Allow
sharing (not wait unless semaphore=0)

Barrier Rendezvous-based synchronization


Condition More complex synchronization: Let
variables threads wait until a user-defined
condition becomes true
Synchronization Primitive: Condition Variables

• Used together with a lock


• One can specify more general waiting
condition compared to semaphores.
• A thread is blocked when condition is no
true:
 placed in a waiting queue, yielding
CPU resource to somebody else.
 Wake up until receiving a signal
Pthread synchronization: Condition variables
int status; pthread_condition_t cond;
const pthread_condattr_t attr;
pthread_mutex mutex;
status = pthread_cond_init(&cond,&attr);
status = pthread_cond_destroy(&cond);
status = pthread_cond_wait(&cond,&mutex);
-wait in a queue until somebody wakes up. Then the mutex is
reacquired.
status = pthread_cond_signal(&cond);
- wake up one waiting thread.
status = pthread_cond_broadcast(&cond);
- wake up all waiting threads in that condition
How to Use Condition Variables: Typical
Flow
 Thread 1: //try to get into critical section and
wait for the condition
Mutex_lock(mutex);
While (condition is not satisfied)
Cond_Wait(mutex, cond);
Critical Section;
Mutex_unlock(mutex)

 Thread 2: // Try to create the condition.


Mutex_lock(mutex);
When condition can satisfy, Signal(cond);
Mutex_unlock(mutex);
Condition variables for in producer-
consumer problem with unbounded
buffer
Producer deposits data in a buffer for others to consume
First version for consumer-producer problem
with unbounded buffer
• int avail=0; // # of data items available for consumption
• Consumer thread:

while (avail <=0); //wait


Consume next item; avail = avail-1;

 Producer thread:

Produce next item; avail = avail+1;


//notify an item is available
Condition Variables for consumer-producer
problem with unbounded buffer
• int avail=0; // # of data items available for consumption
• Pthread mutex m and condition cond;
• Consumer thread:
multex_lock(&m)
while (avail <=0) Cond_Wait(&cond, &m);
Consume next item; avail = avail-1;
mutex_unlock(&mutex)

 Producer thread:
mutex_lock(&m);
Produce next item; availl = avail+1;
Cond_signal(&cond); //notify an item is
available
mutex_unlock(&m);
When to use condition broadcast?

• When waking up one thread to run


is not sufficient.
• Example: concurrent malloc()/free()
for allocation and deallocation of
objects with non-uniform sizes.
Running trace of malloc()/free()
• Initially 10 bytes are free.
• m() stands for malloc(). f() for free()
Thread 1: Thread 2: Thread 3:
m(10) – succ m(5) – wait m(5) – wait
f(10) –broadcast
Resume m(5)-succ
Resume m(5)-succ
m(7) – wait
m(3) –wait
f(5) –broadcast
Resume m(7)-wait Resume m(3)-succ

Time
Issues with Threads: False Sharing,
Deadlocks, Thread-safety
Problem: False Sharing
• Occurs when two or more processors/cores access
different data in same cache line, and at least one
of them writes.
 Leads to ping-pong effect.
• Let’s assume we parallelize code with p=2:
for( i=0; i<n; i++ )
a[i] = b[i];
 Each array element takes 8 bytes
 Cache line has 64 bytes (8 numbers)
False Sharing: Example (2 of 3)

Execute this program in two processors


for( i=0; i<n; i++ )
a[i] = b[i];
cache line

a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]

Written by CPU 0
Written by CPU 1
False Sharing: Example Two CPUs execute:
for( i=0; i<n; i++ )
a[i] = b[i];
a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]

cache line
Written by CPU 0
Written by CPU 1
a[0] a[2] a[4] CPU0

inv data ...


a[1] a[3] a[5] CPU1
Matrix-Vector Multiplication with
Pthreads

Parallel programming book by Pacheco book P.159-162


Sequential code
Block Mapping for Matrix-Vector Multiplication

• Task partitioning
For (i=0; i<m; i=i+1)
Task Si for Row i
y[i]=0;
For (j=0; j<n; j=j+1)
y[i]=y[i] +a[i][j]*x[j]
Task graph
S0 S1 Sm
...
Mapping to
threads S0 S1 S2 S3
...
Thread 0 Thread 1
Using 3 Pthreads for 6 Rows: 2 row per
thread

S0, S1
S2, S3
S4,S5

Code for S0

Code for Si
Pthread code for thread with ID rank
i-th thread calls Pth_mat_vect( &i)
m is # of rows in this matrix A.
n is # of columns in this matrix A.
local_m is # of rows handled by
this thread.

Task Si
Impact of false sharing on performance of
matrix-vector multiplication

(times are in seconds)

Why is performance of
8x8,000,000 matrix bad?
How to fix that?
Deadlock and Starvation

• Deadlock – two or more threads are waiting


indefinitely for an event that can be only caused by
one of these waiting threads
• Starvation – indefinite blocking (in a waiting queue
forever).
 Let S and Q be two mutex locks:
P0 P1
Lock(S); Lock(Q);
Lock(Q); Lock(S);
. .
. .
. .
Unlock(Q); Unlock(S);
Unlock(S); Unlock(Q);
Deadlock Avoidance

• Order the locks and always acquire the locks in


that order.
• Eliminate circular waiting
 :
P0 P1
Lock(S); Lock(S);
Lock(Q); Lock(Q);
. .
. .
. .
Unlock(Q);
Unlock(Q);
Unlock(S);
Unlock(S);
Thread-Safety

• A block of code is thread-safe if it can be


simultaneously executed by multiple threads without
causing problems.
• When you program your own functions, you know if
they are safe to be called by multiple threads or not.
• You may forget to check if system library functions
used are thread-safe.
 Unsafe function: strtok()from C string.h library
 Other example.
– The random number generator random in stdlib.h.
– The time conversion function localtime in time.h.
Concluding Remarks

• A thread in shared-memory programming is analogous


to a process in distributed memory programming.
 However, a thread is often lighter-weight than a full-
fledged process.
• When multiple threads access a shared resource
without controling, it may result in an error: we have a
race condition.
 A critical section is a block of code that updates a
shared resource that can only be updated by one
thread at a time
 Mutex, semaphore, condition variables
• Issues: false sharing, deadlock, thread safety

You might also like