Multicore
Multicore
1 2
Dr. M. Schwind
Prof. Praktische Informatik
WS 2013/2014
1 / 65
Ubersicht
1
Introduction Basics Generic Programming with C++ Concepts of Threading Building Blocks Initialization Parallel Constructs parallel for parallel reduce parallel do pipeline parallel sort Additional Algorithm Templates Synchronization Mutex Atomar Operations Container concurrent vector concurrent hash map Prof. Praktische Informatik Intel Threading Building Blocks concurrent queue Task-Programming
WS 2013/2014
2 / 65
C++-library for shared memory parallel programming mainly for multicore CPU Implements important parallel programming patterns
Parallel loops Pipelining Task programming
Provides data structures, which allow the parallel access from several threads:
Queue (FIFO) Associative Container Vector
Developed by Intel
6
WS 2013/2014
3 / 65
WS 2013/2014
4 / 65
Ubersicht
Commercial- and open-source-version Homepage: http://www.threadingbuildingblocks.org/ Literature:
Website http://www.threadingbuildingblocks.org/documentation.php
Reference Manual Installation Guide Getting Started Guide
2 1
Introduction Basics Generic Programming with C++ Concepts of Threading Building Blocks Initialization Parallel Constructs Synchronization Container Task-Programming
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 6 / 65
Book
Intel Threading Building Blocks: Outtting C++ for Multi-core Processor Parallelism Author: James Reinders Verlag: OReilly ISBN: 0596514808 Erscheinungsdatum: 2007
WS 2013/2014
5 / 65
Example
1 2 3 4 5 6 7 8 9 10 11
class IntStack { public : void push ( const int & item ) { mem [ pos ++]= item ; int pop () { return mem [ - - pos ]; } int mem [100]; int pos ; }; IntStack s ; // Usage s . push (5); x = s . pop ();
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 7 / 65
Example
}
1 2 3 4 5 6 7 8 9
template < typename T > // Use type T instead int class Stack { public : void push ( const T & item ) { mem [ pos ++]= item ; T pop () { return mem [ - - pos ]; } T mem [100]; int pos ; }
WS 2013/2014
8 / 65
Type Requirements
Templates can be used with self dened types Example: Usage of the stack class for storage of self dened tuple classes Analysis of the implementation of the stack class shows, that it is required that a assignment operator must be dened. Template implementations require certain semantic and syntactic requirements.
Declaration of objects from a template class uses class name followed by the types specied in <>-braces
Example
1 2 3 4 5 6 7
// D e c l a r a t i o n of a integer stack Stack < int > int_stack ; int_stack . push (5); // D e c l a r a t i o n of a stack using double p r e c i s i o n numbers Stack < double > double_stack ; double_stack . push (5.0);
Example
1 2 3 4 5 6 7 8 9 10 11
9 / 65
class IntTupel { public : // A s s i g n m e n t O pe r at o r IntTupel & operator =( const IntTupel & other ) { s1 = other . s1 ; s2 = other . s2 ; return * this ; } int s1 , s2 ; // el e me nt s of the tuple }; ... // Usage Stack < IntTupel > s ; s . push ( IntTupel (5 ,6));
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 10 / 65
WS 2013/2014
Splittable Concept
pseudo signature
X::X(X& x, Split )
A model is a type which fullls all requirements of a concept Concepts are in threading building blocks described by pseudo signatures:
Splitting-Constructor splits objects into two parts Argument Split is used to distinguish the splitting-constructor from the copy constructor Used for:
Partitioning of a index range into two subranges, which can be computed in parallel Duplication of function objects which are computed in parallel
Example (CopyConstructible)
pseudo signature
T( const T&) ~T() T* operator& () const T* operator&() const
Models:
blocked_range and blocked_range2d parallel_reduce and parallel_scan
WS 2013/2014
12 / 65
Range Concept
Represent index sets Typically used in parallel loops pseudo signature
R::R(const R& ) R::~R() bool R::empty() const R::is_divisible() const R::R(R& r, Split) const
semantics Copy-Constructor Destructor true if index range empty true if index range can be divided Subdivision of r into two index sets
Represents half open interval [i , j ); i and j have type Value Models for Value are build in types such as int, uint or pointer to vector elements
1 template < typename Value > class blocked_range { 2 public : 3 typedef size_t size_type ; 4 typedef Value const_iterato r ; 5 6 blocked_range ( Value begin , Value end , size_type grainsize =1); 7 blocked_range ( blocked_range & r , split ); 8 9 size_type size () const ; 10 bool empty () const ; 11 12 size_type grainsize () const ; 13 bool is_divisible () const ; 14 15 const _iterator begin () const ; 16 const _iterator end () const ; }
13 / 65 Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 14 / 65
Models:
blocked_range
and blocked_range2d
WS 2013/2014
Initialization of TBB
Example
Concept of value
pseudo signature
Value::Value(const Value&) Value::~Value() operator-(const Value& i, const Value& j) operator+(const Value& i, size_t k)
1 2 3 4 5 6 7 8
include " tbb / t a s k _ s c h e d u l e r _ i n i t . h " using namespace tbb ; int main () { t a s k _ s c h e d u l e r _i n i t init ; ... return EXIT_SUCCESS ; }
Each program requires a tbb::task_scheduler_init-object After initialization threads get started and wait for work assignment. A additional parameter can specify the number of threads Example: task_scheduler_init init(8) creates 8 threads Threads are alive as long the task_scheduler_init-object is not destroyed
task_scheduler_init-Objekt
WS 2013/2014
15 / 65
Ubersicht
1
Introduction Basics Parallel Constructs parallel for parallel reduce parallel do pipeline parallel sort Additional Algorithm Templates Synchronization Container Task-Programming
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 17 / 65
Parallel iteration over a range-object Range object is subdivided into parts For each part the
operator()
Additional version of parallel_for which has as a third argument a partitioner Requirements for body:
pseudo signature
Body::Body(const Body); Body::~Body(); void Body::operator()(Range& r) const;
Reductions Operation
parallel_reduce template<typename Range, typename Body> void parallel_reduce(const Range& range, const Body& body);
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14
class DoubleAll { int * intput ; DoubleAll ( int * _input ) : input ( _input ) {}; void operator ()( const blocked_range < int >& range ) const { for ( int i = range . begin (); i != range . end (); ++ i ) input [ i ]*=2: } } void Par all elDo ubl eAl l ( int * input , size_t n ) { DoubleAll da ( input ); parallel_for ( blocked_range < int >(0 , n ,1000) , da ); }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 19 / 65
Build a single object by applying a reduction operator to a set of objects Computation e.g. the sum, minimum, maximum of vector elements Additional version using a partitioner Reduction operator should be associative Body:
pseudo signature
Body::Body(Body, split); Body::~Body(); void Body::operator()(Range& r); Body::join(Body& rhs);
semantics Splitting Constructor Destructor Reduction of elements using the subrange r Combining the values of subranges; combines rhs with the value of *this
WS 2013/2014 20 / 65
Operation of parallel reduce Recursive subdivision of the range object into subranges until a call to is_divisible returns false Body object:
Is replicated for each subrange Application of the operator-() of body object to each subrange Stores the value of a reduction over a subrange
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
class Sum { public : float * array ; float value ; Sum ( float * _array ) : value (0) , array ( _array ) {} Sum ( Sum & s , split ) { value =0; array = s . array } void operator () ( const blocked_range < int > & range ) { float temp = value ; for ( int i = range . begin (); i != range . end (); ++ i ) temp += array [ i ]; value = temp ; } void join ( Sum & rhs ) ( value += rhs . value ;} }; float ParallelSum ( float * array , size_t n ) { Sum total ( array ); p ar al le l _r educ e ( blocked_range < int >(0 , n , 1000) , total ); return total . value ; }
join
WS 2013/2014
21 / 65
WS 2013/2014
22 / 65
Partitioner
Controls the subdivision of range objects and the assignment of range objects to threads. Used for
parallel_for, parallel_reduce
parallel do
parallel_do template<typename InputIterator, typename Body> void parallel_do(InputIterator first, InputIterator last, Body& body);
and
parallel_scan.
simple_partitioner
Sequential iteration over a elements of some container and applying an operator of the body object. Particularly useful when the elements of the container are not random accessible, e.g. in a list To each element of the container the operator object is applied Iterator object required:
operator()
Subdivides range object not necessarily until Range::is_divisible returns false. Balances work for processors, by ensuring that ranges for threads have nearly equal size.
affinity_partitioner
of the body
Subdivision similar to auto_partitioner On iterating several times over the range object the partitioner assigns subranges to the same threads over all iterations. Increases cache eciency if data ts in cache.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 23 / 65
Iterator is a abstract interface to access elements from a container Iterator objects are dened for STL (Standard Template Library)-Container or TBB-Container Possibility to apply the body to objects which are generated while the computation proceeds.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 24 / 65
Pseudo-Signature Body:
pseudo signature
void B::operator()( B::argument_type &item, parallel_do_feeder<B::argument_type>& feed ) const; B::argument_type B::argument_type(const B::argument_type& ~B::argument
semantics item element to which the operator is applied feed is used to store newly created elements Type of elements Copy constructor of argument_type Destructor of argument_type
Pipeline
Class denition
1 class pipeline { 2 public : 3 pipeline (); 4 virtual ~ pipeline (); 5 void add_filter ( filter & f ); 6 void run ( size_t m a x _ n u m b e r _ o f _ l i v e _ t o k e n s ); 7 void clear (); 8 }
Example
1 2 3 4 5 6 7 8 9 10 11
class ListEl {}; // is Copy - C o n s t r u c t i b l e struct Body { typedef ListEl argument_type ; void operator ()( ListEl c , tbb :: parallel_do_feeder < ListEl >& feed ) const ListEl & new_item = prozess_item ( c ); feed . add ( new_item ); } }; std :: list < ListEl > list ; ... tbb :: parallel_do ( list . begin () , list . end () , Body ());
A pipeline object (class pipeline;) uses several uses several pipeline stages f1 , . . . , fn called lters in TBB. Filters are created outside of the pipeline and put into the pipeline by calling pipeline::addfilter() The method pipeline::run starts the pipeline; max_number_of_live_tokens limits the number of parallel pipeline stages. pipeline::clear() removes all lters from pipeline; after that call the lters can be destroyed.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 26 / 65
WS 2013/2014
25 / 65
Parallel Sorting
parallel_sort
1 template < typename RandomAccessIterator , typename Compare > 2 parallel_sort < R a n d o m A c c e s s I t e r a t o r begin , 3 R a n d o m A c c e s s I t e r a t o r end , 4 const Compare & comp );
Each lter-class has to overwrite the virtual method void* filter::operator()(void *). The return value from the operator-() is used as the argument the next pipeline stage . The rst lter f1 generates the data; a return value of that no more elements need to be processed sind.
NULL item
of
tells TBB
Used for sorting a container-object Unstable sorting order of elements with the same key is not preserved. Deterministic sorting the same sequence of element generates in each sorting run the same sorted sequence RandomAccessIterator is dened in STL-Library; allows random access to elements
The last stage fn should manage the output; The return value of that stage is ignored. A lter can be marked as a parallel lter several items are computed in parallel in that stage
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 27 / 65
Example
1 2 3 4
const int N = 100000; float b [ N ]; ... parallel_sort (b , b +N , std :: greater < float >());
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 28 / 65
Ubersicht
1
Introduction Basics Parallel Constructs Synchronization Mutex Atomar Operations Container Task-Programming
parallel_scan
Computing the prex sum in parallel Used for e.g. parallel sorting
parallel_for_each
WS 2013/2014
29 / 65
WS 2013/2014
30 / 65
is not
Unlocking locks may be forgotten; Not only when using exceptions. Complexity of program text increases Increased programming expenses for the programmer
WS 2013/2014
31 / 65
WS 2013/2014
32 / 65
Solution 2: Division of lock-variables and the locking functionality into two objects
Mutex: globally visible Scoped Lock: Used for locking the mutex
For each thread and each mutex one Scoped Lock instance exists Locks a mutex at its object-construction Unlocks a mutex at their deconstruction Tip: Using a code block (braces { } in C++) and declaring a scoped lock object at the beginning of the code block locks the associated mutex within the whole code block
Mutex Concept
All the following mutex models have to implement to following functions
Pseudo Signature
M() ~M() typename M::scoped_lock M::scoped_lock() M::scoped_lock(M& mutex) M::~scoped_lock() M::scoped_lock::aquire(M& mutex) bool M::scoped_lock::try_aquire(M& mutex)
Example:
1 ... 2 { 3 // C o n s t r u c t i o n of myLock locks mutex myMutex 4 mutex :: scoped_lock myLock ( myMutex ); 5 // C o m p u t a t i o n s are p r o t e c t e d by myMutex 6 ... 7 // u n l o c k i n g of myMutex 8 // ( D e s t r u c t o r of myLock is called i m p l i c i t l y ) , 9 }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 33 / 65
M::scoped_lock::release() static const bool M::is_rw_mutex static const bool M::is_recursive_mutex static const bool M::is_fair_mutex
Prof. Praktische Informatik
Semantic Construction of a Mutex-Object Destruction of a Mutex-Object Type of the Scoped-Lock-Class Construction of a Scoped-Lock-Object without locking the mutex-variable Construction of a Scoped-Lock-Object and locking the mutex Freeing of the mutex, if locked Lock mutex Try to lock, mutex . Returns false If already locked, otherwise true; Unlock of mutex true, if Reader-Writer-Mutex true, if Recursive-Mutex true, if Faire-Mutex
WS 2013/2014 34 / 65
Lock-Implementation using a busy-waiting loop. Uses a ag variable in memory. Good for short delays, since while waiting
processor time and memory bandwidth is used.
Wrapper for recursive operating system implementation (e.g. for pthread_mutex_t) A recursive lock, can be locked several times from one and the same thread. If a mutex was locked n-times, the thread has to be unlocked n times too.
Unfair Implementation:
Order of locking requests is ignored.
queuing_mutex-Class
Implementation using a busy waiting loop Fair implementation locking requests are served in FIFO order. Implementation scales
WS 2013/2014
35 / 65
WS 2013/2014
36 / 65
ReaderWriterMutex-Concept
Read-/Write Locks:
Several threads that only read the protected data structure are allowed to read in parallel. One thread which tries to modify the data structure needs exclusive write access. Can only be locked by several readers or by one writer. Additional requirements compared to the mutex-concept.
Pseudo Signature
M::scoped_lock(M& mutex, bool write=true) M::scoped_lock::aquire(M& mutex, bool write=true) bool M::scoped_lock::try_aquire(M& mutex, bool write=true) bool RW::scoped_lock::upgrade_to_writer() bool RW::scoped_lock::downgrade_to_reader()
Prof. Praktische Informatik
Summary of Locks
Class
mutex recursive_mutex spin_mutex queuing_mutex spin_rw_mutex queuing_rw_mutex
Semantic Constructs Scoped-Lock-Object for locking mutex Locks mutex Try to lock, mutex . If locked returns false, otherwise true; Reader-Lock Writer-Lock Writer-Lock Reader-Lock
WS 2013/2014
recursive x -
release of CPU x x -
37 / 65
WS 2013/2014
38 / 65
exclusive write access; write=false read access Models: Class spin_rw_mutex and Class queuing_rw_mutex Example
write=true
Example
Data Structure queue (FIFO) Implementation using a linked list Attaching elements at the end Taking elements at start Later: Implementation by TBB
enq-Operation deq-Operation concurrent_queue
Example
1 2 3 4 5 6 7
template < typename T > struct Node { Node () : next ( NULL ) {} Node ( const T & v ) : val ( v ) , next ( NULL ) {} T val ; // Value Node * next ; // Pointer to next element };
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
template < typename T > class LockQueue { // Mutexes for taking and a t t a c h i n g mutex enqLock , deqLock ; // // pointer to b e g i n n i n g and end of linked list Node <T > * head , * tail ; public : // Queue has one s en ti n e l element LockQueue () { head = new Node <T >(); tail = head ; } ~ LockQueue () { delete head ; }
WS 2013/2014
39 / 65
WS 2013/2014
40 / 65
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
void enq ( const T & x ) { mutex :: scoped_lock l ( enqLock ); Node <T > * e = new Node <T >( x ); tail - > next = e ; tail = e ; } T deq () { mutex :: scoped_lock l ( deqLock ); if ( head - > next == NULL ) throw new EmptyException (); T val = head - > next - > val ; Node <T > * h = head ; head = head - > next ; delete h ; return val ; } };
Notes: by using two mutexes it is possible to take elements from and attach elements to the queue in parallel Deadlock free, since no thread accesses two locks at the same time points to sentinel-element, it successor is the rst element of the queue
head
WS 2013/2014
41 / 65
WS 2013/2014
42 / 65
Atomars Operations
Ubersicht
Introduction Basics Parallel Constructs Synchronization Container concurrent vector concurrent hash map concurrent queue Task-Programming
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 44 / 65
1 struct atomic <T > { 2 typedef T value_type ; 1 3 4 value_type fetch_and_add ( value_type addend ); // x = x + addend 5 value_type f e t c h _a n d _ i n c r e me n t (); // x = x +1 2 6 value_type f e t c h _a n d _ d e c r e me n t (); // x =x -1 7 value_type compare_an d_s wap ( value_type new_value , (*) 8 value_type comparand ); 9 value_type fetch_and _store ( value_type new_value ); // swap (x , n e w _ v a l u e 3) 10 value_type operator () const ; 11 value_type operator +=( value_type ); 12 value_type operator -=( value_type ); 4 13 value_type operator ++(); 14 value_type operator - -(); 15 } 5
T
Integer- or pointer-type
concurrent vector
concurrent_queue template<typename T> class concurrent_vector;
before new
Properties: Random access to elements (addressed by index) Data structure can grow After growing indices and iteratores are still valid No shrinking is possible Selected Methods
Access to elements:
T& operator[](size_type i) Access i-th element without index checking T& at(size_type i) Access i-th element; Exception std::out_of_range
memory is allocated
size_type max_size()
Iteratores and Ranges and iterator end() random access iteratores for vector elements in increasing order of indices reverse_iterator rbegin() and reverse_iterator rend() random access Iteratores for visiting vector elements in reverse order range_type range(int grainsize) Range object for vector
iterator begin()
Enlargement:
size_type grow_by(size_type delta, const T& t=T()) Enlargement by delta elements void grow_to_at_least(size_type n) Enlargement by minimal n-elements size_t push_back(cons T& val) Attaching value val Intel at the end; Returns Praktische Informatik Threading Building Blocksthe
when index invalid T& front() and T& back() Access to rst or last element
WS 2013/2014
45 / 65
Prof.
Element Access
Accessor-object (proxy) allows the concurrent access to key-value pairs Accessor object uses implicit lock for each key-value pair Construction of a accessor object locking of the corresponding key-value pair Destruction of the accessor-objects unlocking the implicit lock The are two dierent accessor-types:
const_accessor read accessor read/write
Hash-table for storage of key-value pairs with parallel access Key - type of key, T type of values HashCompare Class for mapping of keys to integer values. Concept of HashCompare:
Pseudo signature
HashCompare::HashCompare(const HashCompare&) HashCompare::~HashCompare() bool HashCompare::equal(const Key& j, const Key& k) size_t HashCompare::hash(const Key& k) const
const accessor
1 template < typename Key , typename T , 2 typename HashCompare , typename A > 3 class concurrent_hash_map < Key ,T , hashCompare ,A >:: con st_acce ssor { 4 ... 5 typedef const std :: pair < const Key , T > value_type ; 6 7 bool empty () const ; // Element present ? 8 const value_type & operator *() const ; // Pointer to entry 9 const value_type * operator - >() const ; // R e f e r e n c e to entry 10 void release (); // u n l o c k i n g the i mpl i c i t lock Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 48 / 65 11 };
Conditions:
i,j have type Key; h is a object, which implements the concept HashCompare. If h.equal(i,j) is true, then h.hash(i) = h.hash(j) must hold.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 47 / 65
Selected Methods
Example: Compute the frequency of words
size_type count(const Key& key) const
returns one if
key
Example
1 struct MyHashCompare { 2 static size_t hash ( const string & x ) { 3 size_t h =0; 4 for ( const char * s = x . c_str (); * s ; s ++) 5 h =( h *17)^* s ; 6 return h ; 7 } 8 static bool equal ( const string & x , const string & y ) { 9 return x == y ; 10 } 11 }; 12 13 typedef concurrent_hash_map < string , int , MyHashCompare > StringTable ;
Similar to find; Dierence: If entry not present create and insert new key-value pair with pair<Key,T>(key,T()).
bool erase(const Key& key)
WS 2013/2014
49 / 65
WS 2013/2014
50 / 65
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
struct Tally { StringTable & table ; Tally ( StringTable & _table ) : table ( _table ) {} void operator ()( const blocked_range < string * > r ) const { for ( string * p = r . begin (); p != r . end (); ++ p ) { StringTable :: accessor a ; table . insert (a , * p ); a - > second +=1; } } }; void C ountAc currences ( String * data , int nitems ) { t a s k _s c h e d u l e r _ in i t init ; StringTable table ; parallel_for ( blocked_range < string * >( data , data + nitems , grainsize ) , Tally ( table ) ); for ( StringTable :: iterator i = table . begin (); i != table . end (); ++ i ) cout < <i - > first < < " " <<i - > second < < endl ; }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 51 / 65
concurrent queue
concurrent_queue template<typename T> class concurrent_queue;
FIFO-queue Inserting and deleting elements concurrently possible Limited capacity Implementation uses locks Busy waiting on some (blocking) operations Important methods:
void push(const T& source); Inserting elements at the end void pop(T& destination); Removing and returning from the
beginning;
blocks if empty
bool pop_if_present(T& destination); Removing and returning; size_type size() const; Number of elements stored; If empty, return
the number of waiting threads as a negative number size_t capacity() const; Return maximum capacity
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 52 / 65
Ubersicht
1
Task-Programming
A task is composed of data and code which uses the data for computation. Tasks can be executed in parallel Tasks can be divided into subtasks father-child relationship creates a tree of tasks Child tasks should be independent computation on dierent cores possible Programmer denes the subdivision Scheduler component within TBB manages computation order Example for Algorithms:
Linear algebra (Matrix-Multiplication,-Decomposition) Sorting (Merge-,Quick-Sort) Search
Intel Threading Building Blocks WS 2013/2014 53 / 65 Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 54 / 65
Split-Join
Decomposition of a task into subtasks split-operation Waiting for the completion of childs join-operation Task-Depth
Blocking
1 task * T :: execute () { 2 if ( there is no further division possible ) { 3 /* s e q u e n t i a l c o m p u t a t i o n */ 4 } else { 5 set_ref_count ( k +1); 6 task & tk = new ( al locate_child ()) T (...); tk . spawn (); 7 ... 8 task & t2 = new ( al locate_child ()) T (...); t2 . spawn (); 9 task & t1 = new ( al locate_child ()) T (...); 10 t1 . s p aw n _ a n d _ wa i t_ al l ( t1 ); 11 } 12 return NULL ; }
Each task has the implicit information about his task depth. Task depth of childs is one grater than task depth of father Root task has task depth 0
Reference counter
Each task has a reference counter The reference counter counts the number of existing childs If the reference counter reaches zero task is deleted; reference counter of father is decremented
Explanation: T inherits from the class Task and reimplements the method execute controls the subdivision into tasks; Steps:
execute().
WS 2013/2014
55 / 65
Allocation of task objects set_ref_count() initializing the reference counter to #childs+1 spawn() marks a task for further execution spawn_and_wait() waits, until the reference counter reaches 1. Important: set_ref_count-call before spawn-call execute returns a task isBuilding computed Prof. Praktische Informatik Intelwhich Threading Blocks immediately WS 2013/2014
56 / 65
Example (Blocking)
1 struct Tree { int val ; Tree * left ,* right ; } 2 class SumTask : public Task { 3 int * sum ; 4 Tree * tree ; 5 6 SumTask ( Tree * _tree , int * _sum ) : tree ( _tree ) , sum ( _sum ) {}; 7 8 task * execute () { 9 SumTask *a ,* b ; 10 int ref =1 , x =0 , y =0; 11 if ( tree - > right != NULL ) { 12 a = new ( alloc ate_child ()) SumTask ( tree - > right ,& x ); 13 ref ++; } 14 if ( tree - > left != NULL ) { 15 b = new ( alloc ate_child ()) SumTask ( tree - > left ,& y ); 16 ref ++; } 17 if ( ref > 1) { 18 set_ref_count ( ref ); 19 if ( tree - > right != NULL ) spawn (* a ); 20 if ( tree - > left != NULL ) spawn (* b ); 21 wait_for_all (); } 22 * sum = tree - > val + x + y ; 23 } 24 return NULL ; 25 } }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 57 / 65
Solution:
the method
task::execute()
ends.
The computation using the results from child tasks is outsourced into a continuation-task. The continuation task is executed, after all childs have nished.
WS 2013/2014
58 / 65
Continuation-Passing
1 task * T :: execute () { 2 if ( there is no further division possible ) { 3 /* s e q u e n t i a l c o m p u t a t i o n */ 4 } else { 5 set_ref_count ( k ); 6 r e c y c l e _ a s _ c o n t i n u a t i o n (); 7 task & tk = new ( allocate_child ()) T (...); tk . spawn (); 8 ... 9 task & t1 = new ( allocate_child ()) T (...); t1 . spawn (); 10 return & t1 ; }
Example (Continuation-Passing)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
class SumContTask : public Task { int * sum , x , y ; SumContTask ( int * _sum ) : sum ( _sum ) {} task * execute () { * sum = x + y ; return NULL ;} } class SumTask : public Task { int * sum ; Tree * tree ;
SumTask ( Tree * _tree , int * _sum ) : tree ( _tree ) , sum ( _sum ) {* sum += tree - > val ;}; task * execute () { SumTask *a ,* b ; int ref =0; SumCont * c = new ( a l l o c _ c o n t in u t a t i o n ()) SumContTask ( sum ); if ( tree - > right != NULL ) { a = new ( alloc ate_child ()) SumTask ( tree - > right ,& c - > x ); ref ++; } if ( tree - > left != NULL ) { b = new ( alloca te_child ()) SumTask ( tree - > left ,& c - > y ); ref ++; } if ( ref > 0) { set_ref_count ( ref ); if ( tree - > right != NULL ) c - > spawn (* b ); if ( tree - > left != NULL ) c - > spawn (* a ); } return NULL ; } }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 60 / 65
In the example there is no further computation after t1.spawn() There is no need from algorithm point of view for a continuation task. Internals from TBB require continuation task
recycle_as_continuation()
wait for childs to nish mark child for execution marks a list of childs for execution mark child for execution and wait for the childs Mark childs in list for execution and wait for childs Returns task depth Sets task depth Increments task depth Returns reference counter Sets reference counter Recycling of a task as continuation task Recycling as child with father parent Recycling as child
WS 2013/2014 61 / 65
Root tasks starts task computation Root task has to use Result is stored in
sum &root) new(task::allocate_root())
The static method task::spawn_root_and_wait(task task and waits for completion. The static task::spawn_root_and_wait(task_list executing a list of root tasks
Prof. Praktische Informatik Intel Threading Building Blocks
&root)
WS 2013/2014
62 / 65
Execution Orders
Ready-Pool
New tasks are stored at the beginning of the list corresponding to their tasks depth and are removed at the beginning of their list (LIFO).
WS 2013/2014
63 / 65
WS 2013/2014
64 / 65
2. The task which is farther of the last executed task. 3. A task from the list with the highest task depth. 4. A task with an anity for that thread. 5. A task from the ready pool of another thread with the lowest depth (task stealing).
WS 2013/2014
65 / 65