0% found this document useful (0 votes)

50 views

Multicore

1. The document outlines Intel Threading Building Blocks (TBB), a C++ library for shared memory parallel programming mainly for multicore CPUs. 2. TBB implements important parallel programming patterns like parallel loops, pipelining, and task programming. 3. It provides data structures that allow parallel access from several threads, such as queues, associative containers, and vectors.

Uploaded by

Muhammad Omar Farooq

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

Multicore

Uploaded by

Muhammad Omar Farooq

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Outline Intel Threading Building Blocks

1 2

Intel Threading Building Blocks

Dr. M. Schwind
Prof. Praktische Informatik

Winter Term 2013/2014

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

1 / 65

Ubersicht
1

Intel Threading Building Blocks

Introduction Basics Generic Programming with C++ Concepts of Threading Building Blocks Initialization Parallel Constructs parallel for parallel reduce parallel do pipeline parallel sort Additional Algorithm Templates Synchronization Mutex Atomar Operations Container concurrent vector concurrent hash map Prof. Praktische Informatik Intel Threading Building Blocks concurrent queue Task-Programming

WS 2013/2014

2 / 65

Introduction Basics Parallel Constructs Synchronization Container Task-Programming

C++-library for shared memory parallel programming mainly for multicore CPU Implements important parallel programming patterns
Parallel loops Pipelining Task programming

Provides data structures, which allow the parallel access from several threads:
Queue (FIFO) Associative Container Vector

Developed by Intel
6

No restriction to Intel-Processors Implementation uses generic programming (C++-templates)

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

3 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

4 / 65

Ubersicht
Commercial- and open-source-version Homepage: http://www.threadingbuildingblocks.org/ Literature:
Website http://www.threadingbuildingblocks.org/documentation.php
Reference Manual Installation Guide Getting Started Guide
2 1

Introduction Basics Generic Programming with C++ Concepts of Threading Building Blocks Initialization Parallel Constructs Synchronization Container Task-Programming
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 6 / 65

Book
Intel Threading Building Blocks: Outtting C++ for Multi-core Processor Parallelism Author: James Reinders Verlag: OReilly ISBN: 0596514808 Erscheinungsdatum: 2007

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

5 / 65

Motivation for Generic Programming

Example: simplied implementation of a stack for storing integer values Problem: Type safe implementation of a stack for the storage of variable types requires a implementation per each type Solution:
Usage of the preprocessor (awkward, confusing, dicult to maintain) Usage of templates

Generic Programming with C++

Functions, classes, and methods can be declared with types, which are variable until compile time To dene a class with variables type the declaration of a class is preceded with template<typename T1, typename T2, ...>
T1, T2, ... are identiers for the variable types typename is a keyword preceding the identier

Example
1 2 3 4 5 6 7 8 9 10 11
class IntStack { public : void push ( const int & item ) { mem [ pos ++]= item ; int pop () { return mem [ - - pos ]; } int mem [100]; int pos ; }; IntStack s ; // Usage s . push (5); x = s . pop ();
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 7 / 65

Example
}

1 2 3 4 5 6 7 8 9

template < typename T > // Use type T instead int class Stack { public : void push ( const T & item ) { mem [ pos ++]= item ; T pop () { return mem [ - - pos ]; } T mem [100]; int pos ; }

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

8 / 65

Declaration of Objects using Variable Types

Type Requirements
Templates can be used with self dened types Example: Usage of the stack class for storage of self dened tuple classes Analysis of the implementation of the stack class shows, that it is required that a assignment operator must be dened. Template implementations require certain semantic and syntactic requirements.

Declaration of objects from a template class uses class name followed by the types specied in <>-braces

Example
1 2 3 4 5 6 7
// D e c l a r a t i o n of a integer stack Stack < int > int_stack ; int_stack . push (5); // D e c l a r a t i o n of a stack using double p r e c i s i o n numbers Stack < double > double_stack ; double_stack . push (5.0);

Example
1 2 3 4 5 6 7 8 9 10 11
9 / 65

class IntTupel { public : // A s s i g n m e n t O pe r at o r IntTupel & operator =( const IntTupel & other ) { s1 = other . s1 ; s2 = other . s2 ; return * this ; } int s1 , s2 ; // el e me nt s of the tuple }; ... // Usage Stack < IntTupel > s ; s . push ( IntTupel (5 ,6));
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 10 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

Concepts and Models

A concept is a collection of requirements for a type
Syntactic requirements (e.g. a class denes a method with a specic name) Semantic requirements (a method does a computation in a specic way)

Splittable Concept
pseudo signature
X::X(X& x, Split )

semantics Splitting x into x and a new constructed object

A model is a type which fullls all requirements of a concept Concepts are in threading building blocks described by pseudo signatures:

Splitting-Constructor splits objects into two parts Argument Split is used to distinguish the splitting-constructor from the copy constructor Used for:
Partitioning of a index range into two subranges, which can be computed in parallel Duplication of function objects which are computed in parallel

Example (CopyConstructible)
pseudo signature
T( const T&) ~T() T* operator& () const T* operator&() const

semantics Copy-Constructor Destructor Address from T Address from const T

WS 2013/2014 11 / 65

Models:
blocked_range and blocked_range2d parallel_reduce and parallel_scan

Prof. Praktische Informatik

Intel Threading Building Blocks

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

12 / 65

Range Concept
Represent index sets Typically used in parallel loops pseudo signature
R::R(const R& ) R::~R() bool R::empty() const R::is_divisible() const R::R(R& r, Split) const

Models for Ranges

blocked range
template<typename Value> class blocked_range;

semantics Copy-Constructor Destructor true if index range empty true if index range can be divided Subdivision of r into two index sets

Represents half open interval [i , j ); i and j have type Value Models for Value are build in types such as int, uint or pointer to vector elements
1 template < typename Value > class blocked_range { 2 public : 3 typedef size_t size_type ; 4 typedef Value const_iterato r ; 5 6 blocked_range ( Value begin , Value end , size_type grainsize =1); 7 blocked_range ( blocked_range & r , split ); 8 9 size_type size () const ; 10 bool empty () const ; 11 12 size_type grainsize () const ; 13 bool is_divisible () const ; 14 15 const _iterator begin () const ; 16 const _iterator end () const ; }
13 / 65 Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 14 / 65

Models:
blocked_range

and blocked_range2d

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

Initialization of TBB
Example
Concept of value
pseudo signature
Value::Value(const Value&) Value::~Value() operator-(const Value& i, const Value& j) operator+(const Value& i, size_t k)

semantics Copy constructor Destructor Number of elements in range [i , j ) k th value after i

1 2 3 4 5 6 7 8

include " tbb / t a s k _ s c h e d u l e r _ i n i t . h " using namespace tbb ; int main () { t a s k _ s c h e d u l e r _i n i t init ; ... return EXIT_SUCCESS ; }

Each program requires a tbb::task_scheduler_init-object After initialization threads get started and wait for work assignment. A additional parameter can specify the number of threads Example: task_scheduler_init init(8) creates 8 threads Threads are alive as long the task_scheduler_init-object is not destroyed
task_scheduler_init-Objekt

gets destructed threads are destroyed

WS 2013/2014 16 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

15 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

Ubersicht
1

Template for Parallel Loops

parallel_for template<typename Range, typename Body> void parallel_for(const Range& range, const Body& body);

Introduction Basics Parallel Constructs parallel for parallel reduce parallel do pipeline parallel sort Additional Algorithm Templates Synchronization Container Task-Programming
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 17 / 65

Parallel iteration over a range-object Range object is subdivided into parts For each part the
operator()

gets called from the body object

Additional version of parallel_for which has as a third argument a partitioner Requirements for body:
pseudo signature
Body::Body(const Body); Body::~Body(); void Body::operator()(Range& r) const;

semantics Copy-constructor Destructor application of the operator () to r

WS 2013/2014 18 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

Operation of parallel for

subdivide the index range recursively until the call to return the value false. For each part of the index range the body object is replicated and applied to that part.
parallel_for is_divisible()

Reductions Operation
parallel_reduce template<typename Range, typename Body> void parallel_reduce(const Range& range, const Body& body);

Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14
class DoubleAll { int * intput ; DoubleAll ( int * _input ) : input ( _input ) {}; void operator ()( const blocked_range < int >& range ) const { for ( int i = range . begin (); i != range . end (); ++ i ) input [ i ]*=2: } } void Par all elDo ubl eAl l ( int * input , size_t n ) { DoubleAll da ( input ); parallel_for ( blocked_range < int >(0 , n ,1000) , da ); }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 19 / 65

Build a single object by applying a reduction operator to a set of objects Computation e.g. the sum, minimum, maximum of vector elements Additional version using a partitioner Reduction operator should be associative Body:
pseudo signature
Body::Body(Body, split); Body::~Body(); void Body::operator()(Range& r); Body::join(Body& rhs);

semantics Splitting Constructor Destructor Reduction of elements using the subrange r Combining the values of subranges; combines rhs with the value of *this
WS 2013/2014 20 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

Operation of parallel reduce Recursive subdivision of the range object into subranges until a call to is_divisible returns false Body object:
Is replicated for each subrange Application of the operator-() of body object to each subrange Stores the value of a reduction over a subrange

Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
class Sum { public : float * array ; float value ; Sum ( float * _array ) : value (0) , array ( _array ) {} Sum ( Sum & s , split ) { value =0; array = s . array } void operator () ( const blocked_range < int > & range ) { float temp = value ; for ( int i = range . begin (); i != range . end (); ++ i ) temp += array [ i ]; value = temp ; } void join ( Sum & rhs ) ( value += rhs . value ;} }; float ParallelSum ( float * array , size_t n ) { Sum total ( array ); p ar al le l _r educ e ( blocked_range < int >(0 , n , 1000) , total ); return total . value ; }

Combination of intermediate results with

join

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

21 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

22 / 65

Partitioner
Controls the subdivision of range objects and the assignment of range objects to threads. Used for
parallel_for, parallel_reduce

parallel do
parallel_do template<typename InputIterator, typename Body> void parallel_do(InputIterator first, InputIterator last, Body& body);

and

parallel_scan.

simple_partitioner

Recursive subdivision of range objects until Range::is_divisible return false.

auto_partitioner

Sequential iteration over a elements of some container and applying an operator of the body object. Particularly useful when the elements of the container are not random accessible, e.g. in a list To each element of the container the operator object is applied Iterator object required:
operator()

Subdivides range object not necessarily until Range::is_divisible returns false. Balances work for processors, by ensuring that ranges for threads have nearly equal size.
affinity_partitioner

of the body

Subdivision similar to auto_partitioner On iterating several times over the range object the partitioner assigns subranges to the same threads over all iterations. Increases cache eciency if data ts in cache.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 23 / 65

Iterator is a abstract interface to access elements from a container Iterator objects are dened for STL (Standard Template Library)-Container or TBB-Container Possibility to apply the body to objects which are generated while the computation proceeds.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 24 / 65

Pseudo-Signature Body:
pseudo signature
void B::operator()( B::argument_type &item, parallel_do_feeder<B::argument_type>& feed ) const; B::argument_type B::argument_type(const B::argument_type& ~B::argument

semantics item element to which the operator is applied feed is used to store newly created elements Type of elements Copy constructor of argument_type Destructor of argument_type

Pipeline
Class denition
1 class pipeline { 2 public : 3 pipeline (); 4 virtual ~ pipeline (); 5 void add_filter ( filter & f ); 6 void run ( size_t m a x _ n u m b e r _ o f _ l i v e _ t o k e n s ); 7 void clear (); 8 }

Example
1 2 3 4 5 6 7 8 9 10 11
class ListEl {}; // is Copy - C o n s t r u c t i b l e struct Body { typedef ListEl argument_type ; void operator ()( ListEl c , tbb :: parallel_do_feeder < ListEl >& feed ) const ListEl & new_item = prozess_item ( c ); feed . add ( new_item ); } }; std :: list < ListEl > list ; ... tbb :: parallel_do ( list . begin () , list . end () , Body ());

A pipeline object (class pipeline;) uses several uses several pipeline stages f1 , . . . , fn called lters in TBB. Filters are created outside of the pipeline and put into the pipeline by calling pipeline::addfilter() The method pipeline::run starts the pipeline; max_number_of_live_tokens limits the number of parallel pipeline stages. pipeline::clear() removes all lters from pipeline; after that call the lters can be destroyed.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 26 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

25 / 65

Class Denition of Filter

1 class Filter { 2 enum mode { parallel , serial } 3 filter ( mode ); 4 bool is_serial () const ; 5 virtual void * operator ()( void * item ) = 0; 6 virtual ~ filter (); }

Parallel Sorting
parallel_sort

1 template < typename RandomAccessIterator , typename Compare > 2 parallel_sort < R a n d o m A c c e s s I t e r a t o r begin , 3 R a n d o m A c c e s s I t e r a t o r end , 4 const Compare & comp );

Each lter-class has to overwrite the virtual method void* filter::operator()(void *). The return value from the operator-() is used as the argument the next pipeline stage . The rst lter f1 generates the data; a return value of that no more elements need to be processed sind.
NULL item

tells TBB

Used for sorting a container-object Unstable sorting order of elements with the same key is not preserved. Deterministic sorting the same sequence of element generates in each sorting run the same sorted sequence RandomAccessIterator is dened in STL-Library; allows random access to elements

The last stage fn should manage the output; The return value of that stage is ignored. A lter can be marked as a parallel lter several items are computed in parallel in that stage
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 27 / 65

Example
1 2 3 4
const int N = 100000; float b [ N ]; ... parallel_sort (b , b +N , std :: greater < float >());
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 28 / 65

Additional algorithm templates

Ubersicht
1

Introduction Basics Parallel Constructs Synchronization Mutex Atomar Operations Container Task-Programming

parallel_scan

Computing the prex sum in parallel Used for e.g. parallel sorting
parallel_for_each

parallel application of a function to elements of a random-access-iterator

parallel_invoke

parallel invocation of up to 10 functions

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

29 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

30 / 65

Scoped Locking Pattern

Motivation: Exception Usage:
1 2 3 4 5 6 7 8 9 10 11 12
fun1 () { ... throw new Exception (); } void fun2 () { lock . lock (); fun1 (); // mutex is not // un l oc k ed lock . unlock (); } fun1

Solution 1: Modication of fun2

1 void fun2 () { 2 lock . lock (); 3 try { 4 fun2 (); 5 } 6 catch ( Exception * e ) { 7 lock . unlock (); 8 // E x c e p t i o n H an dl i n g 9 } 10 catch (...) { 11 lock . unlock (); 12 throw ; 13 } 14 lock . unlock (); 15 }

1 void fun3 () { 2 try { 3 fun2 (); 4 } catch ( Exception * e ) { 5 // e x e c p t i o n ha ndl in g 6 } 7 }

Disadvantage: throws a exception, the lock variable

lock

Problem: In case unlocked

is not

Unlocking locks may be forgotten; Not only when using exceptions. Complexity of program text increases Increased programming expenses for the programmer

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

31 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

32 / 65

Solution 2: Division of lock-variables and the locking functionality into two objects
Mutex: globally visible Scoped Lock: Used for locking the mutex
For each thread and each mutex one Scoped Lock instance exists Locks a mutex at its object-construction Unlocks a mutex at their deconstruction Tip: Using a code block (braces { } in C++) and declaring a scoped lock object at the beginning of the code block locks the associated mutex within the whole code block

Mutex Concept
All the following mutex models have to implement to following functions
Pseudo Signature
M() ~M() typename M::scoped_lock M::scoped_lock() M::scoped_lock(M& mutex) M::~scoped_lock() M::scoped_lock::aquire(M& mutex) bool M::scoped_lock::try_aquire(M& mutex)

Example:
1 ... 2 { 3 // C o n s t r u c t i o n of myLock locks mutex myMutex 4 mutex :: scoped_lock myLock ( myMutex ); 5 // C o m p u t a t i o n s are p r o t e c t e d by myMutex 6 ... 7 // u n l o c k i n g of myMutex 8 // ( D e s t r u c t o r of myLock is called i m p l i c i t l y ) , 9 }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 33 / 65

M::scoped_lock::release() static const bool M::is_rw_mutex static const bool M::is_recursive_mutex static const bool M::is_fair_mutex
Prof. Praktische Informatik

Semantic Construction of a Mutex-Object Destruction of a Mutex-Object Type of the Scoped-Lock-Class Construction of a Scoped-Lock-Object without locking the mutex-variable Construction of a Scoped-Lock-Object and locking the mutex Freeing of the mutex, if locked Lock mutex Try to lock, mutex . Returns false If already locked, otherwise true; Unlock of mutex true, if Reader-Writer-Mutex true, if Recursive-Mutex true, if Faire-Mutex
WS 2013/2014 34 / 65

Intel Threading Building Blocks

Models Implementing the Mutex-Concept

spin_mutex-Class mutex-Class

Wrapper for operating system implementation of locks

recursive_mutex-Class

Lock-Implementation using a busy-waiting loop. Uses a ag variable in memory. Good for short delays, since while waiting
processor time and memory bandwidth is used.

Wrapper for recursive operating system implementation (e.g. for pthread_mutex_t) A recursive lock, can be locked several times from one and the same thread. If a mutex was locked n-times, the thread has to be unlocked n times too.

Unfair Implementation:
Order of locking requests is ignored.
queuing_mutex-Class

Implementation using a busy waiting loop Fair implementation locking requests are served in FIFO order. Implementation scales

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

35 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

36 / 65

ReaderWriterMutex-Concept
Read-/Write Locks:
Several threads that only read the protected data structure are allowed to read in parallel. One thread which tries to modify the data structure needs exclusive write access. Can only be locked by several readers or by one writer. Additional requirements compared to the mutex-concept.
Pseudo Signature
M::scoped_lock(M& mutex, bool write=true) M::scoped_lock::aquire(M& mutex, bool write=true) bool M::scoped_lock::try_aquire(M& mutex, bool write=true) bool RW::scoped_lock::upgrade_to_writer() bool RW::scoped_lock::downgrade_to_reader()
Prof. Praktische Informatik

Summary of Locks

Class
mutex recursive_mutex spin_mutex queuing_mutex spin_rw_mutex queuing_rw_mutex

Semantic Constructs Scoped-Lock-Object for locking mutex Locks mutex Try to lock, mutex . If locked returns false, otherwise true; Reader-Lock Writer-Lock Writer-Lock Reader-Lock
WS 2013/2014

scalable OS dependent OS dependent x x

fair OS dependent OS dependent x x

recursive x -

release of CPU x x -

Intel Threading Building Blocks

37 / 65

1 struct atomic <T > { 2 typedef T value_type ; 1 3 4 value_type fetch_and_add ( value_type addend ); // x = x + addend 5 value_type f e t c h _a n d _ i n c r e me n t (); // x = x +1 2 6 value_type f e t c h _a n d _ d e c r e me n t (); // x =x -1 7 value_type compare_an d_s wap ( value_type new_value , (*) 8 value_type comparand ); 9 value_type fetch_and _store ( value_type new_value ); // swap (x , n e w _ v a l u e 3) 10 value_type operator () const ; 11 value_type operator +=( value_type ); 12 value_type operator -=( value_type ); 4 13 value_type operator ++(); 14 value_type operator - -(); 15 } 5
T

Integer- or pointer-type

Operations are executed atomar compare_and_swap:

Compares comparand with value from *this, if equal sets *this=new_value Returns the old value of *this
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 43 / 65

concurrent vector
concurrent_queue template<typename T> class concurrent_vector;

Selected Methods (Continued)

Memory
size_type size() Number of elements stored bool empty() Returns size() == 0 size_type capacity() Maximum number of elements,

before new

Properties: Random access to elements (addressed by index) Data structure can grow After growing indices and iteratores are still valid No shrinking is possible Selected Methods
Access to elements:
T& operator[](size_type i) Access i-th element without index checking T& at(size_type i) Access i-th element; Exception std::out_of_range

memory is allocated
size_type max_size()

Maximum number of elements

Iteratores and Ranges and iterator end() random access iteratores for vector elements in increasing order of indices reverse_iterator rbegin() and reverse_iterator rend() random access Iteratores for visiting vector elements in reverse order range_type range(int grainsize) Range object for vector
iterator begin()

Enlargement:
size_type grow_by(size_type delta, const T& t=T()) Enlargement by delta elements void grow_to_at_least(size_type n) Enlargement by minimal n-elements size_t push_back(cons T& val) Attaching value val Intel at the end; Returns Praktische Informatik Threading Building Blocksthe

when index invalid T& front() and T& back() Access to rst or last element

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

45 / 65

Prof.

number of elements WS 2013/2014 stored 46 / 65

concurrent hash map

concurrent_hash_map<Key,T,HashCompare> template<typename Key,typename T,typename HashCompare> class concurrent_hash_map;

Element Access
Accessor-object (proxy) allows the concurrent access to key-value pairs Accessor object uses implicit lock for each key-value pair Construction of a accessor object locking of the corresponding key-value pair Destruction of the accessor-objects unlocking the implicit lock The are two dierent accessor-types:
const_accessor read accessor read/write

Hash-table for storage of key-value pairs with parallel access Key - type of key, T type of values HashCompare Class for mapping of keys to integer values. Concept of HashCompare:
Pseudo signature
HashCompare::HashCompare(const HashCompare&) HashCompare::~HashCompare() bool HashCompare::equal(const Key& j, const Key& k) size_t HashCompare::hash(const Key& k) const

Semantic Copy-Constructor Destructor True, if j and k are equal Mappping k Integer

access read lock access read-/write lock

const accessor
1 template < typename Key , typename T , 2 typename HashCompare , typename A > 3 class concurrent_hash_map < Key ,T , hashCompare ,A >:: con st_acce ssor { 4 ... 5 typedef const std :: pair < const Key , T > value_type ; 6 7 bool empty () const ; // Element present ? 8 const value_type & operator *() const ; // Pointer to entry 9 const value_type * operator - >() const ; // R e f e r e n c e to entry 10 void release (); // u n l o c k i n g the i mpl i c i t lock Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 48 / 65 11 };

Conditions:

i,j have type Key; h is a object, which implements the concept HashCompare. If h.equal(i,j) is true, then h.hash(i) = h.hash(j) must hold.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 47 / 65

Selected Methods
Example: Compute the frequency of words
size_type count(const Key& key) const

returns one if

key

is present, null otherwise

res

Example
1 struct MyHashCompare { 2 static size_t hash ( const string & x ) { 3 size_t h =0; 4 for ( const char * s = x . c_str (); * s ; s ++) 5 h =( h *17)^* s ; 6 return h ; 7 } 8 static bool equal ( const string & x , const string & y ) { 9 return x == y ; 10 } 11 }; 12 13 typedef concurrent_hash_map < string , int , MyHashCompare > StringTable ;

bool find(accessor& res, const Key key)

Search for key; If present returns in a write lock

the entry; locks the entry with

bool insert(accessor& res, const Key key)

Similar to find; Dierence: If entry not present create and insert new key-value pair with pair<Key,T>(key,T()).
bool erase(const Key& key)

Search key; if present delete it Iteration over elements:

1. by using iterator begin() and iterator end() 2. by using a range object returned by range_type range(size_t grainsize)

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

49 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

50 / 65

Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
struct Tally { StringTable & table ; Tally ( StringTable & _table ) : table ( _table ) {} void operator ()( const blocked_range < string * > r ) const { for ( string * p = r . begin (); p != r . end (); ++ p ) { StringTable :: accessor a ; table . insert (a , * p ); a - > second +=1; } } }; void C ountAc currences ( String * data , int nitems ) { t a s k _s c h e d u l e r _ in i t init ; StringTable table ; parallel_for ( blocked_range < string * >( data , data + nitems , grainsize ) , Tally ( table ) ); for ( StringTable :: iterator i = table . begin (); i != table . end (); ++ i ) cout < <i - > first < < " " <<i - > second < < endl ; }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 51 / 65

concurrent queue
concurrent_queue template<typename T> class concurrent_queue;

FIFO-queue Inserting and deleting elements concurrently possible Limited capacity Implementation uses locks Busy waiting on some (blocking) operations Important methods:
void push(const T& source); Inserting elements at the end void pop(T& destination); Removing and returning from the

beginning;

blocks if empty
bool pop_if_present(T& destination); Removing and returning; size_type size() const; Number of elements stored; If empty, return

the number of waiting threads as a negative number size_t capacity() const; Return maximum capacity
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 52 / 65

Ubersicht
1

Task-Programming
A task is composed of data and code which uses the data for computation. Tasks can be executed in parallel Tasks can be divided into subtasks father-child relationship creates a tree of tasks Child tasks should be independent computation on dierent cores possible Programmer denes the subdivision Scheduler component within TBB manages computation order Example for Algorithms:
Linear algebra (Matrix-Multiplication,-Decomposition) Sorting (Merge-,Quick-Sort) Search
Intel Threading Building Blocks WS 2013/2014 53 / 65 Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 54 / 65

Introduction Basics Parallel Constructs Synchronization Container Task-Programming

Prof. Praktische Informatik

Split-Join
Decomposition of a task into subtasks split-operation Waiting for the completion of childs join-operation Task-Depth

Blocking
1 task * T :: execute () { 2 if ( there is no further division possible ) { 3 /* s e q u e n t i a l c o m p u t a t i o n */ 4 } else { 5 set_ref_count ( k +1); 6 task & tk = new ( al locate_child ()) T (...); tk . spawn (); 7 ... 8 task & t2 = new ( al locate_child ()) T (...); t2 . spawn (); 9 task & t1 = new ( al locate_child ()) T (...); 10 t1 . s p aw n _ a n d _ wa i t_ al l ( t1 ); 11 } 12 return NULL ; }

Each task has the implicit information about his task depth. Task depth of childs is one grater than task depth of father Root task has task depth 0

Reference counter
Each task has a reference counter The reference counter counts the number of existing childs If the reference counter reaches zero task is deleted; reference counter of father is decremented

Explanation: T inherits from the class Task and reimplements the method execute controls the subdivision into tasks; Steps:

execute().

Split-/Join-Parallelism; Two possible methods

Continuation-Passing Blocking

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

55 / 65

Allocation of task objects set_ref_count() initializing the reference counter to #childs+1 spawn() marks a task for further execution spawn_and_wait() waits, until the reference counter reaches 1. Important: set_ref_count-call before spawn-call execute returns a task isBuilding computed Prof. Praktische Informatik Intelwhich Threading Blocks immediately WS 2013/2014

56 / 65

Example (Blocking)
1 struct Tree { int val ; Tree * left ,* right ; } 2 class SumTask : public Task { 3 int * sum ; 4 Tree * tree ; 5 6 SumTask ( Tree * _tree , int * _sum ) : tree ( _tree ) , sum ( _sum ) {}; 7 8 task * execute () { 9 SumTask *a ,* b ; 10 int ref =1 , x =0 , y =0; 11 if ( tree - > right != NULL ) { 12 a = new ( alloc ate_child ()) SumTask ( tree - > right ,& x ); 13 ref ++; } 14 if ( tree - > left != NULL ) { 15 b = new ( alloc ate_child ()) SumTask ( tree - > left ,& y ); 16 ref ++; } 17 if ( ref > 1) { 18 set_ref_count ( ref ); 19 if ( tree - > right != NULL ) spawn (* a ); 20 if ( tree - > left != NULL ) spawn (* b ); 21 wait_for_all (); } 22 * sum = tree - > val + x + y ; 23 } 24 return NULL ; 25 } }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 57 / 65

Problems with Blocking

Problems: Local variables of task::execute remain on the stack of the executing OS thread, while calling task::spawn_and_wait. Task-Stealing in conjunction with blocking may result in a stack growth; Remember stack size is limited The scheduler tries to limit the stack growth, be choosing ready tasks with a task depth higher then the last blocking task. limited parallelism Instead of calling
task::spawn_and_wait

Solution:

the method

task::execute()

ends.

The computation using the results from child tasks is outsourced into a continuation-task. The continuation task is executed, after all childs have nished.

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

58 / 65

Continuation-Passing
1 task * T :: execute () { 2 if ( there is no further division possible ) { 3 /* s e q u e n t i a l c o m p u t a t i o n */ 4 } else { 5 set_ref_count ( k ); 6 r e c y c l e _ a s _ c o n t i n u a t i o n (); 7 task & tk = new ( allocate_child ()) T (...); tk . spawn (); 8 ... 9 task & t1 = new ( allocate_child ()) T (...); t1 . spawn (); 10 return & t1 ; }

Example (Continuation-Passing)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
class SumContTask : public Task { int * sum , x , y ; SumContTask ( int * _sum ) : sum ( _sum ) {} task * execute () { * sum = x + y ; return NULL ;} } class SumTask : public Task { int * sum ; Tree * tree ;

SumTask ( Tree * _tree , int * _sum ) : tree ( _tree ) , sum ( _sum ) {* sum += tree - > val ;}; task * execute () { SumTask *a ,* b ; int ref =0; SumCont * c = new ( a l l o c _ c o n t in u t a t i o n ()) SumContTask ( sum ); if ( tree - > right != NULL ) { a = new ( alloc ate_child ()) SumTask ( tree - > right ,& c - > x ); ref ++; } if ( tree - > left != NULL ) { b = new ( alloca te_child ()) SumTask ( tree - > left ,& c - > y ); ref ++; } if ( ref > 0) { set_ref_count ( ref ); if ( tree - > right != NULL ) c - > spawn (* b ); if ( tree - > left != NULL ) c - > spawn (* a ); } return NULL ; } }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 60 / 65

In the example there is no further computation after t1.spawn() There is no need from algorithm point of view for a continuation task. Internals from TBB require continuation task
recycle_as_continuation()

marks father as a continuation task

Additional Possibility: Specifying a continuation task implicitly. (shown in next example)

Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 59 / 65

Important Methods of the Class Task

void wait_for_all(); void spawn(task &child); void spawn(task_list& list); spawn_and_wait_for_all(task &child); spawn_and_wait_for_all(task_list &list); depth_type depth(); void set_depth(depth_type new_depth); void add_to_depth(int delta); int ref_count() const: void set_ref_count(int count); void recycle_as_continuation(); void recycle_as_child_of(task& parent); void recycle_to_reexecute();
Prof. Praktische Informatik

Initialization of the Task Scheduler

Example
1 int ParallelSum ( Tree * tree ) { 2 int sum ; 3 SumTask & a =* new ( task :: allocate_root ()) SumTask ( tree , & sum ); 4 task :: s p a w n _ ro o t _ a n d _ w a i t ( a ); 5 return sum ; 6 }

wait for childs to nish mark child for execution marks a list of childs for execution mark child for execution and wait for the childs Mark childs in list for execution and wait for childs Returns task depth Sets task depth Increments task depth Returns reference counter Sets reference counter Recycling of a task as continuation task Recycling as child with father parent Recycling as child
WS 2013/2014 61 / 65

Root tasks starts task computation Root task has to use Result is stored in
sum &root) new(task::allocate_root())

as argument to new executes root

The static method task::spawn_root_and_wait(task task and waits for completion. The static task::spawn_root_and_wait(task_list executing a list of root tasks
Prof. Praktische Informatik Intel Threading Building Blocks

&root)

can be used for

Intel Threading Building Blocks

WS 2013/2014

62 / 65

Execution Orders

Ready-Pool

Each OS-thread manages a ready-pool Organization of the ready-pool:

Per task depth there is a list with ready to executed tasks. The lists are managed by an array; the task depth is the index

small memory footprint good cache locality no parallelism

high memory footprint poor cache locality high parallelism

New tasks are stored at the beginning of the list corresponding to their tasks depth and are removed at the beginning of their list (LIFO).

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

63 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

64 / 65

Operation of the task-scheduler

Tasks are executed in the following order: 1. The task returned by

task::execute().

2. The task which is farther of the last executed task. 3. A task from the list with the highest task depth. 4. A task with an anity for that thread. 5. A task from the ready pool of another thread with the lowest depth (task stealing).

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

65 / 65

C and C++ Summer Internship Report
67% (21)
C and C++ Summer Internship Report
14 pages
C/C++ Programming Interview Questions and Answers: by Satish Shetty, July 14th, 2004
No ratings yet
C/C++ Programming Interview Questions and Answers: by Satish Shetty, July 14th, 2004
16 pages
CSE1002 Problem Solving With Object Oriented Programming LO 1 AC39
No ratings yet
CSE1002 Problem Solving With Object Oriented Programming LO 1 AC39
7 pages
Gradcpp Lec3
No ratings yet
Gradcpp Lec3
25 pages
OOP1 Lab8
No ratings yet
OOP1 Lab8
12 pages
9 CPP Template
No ratings yet
9 CPP Template
9 pages
C++ Templates: Characteristics of Generic Libraries
100% (1)
C++ Templates: Characteristics of Generic Libraries
18 pages
C++ Mini-Course - Best C++ Programing Book
No ratings yet
C++ Mini-Course - Best C++ Programing Book
57 pages
C++ Interview Question
No ratings yet
C++ Interview Question
29 pages
Polymorph is m
No ratings yet
Polymorph is m
19 pages
C#.Net Unit-I
No ratings yet
C#.Net Unit-I
285 pages
C++ Inteview Ques
No ratings yet
C++ Inteview Ques
27 pages
C# Interview Questions and Answers: Page 1 Page 2 Page 3 Next
No ratings yet
C# Interview Questions and Answers: Page 1 Page 2 Page 3 Next
26 pages
With C Solution Set
No ratings yet
With C Solution Set
26 pages
C QPA - 2024
No ratings yet
C QPA - 2024
12 pages
Pyqs - C++
No ratings yet
Pyqs - C++
52 pages
Unit-3: Darshan Institute of Engineering & Technology For Diploma Studies
No ratings yet
Unit-3: Darshan Institute of Engineering & Technology For Diploma Studies
3 pages
Introduction To Problem Solving - Docx2
No ratings yet
Introduction To Problem Solving - Docx2
25 pages
STL2
No ratings yet
STL2
13 pages
Classes Compiler Synthesized Functions
No ratings yet
Classes Compiler Synthesized Functions
6 pages
Assg2_OOP_solution
No ratings yet
Assg2_OOP_solution
11 pages
Lecture Notes CS1201 Part2
No ratings yet
Lecture Notes CS1201 Part2
108 pages
Object-Oriented Programming by C++ Lec 5
No ratings yet
Object-Oriented Programming by C++ Lec 5
38 pages
Chapter 4 (Arrays and Functions)
100% (1)
Chapter 4 (Arrays and Functions)
32 pages
Industrial Treaning Report 2022-2023
No ratings yet
Industrial Treaning Report 2022-2023
29 pages
C Tutor
No ratings yet
C Tutor
15 pages
New Text Document (4)
No ratings yet
New Text Document (4)
10 pages
FRECK
No ratings yet
FRECK
8 pages
Applications of Pointers in C
No ratings yet
Applications of Pointers in C
8 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Act 2 Arrays Pointers and Dynamic Memory Allocation
No ratings yet
Act 2 Arrays Pointers and Dynamic Memory Allocation
8 pages
C++ Interview Q
No ratings yet
C++ Interview Q
45 pages
Computer Programming
No ratings yet
Computer Programming
6 pages
C# Interview Question PDF
No ratings yet
C# Interview Question PDF
26 pages
C Sharp Lab Manual PDF
No ratings yet
C Sharp Lab Manual PDF
25 pages
C++ QUESTION PAPER 2022 ( oct )
No ratings yet
C++ QUESTION PAPER 2022 ( oct )
15 pages
Module2 Mix
No ratings yet
Module2 Mix
25 pages
Computing in C Part II: Object Oriented Programming in C: 1 What Is OOP?
No ratings yet
Computing in C Part II: Object Oriented Programming in C: 1 What Is OOP?
34 pages
What Is C++?: C++ Interview Questions and Answers
No ratings yet
What Is C++?: C++ Interview Questions and Answers
42 pages
Week No.10
No ratings yet
Week No.10
7 pages
c++ Structure 5
No ratings yet
c++ Structure 5
8 pages
AWP 18
No ratings yet
AWP 18
48 pages
Module 3 CPP New
No ratings yet
Module 3 CPP New
21 pages
C Interview Questions and Answers
No ratings yet
C Interview Questions and Answers
63 pages
List of CPP Important Questions Answer Updates - 28 - 11 - 2019 - 08 - 37 - 39
No ratings yet
List of CPP Important Questions Answer Updates - 28 - 11 - 2019 - 08 - 37 - 39
97 pages
C
No ratings yet
C
183 pages
Oop Lab Manual
No ratings yet
Oop Lab Manual
70 pages
Oop Lab 1
No ratings yet
Oop Lab 1
10 pages
OOP PAST PAPER REVISION
No ratings yet
OOP PAST PAPER REVISION
19 pages
What Is Encapsulation??: C/C++ Programming Interview Questions and Answers
No ratings yet
What Is Encapsulation??: C/C++ Programming Interview Questions and Answers
16 pages
What Is C++?: How Do You Find Out If A Linked-List Has An End? (I.e. The List Is Not A Cycle)
No ratings yet
What Is C++?: How Do You Find Out If A Linked-List Has An End? (I.e. The List Is Not A Cycle)
15 pages
Oop Lab 1
No ratings yet
Oop Lab 1
8 pages
C Language CheatSheet - CodeWithHarry
No ratings yet
C Language CheatSheet - CodeWithHarry
10 pages
Boop - Unit - 2
No ratings yet
Boop - Unit - 2
23 pages
"C Programming for Beginners: A Step-by-Step Guide"
From Everand
"C Programming for Beginners: A Step-by-Step Guide"
Lov kush
No ratings yet
C Programming
From Everand
C Programming
Netra
No ratings yet
Coding In C Decoded: Decoded, #1
From Everand
Coding In C Decoded: Decoded, #1
D Brown
No ratings yet
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
The Boost C++ Metaprogramming Library: Aleksey Gurtovoy
No ratings yet
The Boost C++ Metaprogramming Library: Aleksey Gurtovoy
30 pages
C++ STL Functions
No ratings yet
C++ STL Functions
13 pages
Power Up C++ With The Standard Template Library: Part I: Dmitrykorolev
No ratings yet
Power Up C++ With The Standard Template Library: Part I: Dmitrykorolev
18 pages
5.2 Standard Template Library (STL) : Tandard I Brary
No ratings yet
5.2 Standard Template Library (STL) : Tandard I Brary
17 pages
SYLLABUS
No ratings yet
SYLLABUS
7 pages
Brochure Simatic-Step7 Tia-Portal en
No ratings yet
Brochure Simatic-Step7 Tia-Portal en
24 pages
TPSEC05
No ratings yet
TPSEC05
14 pages
VC Strategy and Orcas
No ratings yet
VC Strategy and Orcas
21 pages
The C++ Standard Template Library (STL)
No ratings yet
The C++ Standard Template Library (STL)
20 pages
Cse343summer Training Report 11905427
No ratings yet
Cse343summer Training Report 11905427
17 pages
Lab Manual - OODP
No ratings yet
Lab Manual - OODP
53 pages
STL Containers & Iterators
No ratings yet
STL Containers & Iterators
36 pages
OOP Group C1
No ratings yet
OOP Group C1
7 pages
C++ 90 Days
No ratings yet
C++ 90 Days
3 pages
Oops
No ratings yet
Oops
3 pages
Siemens STL Programming Examples
0% (2)
Siemens STL Programming Examples
5 pages
Standard Template Library
100% (1)
Standard Template Library
17 pages
Unit - 5 Oodp Final
No ratings yet
Unit - 5 Oodp Final
60 pages
C+ - Roadmap
No ratings yet
C+ - Roadmap
2 pages
BITS F232 - FDSA - Prof. Hota
No ratings yet
BITS F232 - FDSA - Prof. Hota
4 pages
Jianghu STLintro
No ratings yet
Jianghu STLintro
36 pages
Multicore C Standard Template Library in A Generat
No ratings yet
Multicore C Standard Template Library in A Generat
11 pages
Grid Howto
No ratings yet
Grid Howto
75 pages
Syllabus
No ratings yet
Syllabus
15 pages
2mark With Answer
No ratings yet
2mark With Answer
38 pages
CSC-335 ADT's & Data Structures - ADT Implementations: Templates and Standard Containers)
No ratings yet
CSC-335 ADT's & Data Structures - ADT Implementations: Templates and Standard Containers)
54 pages
Starting Out With C Early Objects 10th Edition by Tony Gaddis, Judy Walters, Godfrey Muganda ISBN 0135241006 9780135241004 - Download the full ebook now for a seamless reading experience
No ratings yet
Starting Out With C Early Objects 10th Edition by Tony Gaddis, Judy Walters, Godfrey Muganda ISBN 0135241006 9780135241004 - Download the full ebook now for a seamless reading experience
89 pages
Assignment No Title: Demonstration of STL Objectives: 1) To Learn and Understand Concepts of Standard Template Library
No ratings yet
Assignment No Title: Demonstration of STL Objectives: 1) To Learn and Understand Concepts of Standard Template Library
6 pages
Asynchronous C++
No ratings yet
Asynchronous C++
30 pages