Openmp-Examples-5 0 0
Openmp-Examples-5 0 0
Application Programming
Interface
Examples
Version 5.0.0 – November 2019
Source codes for OpenMP 5.0.0 Examples can be downloaded from github.
Foreword vii
Introduction 1
Examples 2
1 Parallel Execution 3
1.1 A Simple Parallel Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 The parallel Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 teams Construct on Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Controlling the Number of Threads on Multiple Nesting Levels . . . . . . . . . . . 11
1.5 Interaction Between the num_threads Clause and omp_set_dynamic . . . . 14
1.6 Fortran Restrictions on the do Construct . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 The nowait Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.8 The collapse Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.9 linear Clause in Loop Constructs . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.10 The parallel sections Construct . . . . . . . . . . . . . . . . . . . . . . . 27
1.11 The firstprivate Clause and the sections Construct . . . . . . . . . . . . 28
1.12 The single Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.13 The workshare Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.14 The master Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.15 The loop Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.16 Parallel Random Access Iterator Loop . . . . . . . . . . . . . . . . . . . . . . . . 39
1.17 The omp_set_dynamic and
omp_set_num_threads Routines . . . . . . . . . . . . . . . . . . . . . . . . 40
1.18 The omp_get_num_threads Routine . . . . . . . . . . . . . . . . . . . . . . 42
i
2 OpenMP Affinity 44
2.1 The proc_bind Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.1 Spread Affinity Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.2 Close Affinity Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.1.3 Master Affinity Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2 Task Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3 Affinity Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4 Affinity Query Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3 Tasking 67
3.1 The task and taskwait Constructs . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2 Task Priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3 Task Dependences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3.1 Flow Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3.2 Anti-dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3.3 Output Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3.4 Concurrent Execution with Dependences . . . . . . . . . . . . . . . . . . 92
3.3.5 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.3.6 taskwait with Dependences . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.7 Mutually Exclusive Execution with Dependences . . . . . . . . . . . . . . 101
3.3.8 Multidependences Using Iterators . . . . . . . . . . . . . . . . . . . . . . 104
3.4 The taskgroup Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.5 The taskyield Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.6 The taskloop Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.7 The parallel master taskloop Construct . . . . . . . . . . . . . . . . . 116
4 Devices 118
4.1 target Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.1.1 target Construct on parallel Construct . . . . . . . . . . . . . . . . 119
4.1.2 target Construct with map Clause . . . . . . . . . . . . . . . . . . . . 120
4.1.3 map Clause with to/from map-types . . . . . . . . . . . . . . . . . . . . 121
4.1.4 map Clause with Array Sections . . . . . . . . . . . . . . . . . . . . . . . 122
4.1.5 target Construct with if Clause . . . . . . . . . . . . . . . . . . . . . 124
4.1.6 target Reverse Offload . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Contents iii
4.13 Device Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
4.13.1 omp_is_initial_device Routine . . . . . . . . . . . . . . . . . . . 198
4.13.2 omp_get_num_devices Routine . . . . . . . . . . . . . . . . . . . . 200
4.13.3 omp_set_default_device and
omp_get_default_device Routines . . . . . . . . . . . . . . . . . . 201
4.13.4 Target Memory and Device Pointers Routines . . . . . . . . . . . . . . . . 202
5 SIMD 204
5.1 simd and declare simd Constructs . . . . . . . . . . . . . . . . . . . . . . . 205
5.2 inbranch and notinbranch Clauses . . . . . . . . . . . . . . . . . . . . . . 212
5.3 Loop-Carried Lexical Forward Dependence . . . . . . . . . . . . . . . . . . . . . 216
6 Synchronization 219
6.1 The critical Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.2 Worksharing Constructs Inside a critical Construct . . . . . . . . . . . . . . . 224
6.3 Binding of barrier Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.4 The atomic Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.5 Restrictions on the atomic Construct . . . . . . . . . . . . . . . . . . . . . . . . 234
6.6 The flush Construct without a List . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.7 Synchronization Based on Acquire/Release Semantics . . . . . . . . . . . . . . . . 240
6.8 The ordered Clause and the ordered Construct . . . . . . . . . . . . . . . . . 248
6.9 The depobj Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.10 Doacross Loop Nest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
6.11 Lock Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
6.11.1 The omp_init_lock Routine . . . . . . . . . . . . . . . . . . . . . . . 262
6.11.2 The omp_init_lock_with_hint Routine . . . . . . . . . . . . . . . 263
6.11.3 Ownership of Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6.11.4 Simple Lock Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.11.5 Nestable Lock Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Contents v
A.4 Changes from 4.0 to 4.0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
A.5 Changes from 3.1 to 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
2 The OpenMP Examples document has been updated with new features found in the OpenMP 5.0
3 Specification. The additional examples and updates are referenced in the Document Revision
4 History of the Appendix, Section A.1 on page 388.
5 Text describing an example with a 5.0 feature specifically states that the feature support begins in
6 the OpenMP 5.0 Specification. Also, an omp_5.0 keyword has been added to metadata in the
7 source code. These distinctions are presented to remind readers that a 5.0 compliant OpenMP
8 implementation is necessary to use these features in codes.
9 Examples for most of the 5.0 features are included in this document, and incremental releases will
10 become available as more feature examples and updates are submitted, and approved by the
11 OpenMP Examples Subcommittee.
vii
1 Introduction
2 This collection of programming examples supplements the OpenMP API for Shared Memory
3 Parallelization specifications, and is not part of the formal specifications. It assumes familiarity
4 with the OpenMP specifications, and shares the typographical conventions used in that document.
5 The OpenMP API specification provides a model for parallel programming that is portable across
6 shared memory architectures from different vendors. Compilers from numerous vendors support
7 the OpenMP API.
8 The directives, library routines, and environment variables demonstrated in this document allow
9 users to create and manage parallel programs while permitting portability. The directives extend the
10 C, C++ and Fortran base languages with single program multiple data (SPMD) constructs, tasking
11 constructs, device constructs, worksharing constructs, and synchronization constructs, and they
12 provide support for sharing and privatizing data. The functionality to control the runtime
13 environment is provided by library routines and environment variables. Compilers that support the
14 OpenMP API often include a command line option to the compiler that activates and allows
15 interpretation of all OpenMP directives.
16 The latest source codes for OpenMP Examples can be downloaded from the sources directory at
17 https://github.com/OpenMP/Examples. The codes for this OpenMP 5.0.0 Examples document have
18 the tag v5.0.0.
19 Complete information about the OpenMP API and a list of the compilers that support the OpenMP
20 API can be found at the OpenMP.org web site
21 http://www.openmp.org
1
1 Examples
2 The following are examples of the OpenMP API directives, constructs, and routines.
C / C++
3 A statement following a directive is compound only when necessary, and a non-compound
4 statement is indented with respect to a directive preceding it.
C / C++
5 Each example is labeled as ename.seqno.ext, where ename is the example name, seqno is the
6 sequence number in a section, and ext is the source file extension to indicate the code type and
7 source form. ext is one of the following:
8 • c – C code,
9 • cpp – C++ code,
10 • f – Fortran code in fixed form, and
11 • f90 – Fortran code in free form.
2
1 CHAPTER 1
2 Parallel Execution
3 A single thread, the initial thread, begins sequential execution of an OpenMP enabled program, as
4 if the whole program is in an implicit parallel region consisting of an implicit task executed by the
5 initial thread.
6 A parallel construct encloses code, forming a parallel region. An initial thread encountering a
7 parallel region forks (creates) a team of threads at the beginning of the parallel region, and
8 joins them (removes from execution) at the end of the region. The initial thread becomes the master
9 thread of the team in a parallel region with a thread number equal to zero, the other threads are
10 numbered from 1 to number of threads minus 1. A team may be comprised of just a single thread.
11 Each thread of a team is assigned an implicit task consisting of code within the parallel region. The
12 task that creates a parallel region is suspended while the tasks of the team are executed. A thread is
13 tied to its task; that is, only the thread assigned to the task can execute that task. After completion
14 of the parallel region, the master thread resumes execution of the generating task.
15 Any task within a parallel region is allowed to encounter another parallel region to form a
16 nested parallel region. The parallelism of a nested parallel region (whether it forks
17 additional threads, or is executed serially by the encountering task) can be controlled by the
18 OMP_NESTED environment variable or the omp_set_nested() API routine with arguments
19 indicating true or false.
20 The number of threads of a parallel region can be set by the OMP_NUM_THREADS
21 environment variable, the omp_set_num_threads() routine, or on the parallel directive
22 with the num_threads clause. The routine overrides the environment variable, and the clause
23 overrides all. Use the OMP_DYNAMIC or the omp_set_dynamic() function to specify that the
24 OpenMP implementation dynamically adjust the number of threads for parallel regions. The
25 default setting for dynamic adjustment is implementation defined. When dynamic adjustment is on
26 and the number of threads is specified, the number of threads becomes an upper limit for the
27 number of threads to be provided by the OpenMP runtime.
3
1 WORKSHARING CONSTRUCTS
2 A worksharing construct distributes the execution of the associated region among the members of
3 the team that encounter it. There is an implied barrier at the end of the worksharing region (there is
4 no barrier at the beginning). The worksharing constructs are:
5 • loop constructs: for and do
6 • sections
7 • single
8 • workshare
9 The for and do constructs (loop constructs) create a region consisting of a loop. A loop controlled
10 by a loop construct is called an associated loop. Nested loops can form a single region when the
11 collapse clause (with an integer argument) designates the number of associated loops to be
12 executed in parallel, by forming a "single iteration space" for the specified number of nested loops.
13 The ordered clause can also control multiple associated loops.
14 An associated loop must adhere to a "canonical form" (specified in the Canonical Loop Form of the
15 OpenMP Specifications document) which allows the iteration count (of all associated loops) to be
16 computed before the (outermost) loop is executed. Most common loops comply with the canonical
17 form, including C++ iterators.
18 A single construct forms a region in which only one thread (any one of the team) executes the
19 region. The other threads wait at the implied barrier at the end, unless the nowait clause is
20 specified.
21 The sections construct forms a region that contains one or more structured blocks. Each block
22 of a sections directive is constructed with a section construct, and executed once by one of
23 the threads (any one) in the team. (If only one block is formed in the region, the section
24 construct, which is used to separate blocks, is not required.) The other threads wait at the implied
25 barrier at the end, unless the nowait clause is specified.
26 The workshare construct is a Fortran feature that consists of a region with a single structure
27 block (section of code). Statements in the workshare region are divided into units of work, and
28 executed (once) by threads of the team.
29 MASTER CONSTRUCT
30 The master construct is not a worksharing construct. The master region is is executed only by the
31 master thread. There is no implicit barrier (and flush) at the end of the master region; hence the
32 other threads of the team continue execution beyond code statements beyond the master region.
6 The following example is non-conforming because the matching do directive for the end do does
7 not precede the outermost loop:
8 Example fort_do.2.f
S-1 SUBROUTINE WORK(I, J)
S-2 INTEGER I,J
S-3 END SUBROUTINE WORK
S-4
S-5 SUBROUTINE DO_WRONG
S-6 INTEGER I, J
7 In the following example, the barrier at the end of the first workshare region is eliminated with a
8 nowait clause. Threads doing CC = DD immediately begin work on EE = FF when they are
9 done with CC = DD.
10 Example workshare.2.f
S-1 SUBROUTINE WSHARE2(AA, BB, CC, DD, EE, FF, N)
S-2 INTEGER N
S-3 REAL AA(N,N), BB(N,N), CC(N,N)
S-4 REAL DD(N,N), EE(N,N), FF(N,N)
S-5
S-6 !$OMP PARALLEL
S-7 !$OMP WORKSHARE
S-8 AA = BB
S-9 CC = DD
S-10 !$OMP END WORKSHARE NOWAIT
S-11 !$OMP WORKSHARE
S-12 EE = FF
S-13 !$OMP END WORKSHARE
S-14 !$OMP END PARALLEL
S-15 END SUBROUTINE WSHARE2
1 The following example shows the use of an atomic directive inside a workshare construct. The
2 computation of SUM(AA) is workshared, but the update to R is atomic.
3 Example workshare.3.f
S-1 SUBROUTINE WSHARE3(AA, BB, CC, DD, N)
S-2 INTEGER N
S-3 REAL AA(N,N), BB(N,N), CC(N,N), DD(N,N)
S-4 REAL R
S-5 R=0
S-6 !$OMP PARALLEL
S-7 !$OMP WORKSHARE
S-8 AA = BB
S-9 !$OMP ATOMIC UPDATE
S-10 R = R + SUM(AA)
S-11 CC = DD
S-12 !$OMP END WORKSHARE
S-13 !$OMP END PARALLEL
S-14 END SUBROUTINE WSHARE3
4 Fortran WHERE and FORALL statements are compound statements, made up of a control part and a
5 statement part. When workshare is applied to one of these compound statements, both the
6 control and the statement parts are workshared. The following example shows the use of a WHERE
7 statement in a workshare construct.
8 Each task gets worked on in order by the threads:
9 AA = BB then
10 CC = DD then
11 EE .ne. 0 then
12 FF = 1 / EE then
13 GG = HH
14 Example workshare.4.f
S-1 SUBROUTINE WSHARE4(AA, BB, CC, DD, EE, FF, GG, HH, N)
S-2 INTEGER N
S-3 REAL AA(N,N), BB(N,N), CC(N,N)
S-4 REAL DD(N,N), EE(N,N), FF(N,N)
S-5 REAL GG(N,N), HH(N,N)
S-6
S-7 !$OMP PARALLEL
S-8 !$OMP WORKSHARE
S-9 AA = BB
S-10 CC = DD
S-11 WHERE (EE .ne. 0) FF = 1 / EE
S-12 GG = HH
S-13 !$OMP END WORKSHARE
S-14 !$OMP END PARALLEL
S-15
S-16 END SUBROUTINE WSHARE4
1 In the following example, an assignment to a shared scalar variable is performed by one thread in a
2 workshare while all other threads in the team wait.
3 Example workshare.5.f
S-1 SUBROUTINE WSHARE5(AA, BB, CC, DD, N)
S-2 INTEGER N
S-3 REAL AA(N,N), BB(N,N), CC(N,N), DD(N,N)
S-4
S-5 INTEGER SHR
S-6
S-7 !$OMP PARALLEL SHARED(SHR)
S-8 !$OMP WORKSHARE
S-9 AA = BB
S-10 SHR = 1
S-11 CC = DD * SHR
S-12 !$OMP END WORKSHARE
S-13 !$OMP END PARALLEL
S-14
S-15 END SUBROUTINE WSHARE5
4 The following example contains an assignment to a private scalar variable, which is performed by
5 one thread in a workshare while all other threads wait. It is non-conforming because the private
6 scalar variable is undefined after the assignment statement.
7 Example workshare.6.f
S-1 SUBROUTINE WSHARE6_WRONG(AA, BB, CC, DD, N)
S-2 INTEGER N
S-3 REAL AA(N,N), BB(N,N), CC(N,N), DD(N,N)
S-4
S-5 INTEGER PRI
S-6
S-7 !$OMP PARALLEL PRIVATE(PRI)
S-8 !$OMP WORKSHARE
S-9 AA = BB
S-10 PRI = 1
S-11 CC = DD * PRI
S-12 !$OMP END WORKSHARE
S-13 !$OMP END PARALLEL
2 OpenMP Affinity
3 OpenMP Affinity consists of a proc_bind policy (thread affinity policy) and a specification of
4 places ("location units" or processors that may be cores, hardware threads, sockets, etc.). OpenMP
5 Affinity enables users to bind computations on specific places. The placement will hold for the
6 duration of the parallel region. However, the runtime is free to migrate the OpenMP threads to
7 different cores (hardware threads, sockets, etc.) prescribed within a given place, if two or more
8 cores (hardware threads, sockets, etc.) have been assigned to a given place.
9 Often the binding can be managed without resorting to explicitly setting places. Without the
10 specification of places in the OMP_PLACES variable, the OpenMP runtime will distribute and bind
11 threads using the entire range of processors for the OpenMP program, according to the
12 OMP_PROC_BIND environment variable or the proc_bind clause. When places are specified,
13 the OMP runtime binds threads to the places according to a default distribution policy, or those
14 specified in the OMP_PROC_BIND environment variable or the proc_bind clause.
15 In the OpenMP Specifications document a processor refers to an execution unit that is enabled for
16 an OpenMP thread to use. A processor is a core when there is no SMT (Simultaneous
17 Multi-Threading) support or SMT is disabled. When SMT is enabled, a processor is a hardware
18 thread (HW-thread). (This is the usual case; but actually, the execution unit is implementation
19 defined.) Processor numbers are numbered sequentially from 0 to the number of cores less one
20 (without SMT), or 0 to the number HW-threads less one (with SMT). OpenMP places use the
21 processor number to designate binding locations (unless an "abstract name" is used.)
22 The processors available to a process may be a subset of the system’s processors. This restriction
23 may be the result of a wrapper process controlling the execution (such as numactl on Linux
24 systems), compiler options, library-specific environment variables, or default kernel settings. For
25 instance, the execution of multiple MPI processes, launched on a single compute node, will each
26 have a subset of processors as determined by the MPI launcher or set by MPI affinity environment
27 variables for the MPI library.
28 Threads of a team are positioned onto places in a compact manner, a scattered distribution, or onto
29 the master’s place, by setting the OMP_PROC_BIND environment variable or the proc_bind
44
1 clause to close, spread, or master, respectively. When OMP_PROC_BIND is set to FALSE no
2 binding is enforced; and when the value is TRUE, the binding is implementation defined to a set of
3 places in the OMP_PLACES variable or to places defined by the implementation if the
4 OMP_PLACES variable is not set.
5 The OMP_PLACES variable can also be set to an abstract name (threads, cores, sockets) to specify
6 that a place is either a single hardware thread, a core, or a socket, respectively. This description of
7 the OMP_PLACES is most useful when the number of threads is equal to the number of hardware
8 thread, cores or sockets. It can also be used with a close or spread distribution policy when the
9 equality doesn’t hold.
p0 p1 p2 p3 p4 p5 p6 p7
7 The following equivalent place list declarations consist of eight places (which we designate as p0 to
8 p7):
9 OMP_PLACES="{0,1},{2,3},{4,5},{6,7},{8,9},{10,11},{12,13},{14,15}"
10 or
11 OMP_PLACES="{0:2}:8:2"
2 Tasking
3 Tasking constructs provide units of work to a thread for execution. Worksharing constructs do this,
4 too (e.g. for, do, sections, and singles constructs); but the work units are tightly controlled
5 by an iteration limit and limited scheduling, or a limited number of sections or single
6 regions. Worksharing was designed with "data parallel" computing in mind. Tasking was designed
7 for "task parallel" computing and often involves non-locality or irregularity in memory access.
8 The task construct can be used to execute work chunks: in a while loop; while traversing nodes in
9 a list; at nodes in a tree graph; or in a normal loop (with a taskloop construct). Unlike the
10 statically scheduled loop iterations of worksharing, a task is often enqueued, and then dequeued for
11 execution by any of the threads of the team within a parallel region. The generation of tasks can be
12 from a single generating thread (creating sibling tasks), or from multiple generators in a recursive
13 graph tree traversals. A taskloop construct bundles iterations of an associated loop into tasks,
14 and provides similar controls found in the task construct.
15 Sibling tasks are synchronized by the taskwait construct, and tasks and their descendent tasks
16 can be synchronized by containing them in a taskgroup region. Ordered execution is
17 accomplished by specifying dependences with a depend clause. Also, priorities can be specified
18 as hints to the scheduler through a priority clause.
19 Various clauses can be used to manage and optimize task generation, as well as reduce the overhead
20 of execution and to relinquish control of threads for work balance and forward progress.
21 Once a thread starts executing a task, it is the designated thread for executing the task to
22 completion, even though it may leave the execution at a scheduling point and return later. The
23 thread is tied to the task. Scheduling points can be introduced with the taskyield construct.
24 With an untied clause any other thread is allowed to continue the task. An if clause with a true
25 expression allows the generating thread to immediately execute the task as an undeferred task. By
26 including the data environment of the generating task into the generated task with the mergeable
27 and final clauses, task generation overhead can be reduced.
28 A complete list of the tasking constructs and details of their clauses can be found in the Tasking
29 Constructs chapter of the OpenMP Specifications, in the OpenMP Application Programming
30 Interface section.
67
1 3.1 The task and taskwait Constructs
2 The following example shows how to traverse a tree-like structure using explicit tasks. Note that the
3 traverse function should be called from within a parallel region for the different specified tasks
4 to be executed in parallel. Also note that the tasks will be executed in no specified order because
5 there are no synchronization directives. Thus, assuming that the traversal will be done in post order,
6 as in the sequential code, is wrong.
C / C++
7 Example tasking.1.c
S-1
S-2 struct node {
S-3 struct node *left;
S-4 struct node *right;
S-5 };
S-6
S-7 extern void process(struct node *);
S-8
S-9 void traverse( struct node *p )
S-10 {
S-11 if (p->left)
S-12 #pragma omp task // p is firstprivate by default
S-13 traverse(p->left);
S-14 if (p->right)
S-15 #pragma omp task // p is firstprivate by default
S-16 traverse(p->right);
S-17 process(p);
S-18 }
C / C++
Fortran
8 Example tasking.1.f90
S-1
S-2 RECURSIVE SUBROUTINE traverse ( P )
S-3 TYPE Node
S-4 TYPE(Node), POINTER :: left, right
S-5 END TYPE Node
S-6 TYPE(Node) :: P
S-7
S-8 IF (associated(P%left)) THEN
S-9 !$OMP TASK ! P is firstprivate by default
S-10 CALL traverse(P%left)
S-11 !$OMP END TASK
S-12 ENDIF
CHAPTER 3. TASKING 69
Fortran
1 Example tasking.2.f90
S-1 RECURSIVE SUBROUTINE traverse ( P )
S-2 TYPE Node
S-3 TYPE(Node), POINTER :: left, right
S-4 END TYPE Node
S-5 TYPE(Node) :: P
S-6 IF (associated(P%left)) THEN
S-7 !$OMP TASK ! P is firstprivate by default
S-8 CALL traverse(P%left)
S-9 !$OMP END TASK
S-10 ENDIF
S-11 IF (associated(P%right)) THEN
S-12 !$OMP TASK ! P is firstprivate by default
S-13 CALL traverse(P%right)
S-14 !$OMP END TASK
S-15 ENDIF
S-16 !$OMP TASKWAIT
S-17 CALL process ( P )
S-18 END SUBROUTINE
Fortran
2 The following example demonstrates how to use the task construct to process elements of a linked
3 list in parallel. The thread executing the single region generates all of the explicit tasks, which
4 are then executed by the threads in the current team. The pointer p is firstprivate by default
5 on the task construct so it is not necessary to specify it in a firstprivate clause.
C / C++
6 Example tasking.3.c
S-1
S-2 typedef struct node node;
S-3 struct node {
S-4 int data;
S-5 node * next;
S-6 };
S-7
S-8 void process(node * p)
S-9 {
S-10 /* do work here */
S-11 }
S-12
S-13 void increment_list_items(node * head)
S-14 {
S-15 #pragma omp parallel
S-16 {
CHAPTER 3. TASKING 71
S-31
S-32 END SUBROUTINE
S-33
S-34 END MODULE
Fortran
1 The fib() function should be called from within a parallel region for the different specified
2 tasks to be executed in parallel. Also, only one thread of the parallel region should call fib()
3 unless multiple concurrent Fibonacci computations are desired.
C / C++
4 Example tasking.4.c
S-1 int fib(int n) {
S-2 int i, j;
S-3 if (n<2)
S-4 return n;
S-5 else {
S-6 #pragma omp task shared(i)
S-7 i=fib(n-1);
S-8 #pragma omp task shared(j)
S-9 j=fib(n-2);
S-10 #pragma omp taskwait
S-11 return i+j;
S-12 }
S-13 }
C / C++
Fortran
5 Example tasking.4.f
S-1 RECURSIVE INTEGER FUNCTION fib(n) RESULT(res)
S-2 INTEGER n, i, j
S-3 IF ( n .LT. 2) THEN
S-4 res = n
S-5 ELSE
S-6 !$OMP TASK SHARED(i)
S-7 i = fib( n-1 )
S-8 !$OMP END TASK
S-9 !$OMP TASK SHARED(j)
S-10 j = fib( n-2 )
S-11 !$OMP END TASK
S-12 !$OMP TASKWAIT
S-13 res = i+j
S-14 END IF
S-15 END FUNCTION
CHAPTER 3. TASKING 73
Fortran
1 Example tasking.5.f
S-1 real*8 item(10000000)
S-2 integer i
S-3
S-4 !$omp parallel
S-5 !$omp single ! loop iteration variable i is private
S-6 do i=1,10000000
S-7 !$omp task
S-8 ! i is firstprivate, item is shared
S-9 call process(item(i))
S-10 !$omp end task
S-11 end do
S-12 !$omp end single
S-13 !$omp end parallel
S-14 end
Fortran
2 The following example is the same as the previous one, except that the tasks are generated in an
3 untied task. While generating the tasks, the implementation may reach its limit on unassigned tasks.
4 If it does, the implementation is allowed to cause the thread executing the task generating loop to
5 suspend its task at the task scheduling point in the task directive, and start executing unassigned
6 tasks. If that thread begins execution of a task that takes a long time to complete, the other threads
7 may complete all the other tasks before it is finished.
8 In this case, since the loop is in an untied task, any other thread is eligible to resume the task
9 generating loop. In the previous examples, the other threads would be forced to idle until the
10 generating thread finishes its long task, since the task generating loop was in a tied task.
C / C++
11 Example tasking.6.c
S-1 #define LARGE_NUMBER 10000000
S-2 double item[LARGE_NUMBER];
S-3 extern void process(double);
S-4 int main() {
S-5 #pragma omp parallel
S-6 {
S-7 #pragma omp single
S-8 {
S-9 int i;
S-10 #pragma omp task untied
S-11 // i is firstprivate, item is shared
S-12 {
S-13 for (i=0; i<LARGE_NUMBER; i++)
S-14 #pragma omp task
CHAPTER 3. TASKING 75
C / C++
1 Example tasking.7.c
S-1
S-2 int tp;
S-3 #pragma omp threadprivate(tp)
S-4 int var;
S-5 void work()
S-6 {
S-7 #pragma omp task
S-8 {
S-9 /* do work here */
S-10 #pragma omp task
S-11 {
S-12 tp = 1;
S-13 /* do work here */
S-14 #pragma omp task
S-15 {
S-16 /* no modification of tp */
S-17 }
S-18 var = tp; //value of tp can be 1 or 2
S-19 }
S-20 tp = 2;
S-21 }
S-22 }
C / C++
Fortran
2 Example tasking.7.f
S-1 module example
S-2 integer tp
S-3 !$omp threadprivate(tp)
S-4 integer var
S-5 contains
S-6 subroutine work
S-7 !$omp task
S-8 ! do work here
S-9 !$omp task
S-10 tp = 1
S-11 ! do work here
S-12 !$omp task
S-13 ! no modification of tp
S-14 !$omp end task
S-15 var = tp ! value of var can be 1 or 2
S-16 !$omp end task
S-17 tp = 2
CHAPTER 3. TASKING 77
Fortran
1 Example tasking.8.f
S-1 module example
S-2 integer tp
S-3 !$omp threadprivate(tp)
S-4 integer var
S-5 contains
S-6 subroutine work
S-7 !$omp parallel
S-8 ! do work here
S-9 !$omp task
S-10 tp = tp + 1
S-11 ! do work here
S-12 !$omp task
S-13 ! do work here but don’t modify tp
S-14 !$omp end task
S-15 var = tp ! value does not change after write above
S-16 !$omp end task
S-17 !$omp end parallel
S-18 end subroutine
S-19 end module
Fortran
2 The following two examples demonstrate how the scheduling rules illustrated in Section 2.11.3 of
3 the OpenMP 4.0 specification affect the usage of locks and critical sections in tasks. If a lock is
4 held across a task scheduling point, no attempt should be made to acquire the same lock in any code
5 that may be interleaved. Otherwise, a deadlock is possible.
6 In the example below, suppose the thread executing task 1 defers task 2. When it encounters the
7 task scheduling point at task 3, it could suspend task 1 and begin task 2 which will result in a
8 deadlock when it tries to enter critical region 1.
C / C++
9 Example tasking.9.c
S-1 void work()
S-2 {
S-3 #pragma omp task
S-4 { //Task 1
S-5 #pragma omp task
S-6 { //Task 2
S-7 #pragma omp critical //Critical region 1
S-8 {/*do work here */ }
S-9 }
S-10 #pragma omp critical //Critical Region 2
S-11 {
CHAPTER 3. TASKING 79
C / C++
1 Example tasking.10.c
S-1 #include <omp.h>
S-2 void work() {
S-3 omp_lock_t lock;
S-4 omp_init_lock(&lock);
S-5 #pragma omp parallel
S-6 {
S-7 int i;
S-8 #pragma omp for
S-9 for (i = 0; i < 100; i++) {
S-10 #pragma omp task
S-11 {
S-12 // lock is shared by default in the task
S-13 omp_set_lock(&lock);
S-14 // Capture data for the following task
S-15 #pragma omp task
S-16 // Task Scheduling Point 1
S-17 { /* do work here */ }
S-18 omp_unset_lock(&lock);
S-19 }
S-20 }
S-21 }
S-22 omp_destroy_lock(&lock);
S-23 }
C / C++
Fortran
2 Example tasking.10.f90
S-1 module example
S-2 include ’omp_lib.h’
S-3 integer (kind=omp_lock_kind) lock
S-4 integer i
S-5
S-6 contains
S-7
S-8 subroutine work
S-9 call omp_init_lock(lock)
S-10 !$omp parallel
S-11 !$omp do
S-12 do i=1,100
S-13 !$omp task
S-14 ! Outer task
S-15 call omp_set_lock(lock) ! lock is shared by
S-16 ! default in the task
CHAPTER 3. TASKING 81
Fortran
1 Example tasking.11.f90
S-1 subroutine foo()
S-2 integer :: x
S-3 x = 2
S-4 !$omp task shared(x) mergeable
S-5 x = x + 1
S-6 !$omp end task
S-7 !$omp taskwait
S-8 print *, x ! prints 3
S-9 end subroutine
Fortran
2 This second example shows an incorrect use of the mergeable clause. In this example, the
3 created task will access different instances of the variable x if the task is not merged, as x is
4 firstprivate, but it will access the same variable x if the task is merged. As a result, the
5 behavior of the program is unspecified and it can print two different values for x depending on the
6 decisions taken by the implementation.
C / C++
7 Example tasking.12.c
S-1 #include <stdio.h>
S-2 void foo ( )
S-3 {
S-4 int x = 2;
S-5 #pragma omp task mergeable
S-6 {
S-7 x++;
S-8 }
S-9 #pragma omp taskwait
S-10 printf("%d\n",x); // prints 2 or 3
S-11 }
C / C++
CHAPTER 3. TASKING 83
S-16 state = new_state;
S-17 }
S-18 state[pos] = 0;
S-19 bin_search(pos+1, n, state );
S-20 }
S-21 #pragma omp task final( pos > LIMIT ) mergeable
S-22 {
S-23 char new_state[n];
S-24 if (! omp_in_final() ) {
S-25 memcpy(new_state, state, pos );
S-26 state = new_state;
S-27 }
S-28 state[pos] = 1;
S-29 bin_search(pos+1, n, state );
S-30 }
S-31 #pragma omp taskwait
S-32 }
C / C++
Fortran
1 Example tasking.13.f90
S-1 recursive subroutine bin_search(pos, n, state)
S-2 use omp_lib
S-3 integer :: pos, n
S-4 character, pointer :: state(:)
S-5 character, target, dimension(n) :: new_state1, new_state2
S-6 integer, parameter :: LIMIT = 3
S-7 if (pos .eq. n) then
S-8 call check_solution(state)
S-9 return
S-10 endif
S-11 !$omp task final(pos > LIMIT) mergeable
S-12 if (.not. omp_in_final()) then
S-13 new_state1(1:pos) = state(1:pos)
S-14 state => new_state1
S-15 endif
S-16 state(pos+1) = ’z’
S-17 call bin_search(pos+1, n, state)
S-18 !$omp end task
S-19 !$omp task final(pos > LIMIT) mergeable
S-20 if (.not. omp_in_final()) then
S-21 new_state2(1:pos) = state(1:pos)
S-22 state => new_state2
S-23 endif
S-24 state(pos+1) = ’y’
S-25 call bin_search(pos+1, n, state)
CHAPTER 3. TASKING 85
Fortran
1 Example tasking.14.f90
S-1 subroutine foo()
S-2 integer i
S-3 !$omp task if(.FALSE.) ! This task is undeferred
S-4 !$omp task ! This task is a regular task
S-5 do i = 1, 3
S-6 !$omp task ! This task is a regular task
S-7 call bar()
S-8 !$omp end task
S-9 enddo
S-10 !$omp end task
S-11 !$omp end task
S-12 !$omp task final(.TRUE.) ! This task is a regular task
S-13 !$omp task ! This task is included
S-14 do i = 1, 3
S-15 !$omp task ! This task is also included
S-16 call bar()
S-17 !$omp end task
S-18 enddo
S-19 !$omp end task
S-20 !$omp end task
S-21 end subroutine
Fortran
CHAPTER 3. TASKING 87
S-17 call compute_array(matrix(:, i), M)
S-18 !$omp end task
S-19 enddo
S-20 !$omp end single
S-21 !$omp end parallel
S-22 end subroutine compute_matrix
Fortran
CHAPTER 3. TASKING 89
1 3.3.2 Anti-dependence
2 This example shows an anti-dependence using the depend clause on the task construct.
C / C++
3 Example task_dep.2.c
S-1 #include <stdio.h>
S-2 int main()
S-3 {
S-4 int x = 1;
S-5 #pragma omp parallel
S-6 #pragma omp single
S-7 {
S-8 #pragma omp task shared(x) depend(in: x)
S-9 printf("x = %d\n", x);
S-10 #pragma omp task shared(x) depend(out: x)
S-11 x = 2;
S-12 }
S-13 return 0;
S-14 }
C / C++
Fortran
4 Example task_dep.2.f90
S-1 program example
S-2 integer :: x
S-3 x = 1
S-4 !$omp parallel
S-5 !$omp single
S-6 !$omp task shared(x) depend(in: x)
S-7 print*, "x = ", x
S-8 !$omp end task
S-9 !$omp task shared(x) depend(out: x)
S-10 x = 2
S-11 !$omp end task
S-12 !$omp end single
S-13 !$omp end parallel
S-14 end program
Fortran
5 The program will always print "x = 1", because the depend clauses enforce the ordering of the
6 tasks. If the depend clauses had been omitted, then the tasks could execute in any order and the
7 program would have a race condition.
CHAPTER 3. TASKING 91
1 3.3.4 Concurrent Execution with Dependences
2 In this example we show potentially concurrent execution of tasks using multiple flow dependences
3 expressed using the depend clause on the task construct.
C / C++
4 Example task_dep.4.c
S-1 #include <stdio.h>
S-2 int main() {
S-3 int x = 1;
S-4 #pragma omp parallel
S-5 #pragma omp single
S-6 {
S-7 #pragma omp task shared(x) depend(out: x)
S-8 x = 2;
S-9 #pragma omp task shared(x) depend(in: x)
S-10 printf("x + 1 = %d. ", x+1);
S-11 #pragma omp task shared(x) depend(in: x)
S-12 printf("x + 2 = %d\n", x+2);
S-13 }
S-14 return 0;
S-15 }
C / C++
Fortran
5 Example task_dep.4.f90
S-1
S-2 program example
S-3 integer :: x
S-4
S-5 x = 1
S-6
S-7 !$omp parallel
S-8 !$omp single
S-9
S-10 !$omp task shared(x) depend(out: x)
S-11 x = 2
S-12 !$omp end task
S-13
S-14 !$omp task shared(x) depend(in: x)
S-15 print*, "x + 1 = ", x+1, "."
S-16 !$omp end task
S-17
S-18 !$omp task shared(x) depend(in: x)
S-19 print*, "x + 2 = ", x+2, "."
CHAPTER 3. TASKING 93
Fortran
1 Example task_dep.5.f90
S-1 ! Assume BS divides N perfectly
S-2 subroutine matmul_depend (N, BS, A, B, C)
S-3 implicit none
S-4 integer :: N, BS, BM
S-5 real, dimension(N, N) :: A, B, C
S-6 integer :: i, j, k, ii, jj, kk
S-7 BM = BS - 1
S-8 do i = 1, N, BS
S-9 do j = 1, N, BS
S-10 do k = 1, N, BS
S-11 !$omp task shared(A,B,C) private(ii,jj,kk) & ! I,J,K are firstprivate by default
S-12 !$omp depend ( in: A(i:i+BM, k:k+BM), B(k:k+BM, j:j+BM) ) &
S-13 !$omp depend ( inout: C(i:i+BM, j:j+BM) )
S-14 do ii = i, i+BM
S-15 do jj = j, j+BM
S-16 do kk = k, k+BM
S-17 C(jj,ii) = C(jj,ii) + A(kk,ii) * B(jj,kk)
S-18 end do
S-19 end do
S-20 end do
S-21 !$omp end task
S-22 end do
S-23 end do
S-24 end do
S-25 end subroutine
Fortran
CHAPTER 3. TASKING 95
Fortran
1 Example task_dep.6.f90
S-1
S-2
S-3 subroutine foo()
S-4 implicit none
S-5 integer :: x, y
S-6
S-7 x = 0
S-8 y = 2
S-9
S-10 !$omp task depend(inout: x) shared(x)
S-11 x = x + 1 !! 1st child task
S-12 !$omp end task
S-13
S-14 !$omp task shared(y)
S-15 y = y - 1 !! 2nd child task
S-16 !$omp end task
S-17
S-18 !$omp taskwait depend(in: x) !! 1st taskwait
S-19
S-20 print*, "x=", x
S-21
S-22 !! Second task may not be finished.
S-23 !! Accessing y here will create a race condition.
S-24
S-25 !$omp taskwait !! 2nd taskwait
S-26
S-27 print*, "y=", y
S-28
S-29 end subroutine foo
S-30
S-31 program p
S-32 implicit none
S-33 !$omp parallel
S-34 !$omp single
S-35 call foo()
S-36 !$omp end single
S-37 !$omp end parallel
S-38 end program p
Fortran
2 In this example the first two tasks are serialized, because a dependence on the first child is produced
3 by x with the in dependence type in the depend clause of the second task. However, the
4 generating task at the first taskwait waits only on the first child task to complete, because a
CHAPTER 3. TASKING 97
Fortran
1 Example task_dep.7.f90
S-1
S-2
S-3 subroutine foo()
S-4 implicit none
S-5 integer :: x, y
S-6
S-7 x = 0
S-8 y = 2
S-9
S-10 !$omp task depend(inout: x) shared(x)
S-11 x = x + 1 !! 1st child task
S-12 !$omp end task
S-13
S-14 !$omp task depend(in: x) depend(inout: y) shared(x, y)
S-15 y = y - x !! 2nd child task
S-16 !$omp end task
S-17
S-18 !$omp taskwait depend(in: x) !! 1st taskwait
S-19
S-20 print*, "x=", x
S-21
S-22 !! Second task may not be finished.
S-23 !! Accessing y here would create a race condition.
S-24
S-25 !$omp taskwait !! 2nd taskwait
S-26
S-27 print*, "y=", y
S-28
S-29 end subroutine foo
S-30
S-31 program p
S-32 implicit none
S-33 !$omp parallel
S-34 !$omp single
S-35 call foo()
S-36 !$omp end single
S-37 !$omp end parallel
S-38 end program p
Fortran
2 This example is similar to the previous one, except the generating task is directed to also wait for
3 completion of the second task.
CHAPTER 3. TASKING 99
Fortran
1 Example task_dep.8.f90
S-1
S-2
S-3 subroutine foo()
S-4 implicit nonE
S-5 integer :: x, y
S-6
S-7 x = 0
S-8 y = 2
S-9
S-10 !$omp task depend(inout: x) shared(x)
S-11 x = x + 1 !! 1st child task
S-12 !$omp end task
S-13
S-14 !$omp task depend(in: x) depend(inout: y) shared(x, y)
S-15 y = y - x !! 2nd child task
S-16 !$omp end task
S-17
S-18 !$omp taskwait depend(in: x,y)
S-19
S-20 print*, "x=", x
S-21 print*, "y=", y
S-22
S-23 end subroutine foo
S-24
S-25 program p
S-26 implicit none
S-27 !$omp parallel
S-28 !$omp single
S-29 call foo()
S-30 !$omp end single
S-31 !$omp end parallel
S-32 end program p
Fortran
2 Devices
3 The target construct consists of a target directive and an execution region. The target
4 region is executed on the default device or the device specified in the device clause.
5 In OpenMP version 4.0, by default, all variables within the lexical scope of the construct are copied
6 to and from the device, unless the device is the host, or the data exists on the device from a
7 previously executed data-type construct that has created space on the device and possibly copied
8 host data to the device storage.
9 The constructs that explicitly create storage, transfer data, and free storage on the device are
10 catagorized as structured and unstructured. The target data construct is structured. It creates a
11 data region around target constructs, and is convenient for providing persistent data throughout
12 multiple target regions. The target enter data and target exit data constructs are
13 unstructured, because they can occur anywhere and do not support a "structure" (a region) for
14 enclosing target constructs, as does the target data construct.
15 The map clause is used on target constructs and the data-type constructs to map host data. It
16 specifies the device storage and data movement to and from the device, and controls on the
17 storage duration.
18 There is an important change in the OpenMP 4.5 specification that alters the data model for scalar
19 variables and C/C++ pointer variables. The default behavior for scalar variables and C/C++ pointer
20 variables in an 4.5 compliant code is firstprivate. Example codes that have been updated to
21 reflect this new behavior are annotated with a description that describes changes required for
22 correct execution. Often it is a simple matter of mapping the variable as tofrom to obtain the
23 intended 4.0 behavior.
24 In OpenMP version 4.5 the mechanism for target execution is specified as occuring through a target
25 task. When the target construct is encountered a new target task is generated. The target task
26 completes after the target region has executed and all data transfers have finished.
27 This new specification does not affect the execution of pre-4.5 code; it is a necessary element for
28 asynchronous execution of the target region when using the new nowait clause introduced in
29 OpenMP 4.5.
118
1 4.1 target Construct
6 In the following example, the usual Fortran approach is used for dynamic memory. The p0, v1, and
7 v2 arrays are allocated in the main program and passed as references from one routine to another. In
8 vec_mult, p1, v3 and v4 are references to the p0, v1, and v2 arrays, respectively. The target
2 SIMD
3 Single instruction, multiple data (SIMD) is a form of parallel execution in which the same operation
4 is performed on multiple data elements independently in hardware vector processing units (VPU),
5 also called SIMD units. The addition of two vectors to form a third vector is a SIMD operation.
6 Many processors have SIMD (vector) units that can perform simultaneously 2, 4, 8 or more
7 executions of the same operation (by a single SIMD unit).
8 Loops without loop-carried backward dependency (or with dependency preserved using ordered
9 simd) are candidates for vectorization by the compiler for execution with SIMD units. In addition,
10 with state-of-the-art vectorization technology and declare simd construct extensions for
11 function vectorization in the OpenMP 4.5 specification, loops with function calls can be vectorized
12 as well. The basic idea is that a scalar function call in a loop can be replaced by a vector version of
13 the function, and the loop can be vectorized simultaneously by combining a loop vectorization
14 (simd directive on the loop) and a function vectorization (declare simd directive on the
15 function).
16 A simd construct states that SIMD operations be performed on the data within the loop. A number
17 of clauses are available to provide data-sharing attributes (private, linear, reduction and
18 lastprivate). Other clauses provide vector length preference/restrictions (simdlen /
19 safelen), loop fusion (collapse), and data alignment (aligned).
20 The declare simd directive designates that a vector version of the function should also be
21 constructed for execution within loops that contain the function and have a simd directive. Clauses
22 provide argument specifications (linear, uniform, and aligned), a requested vector length
23 (simdlen), and designate whether the function is always/never called conditionally in a loop
24 (branch/inbranch). The latter is for optimizing performance.
25 Also, the simd construct has been combined with the worksharing loop constructs (for simd
26 and do simd) to enable simultaneous thread execution in different SIMD units.
204
1 5.1 simd and declare simd Constructs
2 The following example illustrates the basic use of the simd construct to assure the compiler that
3 the loop can be vectorized.
C / C++
4 Example SIMD.1.c
S-1 void star( double *a, double *b, double *c, int n, int *ioff )
S-2 {
S-3 int i;
S-4 #pragma omp simd
S-5 for ( i = 0; i < n; i++ )
S-6 a[i] *= b[i] * c[i+ *ioff];
S-7 }
C / C++
Fortran
5 Example SIMD.1.f90
S-1 subroutine star(a,b,c,n,ioff_ptr)
S-2 implicit none
S-3 double precision :: a(*),b(*),c(*)
S-4 integer :: n, i
S-5 integer, pointer :: ioff_ptr
S-6
S-7 !$omp simd
S-8 do i = 1,n
S-9 a(i) = a(i) * b(i) * c(i+ioff_ptr)
S-10 end do
S-11
S-12 end subroutine
Fortran
2 Synchronization
3 The barrier construct is a stand-alone directive that requires all threads of a team (within a
4 contention group) to execute the barrier and complete execution of all tasks within the region,
5 before continuing past the barrier.
6 The critical construct is a directive that contains a structured block. The construct allows only
7 a single thread at a time to execute the structured block (region). Multiple critical regions may exist
8 in a parallel region, and may act cooperatively (only one thread at a time in all critical regions),
9 or separately (only one thread at a time in each critical regions when a unique name is supplied
10 on each critical construct). An optional (lock) hint clause may be specified on a named
11 critical construct to provide the OpenMP runtime guidance in selection a locking mechanism.
12 On a finer scale the atomic construct allows only a single thread at a time to have atomic access to
13 a storage location involving a single read, write, update or capture statement, and a limited number
14 of combinations when specifying the capture atomic-clause clause. The atomic-clause clause is
15 required for some expression statements, but is not required for update statements. The
16 memory-order clause can be used to specify the degree of memory ordering enforced by an
17 atomic construct. From weakest to strongest, they are relaxed (the default), acquire and/or
18 release clauses (specified with acquire, release, or acq_rel), and seq_cst. Please see
19 the details in the atomic Construct subsection of the Directives chapter in the OpenMP
20 Specifications document.
21 The ordered construct either specifies a structured block in a loop, simd, or loop SIMD region
22 that will be executed in the order of the loop iterations. The ordered construct sequentializes and
23 orders the execution of ordered regions while allowing code outside the region to run in parallel.
24 Since OpenMP 4.5 the ordered construct can also be a stand-alone directive that specifies
25 cross-iteration dependences in a doacross loop nest. The depend clause uses a sink
26 dependence-type, along with a iteration vector argument (vec) to indicate the iteration that satisfies
27 the dependence. The depend clause with a source dependence-type specifies dependence
28 satisfaction.
219
1 The flush directive is a stand-alone construct for enforcing consistency between a thread’s view
2 of memory and the view of memory for other threads (see the Memory Model chapter of this
3 document for more details). When the construct is used with an explicit variable list, a strong flush
4 that forces a thread’s temporary view of memory to be consistent with the actual memory is applied
5 to all listed variables. When the construct is used without an explicit variable list and without a
6 memory-order clause, a strong flush is applied to all locally thread-visible data as defined by the
7 base language, and additionally the construct provides both acquire and release memory ordering
8 semantics. When an explicit variable list is not present and a memory-order clause is present, the
9 construct provides acquire and/or release memory ordering semantics according to the
10 memory-order clause, but no strong flush is performed. A resulting strong flush that applies to a set
11 of variables effectively ensures that no memory (load or store) operation for the affected variables
12 may be reordered across the flush directive.
13 General-purpose routines provide mutual exclusion semantics through locks, represented by lock
14 variables. The semantics allows a task to set, and hence own a lock, until it is unset by the task that
15 set it. A nestable lock can be set multiple times by a task, and is used when in code requires nested
16 control of locks. A simple lock can only be set once by the owning task. There are specific calls for
17 the two types of locks, and the variable of a specific lock type cannot be used by the other lock type.
18 Any explicit task will observe the synchronization prescribed in a barrier construct and an
19 implied barrier. Also, additional synchronizations are available for tasks. All children of a task will
20 wait at a taskwait (for their siblings to complete). A taskgroup construct creates a region in
21 which the current task is suspended at the end of the region until all sibling tasks, and their
22 descendants, have completed. Scheduling constraints on task execution can be prescribed by the
23 depend clause to enforce dependence on previously generated tasks. More details on controlling
24 task executions can be found in the Tasking Chapter in the OpenMP Specifications document.
1 Although the following example might work on some implementations, this is also non-conforming:
2 Example atomic_restrict.3.f
S-1 SUBROUTINE ATOMIC_WRONG3
S-2 INTEGER:: I
S-3 REAL:: R
S-4 EQUIVALENCE(I,R)
S-5
S-6 !$OMP PARALLEL
S-7 !$OMP ATOMIC UPDATE
S-8 I = I + 1
S-9 ! incorrect because I and R reference the same location
S-10 ! but have different types
S-11 !$OMP END PARALLEL
S-12
S-13 !$OMP PARALLEL
S-14 !$OMP ATOMIC UPDATE
S-15 R = R + 1.0
S-16 ! incorrect because I and R reference the same location
S-17 ! but have different types
S-18 !$OMP END PARALLEL
S-19
S-20 END SUBROUTINE ATOMIC_WRONG3
Fortran
Fortran
7 Example init_lock.1.f
S-1 FUNCTION NEW_LOCKS()
S-2 USE OMP_LIB ! or INCLUDE "omp_lib.h"
S-3 INTEGER(OMP_LOCK_KIND), DIMENSION(1000) :: NEW_LOCKS
S-4 INTEGER I
S-5
S-6 !$OMP PARALLEL DO PRIVATE(I)
S-7 DO I=1,1000
S-8 CALL OMP_INIT_LOCK(NEW_LOCKS(I))
S-9 END DO
S-10 !$OMP END PARALLEL DO
S-11
S-12 END FUNCTION NEW_LOCKS
Fortran
Fortran
6 Example init_lock_with_hint.1.f
S-1 FUNCTION NEW_LOCKS()
S-2 USE OMP_LIB ! or INCLUDE "omp_lib.h"
S-3 INTEGER(OMP_LOCK_KIND), DIMENSION(1000) :: NEW_LOCKS
S-4
S-5 INTEGER I
S-6
S-7 !$OMP PARALLEL DO PRIVATE(I)
S-8 DO I=1,1000
S-9 CALL OMP_INIT_LOCK_WITH_HINT(NEW_LOCKS(I),
S-10 & OMP_LOCK_HINT_CONTENDED + OMP_LOCK_HINT_SPECULATIVE)
S-11 END DO
S-12 !$OMP END PARALLEL DO
S-13
S-14 END FUNCTION NEW_LOCKS
Fortran
2 Data Environment
3 The OpenMP data environment contains data attributes of variables and objects. Many constructs
4 (such as parallel, simd, task) accept clauses to control data-sharing attributes of referenced
5 variables in the construct, where data-sharing applies to whether the attribute of the variable is
6 shared, is private storage, or has special operational characteristics (as found in the
7 firstprivate, lastprivate, linear, or reduction clause).
8 The data environment for a device (distinguished as a device data environment) is controlled on the
9 host by data-mapping attributes, which determine the relationship of the data on the host, the
10 original data, and the data on the device, the corresponding data.
11 DATA-SHARING ATTRIBUTES
12 Data-sharing attributes of variables can be classified as being predetermined, explicitly determined
13 or implicitly determined.
14 Certain variables and objects have predetermined attributes. A commonly found case is the loop
15 iteration variable in associated loops of a for or do construct. It has a private data-sharing
16 attribute. Variables with predetermined data-sharing attributes can not be listed in a data-sharing
17 clause; but there are some exceptions (mainly concerning loop iteration variables).
18 Variables with explicitly determined data-sharing attributes are those that are referenced in a given
19 construct and are listed in a data-sharing attribute clause on the construct. Some of the common
20 data-sharing clauses are: shared, private, firstprivate, lastprivate, linear, and
21 reduction.
22 Variables with implicitly determined data-sharing attributes are those that are referenced in a given
23 construct, do not have predetermined data-sharing attributes, and are not listed in a data-sharing
24 attribute clause of an enclosing construct. For a complete list of variables and objects with
25 predetermined and implicitly determined attributes, please refer to the Data-sharing Attribute Rules
26 for Variables Referenced in a Construct subsection of the OpenMP Specifications document.
271
1 DATA-MAPPING ATTRIBUTES
2 The map clause on a device construct explicitly specifies how the list items in the clause are
3 mapped from the encountering task’s data environment (on the host) to the corresponding item in
4 the device data environment (on the device). The common list items are arrays, array sections,
5 scalars, pointers, and structure elements (members).
6 Procedures and global variables have predetermined data mapping if they appear within the list or
7 block of a declare target directive. Also, a C/C++ pointer is mapped as a zero-length array
8 section, as is a C++ variable that is a reference to a pointer.
9 Without explicit mapping, non-scalar and non-pointer variables within the scope of the target
10 construct are implicitly mapped with a map-type of tofrom. Without explicit mapping, scalar
11 variables within the scope of the target construct are not mapped, but have an implicit firstprivate
12 data-sharing attribute. (That is, the value of the original variable is given to a private variable of the
13 same name on the device.) This behavior can be changed with the defaultmap clause.
14 The map clause can appear on target, target data and target enter/exit data
15 constructs. The operations of creation and removal of device storage as well as assignment of the
16 original list item values to the corresponding list items may be complicated when the list item
17 appears on multiple constructs or when the host and device storage is shared. In these cases the
18 item’s reference count, the number of times it has been referenced (+1 on entry and -1 on exited) in
19 nested (structured) map regions and/or accumulative (unstructured) mappings, determines the
20 operation. Details of the map clause and reference count operation are specified in the map Clause
21 subsection of the OpenMP Specifications document.
1 a = 11 12 13
2 ptr = 4
3 i = 15
4 A is not allocated
5 ptr = 4
6 i = 5
7 or
8 A is not allocated
9 ptr = 4
10 i = 15
11 a = 1 2 3
12 ptr = 4
13 i = 5
14 The following is an example of the use of threadprivate for module variables:
15 Example threadprivate.6.f
S-1 MODULE INC_MODULE_GOOD3
S-2 REAL, POINTER :: WORK(:)
S-3 SAVE WORK
S-4 !$OMP THREADPRIVATE(WORK)
S-5 END MODULE INC_MODULE_GOOD3
S-6
S-7 SUBROUTINE SUB1(N)
S-8 USE INC_MODULE_GOOD3
S-9 !$OMP PARALLEL PRIVATE(THE_SUM)
S-10 ALLOCATE(WORK(N))
S-11 CALL SUB2(THE_SUM)
S-12 WRITE(*,*)THE_SUM
S-13 !$OMP END PARALLEL
S-14 END SUBROUTINE SUB1
S-15
S-16 SUBROUTINE SUB2(THE_SUM)
S-17 USE INC_MODULE_GOOD3
S-18 WORK(:) = 10
S-19 THE_SUM=SUM(WORK)
S-20 END SUBROUTINE SUB2
S-21
S-22 PROGRAM INC_GOOD3
S-23 N = 10
S-24 CALL SUB1(N)
5 The following example illustrates the use of threadprivate for static class members. The
6 threadprivate directive for a static class member must be placed inside the class definition.
7 Example threadprivate.5.cpp
S-1 class T {
S-2 public:
S-3 static int i;
S-4 #pragma omp threadprivate(i)
S-5 };
C++
7 In exceptional cases, loop iteration variables can be made shared, as in the following example:
8 Example fort_loopvar.2.f90
S-1 SUBROUTINE PLOOP_2(A,B,N,I1,I2)
S-2 REAL A(*), B(*)
S-3 INTEGER I1, I2, N
S-4
S-5 !$OMP PARALLEL SHARED(A,B,I1,I2)
S-6 !$OMP SECTIONS
S-7 !$OMP SECTION
S-8 DO I1 = I1, N
S-9 IF (A(I1).NE.0.0) EXIT
S-10 ENDDO
S-11 !$OMP SECTION
S-12 DO I2 = I2, N
S-13 IF (B(I2).NE.0.0) EXIT
S-14 ENDDO
S-15 !$OMP END SECTIONS
S-16 !$OMP SINGLE
S-17 IF (I1.LE.N) PRINT *, ’ITEMS IN A UP TO ’, I1, ’ARE ALL ZERO.’
S-18 IF (I2.LE.N) PRINT *, ’ITEMS IN B UP TO ’, I2, ’ARE ALL ZERO.’
1 Note however that the use of shared loop iteration variables can easily lead to race conditions.
Fortran
7 Example fort_sa_private.3.f
1 Example fort_sa_private.4.f
S-1 PROGRAM PRIV_RESTRICT4
S-2 INTEGER I, J
S-3 INTEGER A(100), B(100)
S-4 EQUIVALENCE (A(51), B(1))
S-5
S-6 !$OMP PARALLEL DO DEFAULT(PRIVATE) PRIVATE(I,J) LASTPRIVATE(A)
S-7 DO I=1,100
S-8 DO J=1,100
S-9 B(J) = J - 1
S-10 ENDDO
S-11
S-12 DO J=1,100
S-13 A(J) = J ! B becomes undefined at this point
S-14 ENDDO
S-15
S-16 DO J=1,50
S-17 B(J) = B(J) + 1 ! B is undefined
S-18 ! A becomes undefined at this point
S-19 ENDDO
S-20 ENDDO
S-21 !$OMP END PARALLEL DO ! The LASTPRIVATE write for A has
S-22 ! undefined results
S-23
S-24 PRINT *, B ! B is undefined since the LASTPRIVATE
S-25 ! write of A was not defined
S-26 END PROGRAM PRIV_RESTRICT4
2 Example fort_sa_private.5.f
S-1 SUBROUTINE SUB1(X)
S-2 DIMENSION X(10)
S-3
S-4 ! This use of X does not conform to the
1 The following program is non-conforming because the reduction is on the intrinsic procedure name
2 MAX but that name has been redefined to be the variable named MAX.
3 Example reduction.3.f90
S-1 PROGRAM REDUCTION_WRONG
S-2 MAX = HUGE(0)
S-3 M = 0
S-4
S-5 !$OMP PARALLEL DO REDUCTION(MAX: M)
S-6 ! MAX is no longer the intrinsic so this is non-conforming
S-7 DO I = 1, 100
S-8 CALL SUB(M,I)
S-9 END DO
S-10
S-11 END PROGRAM REDUCTION_WRONG
S-12
S-13 SUBROUTINE SUB(M,I)
S-14 M = MAX(M,I)
S-15 END SUBROUTINE SUB
4 The following conforming program performs the reduction using the intrinsic procedure name MAX
5 even though the intrinsic MAX has been renamed to REN.
6 Example reduction.4.f90
S-1 MODULE M
S-2 INTRINSIC MAX
S-3 END MODULE M
S-4
S-5 PROGRAM REDUCTION3
S-6 USE M, REN => MAX
S-7 N = 0
S-8 !$OMP PARALLEL DO REDUCTION(REN: N) ! still does MAX
S-9 DO I = 1, 100
S-10 N = MAX(N,I)
S-11 END DO
S-12 END PROGRAM REDUCTION3
7 The following conforming program performs the reduction using intrinsic procedure name MAX
8 even though the intrinsic MAX has been renamed to MIN.
9 Example reduction.5.f90
2 The following examples shows how user-defined reductions can be defined for some STL
3 containers. The first declare reduction defines the plus (+) operation for std::vector<int> by
4 making use of the std::transform algorithm. The second and third define the merge (or
5 concatenation) operation for std::vector<int> and std::list<int>. It shows how the user-defined
6 reduction operation can be applied to specific data types of an STL.
C++
7 Example udr.6.cpp
S-1 #include <algorithm>
S-2 #include <list>
S-3 #include <vector>
S-4
S-5 #pragma omp declare reduction( + : std::vector<int> : \
S-6 std::transform (omp_out.begin(), omp_out.end(), \
S-7 omp_in.begin(), omp_in.end(),std::plus<int>()))
S-8
S-9 #pragma omp declare reduction( merge : std::vector<int> : \
S-10 omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
S-11
S-12 #pragma omp declare reduction( merge : std::list<int> : \
S-13 omp_out.merge(omp_in))
C++
2 Note that the effect of the copyprivate clause on a variable with the allocatable attribute
3 is different than on a variable with the pointer attribute. The value of A is copied (as if by
4 intrinsic assignment) and the pointer B is copied (as if by pointer assignment) to the corresponding
5 list items in the other implicit tasks belonging to the parallel region.
6 Example copyprivate.4.f
S-1 SUBROUTINE S(N)
S-2 INTEGER N
S-3
S-4 REAL, DIMENSION(:), ALLOCATABLE :: A
S-5 REAL, DIMENSION(:), POINTER :: B
S-6
S-7 ALLOCATE (A(N))
S-8 !$OMP SINGLE
S-9 ALLOCATE (B(N))
S-10 READ (11) A,B
S-11 !$OMP END SINGLE COPYPRIVATE(A,B)
S-12 ! Variable A is private and is
S-13 ! assigned the same value in each thread
S-14 ! Variable B is shared
S-15
S-16 !$OMP BARRIER
S-17 !$OMP SINGLE
S-18 DEALLOCATE (B)
S-19 !$OMP END SINGLE NOWAIT
S-20 END SUBROUTINE S
Fortran
9 In next example, within the parallel construct, the association name thread_id is associated
10 with the private copy of i. The print statement should output the unique thread number.
11 Example associate.2.f
S-1 program example
S-2 use omp_lib
S-3 integer i
S-4 !$omp parallel private(i)
S-5 i = omp_get_thread_num()
S-6 associate(thread_id => i)
S-7 print *, thread_id ! print private i value
S-8 end associate
S-9 !$omp end parallel
S-10 end program
12 The following example illustrates the effect of specifying a selector name on a data-sharing
13 attribute clause. The associate name u is associated with v and the variable v is specified on the
14 private clause of the parallel construct. The construct association is established prior to the
15 parallel region. The association between u and the original v is retained (see the Data Sharing
16 Attribute Rules section in the OpenMP 4.0 API Specifications). Inside the parallel region, v
17 has the value of -1 and u has the value of the original v.
18 Example associate.3.f90
2 Memory Model
3 OpenMP provides a shared-memory model that allows all threads on a given device shared access
4 to memory. For a given OpenMP region that may be executed by more than one thread or SIMD
5 lane, variables in memory may be shared or private with respect to those threads or SIMD lanes. A
6 variable’s data-sharing attribute indicates whether it is shared (the shared attribute) or private (the
7 private, firstprivate, lastprivate, linear, and reduction attributes) in the data environment of an
8 OpenMP region. While private variables in an OpenMP region are new copies of the original
9 variable (with same name) that may then be concurrently accessed or modified by their respective
10 threads or SIMD lanes, a shared variable in an OpenMP region is the same as the variable of the
11 same name in the enclosing region. Concurrent accesses or modifications to a shared variable may
12 therefore require synchronization to avoid data races.
13 OpenMP’s memory model also includes a temporary view of memory that is associated with each
14 thread. Two different threads may see different values for a given variable in their respective
15 temporary views. Threads may employ flush operations for the purposes of making their temporary
16 view of a variable consistent with the value of the variable in memory. The effect of a given flush
17 operation is characterized by its flush properties – some combination of strong, release, and
18 acquire – and, for strong flushes, a flush-set.
19 A strong flush will force consistency between the temporary view and the memory for all variables
20 in its flush-set. Furthermore all strong flushes in a program that have intersecting flush-sets will
21 execute in some total order, and within a thread strong flushes may not be reordered with respect to
22 other memory operations on variables in its flush-set. Release and acquire flushes operate in pairs.
23 A release flush may “synchronize” with an acquire flush, and when it does so the local memory
24 operations that precede the release flush will appear to have been completed before the local
25 memory operations on the same variables that follow the acquire flush.
26 Flush operations arise from explicit flush directives, implicit flush directives, and also from
27 the execution of atomic constructs. The flush directive forces a consistent view of local
28 variables of the thread executing the flush. When a list is supplied on the directive, only the items
29 (variables) in the list are guaranteed to be flushed. Implied flushes exist at prescribed locations of
333
1 certain constructs. For the complete list of these locations and associated constructs, please refer to
2 the flush Construct section of the OpenMP Specifications document.
3 In this chapter, examples illustrate how race conditions may arise for accesses to variables with a
4 shared data-sharing attribute when flush operations are not properly employed. A race condition
5 can exist when two or more threads are involved in accessing a variable in which not all of the
6 accesses are reads; that is, a WaR, RaW or WaW condition exists (R=read, a=after, W=write). A
7 RaR does not produce a race condition. In particular, a data race will arise when conflicting
8 accesses do not have a well-defined completion order. The existence of data races in OpenMP
9 programs result in undefined behavior, and so they should generally be avoided for programs to be
10 correct. The completion order of accesses to a shared variable is guaranteed in OpenMP through a
11 set of memory consistency rules that are described in the OpenMP Memory Consitency section of
12 the OpenMP Specifications document.
2 Program Control
3 Some specific and elementary concepts of controlling program execution are illustrated in the
4 examples of this chapter. Control can be directly managed with conditional control code (ifdef’s
5 with the _OPENMP macro, and the Fortran sentinel (!$) for conditionally compiling). The if
6 clause on some constructs can direct the runtime to ignore or alter the behavior of the construct. Of
7 course, the base-language if statements can be used to control the "execution" of stand-alone
8 directives (such as flush, barrier, taskwait, and taskyield). However, the directives
9 must appear in a block structure, and not as a substatement as shown in examples 1 and 2 of this
10 chapter.
11 CANCELLATION
12 Cancellation (termination) of the normal sequence of execution for the threads in an OpenMP
13 region can be accomplished with the cancel construct. The construct uses a
14 construct-type-clause to set the region-type to activate for the cancellation. That is, inclusion of one
15 of the construct-type-clause names parallel, for, do, sections or taskgroup on the
16 directive line activates the corresponding region. The cancel construct is activated by the first
17 encountering thread, and it continues execution at the end of the named region. The cancel
18 construct is also a cancellation point for any other thread of the team to also continue execution at
19 the end of the named region.
20 Also, once the specified region has been activated for cancellation any thread that encounnters a
21 cancellation point construct with the same named region (construct-type-clause),
22 continues execution at the end of the region.
23 For an activated cancel taskgroup construct, the tasks that belong to the taskgroup set of the
24 innermost enclosing taskgroup region will be canceled.
25 A task that encounters the cancel taskgroup construct continues execution at the end of its task
26 region. Any task of the taskgroup that has already begun execution will run to completion, unless it
27 encounters a cancellation point; tasks that have not begun execution "may" be discarded as
28 completed tasks.
345
1 CONTROL VARIABLES
2 Internal control variables (ICV) are used by implementations to hold values which control the
3 execution of OpenMP regions. Control (and hence the ICVs) may be set as implementation
4 defaults, or set and adjusted through environment variables, clauses, and API functions. Many of
5 the ICV control values are accessible through API function calls. Also, initial ICV values are
6 reported by the runtime if the OMP_DISPLAY_ENV environment variable has been set to TRUE.
7 NESTED CONSTRUCTS
8 Certain combinations of nested constructs are permitted, giving rise to a combined construct
9 consisting of two or more constructs. These can be used when the two (or several) constructs would
10 be used immediately in succession (closely nested). A combined construct can use the clauses of
11 the component constructs without restrictions. A composite construct is a combined construct
12 which has one or more clauses with (an often obviously) modified or restricted meaning, relative to
13 when the constructs are uncombined.
14 Certain nestings are forbidden, and often the reasoning is obvious. Worksharing constructs cannot
15 be nested, and the barrier construct cannot be nested inside a worksharing construct, or a
16 critical construct. Also, target constructs cannot be nested.
17 The parallel construct can be nested, as well as the task construct. The parallel execution in
18 the nested parallel construct(s) is control by the OMP_NESTED and
19 OMP_MAX_ACTIVE_LEVELS environment variables, and the omp_set_nested() and
20 omp_set_max_active_levels() functions.
21 More details on nesting can be found in the Nesting of Regions of the Directives chapter in the
22 OpenMP Specifications document.
1 The following example illustrates the use of the cancel construct in error handling. If there is an
2 error condition from the allocate statement, the cancellation is activated. The encountering
3 thread sets the shared variable err and other threads of the binding thread set proceed to the end of
4 the worksharing construct after the cancellation has been activated.
Fortran
5 Example cancellation.1.f90
S-1 subroutine example(n, dim)
S-2 integer, intent(in) :: n, dim(n)
S-3 integer :: i, s, err
S-4 real, allocatable :: B(:)
S-5 err = 0
S-6 !$omp parallel shared(err)
S-7 ! ...
S-8 !$omp do private(s, B)
S-9 do i=1, n
S-10 !$omp cancellation point do
S-11 allocate(B(dim(i)), stat=s)
S-12 if (s .gt. 0) then
S-13 !$omp atomic write
S-14 err = s
S-15 !$omp cancel do
S-16 endif
S-17 ! ...
S-18 ! deallocate private array B
S-19 if (allocated(B)) then
S-20 deallocate(B)
S-21 endif
S-22 enddo
S-23 !$omp end parallel
S-24 end subroutine
Fortran
Fortran
1 Example requires.1.f90
S-1
S-2 module data
S-3 !$omp requires unified_shared_memory
S-4 type,public :: mypoints
S-5 double precision :: res
S-6 double precision :: data(500)
S-7 end type
S-8 end module
S-9
S-10 program main
S-11 use data
S-12 type(mypoints) :: p
S-13 integer :: q=0
S-14
S-15 !$omp target !! no map clauses needed
S-16 q = q + 1 !! q is firstprivate
S-17 call do_something_with_p(p,q)
S-18 !$omp end target
S-19
S-20 write(*,’(f5.0,i5)’) p%res, q !! output 1. 0
S-21
S-22 end program
S-23
S-24 subroutine do_something_with_p(p,q)
S-25 use data
S-26 type(mypoints) :: p
S-27 integer :: q
S-28
S-29 p%res = q;
S-30 do i=1,size(p%data)
S-31 p%data(i)=q*i
S-32 enddo
S-33
S-34 end subroutine
Fortran
388
1 – reduction clause for task construct (Section 7.9.2 on page 303)
2 – reduction clause for taskloop construct (Section 7.9.3 on page 306)
3 – reduction clause for taskloop simd construct (Section 7.9.3 on page 306)
4 – Memory Allocators for making OpenMP memory requests with traits (Section 8.2 on
5 page 341)
6 – requires directive specifies required features of implementation (Section 9.5 on page 360)
7 – declare variant directive - for function variants (Section 9.6 on page 362)
8 – metadirective directive - for directive variants (Section 9.7 on page 368)
9 • Included the following additional examples for the 4.x features:
10 – more taskloop examples (Section 3.6 on page 112)
11 – user-defined reduction (UDR) (Section 7.9.4 on page 313)