Parallel Programming and Optimization With Intel Xeon Phi Coprocessors
Parallel Programming and Optimization With Intel Xeon Phi Coprocessors
cl
us
iv
el
yP
re
pa
re
d
fo
r Yu
nh
eng
W
an
g
Prepared for Yunheng Wang
g
an
W
ng
Andrey Vladimirov, PhD, is Head of HPC Research at Colfax Inter-
e
national. His primary interest is the application of modern computing
nh
technologies to computationally demanding scientific problems. Prior to
Yu
research interests are in the area of physical modeling with HPC clusters,
cl
Authors are sincerely grateful to James Reinders for supervising and directing the creation of this
book, Albert Lee for his help with editing and error checking, to specialists at Intel Corporation who
contributed their time and shared with the authors their expertise on the MIC architecture programming:
Bob Davies, Shannon Cepeda, Pradeep Dubey, Ronald Green, James Jeffers, Taylor Kidd, Rakesh
Krishnaiyer, Chris (CJ) Newburn, Kevin O’Leary, Zhang Zhang, and to a great number of people,
mostly from Colfax International and Intel, who have ensured that gears were turning and bits were
churning during the production of the book, including Rajesh Agny, Mani Anandan, Joe Curley,
Roger Herrick, Richard Jackson, Mike Lafferty, Thomas Lee, Belinda Liviero, Gary Paek, Troy
Porter, Tim Puett, John Rinehimer, Gautam Shah, Manish Shah, Bruce Shiu, Jimmy Tran, Achim
Wengeler, and Desmond Yuen.
Parallel Programming and Optimization
g
TM
an
with Intel R Xeon Phi Coprocessors
W
e ng
nh
Handbook on the Development and Optimization of Parallel Applications
Yu
TM
for Intel R Xeon R Processors and Intel R Xeon Phi Coprocessors
r
fo
d
re
pa
re
yP
g
While best efforts have been used in preparing this book, the publisher makes no representations or warranties of any kind
an
and assumes no liabilities of any kind with respect to the accuracy or completeness of the contents and specifically disclaims
W
any implied warranties of merchantability or fitness of use for a particular purpose. The publisher shall not be held liable or
responsible to any person or entity with respect to any loss or incidental or consequential damages caused, or alleged to
ng
have been caused, directly or indirectly, by the information or programs contained herein. No warranty may be created or
extended by sales representatives or written sales materials.
e
nh
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Yu
Performance tests are measured using specific computer systems, components, software, operations and functions. Any
change to any of those factors may cause the results to vary. You should consult other information and performance tests to
r
assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
fo
other products.
ed
Results have been simulated and are provided for informational purposes only. Results were derived using simulations run
ar
on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect
p
actual performance.
re
Because of the evolutionary nature of technology, knowledge and best practices described at the time of this writing, may
yP
become outdated or simply inapplicable at a later date. Summaries, strategies, tips and tricks are only recommendations
by the publisher, and reading this eBook does not guarantee that one’s results will exactly mirror our own results. Every
el
company is different and the advice and strategies contained herein may not be suitable for your situation. References are
iv
provided for informational purposes only and do not constitute endorsement of any websites or other sources.
us
The products described in this document may contain design defects or errors known as errata which may cause the product
cl
to deviate from published specifications. All products, computer systems, dates, and figures specified are preliminary based
Ex
ISBN: 978-0-9885234-1-8
CONTENTS iii
Contents
Foreword ix
Preface xi
g
an
List of Abbreviations xiii
W
ng
1 Introduction 1
TM
e
1.1 Intel R Xeon Phi Coprocessors and the MIC Architecture . . . . . . . . . . . . . . . . . . 1
nh
1.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Yu
1.1.2 A Drop-in Solution for a Novel Platform . . . . . . . . . . . . . . . . . . . . . . . 2
r
TM
1.2.6 Intel R Xeon R Processors versus Intel R Xeon Phi Coprocessors: Developer Experi-
us
ence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
cl
Ex
TM
2 Programming Models for Intel R Xeon Phi Applications 37
TM
2.1 Native Applications and MPI on Intel R Xeon Phi Coprocessors . . . . . . . . . . . . . . 37
2.1.1 Using Compiler Argument -mmic to Compile Native Applications for Intel R Xeon
TM
Phi Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1.2 Establishing SSH Sessions with Coprocessors . . . . . . . . . . . . . . . . . . . . . 38
2.1.3 Running Native Applications with micnativeloadex . . . . . . . . . . . . . . . 39
2.1.4 Monitoring the Coprocessor Activity with micsmc . . . . . . . . . . . . . . . . . . 40
2.1.5 Compiling and Running MPI Applications on the Coprocessor . . . . . . . . . . . . 42
2.2 Explicit Offload Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.1 “Hello World” Example in the Explicit Offload Model . . . . . . . . . . . . . . . . 45
2.2.2 Offloading Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.3 Proxy Console I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2.4 Offload Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.5 Environment Variable Forwarding and MIC_ENV_PREFIX . . . . . . . . . . . . . 49
2.2.6 Target-Specific Code with the Preprocessor Macro __MIC__ . . . . . . . . . . . . 50
g
2.2.7 Fall-Back to Execution on the Host upon Unsuccessful Offload . . . . . . . . . . . . 50
an
2.2.8 Using Pragmas to Transfer Bitwise-Copyable Data to the Coprocessor . . . . . . . . 51
W
2.2.9 Data Traffic and Persistence between Offloads . . . . . . . . . . . . . . . . . . . . . 53
ng
2.2.10 Asynchronous Offload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
e 55
2.2.11 Review: Core Language Constructs of the Explicit Offload Model . . . . . . . . . . 57
nh
2.3 MYO (Virtual-Shared) Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Yu
2.4 Multiple Intel R Xeon Phi Coprocessors in a System and Clusters with Intel R Xeon Phi
re
Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
yP
3 Expressing Parallelism 77
Ex
3.2.4 Fork-Join Model of Parallel Execution: Tasks in OpenMP and Spawning in Intel R
TM
Cilk Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.2.5 Shared and Private Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.2.6 Synchronization: Avoiding Unpredictable Program Behavior . . . . . . . . . . . . . 110
3.2.7 Reduction: Avoiding Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.2.8 Additional Resources on Shared Memory Parallelism . . . . . . . . . . . . . . . . . 121
3.3 Task Parallelism in Distributed Memory, MPI . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.3.1 Parallel Computing in Clusters with Multi-Core and Many-Core Nodes . . . . . . . 122
3.3.2 Program Structure in Intel R MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.3.3 Point-to-Point Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.3.4 MPI Communication Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.3.5 Collective Communication and Reduction . . . . . . . . . . . . . . . . . . . . . . . 135
3.3.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
g
4.1 Roadmap to Optimal Code on Intel R Xeon Phi Coprocessors . . . . . . . . . . . . . . . . 139
an
4.1.1 Optimization Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
W
4.1.2 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
ng
4.2 Scalar Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
e
4.2.1 Assisting the Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
nh
4.2.2 Eliminating Redundant Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Yu
4.2.3 Controlling Precision and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.2.4 Library Functions for Standard Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 152
r
fo
4.3 Data Parallelism: Vectorization the Right and Wrong Way . . . . . . . . . . . . . . . . . . 153
d
4.4.1 Too Much Synchronization. Solution: Avoiding True Sharing with Private Variables
iv
4.4.2 False Sharing. Solution: Data Padding and Private Variables . . . . . . . . . . . . . 171
cl
4.4.3 Load Imbalance. Solution: Load Scheduling and Grain Size Specification . . . . . . 175
Ex
4.4.4 Insufficient Parallelism. Solution: Strip-Mining and Collapsing Nested Loops . . . . 179
4.4.5 Wandering Threads. Improving OpenMP Performance by Setting Thread Affinity . . 189
4.4.6 Diagnosing Parallelization Problems with Scalability Tests . . . . . . . . . . . . . . 196
4.5 Memory Access: Computational Intensity and Cache Management . . . . . . . . . . . . . . 197
4.5.1 Cache Organization on Intel Xeon Processors and Intel Xeon Phi Coprocessors . . . 199
4.5.2 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.5.3 Loop Interchange (Permuting Nested Loops) . . . . . . . . . . . . . . . . . . . . . 200
4.5.4 Loop Tiling (Blocking) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.5.5 Cache-Oblivious Recursive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.5.6 Cross-Procedural Loop Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
4.5.7 Advanced Topic: Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
4.6 PCIe Traffic Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.6.1 Memory Retention Between Offloads . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.6.2 Data Persistence Between Offloads . . . . . . . . . . . . . . . . . . . . . . . . . . 222
4.6.3 Memory Alignment and TLB Page Size Control . . . . . . . . . . . . . . . . . . . 222
4.6.4 Offload Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
g
an
5 Summary and Resources 257
W
TM
5.1 Programming Intel R Xeon Phi Coprocessors is Not Trivial, but Offers Double Rewards . . 257
ng
5.2 Practical Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.3 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
e
nh
Yu
A.2.1 Compiling and Running Native Intel R Xeon Phi Applications . . . . . . . . . . . 267
re
g
an
B.2.7 Asynchronous Execution on One and Multiple Coprocessors . . . . . . . . . . . . . 354
W
B.2.8 Using MPI for Multiple Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . 362
B.3 Source Code for Chapter 3: Expressing Parallelism . . . . . . . . . . . . . . . . . . . . . . 364
ng
B.3.1 Automatic Vectorization: Compiler Pragmas and Vectorization Report . . . . . . . . 364
e
nh
B.3.2 Parallelism with OpenMP: Shared and Private Variables, Reduction . . . . . . . . . 374
Yu
B.3.3 Complex Algorithms with Cilk Plus: Recursive Divide-and-Conquer . . . . . . . . . 380
B.3.4 Data Traffic with MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
r
fo
390
re
TM
B.4.2 Using Intel R Trace Analyzer and Collector . . . . . . . . . . . . . . . . . . . . . 394
pa
B.4.9 Shared-Memory Optimization: Loop Collapse and Strip-Mining for Improved Parallel
Ex
Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
B.4.10 Shared-Memory Optimization: Core Affinity Control . . . . . . . . . . . . . . . . . 443
B.4.11 Cache Optimization: Loop Interchange and Tiling . . . . . . . . . . . . . . . . . . 448
B.4.12 Memory Access: Cache-Oblivious Algorithms . . . . . . . . . . . . . . . . . . . . 455
B.4.13 Memory Access: Loop Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
B.4.14 Offload Traffic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
B.4.15 MPI: Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
Bibliography 487
Index 495
Foreword
We live in exciting times; the amount of computing power available for sciences and engineering is
reaching enormous heights through parallel computing. Parallel computing is driving discovery in many
endeavors, but remains a relatively new area of computing. As such, software developers are part of an industry
that is still growing and evolving as parallel computing becomes more commonplace.
g
The added challenges involved in parallel programming are being eased by four key trends in the industry:
an
emergence of better tools, wide-spread usage of better programming models, availability of significantly more
W
hardware parallelism, and more teaching material promising to yield better-educated programmers. We have
ng
seen recent innovations in tools and programming models including OpenMP and Intel Threading Building
TM
Blocks. Now, the Intel R Xeon Phi coprocessor certainly provides a huge leap in hardware parallelism with
e
nh
its general purpose hardware thread counts being as high as 244 (up to 61 cores, 4 threads each).
Yu
This leaves the challenge of creating better-educated programmers. This handbook from Colfax, with a
subtitle of “Handbook on the Development and Optimization of Parallel Applications for Intel Xeon Processors
r
fo
and Intel Xeon Phi Coprocessors” is an example-based course for the optimization of parallel applications for
d
platforms with Intel Xeon processors and Intel Xeon Phi coprocessors.
re
This handbook serves as practical training covering understandable computing problems for C and C++
pa
programmers. The authors at Colfax have developed sample problems to illustrate key challenges and offer
re
their own guidelines to assist in optimization work. They provide easy to follow instructions that allow the
yP
reader to understand solutions to the problems posed as well as inviting the reader to experiment further.
el
Colfax’s examples and guidelines complement those found in our recent book on programming the Intel Xeon
iv
Phi Coprocessor by Jim Jeffers and myself by adding another perspective to the teaching materials available
us
In the quest to learn, it takes multiple teaching methods to reach everyone. I applaud these authors in their
Ex
efforts to bring forth more examples to enable either self-directed or classroom oriented hands-on learning of
the joys of parallel programming.
James R. Reinders
TM
Co-author of “Intel R Xeon Phi Coprocessor High Performance Programming"
c 2013, Morgan Kaufmann Publishers
Intel Corporation
March 2013
Preface
Welcome to the Colfax Developer Training! You are holding in your hands or browsing on your computer
screen a comprehensive set of training materials for this training program. This document will guide you to the
mastery of parallel programming with Intel R Xeon R family products: Intel R Xeon R processors and Intel R
TM
Xeon Phi coprocessors. The curriculum includes a detailed presentation of the programming paradigm
g
for Intel Xeon product family, optimization guidelines, and hands-on exercises on systems equipped with
an
Intel Xeon Phi coprocessors, as well as instructions on using Intel R software development tools and libraries
W
included in Intel R Parallel Studio XE.
ng
These training materials are targeted toward developers familiar with C/C++ programming in Linux.
Developers with little parallel programming experience will be able to grasp the core concepts of this subject
e
nh
from the detailed commentary in Chapter 3. For advanced developers familiar with multi-core and/or GPU
Yu
programming, the training offers materials specific to the Intel compilers and Intel Xeon family products, as
well as optimization advice pertinent to the Many Integrated Core (MIC) architecture.
r
fo
We have written these materials relying on key elements for efficient learning: practice and repetition.
d
As a consequence, the reader will find a large number of code listings in the main section of these materials.
re
In the extended Appendix, we provided numerous hands-on exercises that one can complete either under an
pa
This document is different from a typical book on computer science, because we intended it to be used
yP
as a lecture plan in an intensive learning course. Speaking in programming terms, a typical book traverses
el
material with a “depth-first algorithm”, describing every detail of each method or concept before moving on to
iv
the next method. In contrast, this document traverses the scope of material with a “breadth-first” algorithm.
us
First, we give an overview of multiple methods to address a certain issue. In the subsequent chapter, we
cl
re-visit these methods, this time in greater detail. We may go into even more depth down the line. In this way,
Ex
we expect that students will have enough time to absorb and comprehend the variety of programming and
optimization methods presented here. The course road map is outlined in the following list.
• Chapter 1 presents the Intel Xeon Phi architecture overview and the environment provided by the MIC
Platform Software Stack (MPSS) and Intel Cluster Studio XE on Many Integrated Core architecture
(MIC). The purpose of Chapter 1 is to outline what users may expect from Intel Xeon Phi coprocessors
(technical specifications, software stack, application domain).
• Chapter 2 allows the reader to experience the simplicity of Intel Xeon Phi usage early on in the program.
It describes the operating system running on the coprocessor, with the compilation of native applications,
and with the language extensions and CPU-centric codes that utilize Intel Xeon Phi coprocessors: offload
and virtual-shared memory programming models. In a nutshell, Chapter 2 demonstrates how to write
serial code that executes on Intel Xeon Phi coprocessors.
• Chapter 3 introduces Single Instruction Multiple Data (SIMD) parallelism and automatic vectorization,
thread parallelism with OpenMP and Intel Cilk Plus, and distributed-memory parallelization with MPI.
In brief, Chapter 3 shows how to write parallel code (vectorization, OpenMP, Intel Cilk Plus, MPI).
• Chapter 4 re-iterates the material of Chapter 3, this time delving deeper into the topics of parallel
programming and providing example-based optimization advice, including the usage of the Intel Math
Kernel Library. This chapter is the core of the training. The topics discussed in this Chapter 4 include:
i) scalar optimizations;
ii) improving data structures for streaming, unit-stride, local memory access;
iii) guiding automatic vectorization with language constructs and compiler hints;
iv) reducing synchronization in task-parallel algorithms by the use of reduction;
v) avoiding false sharing;
vi) increasing arithmetic intensity and reducing cache misses by loop blocking and recursion;
vii) exposing the full scope of available parallelism;
viii) controlling process and thread affinity in OpenMP and MPI;
ix) reducing communication through data persistence on coprocessor;
g
an
x) scheduling practices for load balancing across cores and MPI processes;
W
xi) optimized Intel Math Kernel Library function usage, and other.
ng
If Chapter 3 demonstrated how to write parallel code for Intel Xeon Phi coprocessors, then Chapter 4
e
nh
shows how to make this parallel code run fast.
Yu
Throughout the training, we emphasize the concept of portable parallel code. Portable parallelism can be
ed
achieved by designing codes in a way that exposes the data and task parallelism of the underlying algorithm,
ar
and by using language extensions such as OpenMP pragmas and Intel Cilk Plus. The resulting code can be run
p
on processors as well as on coprocessors, and can be ported with only recompilation to future generations of
re
multi- and many-core processors with SIMD capabilities. Even though the Colfax Developer Training program
yP
touches on low-level programming using intrinsic functions, it focuses on achieving high performance by
el
writing highly parallel code and utilizing the Intel compiler’s automatic vectorization functionality and parallel
iv
frameworks.
us
The handbook of the Colfax Developer Training is an essential component of a comprehensive, hands-on
cl
course. While the handbook has value outside a training environment as a reference guide, the full utility of
Ex
the training is greatly enhanced by students’ access to individual computing systems equipped with Intel Xeon
processors, Intel Xeon Phi coprocessors and Intel software development tools. Please check the Web page of
the Colfax Developer training for additional information: http://www.colfax-intl.com/xeonphi/
Welcome to the exciting world of parallel programming!
List of Abbreviations
g
an
AVX Advanced Vector Extensions (SIMD standard)
W
BLAS Basic Linear Algebra Subprograms
e ng
CAO Compiler Assisted Offload
nh
Yu
CLI Command Line
r
CPU Central Processing Unit, used interchangeably with the terms “processor” and “host” to indicate the
fo
FLOP Floating-Point Operation. Refers to any floating-point operation, not just addition or multiplication.
FMA Fused Multiply-Add instruction
FP Floating-point
GCC GNU Compiler Collection
GDDR Graphics Double Data Rate memory
GEMM General Matrix Multiply
GPGPU General Purpose Graphics Processing Unit
GUI Graphical User Interface
HPC High Performance Computing
I/O Input/Output
g
an
“target” to indicate the Intel Xeon Phi coprocessor, as opposed to the Intel Xeon processor.
W
MKL Math Kernel Library
MMX Multimedia Extensions (SIMD standard)
e ng
nh
MPI Message Passing Interface
r Yu
OS Operating System
PCIe Peripheral Component Interconnect Express
PMU Performance Monitoring Unit
RAM Random Access Memory
RNG Random Number Generator
RSA Rivest-Shamir-Adleman cryptography algorithm
RTL Runtime Library
ScaLAPACK Scalable Linear Algebra Package
SGEMM Single-precision General Matrix Multiply
SIMD Single Instruction Multiple Data
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
Chapter 1
Introduction
This chapter introduces the Intel Many Integrated Core (MIC) architecture and positions Intel Xeon Phi
g
an
coprocessors in the context of parallel programing.
W
ng
TM
1.1 Intel R Xeon Phi Coprocessors and the MIC Architecture
e
1.1.1 Overview nh
Yu
Intel Xeon Phi coprocessors have been designed by Intel Corporation as a supplement to the Intel Xeon
r
fo
processor family. These computing accelerators feature the MIC (Many Integrated Core) architecture, which
d
enables fast and energy-efficient execution of High Performance Computing (HPC) applications utilizing
re
massive thread parallelism, vector arithmetics and streamlined memory access. The term “Many Integrated
pa
Core” serves to distinguish the Intel Xeon Phi product family from the “Multi-Core” family of Intel Xeon
re
processors.
yP
Intel Xeon Phi coprocessors derive their high performance from multiple cores, dedicated vector arith-
el
metic units with wide vector registers, and cached on-board GDDR5. High energy efficiency is achieved
iv
through the use of low clock speed x86 cores with lightweight design suitable for parallel HPC applications.
us
Figure 1.1 illustrates the chip layout of an Intel Xeon processor and an Intel Xeon Phi coprocessor
cl
based on the KNC (Knights Corner) architecture. The most apparent difference conveyed by this image is the
Ex
number and density of cores on the chip. This fact reminds the user that massive parallelism in applications is
necessary in order to fully employ Intel Xeon Phi coprocessors.
Figure 1.1: Intel’s multi-core and many-core engines (not to scale). Image credit: Intel Corporation
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Figure 1.2: Top: An Intel Xeon Phi coprocessor based on the KNC chip, with an passive-cooling solution and a PCIe
x16 connector. Bottom: A server computing system featuring eight Intel Xeon Phi coprocessors with a passive-cooling
solution. . Relative sizes not to scale.
g
an
and other.
W
Figure 1.3 demonstrates the variety of choices for thread and data parallelism implementations in the
ng
design of applications for Intel Xeon and Intel Xeon Phi platforms. Depending on the specificity and computing
e
nh
needs of the application, the depth of control over the execution may be chosen from high-level library function
calls to low-level threading functionality and Single Instruction Multiple Data (SIMD) operations. This choice
Yu
is available in both Multi-Core and Many-Core applications.
r
fo
Ease of use
d
TM
Array Notation: Intel R Cilk Plus
iv
R
Intel Threading Building
us
Blocks
cl
TM
Auto vectorization
Ex
R
Intel Cilk Plus
Depth
Semi-auto vectorization:
OpenMP* #pragma (vector, ivdep, simd)
OpenCL*
Fine control
Figure 1.3: Implementation of thread and data parallelism in applications for Intel Xeon processors and Intel Xeon Phi
coprocessors designed with Intel software development tools. Diagram based on materials designed by Intel.
g
an
W
Multi-Core Hosted Offload Symmetric Many Core Hosted
General serial and
parallel computing
Code with highly-
parallel phases
e ng
Codes with
balanced needs
Highly-parallel
codes
nh
Yu
Figure 1.4: Intel R Architecture benefit: wide range of development options. Breadth, depth, familiar models meet varied
r
fo
Intel Xeon Phi coprocessors are Internet Protocol (IP)-addressable devices running a Linux operating
p
system, which enables straightforward porting of code written for the Intel Xeon architecture to the MIC
re
architecture. This, combined with code portability, makes Intel Xeon Phi coprocessors a compelling platform
yP
for heterogeneous clustering. In heterogeneous cluster applications, host processors and MIC coprocessors
can be used on an equal basis as individual compute nodes.
el
iv
us
cl
Ex
g
• The KNC die is manufactured using the 22 nm process technology with 3-D Trigate transistors.
an
W
• Improved technology allows to fit over 50 cores on a single die.
ng
• KNC supports 64-bit instructions and 512-bit SIMD vector registers.
e
nh
• The x86 logic of KNC (excluding the L2 caches) constitutes less than 2% of the die.
Yu
• Each x86 core on the Knights Corner chip has its own Performance Monitoring Unit (PMU).
r
fo
Figure 1.5: Knights Corner die organization. The cores and GDDR5 memory controllers are connected via an Interproces-
sor Network (IPN) ring, which can be thought of as an independent bi-directional ring. The L2 caches are shown here
as slices per core, but can be thought of as a fully coherent cache of the aggregated slices. Information can be copied to
each core that uses it to provide the fastest possible local access, or a single copy can be present for all cores to provide
maximum cache capacity. This diagram is a conceptual drawing and does not imply actual distances, latencies, etc. Image
and description credit: Intel Corporation.
• Each core has a dedicated vector unit supporting 512-bit wide registers with support for the Initial
Many-Core Instructions (IMCI) instruction set.
• Scalar instructions are processed in a separate unit.
• The KNC core is an in-order processor with 4-way hyper-threading.
• Every hyper-thread issues instructions every other cycle, and therefore 2 hyper-threads per core are
necessary to utilize all available cycles. Additional two hyper-threads may improve performance in the
same situations as with hyper-threading in Intel Xeon processors.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Figure 1.6: The topology of a single Knights Corner core. Image credit: Intel Corporation.
The hierarchical cache structure is a significant component in the KNC productivity. The details of cache
organization and properties are discussed in Section 1.2.3
Parameter Value
Cache line size 64B
L1 size 32KB data, 32KB code
L1 set conflict 4KB (Dcache), 8KB (Icache)
L1 ways 8 (Dcache), 4 (Icache)
L1 latency 1 cycle
L2 to 1 prefetch (vprefetch0) buffer depth 8
L2 sizea 512K
L2 set conflict 64KB
g
L2 ways 8
an
L2 latency 15-30 cycles depending on load
W
Memory → L2 prefetch buffer depth 32
ng
Translation Lookaside Buffer (TLB) cover- 64 pages of size 4KB (256KB coverage),
e
age options (L1, data) 8 pages of size 2MB (16MB coverage),
nh
32 pages of size 64K (2MB coverage)
Yu
TLB coverage (L1, instruction) 32 pages of size 4KB (shared amongst all threads running
r
on the cores);
fo
TLB coverage (L2) L2: 64 entries, stores 2M backup PTEs, and PDEs (reduc-
d
TLB Miss penalty for 4K and 64K TLBs ≈30 cycles (assuming PDE hit in the L2 TLB);
yP
Associativity
Eight-way associativity strikes a balance between the low overhead of direct-mapped caches and the
versatility of fully-associative caches. An 8-way set associative cache chooses, for each memory address, one
of 8 ways of cache (i.e., cache segments) into which the data at that memory address be placed. Within the
way, the data can be placed anywhere.
Replacement Policy
The Least Recently Used policy is such behavior of a cache that when some data has to be evicted from
cache in order to load new data, the data is evicted from least recently used set. LRU is implemented by
dedicated hardware units in the cache.
Set Conflicts
To the developer, an important property of multi-way associative caches with LRU is the possibility of
set conflict. A set conflict may occur when the code processes data with a certain stride in virtual memory. For
KNC, the stride is 4 KB in the L1 cache and 64 KB in L2 cache. With this stride, data from memory must be
mapped into same set, and, if LRU is not functioning properly, some data may be evicted prematurely, causing
performance loss.
Coherency
A coherent cache guarantees that when data is modified in one cache, copies of this data in all other
caches will be correspondingly updated before they are made available to the cores accessing these other
caches. In KNC, L2 caches are not truly shared between all cores; each core has its private slice of the
aggregate cache (see Figure 1.5). Therefore, the coherency of the L2 cache comes at the cost of potential
performance loss when data is transferred across the ring interconnect. Generally, data locality is the best way
to optimize cache operation. See Section 4.5 for more information on improving data locality.
g
an
Translation Lookaside Buffer (TLB)
W
ng
Translation Lookaside Buffer, or TLB, is a cache residing on each core, that speeds up the lookup of the
e
physical memory address corresponding to a virtual memory address. Entries, or pages in TLB can vary in the
nh
amount of memory that they map. The physical size of the TLB places restrictions on the number of pages on
Yu
the total address range stored in TLB. When memory address accessed by the code is not found in TLB, the
TLB entries must be re-built in order to look up that address. This causes a data page walk operation, which is
r
fo
fairly expensive compared to the misses in L1 and L2 caches. Optimal TLB page properties depend on the
ed
memory access pattern of the application. As with other cache function, TLB performance can generally be
ar
improved by increasing the locality of data access in time and space. See Section 4.5 for information on TLB
p
performance tuning.
re
yP
Prefetching
el
Another important property of caches is prefetching. During the program execution, it is possible to
iv
us
request that data is fetched into cache before the core uses this data. This diminishes the impact of memory
cl
latency on performance. Two types prefetching is available in KNC: software prefetching, when the prefetch
Ex
instruction is issued by the code in advance of the data usage, and hardware prefetching, when a dedicated
hardware unit in the cache learns the data access pattern and issues prefetch instructions automatically. The L2
cache in KNC has a hardware prefetcher, while the L1 cache does not. Normally, Intel compilers automatically
introduce L1 prefetch instructions into the compiled code. However, in some cases it may be desirable to
manually tune the prefetch distances or to disable software prefetching when it introduces undesirable TLB
misses. See Section 4.5.7 for more information on this topic.
Additional Reading
A comprehensive source on microprocessor caches is the book “Computer Architecture: a Quantitative
Approach” by Hennessy and Peterson [1].
Characteristic Value
Process 22 nm
Peak SP FLOPs 300W 2340 GFLOPS (P1750, +/- 10%)
Peak DP FLOPs 300W 1170 GFLOPS (P1750, +/- 10%)
Peak SP FLOPS 225W 2020 GFLOPS (P1640, +/- 10%)
Peak DP FLOPS 225W 1010 GFLOPS (P1640, +/- 10%)
Reliability Memory ECC
Max Memory Size 3/6/8 GBs
Peak Mem BW Up to 350 GB/sec
L2 Cache Per Core 512K
L2 Cache (all cores) Up to 31 MB
g
PCIe Gen 2 Up to: KNC → Host: 6.5 GB/s Host
an
I/O
→ KNC: 6.0 GB/s
W
Form Factor Double-wide PCIe w/Fan sink or Passive; Dense
ng
Form Factor
e
Transcendentals Exp, Log, Sqrt, Rsqrt, Recip, Div
Power Management C1, C3, C6 nh
Yu
Micro OS Linux
r
fo
Table 1.2: Technical specifications of Intel Xeon Phi coprocessors. Table credit: Intel Corporation.
d
re
pa
Most properties listed in Table 1.2 have been discussed above. The form factor property demonstrates
re
the options for the physical packaging of the product. Active (with fan) and passive (fanless) cooling solutions
yP
require more space inside the system than the dense form factor solution; however, the latter requires a
custom, usually liquid-based, cooling solution. The dense form-factor option is shown in Figure 1.7 as a visual
el
Figure 1.7: KNC without a heat sink (dense form factor). A traditional active-cooling solution is shown Figure 1.2.
• it runs an Secure Shell protocol (SSH) server, which allows the user to log into the coprocessor and
obtain a shell;
• and is capable of running other services such as Network File Sharing Protocol (NFS).
On the operating system level, the above mentioned functionality is provided by the MIC Platform
Software Stack (MPSS), a suite of tools including drivers, daemons, command-line and graphical tools. The
role of MPSS is to boot the coprocessor, load the Linux operating system, populate the virtual file system, and
g
to enable the host system user to interact with Intel Xeon Phi coprocessor in the same way as the user would
an
interact with an independent compute node on the network.
W
ng
TM
Linux* Host Intel R Xeon Phi
e coprocessor
nh
Virtual terminal session
Yu
SSH
ar
TM TM
Intel R Xeon Phi Intel R Xeon Phi
Ex
Figure 1.8 illustrates the role of MPSS in the operation of an Intel Xeon Phi coprocessor. User-level code
for the coprocessor runs in an environment that resembles a compute node. The network traffic is carried over
the PCIe bus instead of network interconnects.
User applications can be built in two ways. For high performance workloads, Intel compilers can be used
to compile C, C++ and Fortran code for the MIC architecture. Compilers are not a part of the MPSS; they
are distributed in additional software suites such as Intel Parallel Studio XE or Intel Cluster Studio XE (see
Section 1.1.3 and Section 1.2.7). For the Linux operating system running on Intel Xeon Phi coprocessors, a
specialized version of GNU Compiler Collection (GCC) is available. The specialized GCC is not capable of
automatic vectorization and other optimization routines that Intel C Compiler is capable of.
TM
1.2.6 Intel R Xeon R Processors versus Intel R Xeon Phi Coprocessors: Developer
Experience
The following is an excerpt from an article “Programming for the Intel Xeon family of products
(Intel Xeon processors and Intel Xeon Phi coprocessors)” by James Reinders, Intel’s Chief Evangelist and
Spokesperson for Software Tools and Parallel Programming [2].
Forgiving Nature: Easier to port, flexible enough for initial base performance
Because an Intel Xeon Phi coprocessor is an x86 SMP-on-a-chip, it is true that a port to an Intel Xeon Phi
coprocessor is often trivial. However, the high degree of parallelism of Intel Xeon Phi coprocessors requires
g
applications that are structured to use the parallelism. Almost all applications will benefit from some tuning
an
beyond the initial base performance to achieve maximum performance. This can range from minor work
W
to major restructuring to expose and exploit parallelism through multiple tasks and use of vectors. The
experiences of users of Intel Xeon Phi coprocessors and the “forgiving nature" of this approach are generally
ng
promising but point out one challenge: the temptation to stop tuning before the best performance is reached.
e
nh
This can be a good thing if the return on investment of further tuning is insufficient and the results are good
enough. It can be a bad thing if expectations were that working code would always be high performance.
Yu
There is no free lunch! The hidden bonus is the “transforming- and-tuning” double advantage of programming
r
investments for Intel Xeon Phi coprocessors that generally applies directly to any general-purpose processor
fo
as well. This greatly enhances the preservation of any investment to tune working code by applying to other
d
There are a number of possible user-level optimizations that have been found effective for ultimate performance.
These advanced techniques are not essential. They are possible ways to extract additional performance for
el
your application. The “forgiving nature" of Intel Xeon Phi coprocessors makes transformations optional but
iv
should be kept in mind when looking for the highest performance. It is unlikely that peak performance will be
us
• Memory Access and Loop Transformations (e.g., cache blocking, loop unrolling, prefetching, tiling,
Ex
Detailed description of Intel Xeon Phi coprocessor programming models can be found in Chapter 2. A
thorough exploration of optimization techniques mentioned in this quote is undertaken in Chapter 4.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
Figure 1.9: Intel software development tool suites for shared-memory and distributed-memory application design. Intel
us
Xeon processors and Intel Xeon Phi coprocessors are supported by both suites.
cl
Ex
g
parallelism. Moreover, coprocessor applications require a high level of parallelism: they should have good
an
scalability up to at least 100 threads, considering the 4-way hyper-threading capability of KNC cores.
W
Figure 1.10 is an illustration of the necessity for parallelism necessary of Intel Xeon Phi utilization.
e ng
nh
Yu
serial serial serial
application processor performance
r
fo
d
re
pa
Examples
1. Compilation of a programming language is an example of a task more suited for Intel Xeon processors
than for Intel Xeon Phi coprocessors, because compilation involves inherently sequential algorithms.
2. Monte Carlo simulations are well-suited for Intel Xeon Phi coprocessors because of their inherent
massive parallelism. See, however, a comment in Section 1.3.2.
g
performance on an Intel Xeon Phi coprocessor than on an Intel Xeon processor. See Section 3.1 for more
an
information about vectorization.
W
Figure 1.11 is a visual pointer to the need for SIMD parallelism (vectorization) in KNC workloads.
e ng
nh
Yu
- Intel Xeon
ar
More Parallel
1 10 100 1k 10k
More Performance
Figure 1.11: Scalar/vector and single/multi-threaded performance. Diagram credit: James Reinders, Intel Corporation
Examples
1. Monte Carlo algorithms may be well-suited for execution on Intel Xeon Phi coprocessors if they either
use vector for calculations inside of each Monte Carlo iteration, or perform multiple simultaneous Monte
Carlo iterations using multiple SIMD lanes.
2. For common linear algebraic operations, there exist SIMD-friendly algorithms and implementations.
These calculations are well-suited for Intel Xeon Phi coprocessors.
g
an
b) memory access pattern is streamlined enough so that the application is limited by memory bandwidth and
W
not memory latency — the bandwidth-bound case.
ng
See also Section 4.5 for a discussion of memory and cache traffic tuning. Multi-threading is as important for
e
bandwidth-bound applications as it is for compute-bound workloads, because all available memory controllers
must be utilized. Figure 1.12 is a visual reminder of this fact. nh
r Yu
fo
d
Single-thread
re
More Parallel
Bandwidth limited
yP
- Intel Xeon
el
iv
us
Multi-threaded
cl
Bandwidth limited
Ex
1 10 100 1k 10k
More Performance
Figure 1.12: Parallelism and bandwidth-bound application performance. Diagram credit: James Reinders, Intel.
Examples
1. Matrix transposition involves a memory access pattern with a stride equal to one of the matrix dimensions.
Therefore, this operation may be unable to fully utilize the coprocessor memory bandwidth;
2. Some stencil operations on dense data have a streamlined memory access pattern. These algorithms are
a good match for the Intel Xeon Phi architecture.
a) for compute-bound calculations, if the coprocessor performs many more than than
1000 GFLOP/s
g
Na = ≈ 1300 (1.1)
an
6 GB/s/sizeof(double)
W
lightweight floating-point operations (additions and multiplications), then the data transport overhead is
insignificant; e ng
nh
b) for compute-bound calculations with division and transcendental operations, this the arithmetic intensity
Yu
threshold at which the communication overhead is justified is lower than 1300 operations per transferred
r
floating-point number;
fo
ed
c) for bandwidth-bound calculations, if the data in the coprocessor memory is read many more than
ar
200 GB/s
p
Ns = ≈ 30 times (1.2)
re
6 GB/s
yP
in streaming fashion, then the data transport across the PCIe bus is likely not to be the bottleneck.
el
iv
us
Arithmetic Complexity
cl
In this context, it is informative to establish a link between the complexity of an algorithm and its ability
Ex
• if the data size is n, and the arithmetic complexity (i.e., the number of arithmetic operations) of the
algorithm scales as O(n), such an algorithm may experience a bottleneck in the data transport. This is
because the coprocessor performs a fixed number of arithmetic operations on every number sent across
the PCIe bus. If this number is too small, the data transport overhead is not justifiable.
• for algorithms in which the arithmetic complexity scales faster than O(n) (e.g., O(n log n) or O(n2 )),
larger problems are likely to be less limited by PCIe traffic than smaller problems, as the arithmetic
intensity in this case increases with n. The stronger the arithmetic complexity scaling, the less important
is the communication overhead.
Masking
Masking the data transfer time can potentially increase overall performance by up to a factor of two. In
order to mask data transfer, the asynchronous transfer and asynchronous execution capabilities of Intel Xeon
Phi coprocessors can be used. More details are provided in Section 2.2.9 and Section 2.3.1.
Examples
1. Computing a vector dot-product, when one or both vectors need to be transferred to the coprocessor, is an
inefficient use of Intel Xeon Phi coprocessors, because the complexity of the algorithm is O(n), and only
2 arithmetic operations per transferred floating-point number are performed. The PCIe communication
overhead is expected to be too high, and such a calculation can be done more efficiently on the host;
2. Computing a matrix-vector product with a square matrix, when only the vector must be transferred to
the coprocessor, is expected to have a small PCIe communication overhead if the vector size is large
enough (so that the communication latency is unimportant). The algorithm complexity is O(n2 ), and
therefore each transferred floating-point number will be used n times.
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
TM
1.4 Installing the Intel R Xeon Phi Coprocessor and MPSS
This section overviews the process of installing an Intel Xeon Phi coprocessor and related software.
g
an
user@host% lspci | grep -i "co-processor"
82:00.0 Co-processor: Intel Corporation Device 2250
W
ng
Listing 1.1: Using lspci to check whether an Intel Xeon Phi coprocessor is installed in the computing system.
e
nh
Yu
Drivers and administrative tools required for Intel Xeon Phi operation are included in the MPSS (Intel
ed
MIC Platform Software Stack) package. The role of MPSS is to boot the Intel Xeon Phi coprocessor, populate
ar
its virtual file system and start the operating system on the coprocessor, to provide connectivity protocols and
p
re
As is the case with hardware installation, computing systems provisioned by Colfax International will
have the drivers already installed. For MPSS installation on other systems, instructions can be found in the
el
corresponding “readme” file included with the software stack. MPSS can be freely downloaded from the Intel
iv
After MPSS has been installed, initial configuration steps are be required, as shown in Listing 1.2
cl
Ex
That command will create configuration files /etc/sysconfig/mic/* and the system-wide config-
uration file /etc/modprobe.d/mic.conf. In addition, the hosts file /etc/hosts will be modified by
MPSS. The IP addresses and hostnames of Intel Xeon Phi coprocessors will be placed into that file.
The Intel Xeon Phi coprocessor driver is a part of MPSS, and it is available as the system service mpss.
It can be enabled at boot by running the following two commands:
In order to stop, start or restart the driver, the following command can be used:
In order to verify the installation and configuration of MPSS, the command miccheck can be used. For
more information on MPSS, refer to Section 1.5.
g
an
code compilation for Intel Xeon Phi coprocessors (see also comment about GCC for the MIC architecture
W
in Section 1.2.5). While a stand-alone Intel compiler of C, C++ or Fortran is sufficient for the development
of applications for Intel Xeon Phi coprocessors, we recommend a suite of Intel software development tools
ng
called Intel Parallel Studio XE 2013. This suite includes valuable additional libraries and performance tuning
e
nh
tools that can accelerate the process of software development. For cluster environments, a better option is Intel
Yu
Cluster Studio XE, which includes the Intel MPI library and Intel Trace Analyzer and Collector in addition to
all of the components of Intel Parallel Studio XE. In order to choose the correct software for your needs, see
r
fo
or from one of the authorized resellers. Colfax International is an authorized reseller offering discounts for
pa
bundling software licenses with hardware purchases, and academic discounts for eligible customers. For
re
[email protected].
Installation instructions are included with the downloadable software suites. After installation, it is
el
important to set up the environment variables for Intel Parallel Studio XE as shown in Listing 1.5.
iv
us
cl
The setup of environment variables using the compilervars script can be automated. The automation
process is depends on the operating system. For example, on RedHat Linux, in order to automate loading the
script for an individual user, place the command shown in Listing 1.5 into the file ~/.bashrc.
In order to verify that the compilers have been successfully installed, run the following commands:
user@host% icc -v
icc version 13.1.0 (gcc version 4.4.6 compatibility)
user@host% icpc -v
icpc version 13.1.0 (gcc version 4.4.6 compatibility)
user@host% ifort -v
ifort version 13.1.0
A) Reboot the system, enter the boot menu and choose to boot the old version of the Linux kernel. Alter-
natively, the choice of the old kernel may be made permanent by modifying the Grub configuration file
/boot/grub/grub.conf. This method is a workaround, and should only be used temporarily, until
method B can be applied.
B) Rebuild the MIC kernel module.
g
an
In order to rebuild the MIC kernel module, superuser access is required. The steps illustrated in Listing 1.7
W
describe the rebuild process of the kernel module.
KNC_gold-2.1.4346-16-rhel-6.3.gz
re
root@host% cd KNC_gold-2.1.4346-16-rhel-6.3/src
root@host% # Rebuild the MPSS kernel module:
el
# output skipped
us
root@host% cd ..
root@host% mkdir old_modules # store old modules
cl
Listing 1.7: Restoring MPSS functionality after a Linux kernel update by rebuilding the MIC kernel module.
TM
1.5 MPSS Tools and Linux Environment on Intel R Xeon Phi Copro-
cessors
This section lists the essential tools and configuration options for managing the Intel Xeon Phi coprocessor
operating environment.
g
an
micctrl a comprehensive configuration tool for the Intel Xeon Phi coprocessor operating system,
W
miccheck a set of diagnostic tests for the verification of the Intel Xeon Phi coprocessor configuration,
ng
micrasd a host daemon logger of hardware errors reported by Intel Xeon Phi coprocessors,
e
micflash an Intel Xeon Phi flash memory agent. nh
r Yu
Most of the administrative tools and utilities can be found in the /opt/intel/mic/bin directory. Some
fo
of these utilities require superuser privileges. In order to facilitate the path lookup for these tools, modify the
d
The information in this section is a brief overview of the above mentioned tools. As usual, the usage and
cl
arguments of these tools can be obtained by running the any of the tools with the argument --help or by
Ex
user@host% /opt/intel/mic/bin/micinfo
System Info
g
Host OS : Linux
an
OS Version : 2.6.32-71.el6.x86_64
W
Driver Version : 3126-12
MPSS Version : 2.1.3126-12
Host Physical Memory
CPU Family
:
:
65922 MB
GenuineIntel Family 6
e ng Model 45 Stepping 5
nh
CPU Speed : 1200
Yu
...
r
fo
Using -listdevices option provides a list of the Intel Xeon Phi coprocessors present in the system.
p
re
yP
VERSION: 3653-8
cl
Ex
Listing 1.11: Listing available Intel Xeon Phi coprocessors with micinfo utility.
To request detailed information about a specific device, the option -deviceinfo <number> should
be used. Additionally, the information displayed by this command can be narrowed down by including the
option -group <group name>. Where valid group names are:
• Version
• Board
• Core
• Thermal
• GDDR
For instance, the following shell command returns the information about the total number of cores on the
first Intel Xeon Phi coprocessor, current voltage and frequency:
g
VERSION: 3653-8
an
W
MicInfo Utility Log
ng
Created Fri Aug 31 13:36:03 2012
e
Device No: 0, Device Name: K1OM nh
Yu
Core
r
fo
Listing 1.12: Printing out detailed information about cores from the first Intel Xeon Phi coprocessor.
yP
el
iv
us
cl
Ex
-c or --cores returns the average and per core utilization levels for each available board in the system.
g
an
-f or --freq returns the clock frequency and power levels for each available board in the system.
W
-i or --info returns general system info.
-m or --mem returns memory utilization data.
e ng
nh
Yu
-t or --temp returns temperature levels for each available board in the system.
r
fo
--pwrenable [cpufreq | corec6 | pc3 | pc6 | all] enables the specified respective power
management features, disables unspecified
ed
ar
--pwrstatus Returns and the status of power management features for each coprocessor,
p
re
--turbo [status | enable | disable] returns or modifies the Turbo Mode status on all copro-
yP
cessors
el
--ecc [status | enable | disable] returns or modifies the ECC status on all coprocessors
iv
us
-a or --all results in the processing of all valid options, excluding --help, --turbo, and --ecc.
cl
Ex
Card 1 (info):
Device Name: ............. KNC
Device ID: ............... 2250
Number of Cores: ......... 60
OS Version: .............. 2.6.34-ga914e40
Flash Version: ........... 2.1.02.0314
Driver Version: .......... DRIVERS_3126-12
Stepping: ................ 0
SubStepping: ............. 2500
Card 1 (temp):
Cpu Temp: ................ 59.00 C
Memory Temp: ............. 42.00 C
g
Fan-In Temp: ............. 31.00 C
an
Fan-Out Temp: ............ 42.00 C
W
Core Rail Temp: .......... 43.00 C
Uncore Rail Temp: ........ 44.00 C
ng
Memory Rail Temp: ........ 44.00 C
e
Card 1 (freq): nh
Yu
Core Frequency: .......... 1.00 GHz
r
Card 1 (mem):
yP
Card 1 (cores):
cl
Listing 1.13: micsmc command-line (CLI) output the Intel Xeon Phi coprocessor core utilization, temperature, memory
usage, and power usage statistics.
Running micsmc without the command-line arguments will open the application’s GUI and display
system physical characteristics with graphical primitives, as demonstrated in Figure 1.13 and Figure 2.1.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
Figure 1.13: The GUI mode of the micsmc tool illustrating the execution of a workload on a system with two Intel Xeon
iv
Phi coprocessors.
us
cl
Ex
g
an
--resetconfig used after changes are made to configuration files. It recreates all the default files based
W
on the new configuration. The MPSS service must not be running.
ng
--resetdefaults used to reset configuration files back to defaults if hand editing of files has created
e
unknown situations. The MPSS service must not be running.
nh
Yu
--cleanconfig [MIC list] the cleanconfig option removes: the filelist file and directories as-
r
sociated with the MicDir configuration parameter; the image file specified by the FileSystem
fo
the
pa
/etc/sysconfig/mic/default.conf file.
re
modify the Intel Xeon Phi coprocessor filesystem to configure, add and delete Linux users and groups.
el
user@mic% micctrl -s
Ex
Listing 1.14: Using the micctrl utility to query the power status of coprocessors.
user@host% miccheck
g
MIC 0 Test 4 Ensure MAC address is unique : OK
an
MIC 0 Test 5 Check the POST code via PCI : OK
W
MIC 0 Test 6 Ping the MIC : OK
ng
MIC 0 Test 7 Connect to the MIC : OK
MIC 0 Test 8 Check for normal mode :
e OK
MIC 0 Test 9 Check the POST code via SCIF : OK
nh
MIC 0 Test 10 Send data to the MIC : OK
Yu
Listing 1.15: miccheck tests the Intel Xeon Phi coprocessor’s configuration and functionality.
micrasd
micrasd is an application running on the host system to handle and log the hardware errors reported by
Intel Xeon Phi coprocessors. It can be run as a service daemon. This tool requires administrative privileges.
The following command starts micrasd:
root@host% /opt/intel/mic/bin/micrasd
Listing 1.16: micrasd log Intel Xeon Phi coprocessor errors handler.
In order to run the utility in the daemon mode, the command line argument -daemon should be used. In
the daemon mode, micrasd will run in the background and handle/log errors silently.
The usage of micrasd:
g
root@host% micrasd [-daemon] [-loglevel LEVELS] [-help]
an
W
Listing 1.17: micrasd usage.
e ng
-daemon to run it in daemon mode
nh
Yu
-loglevel to set the logging level, from 1 to 7
r
fo
The errors will be logged into the Linux system log /var/log/messages with the tag “micras”.
us
cl
Ex
micflash
The primary purpose of this tool is to update the firmware in Intel Xeon Phi coprocessor’s flash memory.
In addition, micflash can save and retrieve the current flash image version.
Prior to using the micflash utility, the Intel Xeon Phi coprocessor must be put in either the offline, or
the ready state. The utility will fail if the device is in the normal mode. The Intel Xeon Phi coprocessor can be
placed in the ready state with the following command (root privileges are required):
root@host% micctrl -r
root@host% micctrl -w
mic0: ready
Listing 1.18: micctrl provides information about the Intel Xeon Phi coprocessor’s current status.
g
an
• micflash -info -device 0 – provide the information about which sections of the flash are
W
update-able on the hardware,
3 or Fboot0 version for fboot0 flash section version (valid only for NKF and earlier MIC versions)
yP
Note: Flash version information can be retrieved from micsmc -i output as well, and not require
iv
• additionally, we can use micflash -info <flash_image> to find the version information from
Ex
If a firmware update is performed, the host system must be rebooted prior to using Intel Xeon Phi
coprocessors. If any other flash operation besides update is performed, start the mpss service to ensure the
MPSS is fully functional.
If several Intel Xeon Phi coprocessors are present in the system, micflash will operate on the first
device only (by default). To perform update/save/info operations on the other coprocessors, the device
number must be specified with -device <number> option (numbering starts from 0) or -device all
(to run on all coprocessors).
WARNING: Multiple instance of micflash should never be allowed to access the same Intel Xeon Phi
coprocessor simultaneously!
TM
1.5.2 Network Configuration of the uOS on Intel R Xeon Phi Coprocessors
Communication with the uOS (embedded Linux* operating system on the Intel Xeon Phi coprocessor) is
provided by a virtual network interfaces mic0, mic1, etc. These interfaces are created by the MPSS. When
data is sent to or from these virtual interfaces, it is physically transferred over the bus from the host sytem to
the coprocessor or vice versa. Standard networking protocols, such as ping, TCP/IP and SSH are supported.
Network properties of Intel Xeon Phi devices can be configured using the configuration file located at
/etc/sysconfig/mic/default.conf. Listing 1.19 illustrates file format of that file:
g
# Static pair configurations will fill in the second 2 quads by default. The individual
an
# MIC configuration files can override the defaults with MicIPaddress and HostIPaddress.
W
Subnet 172.31
...
ng
# Include all additional functionality configuration files by default
e
Include "conf.d/*.conf"
... nh
r Yu
Listing 1.19: Network configuration in the default configuration /etc/sysconfig/mic/default.conf.
fo
d
re
In order to apply any modifications made to default.conf, the following command must be executed
pa
Listing 1.20: Resetting the Intel Xeon Phi coprocessor configuration file with micctrl utility.
us
cl
Ex
Listing 1.21: Resetting configuration file of the specified Intel Xeon Phi coprocessor.
The rest of Section 1.5.2 discusses the networking parameters that can be controlled in these configuration
files. The configuration files can be modified by direct editing as well as using the tool micctrl. Run
micctrl -h for information on the latter method.
g
computing cluster, this means that Intel Xeon Phi coprocessors on one host can can communicate over
an
TCP/IP directly with coprocessors on another host.
W
ng
Parameter BridgeName in /etc/sysconfig/mic/micN.conf defines the name of the static
bridge to link to.The name specified is overloaded to provide three different sets of topologies:
e
nh
1. If this parameter is commented out, it is assumed the network topology is a static pair. If this is true, the
Yu
2. If the bridge name starts with the string “mic”, then the static bridge is created with this name, binding
ed
3. Finally, if the bridge name does not start with the string “mic”, then the coprocessors will be attached
p
re
to what is assumed to be an existing static bridge to one of the host’s networking interfaces.
yP
Subnet
el
iv
The parameter Subnet defines the leading two or three elements of the coprocessor’s IP address. The
us
default value of Subnet is 172.31. This places the coprocessor in the private network range (see [4] for more
cl
In the static pair network topology, the IP addresses of the host and coprocessors are constructed by
appending to the two-element Subnet a third element specified by the Intel Xeon Phi coprocessor ID plus one
and a .1 for the coprocessor and a .254 for the host (see Table 1.3).
Table 1.3: Default IP address assignment for the static pair network topology.
If static bridge is defined, the IP addresses are constructed by appending to the subnet a .1. and then
the ID assigned to the Intel Xeon Phi coprocessor plus one; the host bridge is assigned 172.31.1.254. See
Table 1.4 for this case.
Table 1.4: Default IP address assignment for the static bridge network topology.
It is also possible to use more than one static bridge. The BridgeName parameter needs to be specified
in each individual Intel Xeon Phi coprocessor configuration file to assigned the correct bridge ID. The files
also need assign the Subnet parameter a 3 element ID in each configuration file. For example a set of files
may assign BridgeName parameter the string micbr0 and Subnet the value 172.31.1. Another set of
files may assign BridgeName parameter the string micbr1 and Subnet the value 172.31.2.
g
MicIPaddress and HostIPaddress
an
W
By default, host and coprocessor IP addresses are automatically generated from the Subnet parameter.
In circumstances in which these automatically generated addresses are inadequate, MicIPaddress and
ng
HostIPaddress can be specified in each micN.conf. These values will override the automatically
e
generated IP addresses.
nh
Yu
Setting the Coprocessor’s Host Name
r
fo
In addition to assigning IP addresses to the coprocessors, MPSS automatically defines hostnames for
d
re
By default, mpssd set the host names of coprocessors to a modified version of the host name. For
example, if the host has the hostname compute1.mycluster.com then the first Intel Xeon Phi copro-
re
yP
cessor will get assigned the hostname compute1-mic0.mycluster.com. The Hostname parameter
in micN.conf allows to override this assignment for each individual Intel Xeon Phi coprocessor.
el
The Hostname parameter will be added to the /etc/hosts file on the host. It will also be used to
iv
MTUsize
The MTUsize parameter allows setting of the network Maximum Transmission Unit (MTU) size. The
default is the max jumbo packet size of 64 kilobytes. This parameter should be set to the default network
packet size being used in the subnet that belongs to. With clusters this is often 9 kilobytes.
DHCP
IP address assignment through DHCP is not natively supported in the current MPSS release (Gold Update
2). However, it can still be enabled through external setup by setting IPADDR=dhcp must in files
/opt/intel/mic/filesystem/micN/etc/sysconfig/network/ifcfg-micN
prior to starting the MPSS.
TM
1.5.3 SSH Access to Intel R Xeon Phi Coprocessors
The Linux OS on the Intel Xeon Phi coprocessor supports SSH access for all users, including root,
using public key authentication keys. The configuration phase of the MPSS stack creates users for each
coprocessor using the file /etc/passwd on the host. For each user, the public SSH key files found in the
user’s /home/user/.ssh directory, are copied to the Intel Xeon Phi coprocessor filesystem.
Listing 1.22 demonstrates the generation of an SSH key pair and its inclusion into the coprocessor
filesystem.
user@host% ssh-keygen
# ... output omitted ...
user@host% sudo service mpss stop
user@host% sudo micctrl --resetconfig
user@host% sudo service mpss start
g
an
W
If MPSS and encryption keys are configured correctly, all Linux users should be able to log in from the
host to Intel Xeon Phi coprocessors using the Linux SSH client ssh. No password is necessary if the SSH key
ng
is copied over to the coprocessor. Listing 1.23 demonstrates interaction with an Intel Xeon Phi coprocessor via
e
nh
a terminal shell.
Yu
user@mic0% ls /
ar
user@mic0% uname -a
yP
Linux mic0 2.6.34.11-g4af9302 #2 SMP Thu Aug 16 18:52:36 PDT 2012 k1om GNU/Linux
el
Listing 1.23: Logging in to Intel Xeon Phi coprocessor using ssh access, querying the number of cores, listing the files
iv
us
On the client, an NFS share can be mounted either manually using the command mount, or automatically by
modifying the file /etc/hosts.
Example
As an example, let us illustrate how to share the Intel MPI library located on the host at the path
g
an
/opt/intel/impi with all Intel Xeon Phi coprocessors. The mount point on coprocessors will be the
W
same as on the host, i.e, /opt/intel/impi. The instructions below assume that the reader is familiar with
the text editor vi.
ng
First of all, let us ensure that all the necessary services are started (you will need to install the package
e
nfs-utils if some of these services are missing):
nh
r Yu
root@host% service rpcbind status
fo
nfsd (pid 3927 3926 3925 3924 3923 3922 3921 3920) is running...
rpc.rquotad (pid 3860) is running...
el
root@host%
iv
us
In order to enable sharing /opt/intel/impi, in the host file /etc/exports the line shown in
Listing 1.25 must be present.
The file /etc/hosts.allow on the host must have the line shown in Listing 1.26:
ALL: mic0,mic1
After the /etc/exports and /etc/hosts.allow files have been updated, the command sudo
exportfs -a should be executed in order to pass the modifications to the NFS server.
On the coprocessor side, the file /etc/fstab must be modified to allow mounting the exported NFS
file system. It must contain the line shown in Listing 1.27.
Finally, on the coprocessor, the mount point /opt/intel/impi must be created, and then the
command mount -a will mount the directory.
g
an
W
Listing 1.28: Enabling NFS mount on a coprocessor.
ng
If the above steps succeed, the host contents of directory /opt/intel/impi will be available on all
e
nh
coprocessors. Next time MPSS is restarted or the host system is rebooted, the NFS share will vanish. In order
Yu
to make these changes persistent between system or MPSS reboots, some files must be edited on the host in
the directory /opt/intel/mic/filesystem as shown in Listing 1.29. This procedure must be repeated
r
fo
user@host% sudo su
p
root@host% cd /opt/intel/mic/filesystem
re
root@host% vi mic0/etc/fstab
yP
root@host% vi mic0.filelist
us
Listing 1.29: Making NFS mount on a coprocessor persistent by modifying the mic0 filesystem files on the host.
The procedure shown in Listing 1.29 can be performed automatically using the tool micctrl:
Listing 1.30: Automated procedure for making NFS mount on a coprocessor persistent by modifying the mic0 filesystem
files on the host (an alternative to the procedure shown in Listing 1.29.
Chapter 2
g
an
W
In Chapter 1, we introduced the MIC architecture without going into the details of how to program Intel
ng
Xeon Phi coprocessors. This chapter demonstrates the utilization of the Intel Xeon Phi coprocessor from
e
user applications written in C/C++ and Fortran. It focuses on transferring data and executable code to the
nh
coprocessor. Parallelism will be discussed in Chapter 3, and performance optimization in Chapter 4.
r Yu
fo
TM
d
Intel Xeon Phi coprocessors run a Linux operating system and support traditional Linux services,
re
including SSH. This allows the user to run applications directly on an Intel Xeon Phi coprocessor by compiling
yP
a native executable for the MIC architecture and transferring it to the coprocessor’s virtual filesystem.
el
iv
us
2.1.1 Using Compiler Argument -mmic to Compile Native Applications for Intel R
cl
TM
Xeon Phi Coprocessors
Ex
In order to compile a C/C++ code to an Intel Xeon Phi executable, Intel compilers must be given the
argument -mmic. A “Hello World” code for the coprocessor is shown in Listing 2.1. Compiling and running
this code on the host is trivial, and the compilation procedure and runtime output are shown in Listing 2.2.
The name of the executable is not specified, so the compiler sets it to the default name a.out.
1 #include <stdio.h>
2 #include <unistd.h>
3
4 int main(){
5 printf("Hello world! I have %ld logical cores.\n",
6 sysconf(_SC_NPROCESSORS_ONLN ));
7 }
Listing 2.1: This C code (“hello.c”) can be compiled for execution on the host as well as on an Intel Xeon Phi
coprocessor.
Listing 2.2: Compiling and running the “Hello World” code on the host.
Listing 2.3 shows how this code can be compiled for native execution on an Intel Xeon Phi coprocessor.
The code fails to run on the host, because it is not compiled for the Intel Xeon architecture. See Section 2.1.2
to see how this code can be transferred to an Intel Xeon Phi coprocessor and executed on it.
g
Listing 2.3: A native application for Intel Xeon Phi coprocessors cannot be run on the host system.
an
W
ng
2.1.2 Establishing SSH Sessions with Coprocessors e
nh
Intel Xeon Phi coprocessors runs a Linux operating system with an SSH server (see also Section 1.4.2)
and, when the MPSS is loaded, the list of Linux users and their SSH keys on the host are transferred to the
Yu
Intel Xeon Phi coprocessor. By default, the first Intel Xeon Phi coprocessor in the system is resolved to the
r
fo
tool, as shown in Listing 2.4. After that, the user can log into the coprocessor using ssh and use the shell to
ar
run the application on the coprocessor. Running this executable produces the expected “Hello world” output,
p
re
/home/user
Ex
user@mic0% ls
a.out
user@mic0% ./a.out
Hello world! I have 240 logical cores.
user@mic0%
Listing 2.4: Transferring and running a native application on an Intel Xeon Phi coprocessor.
g
user@host% export SINK_LD_LIBRARY_PATH=/opt/intel/composerxe/compiler/lib/mic
an
user@host% micnativeloadex ./myapp -l
W
Dependency information for my
ng
Full path was resolved as
e
/home/user/myapp
nh
Yu
Binary was built for Intel(R) Xeon Phi(TM) Coprocessor
(codename: Knights Corner) architecture
r
fo
SINK_LD_LIBRARY_PATH = /opt/intel/composerxe/lib/mic
d
re
Dependencies Found:
pa
/opt/intel/composer_xe_2013.2.146/compiler/lib/mic/libiomp5.so
re
yP
Dependencies Not Found Locally (but may exist already on the coprocessor):
libm.so.6
el
libpthread.so.0
iv
libc.so.6
us
libdl.so.2
libstdc++.so.6
cl
libgcc_s.so.1
Ex
Listing 2.5: Using micnativeloadex to find library dependencies for native MIC application and launch it from the
host.
Note: micnativeloadex may also be used for performance analysis of native MIC applications. See
Section A.4.1 for more information.
1 #include <stdio.h>
2 #include <unistd.h>
3 #include <pthread.h>
4
5 void* spin(void* arg) {
6 while(true);
return NULL;
g
7
an
8 }
9
W
10 int main(){
ng
11 int n=sysconf(_SC_NPROCESSORS_ONLN);
12 printf("Spawning %d threads that do nothing. Press ^C to terminate.\n", n);
e
for (int i = 0; i < n-1; i++) {
nh
13
14 pthread_t foo;
Yu
17 spin(NULL);
18 }
ed
ar
Listing 2.6: This C code (“spin.c”) illustrates how the pthreads library can be used to produce parallel codes on
p
re
Intel Xeon Phi coprocessors. This workload is produced in order to illustrate monitoring tools for the coprocessor.
yP
The output in Listing 2.7 illustrates the compilation and running of the code spin.c on a coprocessor.
el
iv
The code enters an infinite loop and never terminates, so the execution must be terminated by pressing Ctrl+C.
us
However, while the program is running, we can monitor the Intel Xeon Phi coprocessor load.
cl
Ex
Listing 2.7: Compiling and running the code in Listing 2.6 as a native workload for Intel Xeon Phi coprocessors.
The utility micsmc included with the MPSS is a graphical user interface that allows to monitor the load
on the coprocessor, temperature, read logs and error messages and control some of the coprocessor’s settings.
Figure 2.1 (top panel) illustrates how the load on the coprocessor increases for the duration of the
execution of the workload code, and drops afterwards. Three panels in the bottom part of Figure 2.1 show
some of the information and controls accessible via the micsmc interface. Much of this information and
controls are also available via the MPSS command line tools. See Section 1.5.1 for more information.
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
Figure 2.1: The interface of the micsmc utility. The top panel demonstrates the load on the Intel Xeon Phi coprocessor.
cl
Three panels in the bottom part of the screenshot demonstrate the controls and information available via micsmc.
Ex
About MPI
MPI, or Message Passing Interface, is a communication protocol for distributed memory high performance
applications. Intel’s proprietary implementation of MPI is available as Intel MPI Library. This library
implements version 2.2 of the MPI protocol. For information about using MPI to express distributed-memory
parallel algorithms, refer to Section 3.3. The Intel MPI Reference Guide [6] contains more detailed information
about using Intel MPI.
Prerequisites
g
an
Before compiling and running MPI applications, environment variables should be set by calling a script
W
included in Intel MPI:
Additionally, the Intel MPI binaries and libraries have to available on the Intel Xeon Phi coprocessor.
ar
There are two ways to achieve that. A straightforward, but not recommended method, is to copy certain files
p
from /opt/intel/impi to the coprocessor. A better method is to NFS-share the required files with the
re
coprocessor or coprocessors. The procedure for doing so is described in Section 1.5.4. We will assume that
yP
the latter method is used, and that all the required files are already available on the coprocessor.
el
iv
Usage
us
cl
MPI applications must be compiled with special wrapper applications: mpiicc for C, mpiicpc for
Ex
C++ or mpiifort for Fortran codes. In order to launch the resulting executable as a parallel MPI application,
it should be run using a wrapper script called mpirun. MPI executables can also be executed as usual
applications, however, in this case, parallelization does not occur.
1 #include "mpi.h"
2 #include <stdio.h>
3 #include <string.h>
4
5 int main (int argc, char *argv[]) {
6 int i, rank, size, namelen;
7 char name[MPI_MAX_PROCESSOR_NAME];
8
9 MPI_Init (&argc, &argv);
g
10
an
11 MPI_Comm_size (MPI_COMM_WORLD, &size);
W
12 MPI_Comm_rank (MPI_COMM_WORLD, &rank);
13 MPI_Get_processor_name (name, &namelen);
ng
14
e
15 printf ("Hello World from rank %d running on %s!\n", rank, name);
16
nh
if (rank == 0) printf("MPI World size = %d processes\n", size);
Yu
17
18 MPI_Finalize ();
r
19 }
fo
d
re
Listing 2.9: Source code HelloMPI.c of a “Hello world” program with MPI.
pa
re
In order to compile and run the source file from Listing 2.9, we use the procedure demonstrated in
yP
Listing 2.10.
el
iv
Listing 2.10: Compiling the “Hello World!” code with Intel MPI for the host system and running it using two processes.
Listing 2.11: Compiling and running a Hello World code with Intel MPI on an Intel Xeon Phi coprocessor.
g
an
The difference between this case and the case shown in Listing 2.10 is that we included the argument
W
-host mic0 instead of -host localhost. The hostname mic0 resolves to the first coprocessor in the
Phi coprocessors. The discussion of MPI will continue in Section 2.4.3, where we will demonstrate how to run
r
MPI calculations on multiple coprocessors or on the host and coprocessors simultaneously. Subsequently, in
fo
Section 3.3, we will introduce message passing in order to effect cooperation between processes. Optimization
ed
g
1 #include <stdio.h>
an
2
W
3 int main(int argc, char * argv[] ) {
4 printf("Hello World from host!\n");
ng
5 #pragma offload target(mic)
e
6 {
7 printf("Hello World from coprocessor!\n");
nh
Yu
8 fflush(0);
9 }
r
10 printf("Bye\n");
fo
11 }
d
re
pa
Listing 2.12: Source code of hello-offload.cpp example with the offload segment to be executed on Intel Xeon
Phi coprocessor.
re
yP
Line 6 in Listing 2.12 — #pragma offload target(mic) — indicates that the following segment
el
of the code should be executed on an Intel Xeon Phi coprocessor (i.e., “offloaded”). This application must
iv
be compiled as a usual host application: no additional compiler arguments are necessary in order to compile
us
1 #include <stdio.h>
2
3 __attribute__((target(mic))) void MyFunction() {
4 printf("Hello World from coprocessor!\n");
5 fflush(0);
6 }
7
8 int main(int argc, char * argv[] ) {
9 printf("Hello World from host!\n");
10 #pragma offload target(mic)
{
g
11
an
12 MyFunction();
}
W
13
14 printf("Bye\n");
ng
15 }
e
nh
Listing 2.14: Offloading a function to an Intel Xeon Phi coprocessor.
r Yu
If multiple functions must be declared with this qualifier, there is a short-hand way to set and unset this
fo
qualifier inside a source file (see Listing 2.15). This also useful when using #include to inline header files.
ed
p ar
1 #include <stdio.h>
re
2
yP
5
iv
6 void MyFunctionOne() { // This function has target(mic) set by the pragma above
us
9 void MyFunctionTwo() { // The target(mic) attribute is still active for this function
Ex
10 fflush(0);
11 }
12
13 #pragma offload_attribute(pop)
14 // The target(mic) attribute is unset after the above pragma
15
16 int main(int argc, char * argv[] ) {
17 printf("Hello World from host!\n");
18 #pragma offload target(mic)
19 {
20 MyFunctionOne();
21 MyFunctionTwo();
22 }
23 printf("Bye\n");
24 }
Listing 2.15: Declaring multiple functions with the target attribute qualifier.
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
Figure 2.2: Proxy console I/O diagram. Output to standard output and standard error streams on the coprocessor is
yP
buffered and passed on to the host terminal by the coi daemon running on the Intel Xeon Phi coprocessor. Image credit:
el
Intel Corporation.
iv
us
In the case of the “Hello World” code (Listing 2.12 and Listing 2.13), buffering delays caused the stream
cl
from the coprocessor to be printed out after the host had finished the last printf() function call (line 11 in
Ex
Listing 2.12).
The output buffer must be flushed using the fflush(0) function of the stdio library in order to
ensure consistent operation of the console proxy. Without fflush(0) in the coprocessor code, the output of
the printf function might be lost if the program is terminated prematurely.
The proxy console I/O service is enabled by default. It can be disabled by setting the environment
variable MIC_PROXY_IO=0.
Despite the name "Proxy console I/O", the coi service proxies only the standard output and standard
error streams. Proxy console input is not supported.
a) When this variable is not set or has the value 0, no diagnostic output is produced.
b) Setting OFFLOAD_REPORT=1 produces output including the offload locations and times.
c) Setting OFFLOAD_REPORT=2, in addition, produces information regarding data traffic.
g
Hello World from coprocessor!
an
W
user@host%
user@host% export OFFLOAD_REPORT=1
ng
user@host% ./hello_offload
Hello World from host!
e
nh
Bye
Hello World from coprocessor!
Yu
user@host% ./hello_offload
re
Bye
Hello World from coprocessor!
el
user@host%
Listing 2.16: Using the environment variable OFFLOAD_REPORT to monitor the execution of an application performing
offload to an Intel Xeon Phi coprocessor.
g
an
W
1 #include <stdio.h>
ng
2 #include <stdlib.h>
3
e
int main(){
nh
4
5 #pragma offload target (mic)
Yu
6 {
7 char* myvar = getenv("MYVAR");
r
if (myvar) {
fo
8
9 printf("MYVAR=%s on the coprocessor.\n", myvar);
d
10 } else {
re
12 }
re
13 }
}
yP
14
el
Listing 2.17: This code, environment.cc, prints the value of the environment variable MYVAR on the coprocessor.
iv
us
cl
Ex
Listing 2.18: With MIC_ENV_PREFIX undefined, environment variables are passed to the coprocessor without name
changes. With MIC_ENV_PREFIX=MIC, only variables starting with MIC_ are passed, with prefix dropped.
g
demonstrated in Listing 2.19.
an
W
ng
1 printf("Hello World from host!\n");
2 #pragma offload target(mic) e
{
nh
3
4 printf("Hello World ");
Yu
5 #ifdef __MIC__
6 printf("from coprocessor (offload succeeded).\n");
r
fo
7 #else
8 printf("from host (offload to coprocessor has failed, running on the host).\n");
ed
9 #endif
ar
10 fflush(0);
}
p
11
re
yP
Listing 2.19: Fragment of hello-fallback.cpp: handling the fall-back to host in cases when offload fails.
el
iv
In Listing 2.20, the code hello-fallback.cpp is complied and executed. In the first execution
us
attempt, the coprocessor is available, and offload occurs. In the second attempt, the MIC driver is intentionally
cl
disabled, and offload fails. Execution proceeds nevertheless, only the code is run on the host.
Ex
Listing 2.20: Compiling and running hello-fallback.cpp: fall-back to execution on the host when offload fails.
1 void MyFunction() {
2 const int N = 1000;
3 int data[N];
4 #pragma offload target(mic)
g
5 {
an
6 for (int i = 0; i < N; i++)
data[i] = 0;
W
7
8 }
ng
9 }
e
nh
Listing 2.21: Offload of local scalars and arrays of known size occurs automatically
r Yu
fo
When data is stored in an array referenced by a pointer, and the array size is unknown at compile time,
re
the programmer must indicate the array length in a clause of #pragma offload, as shown in Listing 2.22.
pa
3 {
us
5 data[i] = 0;
Ex
6 }
7 }
The clauses indicating data transfer direction and amount are further discussed in Sections 2.2.9 and
2.2.10.
g
an
W
Data Transfer without Computation
e ng
If it is necessary to send data to the coprocessor without launching any processing of this data, either
the body of the offloaded code can be left blank (i.e., use “{}” after #pragma offload), or a special
nh
#pragma offload_transfer can be used as shown in Listing 2.24.
r Yu
fo
3 { }
ar
4
p
7
8 // The above pragma does not have a body.
el
10 }
us
cl
This pragma especially useful when combined with the clause signal in order to start data transfer
without blocking (i.e., initiate an asynchronous transfer). In this way, data transfer may be overlapped with
some other work on the host or on the coprocessor. See Section 2.2.10 for a discussion of asynchronous
transfer.
• Specifiers in/out indicate that the data must be sent to (“in”) or from (“out”) the coprocessor;
• inout indicates that data must be passed both to and from the target, and
• nocopy can be used to indicate that data should not be transferred in either direction.
These four specifiers only apply to bitwise-copyable data referenced by a pointer (e.g., an array of scalars or
an array of structs). Refer to [8] and [9] for complete information. The rest of this subsection demonstrates the
basic usage of language constructs for data transfer between the host and the coprocessor.
g
an
The following example shows how to initiate an offload instance, in which arrays p1 and p2 are sent to
the Intel Xeon Phi coprocessor, and array sum is fetched back.
W
e ng
1 #include <stdlib.h>
2 #define N 1000
nh
Yu
3
4 int main() {
r
5 double *p1=(double*)malloc(sizeof(double)*N);
fo
6 double *p2=(double*)malloc(sizeof(double)*N);
d
7 p1[0:N]=1; p2[0:N]=2;
re
8 double sum[N];
pa
12 {
13 for (int i = 0; i < N; i++) {
el
15 }
us
16 }
cl
Ex
In example shown in Listing 2.25, arrays p1 and p2 referenced by pointers are passed to the coprocessor.
The pointers must be declared in the scope of the offload pragma, and their length must be specified in the in
clause. At the same time, the data returned from the coprocessor at the end of the offload is stored in sum,
which is a global array with size known at compile time. Its length does not have to be specified in the out
clause. This type of data transfer is synchronous, because the execution of the host code is blocked until the
offloaded code returns.
In order to preserve some coprocessor allocated memory, and, optionally, the data in it, between offloads,
clauses alloc_if and free_if may be used. These clauses are given arguments which, if evaluated to 1,
enforce the allocation and deallocation of data, respectively. By default, the argument of both alloc_if and
free_if evaluates to 1. Allocation takes place at the start of the offload, and freeing takes place at the end
of the offload.
Example in Listing 2.26 demonstrates how data can be transferred to the coprocessor and preserved there
until the next offload.
1 SetupPersistentData(N, persistent);
2
3 #pragma offload_transfer target(mic:0) \
4 in(persistent : length(N) alloc_if(1) free_if(0) )
5
6 for (int iter = 0; iter < nIterations; iter++) {
7 SetupDataset(iter, dataset);
8 #pragma offload target(mic:0) \
9 in (dataset : length(N) alloc_if(iter==0) free_if(iter==nIterations-1) ) \
g
out (results : length(N) alloc_if(iter==0) free_if(iter==nIterations-1) ) \
an
10
11 nocopy (persistent : length(N) alloc_if(0) free_if(iter==nIterations-1) )
W
12 {
Compute(N, dataset, results, persistent);
ng
13
14 } e
15 ProcessResults(N, results);
nh
16 }
r Yu
Listing 2.26: Illustration of data transfer with persistence on the coprocessor between offload regions.
fo
ed
In Listing 2.26, the first pragma allocates array persistent on the coprocessor and initializes it
ar
by transferring some data into this array from the host. Then, inside the for-loop, the data in the array
p
re
persistent is re-used, because the nocopy specifier is used in #pragma offload. In order for
yP
nocopy to work, the allocated memory must persist between offloads, and this behavior is requested by
using alloc_if(iter==0) in order to limit data allocation to the first iteration . Finally, the clause
el
free_if(iter==nIterations-1) makes sure that the memory on the coprocessor is eventually deal-
iv
located to prevent memory leaks. The deallocation occurs only in the last iteration.
us
Note how the character ‘\’ is used in order to make the specification of the pragma continue onto the
cl
next line.
Ex
In order to effect asynchronous data transfer, the specifier signal and wait and #pragma offload_wait
are used. Complete information about asynchronous transfer can be found in the Intel C++ Compiler reference
[10].
g
an
W
1 #pragma offload_transfer target(mic:0) in(data : length(N)) signal(data)
ng
2
3 // Execution will not block until transfer is completer.
e
4
5 SomeOtherFunction(otherData); nh
// The function called below will be executed concurrently with data transfer.
Yu
6
#pragma offload target(mic:0) wait(data) \
r
7
fo
Calculate(data, result);
re
10
11 }
pa
re
Example in Listing 2.27 illustrates the use of asynchronous transfer pragmas. In this code, #pragma
us
offload_transfer initiates the transfer, and specifier signal indicates that it should be asynchronous.
cl
With asynchronous offload, SomeOtherFunction() will be executed concurrently with data transport. The
Ex
second pragma statement, #pragma offload, performs an offloaded calculation. Specifier wait(data)
indicates that the offloaded calculation should not start until the data transport signaled by data has been
completed. Any pointer variable can serve as the signal, not just the pointer to the array being transferred.
Besides including the wait clause in an offload pragma, the compiler supports the offload_wait
pragma, which is illustrated in Listing 2.28.
Here, the host code execution will wait at this pragma until the transport signaled by data has finished.
This pragma is useful when it is not necessary to initiate another offload or data transfer at the synchronization
point.
Similarly to asynchronous data transfer, function offload can be done asynchronously, as shown in
Figure 2.29.
1 char* offload0;
2 char* offload1;
3
4 #pragma offload target(mic:0) signal(offload0) \
5 in(data0 : length(N)) out(result0 : length(N))
6 { // Offload will not begin until data are transferred
7 Calculate(data0, result0);
8 }
9
10 #pragma offload target(mic:1) signal(offload1) \
11 in(data1 : length(N)) out(result1 : length(N))
12 { // Offload will not begin until data are transferred
13 Calculate(data1, result1);
14 }
15
16 #pragma offload_wait target(mic:0) wait(offload0)
17 #pragma offload_wait target(mic:1) wait(offload1)
g
Listing 2.29: Illustration of asynchronous offload to different coprocessors.
an
W
In this code, two coprocessors are employed simultaneously using asynchronous offloads. More in-
Section 2.4.1.
e ng
formation on managing multiple coprocessors in a system with the explicit offload model can be found in
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
1 int __attribute__((target(mic)))
2 CountNonzero(const int N, const int* arr) {
3 int nz=0;
4 for (int i = 0; i < N; i++) {
5 if (arr[i] != 0) nz++
6 }
g
7 return nz;
an
8 }
W
ng
Listing 2.30: Illustration of __attribute__((target(mic))) usage. The code in the top panel indicates
e
that global variable data may be used in coprocessor code. The code in the top panel indicates that function
CountNonzero() may be used in coprocessor code.
nh
r Yu
fo
Examples in Listing 2.30 show how the non-scalar variable data can be made visible in the scope of
d
the target code and the function CountNonzero() can be compiled for the coprocessor.
re
pa
1
2
3 double* ptrdata; // Apply the offload qualifier to a pointer-based array,
4 void MyFunction(); // a function
5 #include "myvariables.h" // or even a whole file
6
7 #pragma offload_attribute(pop)
Example in Listing 2.31 specifies that several arrays and all of the variables and functions declared in
the header file myvariables.h should be accessible to the coprocessor code. Note that these must
be global variables.
The single line of code in Listing 2.32 requests that the array data, which contains 1000 elements, be
transferred in, i.e., from the host to the coprocessor number 0, and that the memory on the coprocessor
must be allocated before the offload, but not freed afterwards. The symbol “\” is used to break the
pragma code into several lines. This is a blocking operation, which means that code execution will stop
until the transfer is complete. It is also possible to request a non-blocking (asynchronous) offload, using
the signal clause, as described in Section 2.2.9.
In this example, no operations will be applied to the transferred data. In order to request some processing
along with data transfer, #pragma offload should be used, as described below.
g
an
• #pragma offload target(mic) specifies that the code following this pragma must be executed
W
on the coprocessor if possible. Note that the code offload using this method blocks until the offloaded
code returns. This pragma takes a number of clauses to specify data traffic. These clauses are described
in Section 2.2.9. e ng
nh
Yu
3 ct = CountNonzero(N, data);
4 }
ed
ar
The function CountNonzero() in example shown in Listing 2.33 will be offloaded to the coprocessor
el
if possible. Code execution on the host blocks until the offloaded code returns from the coprocessor.
iv
Note that the scalar variables N and c are lexically visible to the both the host and the coprocessor and
us
g
an
memory in hardware. In order to use the virtual-shared memory model, the programmer has to specify what
W
data and how it should be accessed by the target:
ng
1. Programmer marks variables that need to be shared between the host system and the target.
e
nh
2. The same shared variable can then be used in both host and coprocessor code.
Yu
3. Runtime automatically maintains coherence at the beginning and at the end of offload statements. Upon
r
fo
entry to the offload code, data modified on the host are automatically copied to the target, and upon exit
from the offload call, data modified on the target are copied to the host.
d
re
pa
5. Note that, despite _Cilk being a part of these keywords, the programmer is not limited to using Intel
yP
Cilk Plus to parallelize the offloaded code. OpenMP, Pthreads, and other frameworks can be used within
el
the offloaded segment. See Section 3.2 for more information about Intel Cilk Plus and OpenMP.
iv
us
cl
Ex
1 #include <stdio.h>
2 #define N 1000
3 _Cilk_shared int ar1[N];
4 _Cilk_shared int ar2[N];
5 _Cilk_shared int res[N];
6
7 void initialize() {
8 for (int i = 0; i < N; i++) {
g
9 ar1[i] = i;
an
10 ar2[i] = 1;
W
11 }
12 }
13
14 _Cilk_shared void add() {
e ng
nh
15 #ifdef __MIC__
Yu
18 #else
fo
20 #endif
}
ar
21
22
p
void verify() {
re
23
24 bool errors = false;
yP
28 }
us
29
cl
31 initialize();
32 _Cilk_offload add(); // Function call on coprocessor:
33 // // ar1, ar2 are copied in, res copied out
34 verify();
35 }
Listing 2.34: Example of using the virtual-shared memory and offloading calculations with _Cilk_shared and
_Cilk_offload of the function call. Note that, even though data are not explicitly passed from the host to the
coprocessor, function compute_sum(), executed on the coprocessor, has access to data initialized on the host.
1 #include <stdio.h>
2 #define N 10000
3 int* _Cilk_shared data; // Shared pointer to shared data
4 int _Cilk_shared sum;
5
_Cilk_shared void ComputeSum() {
g
6
an
7 #ifdef __MIC__
printf("Address of data[0] on coprocessor: %p\n", &data[0]);
W
8
9 sum = 0;
ng
10 #pragma omp parallel for reduction(+: sum)
11 for (int i = 0; i < N; ++i)
e
12
13
sum += data[i];
#else nh
Yu
14 printf("Offload to coprocessor failed!\n");
#endif
r
15
fo
16 }
17
d
int main() {
re
18
19 data = (_Cilk_shared int*)_Offload_shared_malloc(N*sizeof(float));
pa
21 data[i] = i%2;
yP
25 _Offload_shared_free(data);
us
26 }
cl
Ex
Listing 2.35: Using the _Offload_shared_malloc for dynamic virtual-shared memory allocation in C/C++.
Note that in the above listing, data is a global variable declared as a pointer to shared memory, and that it
is allocated using a special _Offload_shared_malloc call. Variables marked with the _Cilk_shared
keyword will be placed at the same virtual addresses on both the host and the coprocessor, and will synchronize
their values at the beginning and end of offload function calls marked with the _Cilk_offload keyword.
1 #include <stdio.h>
2 #include <string.h>
3
4 typedef struct {
5 int i;
6 char c[10];
7 } person;
8
9 _Cilk_shared void SetPerson(_Cilk_shared person & p,
g
10 _Cilk_shared const char* name, const int i) {
an
11 #ifdef __MIC__
W
12 p.i = i;
13 strcpy(p.c, name);
ng
14 printf("On coprocessor: %d %s\n", p.i, p.c);
15 #else
e
nh
16 printf("Offload to coprocessor failed.\n");
#endif
Yu
17
18 }
r
19
fo
23 int main(){
p
24 strcpy(who, "John");
re
Note that in this example, function SetPerson accepts an argument initialized on the host, however, it
is executed on the coprocessor, and produces an object (someone), which is later used on the host.
The example in Listing 2.37 and Listing 2.38 could also be implemented in the explicit offload model
using pragmas. However, a more complex object, such as the class shown in Listing 2.39 and Listing 2.40, can
only be shared in the virtual-shared memory model.
1 #include <stdio.h>
2 #include <string.h>
3
4 class _Cilk_shared Person {
5 public:
6 int i;
7 char c[10];
8
9 Person() {
10 i=0; c[0]=’\0’;
11 }
12
13 void Set(_Cilk_shared const char* name, const int i0) {
14 #ifdef __MIC__
15 i = i0;
16 strcpy(c, name);
17 printf("On coprocessor: %d %s\n", i, c);
18 #else
19 printf("Offload to coprocessor failed.\n");
g
#endif
an
20
21 }
W
22 };
ng
23
24 Person _Cilk_shared someone;
e
25 char _Cilk_shared who[10];
26
nh
Yu
27 int main(){
28 strcpy(who, "Mary");
r
29
30 printf("On host: %d %s\n", someone.i, someone.c);
d
31 }
re
pa
user@host % ./a.out
Ex
On host: 2 Mary
On coprocessor: 2 Mary
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <new>
4
5 class _Cilk_shared MyClass {
6 int i;
7 public:
8 MyClass(){ i = 0; };
9 void set(const int l) { i = l; }
g
10 void print(){
an
11 #ifdef __MIC__
W
12 printf("On coprocessor: ");
13 #else
ng
14 printf("On host: ");
15 #endif
e
nh
16 printf("%d\n", i);
}
Yu
17
18 };
r
19
fo
21
22 int main()
ar
23 {
p
31
Ex
Listing 2.41: Using the placement version of operator new to allocate a virtual-shared object of type MyClass.
The placement version of operator new is made available by including the header file <new>. The
presence of the argument address instructs the operator new that memory has already been allocated for
the object at that address, and that only the constructor of the class needs to be called.
The result of the code in Listing 2.41 is shown in Listing 2.42.
user@host % ./a.out
On host: 1000
On coprocessor: 1000
Function int _Cilk_shared f(int x) Executable code for both host and
{ return x+1; } target; may be called from either side
File/Function static static _Cilk_shared int x; Visible on both sides, only to code
within the file/function
g
an
W
Pointer to shared data int _Cilk_shared *p; p is local (not shared),
can point to shared data
e ng
A shared pointer int *_Cilk_shared p;
nh p is shared;
should only point at shared data
rYu
fo
asynchronously func(y);
iv
us
Table 2.1: Keywords _Cilk_shared and _Cilk_offload usage for data and functions.
TM
2.4 Multiple Intel R Xeon Phi Coprocessors in a System and Clus-
TM
ters with Intel R Xeon Phi Coprocessors
We have discussed in Sections 2.4.1, 2.4.2 and 2.4.3 how to use a single Intel Xeon Phi coprocessor
using native (or MPI) applications, in the explicit offload model or in the virtual-shared memory model. This
section describes how multiple coprocessors can be used in these programming models.
There are several ways to employ several Intel Xeon Phi coprocessors simultaneously. The best method
depends on the structure and parallel algorithm of the application.
In distributed memory applications using MPI, there exists a multitude of methods for utilizing multiple
hosts and multiple devices (see Section 3.3.1). However, all of these methods can be placed into one of the
following two categories:
(1) MPI processes run only on hosts and perform offload to coprocessors, and
(2) MPI processes run as native applications on coprocessors (or on coprocessors as well as hosts).
g
an
For applications utilizing MPI in mode (1), and for offload applications using only a single host, multiple
W
coprocessors per host can be utilized using a combination of approaches described in Section 2.4.1 and
Section 2.4.2:
e ng
(1a) spawning multiple threads on the host, each performing offload to the respective coprocessor, and
nh
Yu
For MPI applications in mode (2), scaling across multiple coprocessors occurs naturally.
fo
We will start with the discussion of the offload model, in its explicit implementation and in the MYO
ed
variation, and then proceed to discussing the usage of MPI for heterogeneous applications with multiple Intel
ar
Xeon Phi coprocessors. Note that this section is not a tutorial on OpenMP, Intel Cilk Plus or MPI. Refer to
p
re
1 #include <stdio.h>
2
3 _Cilk_shared int numDevices;
4
5 int main() {
6 numDevices = _Offload_number_of_devices();
7 printf("Number of available coprocessors: %d\n\n" ,numDevices);
8 }
Listing 2.43: _Offload_number_of_devices() will return the number of Intel Xeon Phi coprocessors in the
g
system.
an
W
Note: at the time of the writing of this document, the Intel C Compiler version 13.0.1 recognizes
ng
the function _Offload_number_of_devices() only if (a) at least one _Cilk_shared variable or
e
function is declared in the compiled code, or (b) the code is compiled with the argument -offload-build.
nh
Yu
Specifying an Explicit Offload Target
r
fo
With several Intel Xeon Phi coprocessors installed in a system, it is possible to request offload to a
d
specific coprocessor. This has been demonstrated in Listing 2.28, where mic:0 indicates that the offload must
re
be performed to the first coprocessor in the system. Another example is shown in Listing 2.44.
pa
re
1
2 {
foo();
el
3
}
iv
4
us
cl
Listing 2.44: target(mic:0) directs the offload to the first Intel Xeon Phi coprocessor in the system.
Ex
Specifying a target number of 0 or greater indicates that the call applies to the coprocessor with the
corresponding zero-based number. For a target number greater than or equal to the coprocessor count, the
offload will be directed to the coprocessor equal to the target number modulo device count. For example, with
4 coprocessors in the system, mic:1, mic:5, mic:9, etc., direct offload to the second coprocessor.
Specifying mic:-1 instead will invite the runtime system to choose a coprocessor or fail if none are
found.
In applications using asynchronous offloads, specifying target numbers is critical, as waiting for a signal
from the wrong coprocessor can result in the code hanging. The same applies to applications that use data
persistence on the coprocessor. If a persistent array is allocated on a specific coprocessor, but an offload
pragma tries to re-use that array on a different coprocessor, a runtime error will occur.
1 #include <stdlib.h>
2 #include <stdio.h>
3
4 __attribute__((target(mic))) int* response;
5
6 int main(){
7 int n_d = _Offload_number_of_devices();
8 if (n_d < 1) {
9 printf("No devices available!");
10 return 2;
11 }
12 response = (int*) malloc(n_d*sizeof(int));
g
response[0:n_d] = 0;
an
13
14 #pragma omp parallel for
W
15 for (int i = 0; i < n_d; i++) {
ng
16 // The body of this loop is executed by n_d host threads concurrently
17 #pragma offload target(mic:i) inout(response[i:1]) e
{
nh
18
19 // Each offloaded segment blocks the execution of the thread that launched it
Yu
20 response[i] = 1;
21 }
r
}
fo
22
23 for (int i = 0; i < n_d; i++)
ed
24 if (response[i] == 1) {
ar
28
29 }
el
iv
Listing 2.45: Illustration of employing several Intel Xeon Phi coprocessors simultaneously using multiple host threads.
us
This code must be compiled with the compiler argument -openmp in order to enable #pragma omp
cl
Ex
The for-loop in line 15 is executed in parallel on the host. Today’s computing systems support a maximum
of eight Intel Xeon Phi coprocessors, and the number of CPU cores in these systems is no less than eight.
Therefore, the default behavior of this parallel loop is to launch all n_d host threads simultaneously. Each
thread executes its own offloaded segment, and all offloaded segments will therefore run concurrently. See
Section 3.2.3 for more information about parallel loops in OpenMP.
In high performance applications, algorithms generally distribute work across available coprocessors.
The example in Listing 2.45 illustrates one of the language constructs that may be used for work distribution.
The clause inout(response[i:1]) indicates that only a segment of array response should be sent in
and out of coprocessor mic:i, namely, the segment starting with the index i and having a length of 1. This
is an example of Intel Cilk Plus array notation further discussed in Section 3.1.7.
1 #include <stdlib.h>
2 #include <stdio.h>
3
4 __attribute__((target(mic))) int* response;
5
6 int main(){
7 int n_d = _Offload_number_of_devices();
8 if (n_d < 1) {
9 printf("No devices available!");
10 return 2;
11 }
12 response = (int*) malloc(n_d*sizeof(int));
g
response[0:n_d] = 0;
an
13
14 for (int i = 0; i < n_d; i++) {
W
15 #pragma offload target(mic:i) inout(response[i:1]) signal(&response[i])
ng
16 {
17 // The offloaded job does not block the execution on the host
e
response[i] = 1;
nh
18
19 }
Yu
20 }
21
r
22
23 // This loop waits for all asynchronous offloads to finish
d
25 }
pa
26
re
28
29 printf("OK: device %d responded\n", i);
} else {
el
30
printf("Error: device %d did not respond\n", i);
iv
31
32 }
us
33 }
cl
Ex
Listing 2.46: Illustration of employing several Intel Xeon Phi coprocessors simultaneously using asynchronous offloads
3150:
The code sample shown above uses only one host thread, but this thread spawns multiple offloads in
for-loop line 14. The asynchronous nature of offload is requested by the clause signal. Any pointer can be
chosen as a signal, as long as the pointer assigned to each coprocessor is unique. In this code, for simplicity,
the signal is chosen as a pointer to the array sent to the respective coprocessor. Loop in line 22 waits for
signals. The arrival of each signal indicates the end of the offload.
1 _Cilk_offload_to(i) func();
g
an
Listing 2.47: _Cilk_offload_to(i) will use Intel Xeon Phi coprocessor number i (counted from zero) for offload-
W
ing. See also Section 2.4.1 for information about the rules of coprocessor specification.
ng
Note that, even though the keyword _Cilk_spawn is a part of the parallel framework Intel Cilk Plus,
e
nh
the programmer is not restricted to using Intel Cilk Plus for parallelizing the offloaded code. OpenMP and
Yu
In order to effect asynchronous offload in the MYO model, the keyword _Cilk_spawn should be
ar
The keyword _Cilk_spawn is a part of Intel Cilk Plus, and therefore synchronization between spawned
Ex
offloads is done in the same way as with jobs spawned on the host, i.e., using _Cilk_sync. More information
can be found in Section 3.2.4.
1 #include <stdlib.h>
2 #include <stdio.h>
3
4 int _Cilk_shared *response;
5
6 void _Cilk_shared Respond(int _Cilk_shared & a) {
7 a = 1;
8 }
9
10 int main(){
11 int n_d = _Offload_number_of_devices();
12 if (n_d < 1) {
13 printf("No devices available!");
14 return 2;
g
15 }
an
16 response = (int _Cilk_shared *) _Offload_shared_malloc(n_d*sizeof(int));
W
17 response[0:n_d] = 0;
18 _Cilk_for (int i = 0; i < n_d; i++) {
ng
19 // All iterations start simultaneously in n_d host threads
e
20 _Cilk_offload_to(i)
21
}
Respond(response[i]);
nh
Yu
22
23 for (int i = 0; i < n_d; i++)
r
24 if (response[i] == 1) {
fo
26
re
28 }
29 }
re
yP
Listing 2.49: Illustration of employing several Intel Xeon Phi coprocessors simultaneously using multiple host threads.
el
iv
In this case, the loop in Line 18 is executed in parallel with the help of the Intel Cilk Plus library. It is
us
expected that the number of available Cilk Plus workers is greater than the number of coprocessors in the
cl
Ex
system, and therefore, all offloads will start simultaneously. See Section 3.2.3 for more information about
parallel loops in Intel Cilk Plus.
1 #include <stdlib.h>
2 #include <stdio.h>
3
4 int _Cilk_shared *response;
5
6 void _Cilk_shared Respond(int _Cilk_shared & a) {
7 a = 1;
8 }
9
10 int main(){
int n_d = _Offload_number_of_devices();
g
11
an
12 if (n_d < 1) {
printf("No devices available!");
W
13
14 return 2;
ng
15 }
16 response = (int _Cilk_shared *) _Offload_shared_malloc(n_d*sizeof(int));
e
nh
17 response[0:n_d] = 0;
18 for (int i = 0; i < n_d; i++) {
Yu
19 _Cilk_spawn _Cilk_offload_to(i)
Respond(response[i]);
r
20
fo
21 }
22 _Cilk_sync;
ed
24 if (response[i] == 1) {
p
26 } else {
yP
29 }
iv
us
Listing 2.50: Illustration of employing several Intel Xeon Phi coprocessors simultaneously using multiple host threads.
cl
Ex
(1) MPI processes run only on processors and perform offload to coprocessors attached to their respective
host. In this case, multiple coprocessors can be used as described in Section 2.4.1 and Section 2.4.2. It is
possible to employ this method with either the bridge, or the static pair network topology of coprocessors
(see Section 1.5.2).
(2) MPI processes run as native applications on coprocessors (or on coprocessors as well as processors). The
procedure employing multiple coprocessors with this approach is presented in this section. Note that in
order to use this approach with more than one host (i.e., on a cluster), the network connections of Intel
Xeon Phi coprocessors must be configured in the bridge topology, so that all coprocessors are directly
IP-addressable on the same private network as the hosts. However, in this section, we restrict the example
to a single host with multiple coprocessors, and therefore the network configuration is unimportant.
g
an
Code
W
Let us re-use the “Hello World” application for MPI shown in Listing 2.9. For convenience, this code is
ng
reproduced in Listing 2.51.
e
nh
Yu
1 #include "mpi.h"
2 #include <stdio.h>
r
fo
3 #include <string.h>
4
d
7 char name[MPI_MAX_PROCESSOR_NAME];
re
8
MPI_Init (&argc, &argv);
yP
9
10
el
12
MPI_Get_processor_name (name, &namelen);
us
13
14
cl
Listing 2.51: Source code HelloMPI.c of a “Hello world!” program with MPI.
Note that we assume that the MPI library has been NFS-shared with both coprocessors as discussed in
Section 1.5.4.
In order to run this code on two coprocessors attached to the machine, let us first see how we can launch
an MPI job on the coprocessor from the host (see Listing 2.52). In this case, an additional environment variable,
I_MPI_MIC, must be set on the host, and the argument -host mic0 must be passed to mpirun.
g
an
W
Listing 2.52: Launching an Intel MPI application from the host.
e ng
nh
Yu
In order to start the application on two coprocessors, we can specify the list of hosts and their respective
ar
parameters using the separator ‘:’, as shown in Listing 2.53. In the same way, applications on remote coproces-
p
sors and remote hosts can be launched, if the bridged network topology makes these remote coprocessors or
re
Listing 2.53: Launching an Intel MPI application on two coprocessors from the host.
In practice, in order to run jobs on hundreds or thousands of hosts and coprocessors, mpirun accepts a
file with the list of machines instead of individual hosts separated with ‘:’, as demonstrated in Listing 2.54.
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
Listing 2.55: Launching an Intel MPI application on two coprocessors and the host itself. The symbol ‘\’ in the second
line indicates the continuation of the shell command onto the next line.
g
an
W
Peer to Peer Communication between Coprocessors
e ng
Note that in order for MPI jobs on two or more coprocessors to work, they must be able to communicate
with each other via TCP/IP. In order to check whether IP packets can travel from one coprocessor to another,
nh
one can log in to the coprocessor and try to ping another coprocessor. If this test fails, the administrator must
Yu
check that packet forwarding is enabled on the host. Enabling packet forwarding can be done by editing the
r
file /etc/sysctl.conf and ensuring that the following line is present (or changing 0 to 1 otherwise):
fo
ed
net.ipv4.ip_forward = 1
p ar
re
Listing 2.56: Enabling packet forwarding in the host file /etc/sysctl.conf to facilitate peer to peer communication
yP
between coprocessors.
el
iv
If the file /etc/sysctl.conf was edited, the settings will become effective after system reboot. It is
us
also possible to enable packet forwarding for current session using the command shown in Listing 2.57.
cl
Ex
Chapter 3
Expressing Parallelism
Chapter 2 discussed the methods of data sharing in applications employing Intel Xeon Phi coprocessors.
g
an
This chapter introduces parallel programming language extensions of C/C++ supported by the Intel C++
W
Compiler for programming the Intel Xeon and Intel Xeon Phi architectures. It discusses data parallelism
(SIMD instructions and automatic vectorization), shared-memory thread parallelism (OpenMP, Intel Cilk Plus)
ng
and distributed-memory process parallelism with message passing (MPI). The purpose of this chapter is to
e
nh
introduce parallel programming paradigms and language constructs, rather than to provide optimization advice.
For optimization, refer to Chapter 4.
Yu
r
fo
This section introduces vector instructions (SIMD parallelism) in Intel Xeon processors and Intel Xeon
Phi coprocessors and outlines the Intel C++ Compiler support for these instructions. Vector operations
re
yP
illustrated in this section can be used in both serial and multi-threaded codes, however, examples are limited to
serial codes for simplicity. This section introduces the language extensions and concepts of SIMD calculations;
el
Most processor architectures today include SIMD (Single Instruction Multiple Data) parallelism in the
form of a vector instruction set. SIMD instructions are designed to apply the same mathematical operation to
several integer or floating-point numbers. The following pseudocode illustrates SIMD instructions:
Listing 3.1: This pseudocode illustrates the concept of SIMD operations. The SIMD loop (right) performs 1/4 the number
of iterations of the scalar loop (left), and each addition operator acts on 4 numbers at a time (i.e., addition here is a single
instruction for multiple data elements).
The maximum potential speedup of this SIMD-enabled calculation with respect a scalar version is equal
to the number of values held in the processor’s vector registers. In the example in Listing 3.1, this factor is
equal to 4. The practical speedup with SIMD depends on the width of vector registers, type of scalar operands,
type of instruction and associated memory traffic. See Section 3.1.2 for more information about various SIMD
instruction sets. Additional reading for code vectorization can be found in the book [13] by Aart Bik, former
lead architect of automatic vectorization in the Intel compilers.
g
Instruction Set Year and Intel Processor Vector Packed Data Types
an
registers
W
MMX 1997, Pentium 64-bit 8-, 16- and 32-bit integers
ng
SSE 1999, Pentium III 128-bit 32-bit single precision FP (floating-point)
SSE2 2001, Pentium 4 128-bit
e
8 to 64-bit integers; single & double prec. FP
nh
SSE3–SSE4.2 2004 – 2009 128-bit (additional instructions)
Yu
IMCI 2012, Knights Corner 512-bit 32- and 64-bit integers; single & double prec.
ed
FP
p ar
Table 3.1: History of SIMD instruction sets supported by the Intel processors. Processors supporting modern instruction
re
sets are backward-compatible with older instruction sets. The Intel Xeon Phi coprocessor is an exception to this trend,
yP
Even if you did not know about SIMD instructions before, or did not make specific efforts to employ
Ex
them in your code, your application may already be using SIMD parallelism.
Some high-level mathematics libraries, such as the Intel MKL, contain implementations of common
operations for linear algebra, signal analysis, statistics, etc, which use SIMD instructions. In codes where
performance-critical calculations call such library functions, vectorization is employed without burdening
the programmer. Whenever your application performs an operations that can be expressed as an Intel MKL
library function, the easiest way to vectorize this operation is to call the library implementation. This applies
to workloads for the Intel Xeon and Intel Xeon Phi architecture alike.
In original high performance codes, SIMD operations may be implemented by the compiler through a
feature known as automatic vectorization. Automatic vectorization is enabled at the default optimization level
-O2. However, in order to gain the most from automatic vectorization, the programmer must organize data
and loops in a certain way, as described further in this section. Automatic vectorization is the most convenient
way to employ SIMD, because cross-platform porting is performed by the compiler.
Finally, SIMD instructions may be called explicitly via assembly code or vector intrinsics. This method
may sometimes yield better performance than automatic vectorization, but cross-platform porting is difficult.
The rest of this section explains how to ensure that user code is vectorized.
g
an
shown in Listing 3.2.
W
ng
1 __declspec(align(64)) float A[n];
e
nh
Listing 3.2: Allocating a stack array A aligned on a 64-byte boundary.
r Yu
fo
d
Listing 3.3: Alternative way of creating a stack array A aligned on a 64-byte boundary.
el
iv
us
In both examples shown above, array A will be placed in memory in such a way that the address of A[0]
cl
is a multiple of 64, i.e., aligned on a 64-byte boundary. Note that setting a very high alignment value may
Ex
lead to significant fraction of virtual memory being wasted. Also remember that the boundary value must be a
power of two. See the Intel C++ Compiler Reference [16] for more information.
1 #include <malloc.h>
2 // ...
3 float *A = (float*)_mm_malloc(n*sizeof(float), 16);
4 // ...
5 _mm_free(A);
Listing 3.4: Allocating and freeing a memory block aligned on a 16-byte boundary with _mm_malloc/_mm_free.
An alternative way to achieve alignment when allocating memory on the heap is to use the malloc call
to allocate a block of memory slightly larger than needed, and then point a new pointer to an aligned address
within that block. The advantage of this method is that it can be used in compilers that do not support the
_mm_malloc/_mm_free calls. See Listing 3.5 for an example of this procedure.
1 #include <stdlib.h>
2 // ...
3 char *holder = malloc(bytes+32-1); // Not guaranteed to be aligned
4 size_t offset = (32-((size_t)(*holder))%32)%32; // From holder to nearest aligned addr.
5 float *ptr=(float*) ((char*)(holder) + offset); // ptr[0] aligned on 32-byte boundary
6 // ...
7 free(holder); // use original pointer to deallocate memory
Listing 3.5: Allocating and freeing a memory block aligned on a 32-byte boundary with malloc and free. Note that in
this case, the pointer ptr should be used to access data, but memory must be free-d via holder.
g
an
Alignment of Objects Created with the Operator new
W
In C++, the operator new does not guarantee alignment. In order to align a C++ class on a boundary, the
e ng
programmer can allocate an aligned block of memory using one of the methods shown above, and then use the
placement version of the operator new as shown in Listing 3.6. Naturally, if this method is used for objects of
nh
derived types (classes and structures), then the internal structure of these types must be designed in such a way
Yu
1 #include <new>
2 // ...
ar
4 MyClass *ptr = new (buf) MyClass; // placing MyClass without allocating new memory
re
5 // ...
yP
6 ptr->~MyClass();
7 _mm_free(buf);
el
iv
us
Listing 3.6: Placing an object of type MyClass into a memory block aligned on a 64-byte boundary. Note that the
delete operator should not be called on ptr; instead, the destructor should be run explicitly, followed by freeing the
cl
Ex
3.1.5 Using SIMD with Inline Assembly Code, Compiler Intrinsics and Class Li-
braries
SIMD instructions can be explicitly called from the user code using inline assembly, compiler intrinsics
and class libraries. Note that this method of using SIMD instructions is not recommended, as it limits the
portability of the code across different architectures. For example, porting a code that runs on Intel Xeon
processors and uses AVX intrinsics to run on Intel Xeon Phi coprocessors with IMCI intrinsics requires that
the portion of code with intrinsics is completely re-written. Instead of explicit SIMD calls, developers are
encouraged to employ automatic vectorization with methods described in Section 3.1.6 through Section 3.1.10.
However, this section provides information about intrinsics for reference.
g
an
assembly, compiler intrinsics, provide the same level of control and performance, while keeping the code more
W
readable. The next section introduces the use of compiler intrinsics.
e ng
Intel R Compilers intrinsics
nh
For every instruction set supported by the Intel compiler, there is a corresponding header file that declares
Yu
the corresponding short vector types and vector functions. Table 3.2 lists these header files.
r
fo
d
MMX mmintrin.h
pa
AVX immintrin.h
cl
AVX2 immintrin.h
Ex
IMCI immintrin.h
Table 3.2: Header files for the Intel C++ Compiler intrinsics.
1. the data has to be loaded into variables representing the content of vector registers;
3. the data in resultant vector register variables must be stored back in memory.
In addition, in some cases, the data loaded into vector registers must be aligned, i.e., placed at a memory
address which is a multiple of a certain number of bytes. See Section 3.1.4 for more information on data
alignment.
Codes in Listing 3.7 illustrate using the SSE2 and IMCI intrinsics for the addition of two arrays shown in
Listing 3.1. Note that the stride of the loop variable i is 4 for the SSE2 code and 16 for the IMCI code.
1 for (int i=0; i<n; i+=4) { 1 for (int i=0; i<n; i+=16) {
2 __m128 Avec=_mm_load_ps(A+i); 2 __m512 Avec=_mm512_load_ps(A+i);
3 __m128 Bvec=_mm_load_ps(B+i); 3 __m512 Bvec=_mm512_load_ps(B+i);
4 Avec=_mm_add_ps(Avec, Bvec); 4 Avec=_mm512_add_ps(Avec, Bvec);
5 _mm_store_ps(A+i, Avec); 5 _mm512_store_ps(A+i, Avec);
6 } 6 }
Listing 3.7: Addition of two arrays using SSE2 intrinsics (left) and IMCI intrinsics (right). These codes assume that
the arrays float A[n] and float B[n] are aligned on a 16-byte boundary, and that n is a multiple of 4 for SSE
and a multiple of 16 for IMCI. Variables Avec and Bvec are 128 = 4 × sizeof(float) bits in size for SSE2 and
512 = 16 × sizeof(float) bits for the Intel Xeon Phi architecture.
The SSE2 code in Listing 3.7 will run only on Intel Xeon processors, and the IMCI code will run
only on Intel Xeon Phi coprocessors. The necessity to maintain a separate version of a SIMD code for each
target instruction set is generally undesirable, however, it cannot be avoided when code is vectorized with
intrinsics. A better approach to expressing SIMD parallelism is using the Intel Cilk Plus extensions for array
g
notation (see Section 3.1.7) or auto-vectorizable C loops and vectorization pragmas (see Section 3.1.6 through
an
Section 3.1.10).
W
Note that switching between different instruction sets in a code employing SIMD intrinsics should be
ng
done with care. In some cases, in order to switch between different instruction sets supported by a processor,
e
register have to be set to a certain state to avoid a performance penalty. See the Intel Compilers Reference [20]
nh
for details.
r Yu
Class Libraries
fo
The C++ vector class library provided by the Intel Compilers ([21], [22]) defines short vectors as C++
ed
classes, and operators acting on these vectors are defined in terms of SIMD instructions. A similar library was
ar
recently released by Agner Fog [23]. Table 3.3 lists the header files that should be included in order to gain
p
re
MMX
iv
ivec.h
us
SSE fvec.h
cl
SSE2 dvec.h
Ex
AVX TBA
IMCI TBA
Table 3.3: Header files for the Intel C++ Class Library.
Codes in Listing 3.8 demonstrate how the C++ vector class library included with the Intel C++ compiler
can be used to execute the SIMD loop shown in Listing 3.1.
1 for (int i=0; i<n; i+=4) { 1 for (int i=0; i<n; i+=16) {
2 F32vec4 *Avec=(F32vec4*)(A+i); 2 F32vec16 *Avec=(F32vec16*)(A+i);
3 F32vec4 *Bvec=(F32vec4*)(B+i); 3 F32vec16 *Bvec=(F32vec16*)(B+i);
4 *Avec = *Avec + *Bvec; 4 *Avec = *Avec + *Bvec;
5 } 5 }
Listing 3.8: Addition of two arrays using the Intel C++ vector class library with SSE2 (left) and IMCI instructions (right).
These codes assume that the arrays float A[n] and float B[n] are aligned on a 16-byte boundary, and that n is a
multiple of 4 for SSE2 and a multiple of 16 for the Intel Xeon Phi architecture.
g
an
1 #include <stdio.h>
W
2
3 int main(){
ng
4 const int n=8;
e
5 int i;
6 __declspec(align(64)) int A[n];
nh
Yu
7 __declspec(align(64)) int B[n];
8
r
9 // Initialization
fo
11 A[i]=B[i]=i;
re
12
pa
15 A[i]+=B[i];
yP
16
17 // Output
el
20 }
cl
Ex
Listing 3.9: The source code file autovec.c (top panel) illustrates a regular C++ code that will be auto-vectorized by
the compiler. The only step the developer had to make in this example is allocating the arrays on a 64-byte boundary. The
bottom panel shows the compilation and runtime output of the code.
Let us focus on the source code in Listing 3.9 first. Unlike codes in Listing 3.7 and Listing 3.8, the code
in Listing 3.9 is oblivious of the architecture that it is compiled for. Indeed, this code can be compiled and
auto-vectorized for Pentium 4 processors with SSE2 instructions as well as for Intel Xeon Phi coprocessors.
The only place where architecture is implicitly assumed is the alignment boundary value of 64. This value is
greater than the SSE requirement (16) and is chosen to satisfy the IMCI alignment requirements.
Now let us take a look at the compilation and runtime output of the code shown in Listing 3.9.
• The code was compiled with the argument -vec-report3, which forces the compiler to print some
of the automatic vectorization status information.
• No special optimization arguments were used. Automatic vectorization is enabled for optimization level
-O2, which is the default optimization level, and higher.
• The first line of the compiler output indicates that the initialization loop in line 10 of the source code
was not vectorized: “vectorization possible but seems inefficient”. This happened because the array size
is known at compile time and is very small. The heuristics analyzed by the auto-vectorizer suggest that
the vectorization overhead is not going to be beneficial for performance.
g
an
• The second line of the compiler output reads in capitals: “LOOP WAS VECTORIZED”. This is the
W
expected result, and it applies to the loop in line 14 that performs addition.
ng
• The third line indicates that the loop in line 18 was not vectorized because of the “existence of vector
e
nh
dependence”, because the printf statement in that loop cannot be expressed via vector instructions.
Yu
• Code output following the compilation report shows that the code works as expected.
r
As a proof of that this C code is indeed a portable solution, one can compile it for native execution on
fo
Intel Xeon Phi coprocessors. Listing 3.10 illustrates the result, which is self-explanatory.
ed
p ar
Listing 3.10: Compilation and runtime output of the code in Listing 3.9 for Intel Xeon Phi execution
cl
Ex
In addition to portability across architectures, reliance on automatic vectorization provides other benefits.
For instance, auto-vectorizable code may release the programmer of the requirement that the number of
iterations should be a multiple of the number of data elements in the vector register. Indeed, the compiler will
peel off the last few iterations if necessary, and perform them with scalar instructions. It is also possible to
automatically vectorize loops working with data that are not aligned on a proper boundary. In this case, the
compiler will generate code to check the data alignment at runtime and, if necessary, peel off a few iterations
at the start of the loop in order to perform the bulk of the calculations with fast aligned instructions.
Generally, the only type of loops that the compiler will auto-vectorize is a for-loop, with the number of
iterations run known at runtime, or, better yet, at compile time. Memory access in the loop must have a regular
pattern, ideally with unit stride.
Non-standard loops that cannot be automatically vectorized include: loops with irregular memory access
pattern, calculations with vector dependence, while-loops or for-loops in which the number of iterations
cannot be determined at the start of the loop, outer loops, loops with complex branches (i.e., if-conditions),
and anything else that cannot be, or is very difficult to vectorize.
Further information on automatic vectorization of loops can be found in Section 3.1.10 and Section 4.3,
and in the Intel C++ compiler reference [24].
TM
3.1.7 Extensions for Array Notation in Intel R Cilk Plus
Automatic vectorization in the Intel C++ compiler is not limited to loops. The Intel Cilk Plus extension
provides additional tools that enable the programmer to indicate data parallelism so that the compiler can
automatically vectorize with SIMD operations.
Array notation is a method for specifying slices of arrays or whole arrays, and applying element-wise
operations to arrays of the same shape. The Intel C++ Compiler implements these operations using vector
code, mapping data-parallel constructs to the SIMD hardware.
In the example code in Listing 3.9, the addition loop in lines 14-15 can be replaced with the code shown
in Listing 3.11. When this code is compiled with the Intel C++ Compiler, the addition operation will be
automatically vectorized.
1 A[:] += B[:];
Listing 3.11: Intel Cilk Plus extensions for array notation example: to all elements of array A, add elements of array B.
g
an
W
It is also possible to specify a slice of arrays:
e ng
1 A[0:16] += B[32:16];
nh
Yu
Listing 3.12: Intel Cilk Plus extensions for array notation example: to sixteen elements of array A (0 through 15) add
r
1 A[0:16:2] += B[32:16:4];
yP
el
Listing 3.13: Intel Cilk Plus extensions for array notation example: to sixteen elements of array A (0, 2, 4, . . . , 30) add
iv
The Intel Cilk Plus extensions are enabled by default, and therefore no additional modifications of the
Ex
code or compiler arguments are necessary. However, in order to enable compilation with non-Intel compilers,
the programmer must protect the expressions with array notation with preprocessor directives and provide an
alternative implementation of these expressions with loops that can be understood by other compilers. See
Listing 3.14 for an example.
1 #ifdef __INTEL_COMPILER
2 A[:] += B[:]
3 #else
4 for (int i = 0; i < 16; i++)
5 A[i] += B[i];
6 #endif
Listing 3.14: Protecting Intel Cilk Plus array notation in order to enable compilation with non-Intel compilers.
Array notation extensions also work with multidimensional arrays. Refer to the Intel C++ Compiler
documentation for more details on Intel Cilk Plus [25] and the array notation extensions of this library [26].
g
6
an
7 output[i] = my_simple_add(inputa[i], inputb[i]);
8 }
W
ng
Listing 3.15: Scalar function for addition in C.
e
nh
Yu
If the code of the function and the call to the function are located in the same file, the compiler may
perform inter-procedural optimization (inline the function) and vectorize this loop. However, what if the
r
fo
function is a part of a library? That would make it impossible for the compiler to inline the function code to
replace scalar addition with SIMD operations in that case. The solution to this situation is offered by elemental
ed
must be added to the function declaration. And in order to force the vectorization of the loop using this
p
re
function, #pragma simd must be used. Listing 3.16 demonstrates this method.
yP
el
2 return x1 + x2;
us
3 }
4 #pragma simd
cl
The usage of elemental functions may be combined with array notation as shown in Listing 3.17.
For more information on elemental functions in Intel Cilk Plus, refer to the Intel C++ Compiler compiler
documentation [27].
1 float* a;
2 // ...
3 for (int i = 1; i < n; i++)
4 a[i] += b[i]*a[i-1];
Listing 3.18: Vector dependence makes the vectorization of this loop impossible.
However, in some cases the compiler may not have sufficient information in order to determine whether
a true vector dependence is present in the loop. Such cases are referred to as assumed vector dependence.
g
an
W
Assumed Vector Dependence Example
ng
Code in Listing 3.19 shows a case where it is impossible to determine whether a vector dependence exists.
e
nh
If pointers a and b point to distinct, non-overlapping memory segments, then there is no vector dependence.
However, there is a possibility that the user will pass to the function a and b pointing to overlapping memory
Yu
addresses (e.g., a==b+1), in which case vector dependence will exist.
r
fo
d
re
2
3 a[i] = b[i];
re
4 }
yP
el
Listing 3.19: Vector dependence may occur if memory regions referred to by a and b overlap. Intel Compilers may refuse
iv
In order to illustrate what happens in situations with assumed vector dependence, we place the code from
Listing 3.19 into file vdep.cc and compile it. The compiler output shown in Listing 3.20 reports that the
loop is not vectorized. The reason for that is that an assumed vector dependence has been found.
Listing 3.20: Compiler argument -vec-report3 prints diagnostic information about automatic vectorization.
In cases when the developer knows that there will not be a true vector dependence situation, it is possible
to instruct the compiler to ignore assumed vector dependencies found in a loop. This can be done with
#pragma ivdep, as shown in Listing 3.21.
Listing 3.21: The pragma before this loop instructs the compiler to ignore assumed vector dependence.
Listing 3.22 shows the compilation output. This time, automatic vectorization has succeeded.
g
an
Listing 3.22: Automatic vectorization succeeds thanks to #pragma ivdep.
W
ng
It must be noted that if the function compiled in this way is called with overlapping arrays a and b (i.e.,
e
nh
with true vector dependence), the code may produce incorrect results or crash.
r Yu
fo
ed
Pointer Disambiguation
p ar
A more fine-grained method to disambiguate the possibility of vector dependence is the restrict
re
keyword. This keyword applies to each pointer variable qualified with it, and instructs the compiler that
yP
the object accessed by the pointer is only accessed by that pointer in the given scope. In order to enable
the keyword restrict, the compiler argument -restrict must be used. An example of the usage of
el
iv
keyword restrict is shown in Listing 3.23. This time, automatic vectorization has succeeded as well. Note
us
that the compiler was given the argument -restrict, without which compilation would have failed.
cl
Ex
Hint: sometimes it may be desirable to disable the restrict keyword. In order to avoid editing code to do
that, it is useful to define a compiler macro RESTRICT and set it to the value “restrict” or to an empty
value, depending on the purpose. In the code, the macro RESTRICT should be used instead of the word
restrict. This is illustrated in Listing 3.24.
g
The following list contains some compiler pragmas that may be useful for tuning vectorized code
an
performance. Details can be found in Intel C++ compiler reference [28]. In the PDF version of this document,
W
the items in the list below are hyperlinks pointing to the corresponding articles in the compiler reference.
ng
• #pragma simd
e
Used to guide the compiler to automatically vectorize more loops (e.g., some outer loops). Arguments
nh
of this pragma can guide the compiler in cases when automatic vectorization is difficult. It is also
Yu
possible to improve vectorization efficiency by specifying expected runtime parameters of the loop in
r
Instructs the compiler to implement automatic vectorization of the loop following this pragma, even
pa
if heuristic analysis suggests otherwise, or if non-unit stride or unaligned accesses make vectorization
re
inefficient.
yP
Instructs the compiler to always use aligned or unaligned data movement instructions. Useful, for
iv
instance, when the developer guarantees data alignment. In this case, placing #pragma vector
us
aligned before the loop eliminates unnecessary run-time checks for data alignment, which improves
cl
performance.
Ex
• #pragma ivdep
Instructs the compiler to ignore vector dependence, which increases the likelihood of automatic loop
vectorization. See Section 3.1.9 for more information. The keyword restrict can often help to
achieve a similar result.
• restrict qualifier and -restrict command-line argument
This keyword qualifies a pointer as restricted, i.e., the developer using the restrict keyword guaran-
tees to the compiler that in the scope of this pointer’s visibility, it points to data which is not referenced
by any other pointer. Qualifying function arguments with the restrict keyword helps in the elimi-
nation of assumed vector dependencies. The restrict keyword in the code must be enabled by the
-restrict compiler argument. See Section 3.1.9 for more detail.
• #pragma loop count
Informs the compiler of the number of loop iterations anticipated at runtime. This helps the auto-
vectorizer to make more accurate predictions regarding the optimal vectorization strategy.
• __assume_aligned keyword
g
an
Helps to eliminate runtime alignment checks when data is guaranteed to be properly aligned. This
W
keyword produces an effect similar to that of #pragma vector aligned, but provides more
ng
granular control, as __assume_aligned applies to an individual array that participates in the
calculation, and not to the whole loop.
e
nh
• -vec-report[n]
Yu
This compiler argument indicates the level of verbosity of the automatic vectorizer. -vec-report3
r
provides the most verbose report including vectorized and non-vectorized loops and any proven or
fo
• -O[n]
p
Optimization level, defaults to -O2. Automatic vectorization is enabled with -O2 and higher optimiza-
re
tion levels.
yP
• -x[code]
el
iv
Instructs the compiler to target specific processor features, including instruction sets and optimizations.
us
For example, to generate AVX code, -xAVX can be used; for SSE2, -xSSE2. Using -xhost targets
cl
g
an
elements (long integers or double precision floating-point numbers) or up to sixteen 32-bit elements (integers
or single precision floating-point numbers). For use with intrinsic functions, these registers are represented
W
by three data types declared in the header file immintrin.h: __mm512 (single precision floating-point
ng
vector), __mm512i (32- or 64- bit integer vector) and __mm512d (double precision floating-point vector).
e
Most instructions operate on three arguments: either two source registers with a separate destination register,
or three source registers, one of which is also a destination. nh
Yu
For each operation, two types of instructions are available: unmasked an masked. Unmasked instructions
r
apply the requested operation to all elements of the vector registers. Masked instructions apply the operation
fo
to some of the elements and preserve the value of other elements in the output register. The set of elements
d
re
that must be modified in the output registers is controlled by an additional argument of type __mmask16 or
pa
__mmask8. This is a short integer value, in which bits set to 1 or 0 indicate that the corresponding output
re
elements should be modified or preserved by the masked operation using this bitmask.
yP
The classes of available IMCI instructions are outlined in the list below, illustrated with calls to the
respective intrinsic functions.
el
iv
Initialization instructions are used to fill a 512-bit vector register with one or multiple values of scalar
us
elements. Example:
cl
Ex
The above example creates a 512-bit short vector of sixteen SP floating-point numbers and initializes all
sixteen elements to a value of 3.14f;
Load and store instructions copy a contiguous 512-bits chunk of data from a memory location to the vector
register (load) or from the vector register to a memory location (store). The address from/to which the
copying takes place must be 64-byte aligned. Additional versions of these instructions operate only on
the high or low 64 bits of the vector. Example:
In this example, elements 32 through 47 of array myarr are loaded into the vector register assigned to
variable myvec.
Gather and scatter instructions are used to copy non-contiguous data from memory to vector registers
(gather), or from vector registers to memory (scatter). This type of instructions is unique to the Intel
Xeon Phi architecture and is not available in Intel Xeon processors. The memory access pattern must
have a power of 2 stride (1, 2, 4, 8, . . . elements). The copying of data can be done simultaneously
with type conversion. It is also possible to specify prefetching from memory to cache for this type of
operation. Example:
The above code scatters the values in integer short vector myvec to array myarr starting with the index
0 with a stride of 4. That is, elements 0, 1, 2, . . . , 15 of the short vector myvec will be copied to array
elements myvec[0], myvec[4], myvec[8], . . . , myvec[60], respectively.
Arithmetic instructions are the core of high performance calculations. The list below illustrates the scope of
g
these instructions.
an
W
a) Addition, subtraction and multiplication are available for all data types supported in the IMCI. It is
ng
possible to specify the rounding method for floating-point operations. Example:
e
nh
1 __mm512 c = _mm512_mul_ps(a, b);
Yu
b) Fused Multiply-Add instruction (FMA) is the basis of several operations in linear algebra, including
r
fo
of vectors v1 and v2 and add the result to vector v3. The FMA instruction is currently only
supported by the Intel Xeon Phi architecture, and there is no FMA support in today’s Intel Xeon
ar
processors. The latency and throughput of FMA is comparable to that of individual addition or of
p
re
individual multiplication instruction, and therefore it is always preferable to use FMA instead of
yP
separate addition and multiplication where possible. It is possible to specify the rounding method
for floating-point operations. Example:
el
iv
1
cl
Ex
c) Division and transcendental function implementations are available in the Intel Short Vector Math
Library (SVML). The following transcendental operations are supported:
- Division and reciprocal calculation;
- Error function;
- Inverse error function;
- Exponential functions (natural, base 2 and base 10) and the power function. Base 2 exponential
is the fastest implementation;
- Logarithms (natural, base 2 and base 10). Base 2 logarithm is the fastest implementation;
- Square root, inverse square root, hypothenuse value and cubic root;
- Trigonometric functions (sin, cos, tan, sinh, cosh, tanh, asin, acos, atan);
- Rounding functions
The following example calculates the reciprocal square root of each element of vector y:
1 __mm512 x = _m512_invsqrt_ps(y);
Swizzle and permute instructions rearrange (shuffle) scalar elements in a vector register. For these operations
it is convenient to think of a 512-bit register as a set of four 128-bit blocks. The swizzle operation
rearranges elements within each 128-bit block, and the permute operation rearranges the 128-bit
blocks in the register according to the pattern specified by the user. These instructions can be used in
combination with another intrinsic, which save processor cycles. Example:
In this example, the swizzle operation with the pattern DCAB is applied to the 512-bit SP floating-point
vector myv2, and then this vector, swizzled, is added to another vector of the same type, myv1.
Comparison instructions perform element-wise comparison between two 512-bit vectors and return a bit-
mask value with bits set to 0 or 1 depending on the result of the respective comparison. Example:
g
1
an
W
The above code compares vectors x and y and returns the bitmask result where bits are set to 1 if the
ng
corresponding element in x is less than the corresponding element in y.
e
nh
Conversion and type cast instructions perform conversion from single to double precision and from double
Yu
to single precision floating-point numbers, from floating-point numbers to integers and from integers to
floating-point numbers.
r
fo
Bitwise instructions perform bit-wise AND, OR, XAND and XOR operations on elements in 512-bit short
d
re
vectors.
pa
Reduction and minimum/maximum instructions allow the calculation of the sum of all elements in a vector,
re
the product of all elements in a vector, and the evaluation of the minimum or maximum of all elements
yP
Vector mask instructions allow to set the values of type __mmask16 and __mmask8 and to perform bitwise
iv
us
operations on them. Masks can be used in all IMCI instructions to control which of the elements in the
cl
resulting vector are modified, and which are preserved in an operation. Bitmasked operations are an
Ex
TM
3.2.1 About OpenMP and Intel R Cilk Plus
g
OpenMP and Cilk Plus have the same scope of application to parallel algorithms and similar functionality.
an
The choice between OpenMP and Cilk Plus as the parallelization method may be dictated either by convenience,
W
or by performance considerations. It is often easy enough to implement the code with both frameworks and
ng
compare the performance. In general, trivial and highly parallel algorithms should run equally well in any of
e
these two parallel frameworks. For complex algorithms with nested parallelism and heterogeneous tasks,
nh
Yu
• Intel Cilk Plus generally provides good performance “out of the box”, but offers little freedom for
fine-tuning. With this framework, the programmer should focus on exposing the parallelism in the
r
fo
application rather than optimizing low-level aspects such as thread creation, work distribution and data
sharing.
ed
ar
• OpenMP may require more tuning to perform well, however, it allows more control over scheduling and
p
re
work distribution.
yP
Additionally, Intel OpenMP and Intel Cilk Plus libraries can be used side by side in the same code
el
without conflicts. In case of nested parallelism, it is preferable to use Cilk Plus parallel regions inside OpenMP
iv
the code following the parallel construct. The other threads in the team enter a wait state until they are needed
to form another team.
Listing 3.25 illustrates the structure of applications with OpenMP constructs, and provides the comments
explanation of each construct or section.
g
13 #pragma omp for nowait // Begin a work-sharing Construct
an
14 for(...)
W
15 { // Each iteration chunk is unit of work.
16 ... // Work is distributed among the team members.
ng
17 } // End of work-sharing construct.
e
18 ... // nowait was specified so threads proceed.
19 #pragma omp critical //
nh
Begin a critical section.
Yu
20 {...} // Only one thread executes at a time.
21 #pragma omp task // Execute in another thread without blocking this thread
r
{...}
fo
22
23 ... // This code is executed by each team member.
d
27
28 ... // Possibly more parallel constructs.
yP
Listing 3.25: The following example illustrates the execution model of an application with OpenMP constructs. Credit:
us
1. Code outside #pragma omp parallel is serial, i.e., executed by only one thread
2. Code directly inside #pragma omp parallel is executed by each thread of the team
3. Code inside work-sharing construct #pragma omp for is distributed across the threads in the team
In order to compile a C++ program with OpenMP pragmas using the Intel C++ Compiler the programmer
must specify the compiler argument -openmp. Without this argument, the code will still compile, but all
code will be executed with only one thread. In order to make certain functions and variables of the OpenMP
library available, #include <omp.h> must be used at the beginning of the code.
TM
Program Structure with Intel R Cilk Plus
Intel Cilk Plus is an emerging standard currently supported by GCC 4.7 and the Intel C++ Compiler.
Its functionality and scope of application are similar to those of OpenMP. There are only three keywords in
the Cilk Plus standard: _Cilk_for, _Cilk_spawn, and _Cilk_sync. Programming for Intel Xeon Phi
coprocessors may also require keywords _Cilk_shared and _Cilk_offload. However, these keywords
allow to implement a variety of parallel algorithms. Language extensions such as array notation, hyperobjects,
elemental function and #pragma simd are also a part of Intel Cilk Plus. Unlike OpenMP, the Cilk Plus
standard guarantees that serialized code will produce the same results as parallel code, if the program has a
deterministic behavior. Last, but not least, Intel Cilk Plus is designed to seamlessly integrate vectorization and
thread-parallelism in applications using this framework.
There are only three keywords in the Cilk Plus standard: _Cilk_for, _Cilk_spawn and _Cilk_sync.
They allow for the implementation of a variety of parallel algorithms. Programming for Intel Xeon Phi copro-
cessors may also require keywords _Cilk_shared and _Cilk_offload.
1 void foo() {
2 ... // Executed by a single worker
3 _Cilk_spawn foo(...) { // Nested parallelism:
4 ... // Execute by a separate worker without blocking this function
5 }
g
_Cilk_sync; // Wait for all tasks spawned from this functions to complete
an
6
7 }
W
8
void bar() {
ng
9
10 _Cilk_for(...) { // May be nested inside another parallel region
e
11 ... // Distribute workload across all available workers
nh
12 }
Yu
13 }
14
r
15
16 ... // Only one worker executes
ed
18
19 ...
p
}
re
20
21 _Cilk_sync; // Wait until all jobs spawned from this function complete
yP
In order to make certain functions of Intel Cilk Plus available, the programmer must use #include
Ex
<cilk/cilk.h>.
The nature of Intel Cilk Plus keywords and semantics preserves the serial nature of codes. The lack
of locks in the code is compensated by the availability of hyperobjects, which facilitate and motivate more
scalable parallel algorithms.
Intel Cilk Plus uses an efficient scheduling algorithm based on “work stealing”, which may be more
efficient than OpenMP in complex multi-program applications.
TM
3.2.2 “Hello World” OpenMP and Intel R Cilk Plus Programs
A sample OpenMP program and its compilation are shown in Listing 3.27,.
1 #include <omp.h>
2 #include <stdio.h>
3
4 int main(){
5 const int nt=omp_get_max_threads();
6 printf("OpenMP with %d threads\n", nt);
7
8 #pragma omp parallel
9 printf("Hello World from thread %d\n", omp_get_thread_num());
10 }
g
an
OpenMP with 5 threads
Hello World from thread 0
W
Hello World from thread 3
ng
Hello World from thread 1
Hello World from thread 2
e
nh
Hello World from thread 4
user@host$
Yu
user@host$ icpc -openmp-stubs hello_omp.cc
hello_omp.cc(8): warning #161: unrecognized #pragma
r
fo
user@host$ ./a.out
pa
user@host$
el
Listing 3.27:
iv
Top: Hello World program in OpenMP. Note the inclusion of the header file omp.h. Parallel execution is requested via
us
Bottom: Compiling the Hello World program with OpenMP. Intel Compilers flag -openmp links the OpenMP runtime
Ex
library for parallel execution, -openmp-stubs serializes the program. Environment variable OMP_NUM_THREADS
controls the default number of threads spawned by #pragma omp parallel. By default, the number of threads is set
to the number of cores (or hyper-threads) in the system.
A sample Intel Cilk Plus program and its compilation are shown in Listing 3.28.
1 #include <cilk/cilk.h>
2 #include <stdio.h>
3
4 int main(){
5 const int nw=__cilkrts_get_nworkers();
6 printf("Cilk Plus with %d workers.\n", nw);
7
8 _Cilk_for (int i=0; i<nw; i++) // Light workload: gets serialized
9 printf("Hello World from worker %d\n", __cilkrts_get_worker_number());
10
11 _Cilk_for (int i=0; i<nw; i++) {
12 double f=1.0;
13 while (f<1.0e40) f*=2.0; // Extra workload: gets parallelized
14 printf("Hello Again from worker %d (%f)\n", __cilkrts_get_worker_number(), f);
15 }
16 }
g
an
user@host$ export CILK_NWORKERS=5
W
user@host$ icpc hello_cilk.cc
ng
user@host$ ./a.out
Cilk Plus with 5 workers. e
Hello World from worker 0
nh
Hello World from worker 0
Yu
Listing 3.28:
Top: Hello World program in Intel Cilk Plus. Note the inclusion of the header file cilk.h to enable Intel Cilk Plus
constructs. Two parallel loops are included to demonstrate dynamic (i.e., determined at runtime) scheduling of loop
iterations.
Bottom: Compiling the Hello World program with Intel Cilk Plus. No compiler flags necessary to enable Intel Cilk Plus,
however, the flag -cilk-serialize can be used to disable parallelism in Intel Cilk Plus constructs. Environment
variable CILK_NWORKERS controls the default number of Intel Cilk Plus workers.
TM
3.2.3 Loop-Centric Parallelism: For-Loops in OpenMP and Intel R Cilk Plus
A significant number of HPC tasks are centered around for-loops with pre-determined loop bounds and a
constant increment of the loop iterator. Such loops can be easily parallelized in shared-memory systems using
#pragma omp parallel for in OpenMP or the statement _Cilk_for in Intel Cilk Plus. Additional
arguments control how loop iterations are distributed across available threads or workers.
Figure 3.1 illustrates the workflow of a loop parallelized in shared memory using OpenMP or Intel Cilk
Plus.
g
Loop iterations
an
W
e ng
nh
Program flow
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
As the above figure illustrates, the execution of a parallel loop is initiated by a single thread. When the
loop starts, multiple threads (in the case of OpenMP) or workers (for Intel Cilk Plus) are spawned, and each
thread gets a portion of the loop iteration space (called “chunk” in the terminology of OpenMP) to process.
When a thread (with OpenMP) or worker (with Intel Cilk Plus) has completed its initial task, it receives from
the scheduler (with OpenMP) or steals from another worker (with Intel Cilk Plus) another chunk to process.
It is possible to instruct parallelization libraries to chose the chunk size dynamically, starting with large and
progressing to smaller chunks as the job is nearing completion. This way, load balance across threads or
workers is maintained without a significant scheduling overhead.
Code samples illustrating the usage of OpenMP and Intel Cilk Plus language constructs to parallelize
loops follow.
For-Loops in OpenMP
With OpenMP, #pragma omp parallel for must be placed before the loop to request its paral-
lelization, as shown in Listing 3.29.
Listing 3.29: The OpenMP library will distribute the iterations of the loop following the #pragma omp parallel
for across threads.
Alternatively, it is possible to call a parallelized loop by placing #pragma omp for nested inside a
#pragma omp parallel construct, as demonstrated in Listing 3.30
g
an
W
1 #pragma omp parallel
2 {
3
4
// Code placed here will be executed by all threads.
// Stack variables declared here will be private to each thread.
e ng
nh
5 int private_number=0;
Yu
9 }
ed
Listing 3.30: When placing #pragma omp for closely nested inside a #pragma omp parallel region, there
yP
should be no word “parallel” before the word “for”. Thread synchronization is implied at the beginning and end of the
for-loop.
el
iv
us
Stack variables declared inside the parallel context or in the body of the loop will be available only on
cl
the thread processing these variables. Variables visible in the scope in which the loop is launched are available
Ex
to all threads, and therefore must be protected from race conditions. In order to efficiently share data between
loop iterations with OpenMP, the reduction clause or locks must be used, as described in Section 3.2.5
If a parallel loop has fewer iterations than the number of available OpenMP threads, then all iterations
will start immediately with one iteration per thread. For parallel loops with more iterations than OpenMP
threads, the run-time library will divide the iterations between threads. In each thread, iterations assigned to it
will be executed sequentially, i.e., the number of simultaneously processed iterations will never be greater than
the number of threads. By default, OpenMP sets the maximum number of threads to be equal to the number of
logical cores in the system.
Depending on the scheduling mode requested by the user, iteration assignment to threads can be either
done before the start of the loop, or it can be decided dynamically. It is possible to tune the performance of
for-loops in OpenMP by specifying the scheduling mode (static, dynamic or guided) and the granularity of
work distribution, known as chunk size.
static : with this scheduling mode, OpenMP evenly distributes loop iterations across threads before the loop
begins. This scheduling method has the smallest parallelization overhead, because no communication
between threads is performed at runtime for scheduling purposes. The downside of this method is that it
may result in load imbalance, if threads complete their iterations at different rates.
dynamic : with this scheduling mode, OpenMP will distribute some fraction of the iteration space across
threads before the loop begins. As threads complete their iterations, they are assigned more work, until
all the work is completed. This method has a greater overhead, but may improve load balance across
threads.
guided : this method is similar to dynamic, except that the granularity of work assignment to threads
decreases as the work nears completion. This method has even greater overhead than dynamic, but
may result in higher overall performance due to better load balancing.
The chunk size controls the minimum number of iterations that are assigned to each thread at any given
scheduling step (except the last one). With small chunk size, dynamic and guided have the potential to
achieve better load balance at the cost of performing more scheduling work. With greater chunk size, the
scheduling overhead is reduced, but load imbalance may be increased. Typically, the optimal chunk size must
be chosen by the programmer empirically.
There are two methods to request the method of scheduling in a loop. The first method is to set the
environment variable OMP_SCHEDULE in order to control the execution of the whole application:
g
an
W
user@host% export OMP_SCHEDULE="dynamic,4"
user@host% ./my_application
e ng
nh
Listing 3.31: Controlling run-time scheduling of parallel loops with an environment variable. The format of the
value of OMP_SCHEDULE is “mode[,chunk_size]”, where mode is one of: static, dynamic, guided, and
Yu
chunk_size is an integer.
r
fo
d
The second is to indicate the scheduling method in the clauses of #pragma omp for. This method
re
provides finer control, but less freedom to modify program behavior after compilation. Listing 3.32 illustrates
pa
that method:
re
yP
2
// ...
iv
3
4 }
us
cl
Ex
Listing 3.32: Controlling the run-time scheduling of a parallel loop with clauses of #pragma omp for.
TM
For-Loops in Intel R Cilk Plus
In Intel Cilk Plus, a parallel for-loop is created as shown in Listing 3.33.
Listing 3.33: The Intel Cilk Plus library will distribute the iterations of the loop following across threads available workers.
Stack variables declared in the body of the loop will be available only on the worker processing these
variables. Variables visible in the scope in which the loop is launched are available to all strands, and therefore
must be protected from race conditions. In order to efficiently share data between Intel Cilk Plus workers,
hyperobjects must be used, as described in Section 3.2.5.
Just like with OpenMP, the run-time Intel Cilk Plus library will process loop iterations in parallel. The
g
an
total iteration space will be divided into chunks, each of which will be executed serially by one of the Intel
W
Cilk Plus workers. By default, the maximum number of workers in Intel Cilk Plus is equal to the number of
logical cores in the system. The number of workers actually used at runtime is dependent on the amount of
ng
work in the loop, and may be smaller than the maximum. This behavior is different from OpenMP, as OpenMP
e
nh
by default spawns a pre-determined number of threads, regardless of the amount of work in the loop.
Yu
Similarly to OpenMP, Intel Cilk Plus allows the user to control the work sharing algorithm in for-loops by
setting the granularity of work distribution. This is done with #pragma cilk grainsize, as illustrated
r
fo
in Listing 3.34
ed
ar
3 // ...
yP
4 }
el
iv
The value of grainsize is the minimum number of iterations assigned to any worker in one scheduling
Ex
step. Like with OpenMP, the choice of grainsize is a compromise between load balancing and the overhead of
scheduling. The default value of grainsize chosen by Intel Cilk Plus works well enough in many cases.
Unlike OpenMP, Intel Cilk Plus has only one mode of scheduling, called work stealing. Work stealing,
depending on the nature of the calculation, may be more or less efficient than OpenMP scheduling methods.
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
- Elemental function
- Fork
- Join
Figure 3.2 illustrates the progress of a parallel code employing the fork-join model. Note that the number
of parallel task can far exceed the physical number of cores in the computing platform. The order of execution
of parallel tasks in available cores is generally not the same as the order in which the tasks were spawned.
A new feature of the OpenMP 3.0 standard, supported by the Intel OpenMP Library, is #pragma omp
task. This pragma allows to create a task that should be executed in parallel with the current scope. A very
large number of tasks can be spawned, however, they will not oversubscribe the system, because the runtime
library will start task execution only when a thread becomes available.
Listing 3.35 illustrates the usage of the OpenMP task pragma to create a parallel recursive algorithm.
1 #include <omp.h>
g
2 #include <stdio.h>
an
3
W
4 void Recurse(const int task) {
5 if (task < 10) {
ng
6 printf("Creating task %d...", task+1);
7 #pragma omp task e
nh
8 {
9 Recurse(task+1);
Yu
10 }
long foo=0; for (long i = 0; i < (1<<20); i++) foo+=i;
r
11
fo
13
14 }
ar
15
p
16 int main() {
re
18 {
19 #pragma omp single
el
20 Recurse(0);
iv
21 }
us
22 }
cl
Ex
Listing 3.35: Source code omptask.cc demonstrating the use of #pragma omp task to effect the fork-join parallel
algorithm.
This code calls the function Recurse(), which forks off recursive calls to itself, requesting that those
recursive calls be run in parallel, without any synchronization of the caller function to its forks. The for-loop
in the code is used only to make the tasks perform some arithmetic work, so that we can see the pattern of task
creation and execution.
Note that #pragma omp task occurs inside of a parallel region, however, parallel execution is
initially restricted to only one thread with #pragma omp single. This is a necessary condition for
parallel tasking. Without #pragma omp parallel, all tasks will be executed by a single thread. Without
#pragma omp single, multiple threads will start task number 0, which is not the desired behavior.
Listing 3.36 demonstrates the execution pattern of this code with four threads.
g
One can see that the code forked off as many jobs as there were available threads (in this case, four), and
an
the creation of other jobs had to wait until one of the threads became free.
W
It is also informative to see the difference between the parallel execution pattern and the serial execution.
ng
In order to run the code serially, we can set the maximum number of OpenMP threads to 1, as shown in
e
Listing 3.37
nh
Yu
user@host% export OMP_NUM_THREADS=1
r
user@host% ./omptask
fo
Creating task 1...Creating task 2...Creating task 3...Creating task 4...Creating task 5.
d
..Creating task 6...Creating task 7...Creating task 8...Creating task 9...Creating task
re
user@host%
Listing 3.37: Running omptask.cc from Listing 3.35 with a single OpenMP thread.
Evidently, in the serial version, the execution recursed into the deepest level before returning to the
calling function. This is the behavior that one would expect from this code if it was stripped of all OpenMP
pragmas.
TM
Fork-Join Model in Intel R Cilk Plus: Spawning
In Intel Cilk Plus, the fork-join model is effected by the keyword _Cilk_spawn. This keyword must
be placed before the function that is forked off, and the function will then be executed in parallel with the
current scope. Listing 3.38 demonstrates the same program as Listing 3.35, but now in the Intel Cilk Plus
framework.
1 #include <stdio.h>
2 #include <cilk/cilk.h>
3
4 void Recurse(const int task) {
5 if (task < 10) {
6 printf("Creating task %d...", task+1);
7 _Cilk_spawn Recurse(task+1);
8 long foo=0; for (long i = 0; i < (1L<<20L); i++) foo+=i;
9 printf("result of task %d in worker %d is %ld\n", task,
10 __cilkrts_get_worker_number(), foo);
11 }
g
an
12 }
13
W
14 int main() {
ng
15 Recurse(0);
16 } e
nh
Yu
Listing 3.38: Source code cilkspawn.cc demonstrating the use of _Cilk_spawn to effect the fork-join parallel
algorithm.
r
fo
ed
No additional compiler arguments are required to compile cilkspawn.cc. Listing 3.39 demonstrates
ar
user@host% ./cilkspawn
iv
Creating task 1...Creating task 2...Creating task 3...Creating task 4...Creating task 5.
us
..Creating task 6...Creating task 7...Creating task 8...Creating task 9...Creating task
10...result of task 9 in worker 0 is 549755289600
cl
Listing 3.39: Compiling and running cilkspawn.cc from Listing 3.38 with a four Intel Cilk Plus workers.
Unlike OpenMP code omptask.cc, this code parallelized with Intel Cilk Plus had spawned all tasks
and queued them for pick up by workers. After that, the code proceeded to run the tasks, as workers employed
work-stealing to balance the load.
In OpenMP parallel regions and loops, multiple threads have access to variables that had been declared
before the parallel region was started. Consider example in Listing 3.40.
1 #include <omp.h>
2 #include <stdio.h>
3
4 int main() {
5 int someVariable = 5;
6 #pragma omp parallel
7 {
8 printf("For thread %d, someVariable=%d\n", omp_get_thread_num(), someVariable);
9 }
10 }
g
an
user@host% icpc -o omp-shared omp-shared.cc -openmp
W
user@host% export OMP_NUM_THREADS=4
user@host% ./omp-shared
ng
For thread 0, someVariable=5
e
For thread 2, someVariable=5
For thread 1, someVariable=5
nh
Yu
For thread 3, someVariable=5
user@host%
r
fo
d
Listing 3.40: Code omp-shared.cc illustrating the use of shared variables in OpenMP.
re
pa
In omp-shared.cc, all threads execute the code inside of #pragma omp parallel. All of these
re
yP
threads have access to variable someVariable declared before the parallel region. By default, all variables
declared before the parallel region are shared between threads. This means that (a) all threads see the value of
el
shared variables, and (b) if one thread writes to the shared variable, all other threads see the modified value.
iv
The latter case may lead to race conditions and unpredictable behavior, unless the write operation is protected
us
In some cases, it is preferable to have a variable of private nature, i.e., have an independent copy of this
Ex
variable in each thread. In order to effect this behavior, the programmer may declare this variable inside the
parallel region as shown in Listing 3.41. Naturally, the programmer can initialize the value of this private
variable with the value of a shared variable.
1 int varPrivate = 3;
2 #pragma omp parallel
3 {
4 int varPrivateLocal = varPrivate; // Each thread will have a copy of varPrivateLocal
5 // ...
6 #pragma omp for
7 for (int i = 0; i < N; i++) {
8 int varTemporary = varPrivateLocal;
9 }
10 }
11 }
Listing 3.41: Variables declared outside the OpenMP parallel region are shared, variables declared inside are private.
In the code in Listing 3.41, an independent copy of varPrivateLocal is available in each thread.
This variable persists throughout the parallel region. Similarly, an independent copy of varTemporary will
exist in each thread. The value of this variable persists for the duration of a single loop iteration, but does not
persist across loop iterations.
There is an additional way to provide to each thread a private copy of some of the variables declared
before the parallel region. This can be done by using clauses private and firstprivate in #pragma
omp parallel as shown in Listing 3.42. With clause private,
a) the variable is private to each thread,
b) the initial value of a private variable is undefined, and
c) the value of the variable in the encompassing scope does not change at the end of the parallel region.
Clause firstprivate is similar to private, but the initial value is initialized with the value outside of
the parallel region.
g
an
1 #include <omp.h>
#include <stdio.h>
W
2
3
ng
4 int main() {
5 int varShared = 5; e
nh
6 int varPrivate = 1;
7 int varFirstprivate = 2;
Yu
9
fo
12 if (omp_get_thread_num() == 0) {
ar
16 }
17 }
el
20 }
cl
Ex
Listing 3.42: Code omp-private.cc illustrating the use of shared variables in OpenMP.
Note that in C++, clauses private and firstprivate duplicate the functionality of scope-local
variables demonstrated in Listing 3.41. However in Fortran, the user must declare all variables at the beginning
of the function, and therefore there is no way to avoid using the clauses private, firstprivate and
lastprivate.
Another type of private variable behavior in OpenMP is effected by clause lastprivate, which
applies only to #pragma omp parallel for. For lastprivate variables, the value in the last
Listing 3.43: Using clause default to request that all variables declared outside the OpenMP parallel region is not
visible within the region.
In the above code, variables a, b, c and i will be shared, variables d and e will be lastprivate,
and all other variables will be none, thus not visible for the parallel region. With default(none), if the
programmer forgets to specify the sharing type for any of the variables used in the loop, the compilation will
fail — this behavior may be desirable in complex cases for explicit variable behavior check.
g
an
W
TM
Variable Sharing in Intel R Cilk Plus
ng
In Intel Cilk Plus, there is no additional pragma-like control over shared or private nature of variables.
e
nh
All variables declared before _Cilk_for are shared, and all variables declared inside the loop are only
Yu
visible to the strand executing the iteration, and exist for the duration of the loop iteration. There is no
native way to declare a variable that persists in a given worker throughout the parallel loop, like variable
r
fo
varPrivateLocal in Listing 3.41. The syntax of Intel Cilk Plus intentionally prohibits the user from
assigning a variable to a worker rather than to a chunk of data. Instead of doing this, developers must design
d
re
their algorithm to use hyperobjects such as reducers and holders, as discussed in Section 3.2.7.
pa
re
yP
el
iv
us
cl
Ex
1 #include <omp.h>
g
#include <stdio.h>
an
2
3
W
4 int main() {
ng
5 const int n = 1000;
6 int sum = 0; e
#pragma omp parallel for
nh
7
8 for (int i = 0; i < n; i++) {
Yu
9 // Race condition
10 sum = sum + i;
r
}
fo
11
12 printf("sum=%d (must be %d)\n", sum, ((n-1)*n)/2);
ed
13 }
p ar
user@host% ./omp-race
sum=208112 (must be 499500)
el
user@host%
iv
us
Listing 3.44: Code omp-race.cc has unpredictable behavior and produces incorrect results due to a race condition in
cl
line 10.
Ex
In line 10 of code omp-race.cc in Listing 3.44, a situation known as a race condition occurs. The
problem is that variable sum is shared between all threads, and therefore more than one thread may execute this
line concurrently. If two threads simultaneously execute line 10, both will have to read, increment and write
the updated value of sum. However, what if one thread updates sum while another thread was incrementing
the old value of sum? This may, and will, lead to an incorrect calculation. Indeed, the output shows a value of
sum=208112 instead of 499500. Moreover, if we run this code multiple times, every time the result will be
different, because the pattern of races between threads will vary from run to run. The parallel program has a
non-predictable behavior! How does one resolve this problem?
The easiest, yet the most inefficient way to protect a portion of a parallel code from concurrent execution
in OpenMP is a critical section, as illustrated in Listing 3.45. #pragma omp critical used in this code
protects the code inside its scope from concurrent execution. The whole iteration space will still be executed
by the code in parallel, but only one thread at a time will be allowed to enter the critical section, while other
threads wait their turn. At this stage in the training we are not concerned with performance, but let us note that
this is a very inefficient way to resolve the race condition in the problem shown in Listing 3.44. We provide
this example because in some cases, a critical section is the only way to avoid unpredictable behavior.
Listing 3.45: Parallel fragment of code omp-critical.cc has predictable behavior, because the race condition was
eliminated with a critical section.
g
Synchronization in OpenMP: Atomic Operations
an
W
A more efficient method of synchronization, albeit limited to certain functions, is the use of atomic
operations. Atomic operations allow the program to safely update a scalar variable in a parallel context. These
ng
operations are effected with #pragma omp atomic, as shown in Figure 3.46.
e
nh
Yu
1 #pragma omp parallel for
2 for (int i = 0; i < n; i++) {
r
fo
3 // Lightweight synchronization
4 #pragma omp atomic
d
5 sum += i;
re
6 }
pa
re
Listing 3.46: This parallel fragment of code omp-critical.cc has predictable behavior, because the race condition
yP
was eliminated with an atomic operation. Note that for this specific example, atomic operations are not the most efficient
el
solution.
iv
us
Capture : operations in the form v = x++, v = x-, v = -x, v = ++x, v = x binop expr
Here x and v are scalar variables, binop is one of +, *, -, - /, &, ˆ , |, «, ». No “trickery” is
allowed for atomic operations: no operator overload, no non-scalar types, no complex expressions.
In many cases, atomic operations are an adequate solution for accessing and modifying shared data.
However, in this particular case, the parallel scalability of the algorithm may be further improved by using
reducers instead of atomic operations, as discussed in Section 3.2.7
1 #include <omp.h>
2 #include <stdio.h>
3
4 int main() {
5 const int N=1000;
6 int* A = (int*)malloc(N*sizeof(int));
7 for (int i = 0; i < N; i++) A[i]=i;
8 #pragma omp parallel
9 {
10 #pragma omp single
11 {
12 // Compute the sum in two threads
g
int sum1=0, sum2=0;
an
13
14 #pragma omp task shared(A, N, sum1)
W
15 {
ng
16 for (int i = 0; i < N/2; i++)
17 sum1 += A[i]; e
}
nh
18
19 #pragma omp task shared(A, N, sum2)
Yu
20 {
21 for (int i = N/2; i < N; i++)
r
sum2 += A[i];
fo
22
23 }
ed
24
ar
27
printf("Result=%d (must be %d)\n", sum1+sum2, ((N-1)*N)/2);
yP
28
29 }
}
el
30
free(A);
iv
31
32 }
us
cl
user@host% ./omptaskwait
Result=499500 (must be 499500)
user@host%
Listing 3.47: Code omp-taskwait.cc illustrates the usage #pragma omp taskwait.
The code in Listing 3.46 is an inefficient way to approach the problem, because it uses only two threads.
A better way to perform parallel reduction is described in Section 3.2.7. Nevertheless, for scalable fork-join
parallel algorithms, #pragma omp taskwait is a native way in OpenMP to implement synchronization
points.
1 #include <stdio.h>
2
3 void Sum(const int* A, const int start, const int end, int & result) {
4 for (int i = start; i < end; i++)
5 result += A[i];
6 }
7
8 int main() {
9 const int N=1000;
g
10 int* A = (int*)malloc(N*sizeof(int));
an
11 for (int i = 0; i < N; i++) A[i]=i;
W
12
13 // Compute the sum with two tasks
ng
14 int sum1=0, sum2=0;
e
15
16 _Cilk_spawn Sum(A, 0, N/2, sum1);
_Cilk_spawn Sum(A, N/2, N, sum2); nh
Yu
17
18
r
20 _Cilk_sync;
d
21
re
23
24 free(A);
re
25 }
yP
el
Just as with OpenMP, this is an inefficient way to implement parallel reduction, and a better method is
cl
TM
Implicit Synchronization in OpenMP and Intel R Cilk Plus
In addition to synchronization methods described above, OpenMP and Intel Cilk Plus contain implicit
synchronization points at the beginning and end of parallel loops and parallel regions (in OpenMP only). This
means that code execution does not proceed until all iterations of the parallel loop have been performed, or
until the last statement of the parallel region has been executed in every thread.
Some parallel algorithms that require synchronization only to modify a common quantity can be expressed
in terms of reduction. This possibility arises if the operation with which the common quantity is calculated
is associative (such as integer addition or multiplication) or approximately associative (such as floating-
point addition or multiplication), i.e., and the order of operations does not affect the result. OpenMP has
reduction clauses for parallel pragmas, and Intel Cilk Plus has specialized variables called reducers in
order to effect reduction. It is also possible to instrument a reduction algorithm using private variables and
minimal synchronization. Properly instrumented parallel reduction avoids excessive synchronization and
communication, which improves the parallel scalability and, therefore, the application performance.
g
In OpenMP, for-loops can automatically perform reduction for certain operations on scalar variables.
an
Listing 3.49 illustrates the algorithm shown in Listing 3.44, Listing 3.45 and Listing 3.46 instrumented using
W
the OpenMP reduction clause:
e ng
nh
Yu
1 #include <omp.h>
2 #include <stdio.h>
r
fo
3
4 int main() {
ed
6 int sum = 0;
7 #pragma omp parallel for reduction(+: sum)
p
8
9 sum = sum + i;
yP
10 }
printf("sum=%d (must be %d)\n", sum, ((n-1)*n)/2);
el
11
12 }
iv
us
user@host% ./omp-reduction
Ex
Listing 3.49: Code omp-reduction.cc has race condition eliminated with a reduction clause.
The syntax of the reduction clause is reduction(operator:variables), where operator is one of: +,
*, -, &, |, ˆ , &&, ||, max or min, and variables is a comma-separated list of variables to which these
operations are applied.
It is possible to implement reduction for other operations and other types of variables using private
variables and a critical section or an atomic barrier after the loop. This, in fact, is what happens behind the
curtains when the reduction clause is specified in OpenMP. With this method, each thread must have a
private variable of the same type as the global reduction variable. In each thread, reduction is performed to
that private variable without synchronization with other threads. At the end of the loop, a critical section is
used in order to reduce the private variables from each thread into the global variable. The principle of this
method is shown in Listing 3.50.
1 #include <omp.h>
2 #include <stdio.h>
3
4 int main() {
5 const int n = 1000;
6 int sum = 0;
7 #pragma omp parallel
8 {
9 int sum_th = 0;
10 #pragma omp for
11 for (int i = 0; i < n; i++) {
12 sum_th = sum_th + i;
13 }
14
15 #pragma omp atomic
16 sum += sum_th;
17
18 }
19 printf("sum=%d (must be %d)\n", sum, ((n-1)*n)/2);
g
}
an
20
W
Listing 3.50: Code omp-reduction2.cc implements reduction using private variables and a minimum reduction
ng
section.
e
nh
Yu
In Listing 3.50, this specific example could also be implemented a critical section instead of atomic
operations. The solution with a critical section is not optimal in this case, however, it may be necessary in other
r
fo
cases, when the reduction operation is not atomic, or the data type of the reduction variable is not supported by
the reduction clause. For example, a C++ container class as a reduction variable is not eligible for the OpenMP
d
re
reduction clause or for atomic operations. However, reduction into a C++ container class can be done using a
pa
critical section.
re
yP
el
iv
us
cl
Ex
TM
Reducers in Intel R Cilk Plus
Intel Cilk Plus, compared to OpenMP, allows the user less fine-grained control over synchronization, but
makes up for it with versatile support for hyperobjects: reducers and holders. The restricted lexicon of the
Intel Cilk Plus framework encourages the programmer to employ efficient parallel algorithms, which avoid
excessive synchronization and exhibit high parallel scalability. In addition, lexical restrictions of Intel Cilk
Plus enforce serial semantics and ensure that serialized version of the code will produce the same results as the
parallel version.
Reducers are variables that hold shared data, yet these variables can be safely used by multiple strands
of a parallel code. At runtime, each worker operates on its own private copy of the data, which reduces
synchronization and communication between workers.
Let us demonstrate an Intel Cilk Plus implementation of the example shown in Listing 3.49. Listing 3.51
demonstrates the parallel sum reduction algorithm with Intel Cilk Plus.
1 #include <cilk/reducer_opadd.h>
#include <stdio.h>
g
2
an
3
4 int main() {
W
5 const int n = 1000;
ng
6 cilk::reducer_opadd<int> sum;
7 sum.set_value(0); e
_Cilk_for (int i = 0; i < n; i++) {
nh
8
9 sum += i;
Yu
10 }
11 printf("sum=%d (must be %d)\n", sum.get_value(), ((n-1)*n)/2);
r
fo
12 }
ed
ar
a) Header file corresponding to a specific reducer must be included. In this case, it is cilk/reducer_opadd.h
el
c) Inside of the parallel region, the reducer sum is used just like a regular variable of type int, except that
only one operation with it is allowed: +=.
d) Outside the parallel region, the reducer can only be used via accessors and mutators (in this case,
get_value() and set_value()).
The power of reducers in Intel Cilk Plus is greatly enhanced by support for user-defined reducers. This
procedure is described in the Intel C++ Compiler reference [29]. However, for a lot of applications, the scope
of reducers provided in Intel Cilk Plus may be sufficient.
The list of reducers supported by Intel Cilk Plus is shown below. Reducer names are self-explanatory,
and additional information can be found in Intel C++ Compiler reference [30].
g
an
reducer_ostream in <cilk/reducer_ostream.h> supports operation «.
W
reducer_basic_string in <cilk/reducer_string.h> supports operation += to create a string.
ng
reducer_string and reducer_wstring are shorthands for reducer_basic_string for
e
types char and wchar_t, respectively.
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
TM
Holders in Intel R Cilk Plus
Holders in Intel Cilk Plus are hyperobjects that allow thread-safe read/write accesses to common data.
Holders are similar to reducers, with the exception that they do not support synchronization at the end of the
parallel region. This enables to instrument holders with a single C++ template class called cilk::holder.
The role of holders in Intel Cilk Plus is similar to the role of private variables in OpenMP declared in
the same way that the variable sum_th is declared in Listing 3.50. However, holders provide additional
functionality in fork-join codes. Namely, the view of a holder upon the first spawned child of a function (or
the first child spawned after a sync) is the same as upon the entry to the function, even if a different worker is
executing the child. This functionality allows to use holders as a replacement to argument passing. Unlike a
truly shared variable, a holder has undetermined state in some cases (in spawned children after the first one
and in an arbitrary iteration of a _Cilk_for loop), because each strand manipulates its private view of the
holder.
Listing 3.52 and Listing 3.54 demonstrate the use of a holder as a private variable. In Listing 3.52, in
the _Cilk_for loop, a separate copy of variable scratch is constructed for each iteration. If the cost
of constructor ScratchType() is too high, then using a holder as shown in Listing 3.54 in may improve
g
efficiency. When ScratchType is wrapped in the template class cilk::holder, the constructor of
an
ScratchType()) will be called only once for each worker. At the same time, the view of the variable
W
scratch is undetermined in an arbitrary iteration of the loop.
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
1 #include <stdio.h>
2 #include <cilk/holder.h>
3 const int N = 100000;
4 class ScratchType {
5 public:
6 int data[N];
7 ScratchType(){printf("Constructor called by worker %d\n",
8 __cilkrts_get_worker_number());}
9 };
10 int main(){
11 _Cilk_for (int i = 0; i < 10; i++) {
12 ScratchType scratch;
13 scratch.data[0:N] = i;
14 int sum = 0;
15 for (int j = 0; j < N; j++) sum += scratch.data[j];
16 printf("i=%d, worker=%d, sum=%d\n", i, __cilkrts_get_worker_number(), sum);
17 }
18 }
g
an
Listing 3.52: Source cilk-noholder.cc demonstrates using a private variable for intermediate calculations in a
W
_Cilk_for loop.
e ng
user@host% icpc -o cilk-noholder cilk-noholder.cc nh
Yu
user@host% export CILK_NWORKERS=2
user@host% ./cilk-noholder
r
fo
Listing 3.53: Compiling and running code cilk-noholder.cc from Listing 3.52. Note that the constructor of
ScratchType() is called for every loop iteration.
1 #include <stdio.h>
2 #include <cilk/holder.h>
3 const int N = 100000;
4 class ScratchType {
5 public:
6 int data[N];
7 ScratchType(){printf("Constructor called by worker %d\n",
8 __cilkrts_get_worker_number());}
9 };
10 int main(){
11 cilk::holder<ScratchType> scratch;
12 _Cilk_for (int i = 0; i < 10; i++) {
13 scratch().data[0:N] = i; // Operator () is an accessor to data in a holder
14 int sum = 0;
15 for (int j = 0; j < N; j++) sum += scratch().data[j];
16 printf("i=%d, worker=%d, sum=%d\n", i, __cilkrts_get_worker_number(), sum);
17 }
18 }
g
an
Listing 3.54: Source cilk-holder.cc demonstrates holder usage in Intel Cilk Plus. Listing 3.55 demonstrates that
W
this code may yield better efficiency than the code without holders in Listing 3.52.
e ng
nh
user@host% icpc -o cilk-holder cilk-holder.cc
Yu
Listing 3.55: Compiling and running cilk-holder.cc. Note that the constructor of ScratchType() is called
once for every worker, but not once for every iteration. If the cost of the constructor is high, this code may provide better
efficiency than the code in Listing 3.52.
1) A dry, but comprehensive description can be found in OpenMP specifications can be found at the OpenMP
Architecture Review Board Web site http://openmp.org/wp/openmp-specifications/ [31]
2) A detailed OpenMP tutorial from Blaise Barney of Lawrence Livermore National Laboratory is available
at https://computing.llnl.gov/tutorials/openMP/ [32]
3) Intel Cilk Plus pages in the Intel C++ Compiler reference provide details and examples for programming
g
an
with this parallel framework [25]
W
4) The Intel Threading Building Blocks project (TBB) is another powerful parallel framework and library:
ng
http://threadingbuildingblocks.org [33]. This product has an open-source implementation.
e
nh
5) Intel Array Building Blocks (ArBB) is high-level library for parallel data processing [34].
Yu
6) The book “Intel Xeon Phi Coprocessor High Performance Programming” by Jim Jeffers and James Reinders
r
fo
7) The book “Structured Parallel Programming: Patterns for Efficient Computation” by Michael McCool,
re
Arch D. Robinson and James Reinders [37] is a developer’s guide to patterns for high-performance parallel
pa
programming (see also the Web site of the book [38]). The book discusses fundamental parallel algorithms
re
8) The book “Parallel Programming in C with MPI and OpenMP” by Michael J. Quinn [39] is full of
el
iv
examples of high performance applications implemented in OpenMP and MPI, illustrating the programming,
us
At this point, the reader is familiar with data parallelism in Intel Xeon family processors (SIMD
instructions), with task parallelism in multi- and many-core systems (multiple threads operating in shared
memory). The next level of parallelism is scaling an application across multiple compute nodes in distributed
memory. The most commonly used framework for distributed memory HPC calculations is the Message
Passing Interface (MPI). This section discusses expressing parallelism with MPI.
MPI is a communication protocol. It allows multiple processes, which do not share common memory,
but reside on a common network, to perform parallel calculations, communicating with each other by passing
g
an
messages. MPI messages are arrays of predefined and user-defined data types. The purpose of MPI messages
W
range from task scheduling to exchanging large amounts of data necessary to perform the calculation. MPI
guarantees that the order of sent messages is preserved on the receiver side. The MPI protocol also provides
ng
error control. However, the developer is responsible for communication fairness control, as well as for task
e
nh
scheduling and computational load balancing.
Yu
Originally, in the era of single-core compute nodes, the dominant MPI usage model in clusters was to run
one MPI process per physical machine. With the advent of multi-core, multi-socket, and now heterogeneous
r
fo
many-core systems, the range of usage models of MPI has grown (see Figures 3.3 — 3.6):
ed
ar
a) It is possible to run one MPI process per compute node, exploiting parallelism in each machine with a
p
re
shared-memory parallel framework, such as OpenMP or Intel Cilk Plus (see Section 3.2). Figure 3.3
yP
Figure 3.3: Hybrid MPI and OpenMP parallelism diagram. One multi-threaded MPI process per node.
b) Alternatively, one single-threaded MPI process can run on each physical or logical core of each machine in
the cluster. In this case, MPI processes running on one compute node still do not share memory address
space. However, message passing between these processes is more efficient, because fast virtual fabrics
can be used for communication. This approach is illustrated in Figure 3.4.
Figure 3.4: Pure MPI parallelism diagram. One single-threaded MPI process per core.
c) Another option is to run multiple multi-threaded MPI processes per compute node, as shown in Figure 3.5.
In this case, each process exploits parallelism in shared memory, and MPI communication between
processes adds distributed-memory parallelism. This hybrid approach may yield optimum performance for
with a high frequency or large volume of communication.
g
an
W
e ng
nh
r Yu
fo
d
re
pa
Figure 3.5: Hybrid MPI and OpenMP parallelism. Several multi-threaded MPI processes per node.
re
yP
d) In heterogeneous clusters with Intel Xeon Phi coprocessors, MPI programmers have a choice of running
el
MPI processes on hosts and coprocessors natively (as in cases a, b and c), or running MPI processes only
iv
us
Figure 3.6: Hybrid MPI and OpenMP parallelism diagram with offload from hosts to coprocessors.
Multiple implementations of MPI have been developed since the protocol’s inception in 1991. In this
training, we will be using the Intel MPI library version 4.1, which implements MPI version 2.2 specification.
Intel MPI has native support for Intel Xeon Phi coprocessors, integrates with Intel software development tools,
and operates with a variety of interconnect fabrics.
g
an
process is assigned a unique identifier called MPI rank. MPI ranks are integers that begin at 0 and increase
contiguously. Using these ranks, processes can coordinate execution and identify their role in the application
W
even before they exchange any messages. It is also possible to launch multiple executables on different hosts
ng
as a part of a single application. For complex applications, processes can be bundled into communicators and
e
nh
groups.
A “Hello World” MPI application was demonstrated in Chapter 2 in Section 2.1 and Section 2.4, and the
Yu
Listing 3.56 schematically demonstrates the structure of all MPI applications in C++.
p
re
yP
1 #include "mpi.h"
2
el
4
us
7
8 MyErrorLogger("...");
9 MPI_Abort(MPI_COMM_WORLD, ret);
10 }
11
12 int worldSize, myRank, myNameLength;
13 char myName[MPI_MAX_PROCESSOR_NAME];
14 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
15 MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
16 MPI_Get_processor_name(myName, &myNameLength);
17
18 // Perform work
19 // Exchange messages with MPI_Send, MPI_Recv, etc.
20 // ...
21
22 // Terminate MPI environment
23 MPI_Finalize();
24 }
• Header file #include <mpi.h> is required for all programs that make Intel MPI library calls.
• Intel MPI calls begin with MPI_
• The MPI portion of the program begins with a call to MPI_Init and ends with MPI_Finalize
• Communicators can be used to address a substructure of MPI processes, and the default communicator
MPI_COMM_WORLD includes all current MPI processes.
• Each process within a communicator identifies itself with a rank, which can be queried by calling the
function MPI_Comm_rank
• The number of processes in the given communicator can be queried with MPI_Comm_size.
• Using the ranks and the world size, it is possible to distribute roles between processes in an application
even before any messages are exchanged.
g
an
• Most MPI routines return an error code. The default MPI behavior is to abort program execution is there
W
was an error.
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
Example
Listing 3.57 demonstrates the use of blocking point-to-point communication routines. In this code,
multiple “worker” processes report to the “master” process with rank equal to 0.
g
an
1 #include <mpi.h>
2 #include <stdio.h>
W
3 int main (int argc, char *argv[]) {
ng
4 int i, rank, size, namelen;
5 char name[MPI_MAX_PROCESSOR_NAME]; e
nh
6 MPI_Status stat;
7 MPI_Init (&argc, &argv);
Yu
12 printf ("I am the master process, rank %d of %d running on %s\n", rank, size, name);
ar
14
re
19 }
} else {
us
20
21 // Blocking send operations in all other processes
cl
Program shown in Listing 3.57 uses two functions that are new in our discussion: MPI_Send and
MPI_Recv. These functions, respectively, send and receive a message. MPI_Send and MPI_Recv and
their variations (read on) are the basis of MPI.
MPI_Recv (&buf, count, datatype, source, tag, comm, &status) is a basic blocking
receive operation. It posts the intent to receive a message and blocks (i.e., waits) until the requested
message is received into the receive buffer buf.
MPI_Send (&buf, count, datatype, dest, tag, comm) is a basic blocking send operation.
It send the message contained in the send buffer buf, and blocks until it is safe to re-use the send buffer.
Here and elsewhere, the meaning and type of common parameters is:
g
an
MPI_Datatype datatype Indicates the type of data elements in the buffer. Table 3.5
W
lists predefined MPI data types
ng
Rank of the process to which the message is sent
e
int dest
nh
Yu
int source Rank of the process from which a message is received. Spe-
cial wild card value MPI_ANY_SOURCE allows to receive a
r
fo
wider.
cl
Ex
MPI_Status* status Pointer to a structure containing the source, the tag and the
length of the received message. In order to access the length
from status, function MPI_Get_count must be used.
The data types supported by MPI are shown in Table 3.5. Note that user-defined data types can be created
in MPI.
g
an
MPI_DOUBLE 8
MPI_REAL4 4
W
MPI_LONG_DOUBLE 16
MPI_REAL8 8
ng
MPI_REAL16 16
MPI_CHARACTER 1 e
MPI_LOGICAL 4
nh
MPI_INTEGER 4
Yu
MPI_REAL 4
r
MPI_DOUBLE_PRECISION 8
fo
MPI_COMPLEX 2*4
ed
MPI_DOUBLE_COMPLEX 2*8
p ar
Table 3.5: Data types in required and recommended by the MPI standard. For a list of the types available in a specific
re
g
task 0 task 1
an
sender sender
W
data data
e ng
nh
Yu
task 2
r
receiver
fo
d
re
Functions MPI_Recv and MPI_Send are easy to use, and they provide message passing functionality
iv
that is sufficient for many real-world HPC applications. That said, the discussion of parallelism in MPI could
us
be terminated at this point. However, we will discuss additional topics of MPI in the rest of Section 3.3. Here
cl
1. Buffering is a system-level functionality of MPI that enables significant optimization for communication
efficiency. The use of buffering may require additional efforts to prevent errors.
2. Non-blocking send and receive operations can be used to overlap computation and communication.
3. Collective communication is helpful for certain parallel patterns, including reduction. Reduction is built
into MPI in the form of dedicated functions.
Finally, in this chapter we did not touch the issues of performance with MPI. This, along with other
performance tuning questions, is left for Chapter 4 (Section 4.7).
Terminology: Application (Send) Buffer, System Buffer and User Space Buffer
Historically, the word “buffer” in the context of MPI is used in multiple terms with very different meaning.
It is important to understand the difference between these terms for future discussion.
a) Application buffer collectively refers to send buffers and receive buffers. This is a memory region in the
user application which holds the data of the sent or received message. In Table 3.4, the variable void
*buf represents either the send, or the receive buffer. In the code in Listing 3.57, the role of send and
g
receive application buffers is played by variables rank, namelen and name.
an
W
b) System buffer is a memory space managed by the MPI runtime library, which is used to hold messages
that are pending for transmission on the sender side, or for reception to the application on the receiver
ng
side. The purpose of the system buffer is to enable asynchronous communication. The system buffer is
e
nh
not visible to the programmer. System buffers in MPI may exist on both the sender side and the receiver
Yu
side. The standard functions MPI_Send and MPI_Recv typically use system-level buffers provided and
managed by the MPI runtime library.
r
fo
c) User space buffer plays the same role as the system buffer: it can temporarily store messages to enable
ed
asynchronous communication. However, this special buffer space is allocated and managed by the user and
ar
In this discussion, we will be using the terms synchronous and asynchronous communication modes and
iv
the terms blocking and non-blocking operations. In MPI, these pairs of terms are not synonymous. It may
us
further add to the confusion that the meaning of synchronous and asynchronous in MPI is different from that in
cl
the offload programming model for Intel Xeon Phi coprocessors. Let us clarify these terms before discussing
Ex
a) Synchronous communication means that the sender must wait until the corresponding receive request
is posted by the receiver. After a “handshake” between the sender and receiver occurs, the message is
passed without buffering. This mode is more deterministic and uses less memory than asynchronous
communication, but at the cost of the time lost for waiting.
b) Asynchronous communication in the case of sending means that the sender does not have to wait for the
receiver to be ready. The sender may put the message into the system buffer (either on the sender, or on the
receiver side) or into the user space buffer, and return.
a) Blocking send functions pause execution until it is safe to modify the current send buffer. Blocking receive
functions wait until the message is fetched into the receive buffer.
b) Non-blocking send functions return immediately and execute the transmission “in background”. Non-
blocking receive functions only post the intent to receive a message, but do not pause execution. It is not
safe to re-use or modify the send buffer before ensuring that a non-blocking send operation has completed.
Likewise, it is unsafe to read from the receive buffer before ensuring that a non-blocking receive operation
has completed. In order to ensure that a non-blocking operation is complete, each non-blocking MPI
function must have a corresponding MPI_Wait function.
Blocking and non-blocking functions exist in synchronous as well as asynchronous flavors.
g
an
Explanation of Communication Modes
W
ng
In order to better illustrate synchronous and asynchronous, blocking and non-blocking, and ready mode
functions, consider this real-world analogy. Suppose the sender (let us call her Sierra) wants to communicate
e
nh
to the receiver (let us call him Romeo) the time and place of their lunch meeting. The following situations are
Yu
equivalent to the various communication modes in MPI:
r
1) Blocking asynchronous send: Sierra dials Romeo’s telephone number and leaves a message on Romeo’s
fo
answering machine. Sierra does not return to her activities until she had left the message. This reflects the
d
blocking nature of this transaction. At the same time, after the transaction is complete, there is no guarantee
re
that Romeo has personally received the message. This reflects the asynchronous nature of the transaction.
pa
The answering machine plays the role of a receiver-side system buffer in this case.
re
yP
2) Blocking synchronous send: Sierra keeps dialing Romeo’s telephone number until Romeo personally
picks up the phone. This transaction is blocking, because Sierra cannot return to her other activities until
el
she speaks to Romeo. This transaction is synchronous because at the end of the transaction, Romeo has
iv
us
3) Non-blocking asynchronous send: Sierra tells her assistant to call Romeo and leave a message on his
Ex
answering machine. Sierra returns to her other activities immediately, so this transaction is non-blocking.
Another property of non-blocking transactions: Sierra must wait for her assistant to finish with this task
before assigning him another task. Her assistant does not have to reach Romeo personally; leaving the
message on the answering machine is satisfactory in this case, so this transaction is asynchronous.
4) Non-blocking synchronous send: Sierra tells her assistant to call Romeo, and to make sure to talk to
him personally, and not to his answering machine. Sierra can do other things while her assistant works
on transmitting the message, so this is a non-blocking transaction. This non-blocking transaction is
synchronous, because after the assistant has finished with this task, Romeo is sure to have received the
message.
5) Blocking ready mode send: Romeo is already on hold on Sierra’s phone line when Sierra picks up
the phone. Sierra returns to her other activities only after she had transmitted her message (blocking
transaction).
6) Non-blocking ready mode send: Romeo is already on hold on Sierra’s phone line, but she re-directs him
to her assistant. Sierra returns to her other activities immediately, so this is non-blocking communication.
MPI_Bsend, Blocking asynchronous send operation with user space Used for asynchronous blocking communica-
MPI_Buffer_attach, buffer. Returns when it is safe to re-use the send tion when system buffer is inefficient, prone
MPI_Buffer_detach buffer. User space buffer must be allocated with to overflows, or is not used by MPI_Send.
MPI_Buffer_attach prior to calling MPI_Bsend.
MPI_Ssend Blocking synchronous send operation. Not buffered. Re- Used (a) when message must be received be-
turns when it is safe to re-use the send buffer. fore function return, or (b) to eliminate mem-
ory overhead of system or user space buffers.
g
Blocking ready mode send operation. Synchronous or Used in codes with fine-grained communica-
an
MPI_Rsend
asynchronous depending on the MPI implementation and tion to improve performance. It is program-
W
runtime conditions. Returns when it is safe to re-use mer’s responsibility to ensure that matching
ng
the send buffer. Assumes that the matching receive had receives post before MPI_Rsend.
already been posted, error otherwise. e
nh
MPI_Recv Blocking receive operation. Can be paired with any send operation.
Yu
MPI_Isend Non-blocking send operation. Synchronous or asyn- Default non-blocking send operation. Used
r
fo
chronous depending on the MPI implementation and run- to overlap communication and computation
time conditions. MPI_Wait must be called prior to re- between MPI_Isend and MPI_Wait.
ed
MPI_Ibsend, Non-blocking asynchronous send operation with user Potentially the most efficient send method.
MPI_Buffer_attach, space buffer. MPI_Wait must be called prior to re-using Asynchronous and non-blocking, allows to
yP
MPI_Buffer_detach the send buffer. User space buffer must be allocated with overlap computation and communication. See
MPI_Buffer_attach prior to calling MPI_Bsend. also comment for MPI_Bsend.
el
iv
us
MPI_Issend Non-blocking synchronous send operation. Not buffered. Used to overlap communication and compu-
cl
MPI_Wait must be called prior to re-using the send tation. At the same time, eliminates memory
Ex
MPI_Irsend Non-blocking ready send operation. Synchronous or asyn- Used instead of MPI_Isend in codes with
chronous depending on the MPI implementation and run- fine-grained communication to improve per-
time conditions. MPI_Wait must be called prior to re- formance by eliminating “handshakes”.
using the send buffer. Assumes that the matching receive
had already been posted, error otherwise.
MPI_Irecv Non-blocking receive operation. MPI_Wait must be Can be paired with any send operation. Used
called prior to using the receive buffer. to overlap computation and communication
between MPI_Irecv and MPI_Wait.
MPI_Wait, Blocks execution until one or more matching non- Every asynchronous operation must have a
MPI_Waitall blocking send or receive operations return. After that, matching MPI_Wait.
MPI_Waitany, it is safe to re-use the send buffer or use the receive buffer.
MPI_Waitsome
Table 3.6: Basic functions for MPI communication. Details may be found at [40] or by clicking function names.
1 #include <mpi.h>
2 #include <stdio.h>
3 #include <stdlib.h>
4
5 int main (int argc, char *argv[]) {
6 const int M = 100000, N = 200000;
7 float data1[M]; data1[:]=1.0f;
g
an
8 double data2[N]; data2[:]=2.0;
9 int myRank, worldSize, size1, size2;
W
10
ng
11 MPI_Init (&argc, &argv);
12 MPI_Comm_size (MPI_COMM_WORLD, &worldSize);
e
MPI_Comm_rank (MPI_COMM_WORLD, &myRank);
nh
13
14
Yu
15 if (worldSize > 1) {
16 if (myRank == 0) {
r
fo
23
24 MPI_Buffer_attach(buffer, bufsize);
MPI_Bsend(data1, M, MPI_FLOAT, 1, 1, MPI_COMM_WORLD);
el
25
MPI_Bsend(data2, N, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD);
iv
26
27 MPI_Buffer_detach(&buffer, &bufsize);
us
28 free(buffer);
cl
29 } else if (myRank == 1) {
Ex
Listing 3.58: Source code mpi-buffered.cc illustrates blocking asynchronous transactions with user space buffer.
Note that the required size of the buffer is calculated using the MPI function MPI_Pack_size, and a
constant MPI_BSEND_OVERHEAD is added to the buffer size.
In order to overlap communication and computation, MPI provides non-blocking send and receive
functions MPI_Isend and MPI_Irecv. Non-blocking functions return immediately, and the code can
perform other operations while communication proceeds. However, for non-blocking send operations, it is
unsafe to re-use the send buffer until blocking function MPI_Wait is called. Likewise, for non-blocking
receive operations, it is unsafe to assume that the message was delivered to the receive buffer until MPI_Wait
is called.
Note that calling MPI_Wait immediately after MPI_Isend or MPI_Irecv is equivalent to using
MPI_Send or MPI_Recv. The purpose of non-blocking functions is to enable the code to perform some
additional operations while the message is in transit.
Listing 3.59 demonstrates the use of non-blocking send operation.
1 #include <mpi.h>
2 #include <stdio.h>
g
3
an
4 int main (int argc, char *argv[]) {
const int N = 100000, tag=1;
W
5
6 float data1[N], data2[N]; data1[:]=0.0f;
ng
7 int myRank, worldSize;
8
e
nh
9 MPI_Request request;
10 MPI_Status stat;
Yu
11
MPI_Init (&argc, &argv);
r
12
fo
14
15
ar
16 if (worldSize > 1) {
p
17 if (myRank == 0) {
re
22 data2[i] = 1.0f;
us
25 } else if (myRank == 1) {
Ex
Note that when communicating processes use a network interconnect (e.g., Ethernet or Infiniband) as
physical network fabric, communication does not stall calculations. In this case, non-blocking communication
may improve performance by masking communication time. However, if the sender and receiver are executing
on the same host, their communication is essentially a memory copy. In this case, non-blocking communication
may be detrimental to performance, because communication and computation will compete for resources,
resulting in undesirable contention.
sender
data
g
an
Broadcast
W
ng
receiver receiver receiver receiver
e
nh
Yu
sender
r
data data
fo
data data
d
Scatter
re
pa
re
Gather
receiver
Reduction
16
receiver
Function Effect
MPI_Barrier Performs group barrier synchronization. Upon reaching the MPI_Barrier call, each
task is blocked until all tasks in the group reach the same MPI_Barrier call.
MPI_Bcast Broadcasts (i.e., sends) a message from one process to all other processes in the group.
MPI_Scatter Distributes distinct messages from a single source task to each task in the group.
MPI_Gather Gathers distinct messages from each task in the group into a single destination task.
MPI_Allgather For each task, performs a one-to-all broadcasting operation within the group.
MPI_Reduce Applies a reduction operation on all tasks in the group and places the result in one task.
Predefined MPI reduction operations are summarized in Table 3.8.
g
an
MPI_Allreduce Applies a reduction operation and places the result in all tasks in the group. This is
W
equivalent to MPI_Reduce followed by MPI_Bcast.
e ng
MPI_Reduce_scatter Performs an element-wise reduction on a vector across all tasks in the group. The
resulting vector is split into disjoint segments and distributed across the tasks. This is
nh
equivalent to MPI_Reduce followed by MPI_Scatter operation.
Yu
MPI_Alltoall Each task in a group performs a scatter operation, sending a distinct message to all the
tasks in the group ordered by index.
p ar
re
MPI_Scan Performs a scan operation with respect to a reduction operation across a task group.
yP
el
Table 3.7: Collective communication functions in MPI. Details may be found at [40] or by clicking function names.
iv
us
cl
1 #include "mpi.h"
2 #include <stdio.h>
3 #define SIZE 4
4
5 int main(int argc, char *argv[]) {
6 int numtasks, rank, sendcount, recvcount, source;
7 float sendbuf[SIZE][SIZE] = {
8 {1.0, 2.0, 3.0, 4.0},
9 {5.0, 6.0, 7.0, 8.0},
10 {9.0, 10.0, 11.0, 12.0},
11 {13.0, 14.0, 15.0, 16.0}};
12 float recvbuf[SIZE];
13
14 MPI_Init(&argc,&argv);
g
15 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
an
16 MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
W
17
18 if (numtasks == SIZE) {
ng
19 source = 1;
e
20 sendcount = SIZE;
21 recvcount = SIZE;
nh
MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount,
Yu
22
23 MPI_FLOAT,source,MPI_COMM_WORLD);
r
25 recvbuf[1],recvbuf[2],recvbuf[3]);
}
d
26
re
27 else
pa
30 MPI_Finalize();
yP
31 }
el
iv
Listing 3.60: Source code mpi-scatter.cc demonstrates one of Intel MPI collective communication operations: a
us
This code demonstrates how function MPI_Scatter is used to distribute (i.e., scatter) the rows of a
matrix from one source process to all other processes. Execution of this application with 4 processes produces
the following output:
1) The official source of all information on MPI is the MPI Forum Web page http://www.mpi-forum.org/ [41]
5) The ANL Web site also features a reference of all MPI routines, which includes syntax and specification.
This reference can be found at http://www.mcs.anl.gov/research/projects/mpi/www/ [40] . All hyperlinks
g
attached to MPI function names in this book point to that document.
an
W
6) A reference guide for Intel MPI version 4.1 can be found at [6]. This manual contains information about
the specifics of Intel’s implementation of MPI.
e ng
7) A popular book on MPI and its interoperation with OpenMP is “Parallel Programming in C with MPI and
nh
OpenMP” by Michael J. Quinn [39]
Yu
Chapter 4 of this document contains more MPI examples and discusses performance analysis and
r
fo
Chapter 4
g
an
W
This chapter delves deeper into the issues of extracting performance out of parallel applications on Intel
ng
Xeon processors and Intel Xeon Phi coprocessors and provides practical examples of high performance codes.
e
This chapter re-iterates on the skills and methods introduced in Chapter 3.
nh
Yu
TM
4.1 Roadmap to Optimal Code on Intel R Xeon Phi Coprocessors
r
fo
d
The Intel Xeon Phi coprocessor is a massively parallel vector processor, and optimization strategies
re
for it are qualitatively the same as those for Intel Xeon processors: SIMD support (vectorization), thread
pa
parallelism and high arithmetic intensity or streaming memory access pattern. However, these requirements
re
are quantitatively more strict for Intel Xeon Phi calculations: the code must be able to utilize wider vectors,
yP
support a greater number of threads, and the penalty for non-local memory access is greater on the coprocessor.
el
iv
In general, in order to expect better performance from an Intel Xeon Phi coprocessor than from the host
Ex
system, the developer should be able to answer “yes” to the following questions:
Has scalar optimization been performed? Some applications can be improved by consistently employing
single precision floating-point arithmetics instead of double precision or a mix of the two, using
array notation instead of pointer arithmetics, removing unnecessary type conversions and eliminating
redundant operations. Section 4.2 discusses these optimizations.
Does the code vectorize? Not only should the compiler report indicate that the automatic vectorization of
performance-critical loops has succeeded. In addition, the programmer must enforce unit-stride memory
access pattern, proper data alignment and eliminate type conversions in vector calculations. See
Section 4.3 for details.
Does the applications scale above 100 threads? Some applications designed for earlier generation proces-
sors may be serial or only support 2 to 4 threads. While this is sufficient to extract significant additional
performance from an Intel Xeon processor, these applications will not show satisfactory performance on
Intel Xeon Phi coprocessors. Massive parallelism is required to fully utilize the coprocessor, because
it derives its performance from concurrent work of over 50 cores with four-way hyper-threading and
low clock speeds. Even if an application can utilize hundreds of threads, thread contention due to
various effects can quench performance. Methods for improving the parallel scalability of task-parallel
calculations are described in Section 4.4.
Is it arithmetically intensive or bandwidth-limited? Applications that are not optimized for data locality
in space and time, and programs with complex memory access patterns may exhibit better performance
on the host system than on the Intel Xeon Phi coprocessor. If complex memory access is an inherent
property of the algorithm, it may be possible to re-structure data to pack memory accesses more closely.
Some algorithms be modified to better utilize the cache hierarchy (for compute-bound calculations) or
to improve memory streaming capabilities (for bandwidth-bound calculations) with techniques such as
loop tiling and cache-oblivious algorithms. These optimizations are described in Section 4.5.
Is cooperation between the host and the coprocessor(s) efficient? When an application utilizes more than
one coprocessor or more than one compute node, load balancing across compute units and the efficiency
of data communication between the host system(s) and coprocessor(s) becomes an issue. Load balancing
across compute units can be tuned using specialized Intel software development tools. In addition, it may
g
be possible to reduce the communication overhead by utilizing improved algorithms and/or optimizing
an
data marshaling policies. These optimizations are described in Section 4.6 and Section 4.7.
W
4.1.2 Expectations e ng
nh
It is often the case that an application providing satisfactory performance on Intel Xeon processors
Yu
initially performs poorly on Intel Xeon Phi coprocessors. This does not necessarily mean that the problem
r
is not “MIC-friendly”. Intel Xeon processors have a resource-rich architecture with hardware prefetchers,
fo
branch predictors, deep pipelines and high clock speeds of the cores, which can compensate sub-optimal
ed
aspects of a variety of workloads. At the same time, on Intel Xeon Phi coprocessors, the same sub-optimal
ar
behaviors of unoptimized codes are widely exposed by the resource-efficient MIC architecture. The good
p
news is that, generally, optimization for Intel Xeon processors leads to performance benefits for Intel Xeon Phi
re
Optimization methods described in this section yield performance benefits for both the many-core
el
and the multi-core architecture. In the best case scenario, a single Intel Xeon Phi coprocessor is capable
iv
of outperforming two Intel Xeon processors system by a factor of 2x (for linear arithmetics-bound codes)
us
on the theoretical limit of the arithmetic capabilities and memory bandwidth of a single Intel Xeon Phi
Ex
coprocessor compared to those of a two-way Intel Xeon processor-based host system with the Sandy Bridge
microarchitecture. However, applications that do not achieve this speedup can still benefit from an Intel Xeon
Phi coprocessor in the system, because available programming models allow to team up the host and the
coprocessor using asynchronous offload or heterogeneous MPI.
Note that the best-case speedup of 2x-5x is lower than frequently quoted speedups of 20x-100x introduced
by using a General Purpose Graphics Processing Unit (GPGPU). This does not mean that the performance of
Intel Xeon Phi coprocessors is an order of magnitude lower than the performance of GPGPUs. The comparison
between GPGPU and CPU performance is complicated by the fact that these systems feature completely
different hardware architectures and use different codes. Therefore, it is easy to be mislead regarding the
performance of a system by comparing poorly optimized CPU codes with highly optimized GPGPU codes.
On the other hand, a Intel Xeon Phi coprocessor runs applications compiled from the same code as the CPU,
and features a similar architecture. Therefore, a highly optimized application for Intel Xeon Phi coprocessors
is likely to show good performance on Intel Xeon processors as well.
In general, before optimizing an application for the MIC architecture, it is important to optimize it for the
multi-core architecture first. Most of the methodology described in this chapter applies to both architectures.
g
an
Optimization Level
W
ng
Performance-critical code should be compiled with the optimization level -O2 or -O3. A simple way
e
to set a specific optimization level is to use the compiler argument -O2 or -O3 [45]. However, a more
nh
fine-grained approach is possible by including #pragma intel optimization_level in the code
Yu
[46]. Figure 4.1 illustrates these methods.
r
fo
d
1
2 void my_function() {
pa
4 }
yP
el
Listing 4.1: Left: specifying the optimization level -O3 as a compiler argument. The specified optimization level is
iv
applied to the whole source file. Right: specifying the optimization level -O3 as a pragma. The optimization level specified
us
The default optimization level is -O2, which optimizes the application for speed. At this level, enabled
optimization functions include: automatic vectorization, inlining, constant propagation, dead-code elimination,
loop unrolling, and other. This is the generally recommended optimization level for most purposes.
The optimization level -O3 enables more aggressive optimization than -O2. It includes all of the features
of -O2 and, in addition, performs loop fusion, block-unroll-and-jam, if-statement collapse, and other. While
-O3 may improve performance with respect to -O2, it may sometimes result in worse performance. It is
straightforward to empirically determine the fastest optimization level.
Whenever a local variable or a function argument is not supposed to change value in the code, it is
beneficial to declare it with the const qualifier. This enables more aggressive compiler optimizations,
including pre-computing commonly used combinations of constants. For example, the code in Listing 4.2
executes 4.5x faster when w and T are declared with the const qualifier.
Sub-optimal Optimized
1 #include <stdio.h> 1 #include <stdio.h>
2 2
3 int main() { 3 int main() {
4 const int N=1<<28; 4 const int N=1<<28;
5 double w = 0.5; 5 const double w = 0.5;
6 double T = (double)N; 6 const double T = (double)N;
7 double s = 0.0; 7 double s = 0.0;
8 for (int i = 0; i < N; i++) 8 for (int i = 0; i < N; i++)
9 s += w*(double)i/T; 9 s += w*(double)i/T;
10 printf("%e\n", s); 10 printf("%e\n", s);
11 } 11 }
g
user 0m0.460s user 0m0.094s
an
sys 0m0.001s sys 0m0.003s
W
user@host% user@host%
e ng
nh
Listing 4.2: The sub-optimal code on the left takes 4.5x longer to compute than the optimized code on the right. The
const qualifier on variables w and T permits the compiler to pre-compute the ratio w/T instead of computing it in every
Yu
iteration.
r
fo
The difference in the performance of these two codes is explained by the fact that the compiler is certain
ed
that it is safe to pre-compute the common expression w/T. The value of this expression is then used it in the
ar
Sub-optimal Optimized
1 #include <stdio.h>
1 #include <stdio.h>
2
2
3 int main() {
3 int main() {
4 const long N = 1000;
4 const long N = 1000;
5 float a[N*N], b[N*N], c[N*N];
5 float a[N*N], b[N*N], c[N*N];
6 a[:] = b[:] = 1.0f;
6 a[:] = b[:] = 1.0f;
7 c[:] = 0.0f;
7 c[:] = 0.0f;
8
8
9 for (int i = 0; i < N; i++)
9 for (int i = 0; i < N; i++)
10 for (int j = 0; j < N; j++) {
10 for (int j = 0; j < N; j++)
11 float* cp = c + i*N + j;
11 for (int k = 0; k < N; k++)
12 for (int k = 0; k < N; k++)
g
12 c[i*N + j] +=
*cp += a[i*N + k] * b[k*N + j];
an
13
13 a[i*N + k] * b[k*N + j];
14 }
W
14
15
15 printf("%f\n", c[0]);
ng
16 printf("%f\n", c[0]);
16 }
17 }
e
nh
Yu
user@host% icc array_pointer.cc user@host% icpc array_index.cc
user@host% time ./a.out user@host% time ./a.out
r
1000.000000 1000.000000
fo
d
Listing 4.3: The intentionally crippled code on the left takes 5x longer to compute than the optimized code on the right.
el
iv
The only difference between the codes is the reference to the array element c[i*N + j]. This example illustrates that
us
the index notation works faster, as it enables the compiler to implement automatic vectorization (with reduction, in this
case).
cl
Ex
Sometimes the compiler can automatically detect when an expression is re-used multiple times or within
a loop, and pre-compute it, as was shown in Listing 4.2. This procedure is known as common subexpression
elimination. In order to insure against situations when the compiler is unable to implement this optimization,
it can be expressed in the code as shown in Figure 4.4.
Sub-optimal Optimized
g
1 for (int i = 0; i < n; i++) 1 for (int i = 0; i < n; i++) {
an
2 { 2 const double sin_A = sin(A[i]);
for (int j = 0; j < m; j++) { for (int j = 0; j < m; j++) {
W
3 3
4 const double r = 4 const double cos_B = cos(B[j]);
ng
5 sin(A[i])*cos(B[j]); 5 const double r = sin_A*cos_B;
6 // ... 6
e// ...
nh
7 } 7 }
8 } 8 }
r Yu
fo
ed
Listing 4.4: Left: unoptimized code computes the value of sin(A[i]) for every iteration in j. Right: optimized code
ar
eliminates redundant calculations of sin(A[i]) by pre-computing it. Note that the assignment of the variable cos_B
will be eliminated by the compiler at -O2 and higher in a procedure known as constant propagation.
p
re
yP
The ternary operator ?: is a short-hand expression for the if. . . else statement. Sometimes the syntax
us
of this operator can cause redundant calculations, like shown in the example below.
cl
Ex
Listing 4.5: In this sub-optimal code, three calls to my_function() will be made: two calls to evaluate the condition
(a<b) and one more call to substitute the result.
In order to avoid this trap, the programmer may pre-compute the expressions used in the ternary operator.
Listing 4.6: In this optimized code, only two calls to my_function() will be made.
Overhead of Abstraction
Complex C++ classes may introduce a significant overhead for operations that are expected to be fast
for simple objects like C arrays. For example, data containers may perform some operations to update their
internal state with every read or write operation. It may be possible to reduce the computational expense of
manipulations with complex classes by outsourcing a part of the calculation to simpler objects.
For example, the calculation shown in Figure 4.7 was found in a scientific code employing the Cern
ROOT library [47]. The code performs binning (i.e., hashing) of events marked by energy values into a
histogram, in which the values of the bins represent the number of events with energies between the bin
boundaries. The histogram is represented by the ROOT library’s class TH1F, which offers additional
histogram functionality required in the project. However, the binning process is a bottleneck of the project,
because the method Fill, called over 109 times, involves unnecessary overhead.
1 // class TH1F contains a histogram with the number of bins equal to nBins,
2 // which span the values from energyMin to energyMax. nBins is of order 100
3 TH1F* energyHistogram = new TH1F("energyHistogram", "", nBins, energyMin, energyMax)
g
4
an
5 // Method TH1F::Fill adds an event with the value eventEnergy[i] to the histogram.
W
6 for (unsigned long i = 0; i < nEvents; i++) // nEvents is of order 1e9.
energyHistogram->Fill( eventEnergy[i] );
ng
7
e
nh
Listing 4.7: This code employs the ROOT library to construct a histogram implemented in class TH1F. The generation
Yu
of the histogram is a bottleneck of the project.
r
fo
The code was optimized by pre-computing the values of the bins in the histogram using a more lightweight
d
object, an array of long integers. The array was then loaded into the histogram using an overloaded Fill
re
method, which takes the number of events in the bin, as opposed to adding a single event at a time. As a result,
pa
1 // array tempHistogram is used to prepare the histogram for loading into class TH1F
el
6
7 tempHistogram[bin]++;
8 }
9 // Now the histogram class is filled, but this time
10 // only nBins=100 calls to TH1F::Fill are made, instead of nEvents=1e9 calls
11 TH1F* energyHistogram = new TH1F("energyHistogram", "", nBins, energyMin, energyMax);
12 for (int i = 0; i < nBins; i++)
13 energyHistogram->Fill( ((float)i+0.5f)*binWidth, (float)tempHistogram[i]);
Listing 4.8: This code produces approximately the same results as the code in Listing 4.7, but works much faster, because
the expensive method Fill is called fewer times.
The calculation of the histogram can be further accelerated through vectorization and thread parallelism.
An example of these optimizations is discussed in Section 4.4.1.
While the example above is specific to the ROOT library, the principle illustrated here applies to other
situations as well. When the overhead of operations in a library function or class is beyond the control of the
developer, it may be possible to eliminate that overhead by pre-computing some of the data in lightweight
objects like C arrays.
Sub-optimal Optimized
1 for (int i = 0; i < n; i++) { 1 const float rn = 1.0f/(float)n;
2 A[i] /= n; 2 for (int i = 0; i < n; i++)
3 } 3 A[i] *= rn;
Listing 4.9: Left: unoptimized code uses the division operation with implicit type conversion in a loop. Right: optimized
g
code pre-computes the reciprocal of the denominator and replaces division with multiplication.
an
W
In some cases, the compiler can automatically precompute the reciprocal; however, doing it explicitly
e ng
like in Listing 4.9 may improve cross-platform and cross-compiler portability of the code. In other cases,
simplifying expressions so as to reduce the number of divisions can help. Listing 4.10 demonstrates examples
nh
of such cases:
Yu
Sub-optimal Optimized
r
fo
4 4 (B[i]*D[i]);
re
5 } 5 }
yP
el
Listing 4.10: Left: unoptimized code uses two division operations in each line. Right: optimized code produces
iv
approximately the same results but eliminates one division operation in each line.
us
cl
The same applies to arithmetic expressions with expensive transcendental operations. Simplifying
Ex
expressions in order to perform fewer expensive operations at the cost of performing a greater number of less
expensive operations may result in a performance increase. This technique is known as strength reduction in
the context of compiler optimization.
1. Single precision floating-point numbers (float) should be used instead of double precision (double)
wherever possible. Similarly, signed 32-bit integers (int) should be preferred to unsigned and 64-bit
integers (long), including array indices.
2. Typecast operations should be avoided. Note the following conventions:
a) Integer constants 1, 0 and -1 are of type int. Constant -1L is long, and 1UL is unsigned long.
b) Floating-point constants 1.0 and 0.0 are of type double. Constant 1.0f is float;
c) In expressions with mixed types, implicit typecast of lower-precision numbers to higher-precision
numbers is assumed. For example, x+i*a where x is of type double, i is int, and a is float, is
equivalent to x + (double)i*(double)a.
g
an
3. The Intel Math library defines in math.h fast single precision versions of most arithmetic functions. The
W
names of single precision functions are derived from double precision function names by appending the
ng
suffix -f. For example,
e
nh
a) function sin(x) takes an argument of type double and returns a value of the same type;
Yu
b) function sinf(x) takes and returns float and executes faster;
r
fo
4. Instead of the exponential function exp()/expf() and the natural logarithm log()/logf(), use faster
re
base 2 versions offered by the Intel Math Library: exp2()/exp2f() and log2()/log2f().
yP
5. Note that single precision functions ending with -f (for example, sinf(), expf(), fabsf(), etc.) are
el
guaranteed to work faster than their double precision counterparts (sin(), expf(), fabs(), etc.) with
iv
the Intel compilers. However, we have seen cases where with other compilers, single precision functions
us
are slower than double precision functions. The same applies to base-2 exponential and logarithm.
cl
Ex
Floating-Point Semantics
The Intel C++ Compiler may represent floating-point expressions in executable code differently, depend-
ing on the floating-point semantics, i.e., rules for finite-precision algebra allowed in the code. These rules are
controlled by an extensive set of command-line compiler arguments. The argument -fp-model controls
floating-point semantics at a high level.
Table 4.1 explains the usage of the argument -fp-model. For more information, see the Compiler
Reference [49] and the white paper “Consistency of Floating-Point Results using the Intel Compiler or Why
doesn’t my application always give the same answer?” by Dr. Martyn J. Corden and David Kreitzer [50].
Argument Effect
-fp-model strict Only value-safe optimizations, exception control is enabled (but may be disabled
using -fp-model no-except), floating-point contractions (e.g., the fused
multiply-add instruction) are disabled. This is the strictest floating-point model
-fp-model precise Only value-safe optimizations, exception control is disabled (but may
g
be enabled using -fp-model except). Serial floating-point calcu-
an
lations are reproducible from run to run. Some parallel OpenMP
W
calculations can be made reproducible by using the environment vari-
ng
able KMP_DETERMINISTIC_REDUCTION. The combination -fp-model
precise -fp-model source produces floating-point results compliant
e
nh
with the IEEE-754 standard.
Yu
-fp-model fast=1 Value-unsafe optimizations are allowed, exceptions are not enforced, contractions
r
are enabled. This is the default floating-point semantics model. The short-hand
fo
-fp-model fast=2 Enables more aggressive optimizations than fast=1, possibly leading to better
p
-fp-model source Intermediate arithmetic results are rounded to the precision defined in the source
el
or fast.
us
cl
-fp-model double Intermediate arithmetic results are rounded to 53-bit (double) precision. Using
Ex
-fp-model extended Intermediate arithmetic results are rounded to 64-bit (extended) precision. Using
extended also assumes precise, unless overridden by strict or fast.
-fp-model [no-]except except enables, no-except disables the floating-point exception semantics.
Table 4.1: Command-line arguments for high-level floating-point semantics control with the Intel C++ Compiler.
In this context, “value-unsafe” optimizations refer to code transformations that produce only ap-
proximately the same result. For example, floating-point multiplication is generally non-associative in
finite-precision arithmetics, i.e., a*(b*c)6= (a*b)*c. If value-unsafe optimizations are enabled, the
compiler may replace an expression like bar=a*a*a*a with foo=a*a; bar=foo*foo. However, if
only value-safe optimizations are allowed, then the expression will be computed from left to right, i.e.,
bar=((a*a)*a)*a. The two expressions produce approximately the same result, but the former employs
one less multiplication operation.
1 #include <stdio.h>
2 #include <math.h>
3 int main() {
4 for (int i = 0; i < 100; i++) {
5 const int N=i*5000;
6 double A = 0.1;
7 for (int r = 0; r < N; r++)
8 A = sqrt(1.0-4.0*(A-0.5)*(A-0.5));
9 if (i<10) printf("After %5d iterations, A=%e\n", N, A);
10 }
11 }
Listing 4.12: Code fpmodel.cc used in Listing 4.13 to illustrate the effect of relaxed floating-point model. The loop
with the sqrt() function performs an iterative update of the value A.
g
an
W
user@host% icpc -o fpmodel-1 -mmic \ user@host% icpc -o fpmodel-2 -mmic \
fpmodel.cc -fp-model fast=2 fpmodel.cc
ng
user@host% scp fpmodel-1 mic0:~/ user@host% scp fpmodel-2 mic0:~/
e
fpmodel-1 100% 11KB 11.6KB/s 00:00 fpmodel-2 100% 11KB 11.2KB/s 00:00
user@host% ssh mic0 time ./fpmodel-1 nh
user@host% ssh mic0 time ./fpmodel-2
Yu
After 0 iterations, A=0.100000 After 0 iterations, A=0.100000
After 5000 iterations, A=0.178449 After 5000 iterations, A=0.178449
r
Listing 4.13: Compiling and running the code illustrated in Listing 4.12 on an Intel Xeon Phi coprocessor. Test case
shown in panel on the left uses default transcendental function accuracy, -fp-model fast=1. Case shown in the other
panel uses relaxed transcendental function accuracy, -fp-model fast=2.
In the code shown in Listing 4.12, a single floating-point number is subjected to an iterative procedure.
It can be demonstrated analytically that this procedure has a stochastic character, i.e., small perturbations in
initial conditions lead to large deviations in the result after several iterations. Listing 4.13 demonstrates that up
to 20000 iterations, codes compiled with -fp-model fast=1 and 2 produce identical results. However,
by iteration 25000, the results are completely different. This occurs because at iteration 23431 (as tested on
our hardware), the two codes produce slightly different results due to different numerical accuracy, and this
subtle difference is amplified by the stochastic iteration in the subsequent steps. Note that at the same time,
the code compiled with -fp-model fast=2 performs almost twice as fast as the code compiled with the
default floating-point model.
This example demonstrates how relaxing the floating-point model may lead to a significant performance
increase on Intel Xeon Phi coprocessors. However, it is only safe to do so in well-behaved, numerically stable
applications.
By default, Intel C++ Compiler replaces calls to arithmetic operations, such as division, square root,
sine, cosine, and other, with the respective calls to the Intel Math Library or Intel Short Vector Math Library
functions or to the processor vector instructions. It is possible to instruct the compiler to use low-precision
functions for some arithmetic operations in order to gain more performance. Naturally, this must be done with
care and only in “well-behaved” applications that can tolerate the imprecise results.
Table 4.2 summarizes the Intel C++ Compiler command line arguments for precision control. For more
information of the function precision control in Intel C++ Compiler see the Intel C++ Compiler Reference
[51] and the white paper “Advanced Optimizations for Intel MIC Architecture, Low Precision Optimizations”
by Wendy Doerner [52].
g
Argument Effect
an
-fimf-precision= Defines the precision for math library functions.
W
value[:funclist] Here, value is one of: high, medium or low, which correspond to pro-
ng
gressively less accurate but more efficient math functions, and funclist
is a comma-separated list of functions that this rule is applied to.
e
nh
Value high is equivalent to max-error=0.6, medium equivalent to
Yu
By default, this option is not specified, and the compiler uses default heuris-
fo
-fimf-max-error= The maximum allowable error expressed in ulps (units in last place) [53].
yP
ulps[:funclist] Max error of 1 ulps corresponds to the last mantissa bit being uncertain; 4
el
ulps is three uncertain bits, etc. This is a more fine-grained method of setting
iv
-fimf-accuracy-bits= The number of correct bits required for mathematical function accuracy. The
conversion formula between accuracy bits and ulps is: ulps = 2p−1−bits ,
Ex
n[:funclist]
where p is 24 for single precision, 53 for double precision and 64 for long
double precision (the number of mantissa bits). This is a more fine-grained
method of setting accuracy than -fimf-precision.
-fimf-domain-exclusion= Defines a list of special-value numbers that do not need to be handled by the
n[:funclist] functions. Here, n is an integer derived by the bitwise OR of the following
values: extremes: 1, NaNs: 2, infinites: 4, denormals: 8, zeroes: 16. For
example, n=15 indicates that extremes, NaNs, infinites and denormals
should be excluded from the domain of numbers that the mathematical
functions must correctly process.
Table 4.2: Command-line arguments for mathematical function precision control with the Intel C++ Compiler.
Listing 4.14 and Listing 4.15 illustrate math function precision control.
1 #include <stdio.h>
2 #include <math.h>
3
4 int main() {
5 const int N = 1000000;
6 const int P = 10;
7 double A[N];
8 const double startValue = 1.0;
9 A[:] = startValue;
10 for (int i = 0; i < P; i++)
11 #pragma simd
12 for (int r = 0; r < N; r++)
13 A[r] = exp(-A[r]);
14
15 printf("Result=%.17e\n", A[0]);
16 }
Listing 4.14: Code precision.cc used in Listing 4.15 to illustrate the effect of relaxed transcendental function
g
precision.
an
W
ng
user@host% icpc -o precision-1 -mmic \ user@host% icpc -o precision-2 -mmic \
e
-fimf-precision=low precision.cc
user@host% scp precision-1 mic0:~/ nh -fimf-precision=high precision.cc
user@host% scp precision-2 mic0:~/
Yu
precision-1 100% 11KB 11.3KB/s 00:00 precision-2 100% 19KB 19.4KB/s 00:00
user@host% ssh mic0 time ./precision-1 user@host% ssh mic0 time ./precision-2
r
fo
Result=5.68428695201873779e-01 Result=5.68428725029060722e-01
real 0m 0.08s real 0m 0.14s
d
re
user@host% user@host%
re
yP
Listing 4.15: Compiling and running the code illustrated in Listing 4.14 on an Intel Xeon Phi coprocessor. Test case
el
shown in panel on the left uses low precision, the case on the right uses high precision.
iv
us
The change of the precision of the exponential function from high to low results in almost a factor of 2
cl
Ex
speedup.
g
10 A[i]=(float)rand()/(float)RAND_MAX; 10 VSL_BRNG_MT19937, 1 );
an
11 } 11 vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD,
W
12 12 rnStream, N, A, 0.0f, 1.0f);
13 printf("Generated %ld random \ 13 printf("Generated %ld random \
ng
14 numbers.\nA[0]=%e\n", N, A[0]); 14 numbers.\nA[0]=%e\n", N, A[0]);
15 free(A); 15
e
free(A);
nh
16 } 16 }
Yu
user@host% icpc -mmic -o rand rand.cc user@host% icpc -mkl -mmic -o rand-mkl \
r
fo
user@host% rand-mkl.cc
user@host% user@host% export SINK_LD_LIBRARY_PATH=\
ed
A[0]=8.401877e-01 A[0]=1.343642e-01
el
iv
Listing 4.16: Comparison of random number generation on an Intel Xeon Phi coprocessor with the C Standard General
Utilities Library (left-hand side) and Intel MKL (right-hand side).
Both codes in Listing 4.16 perform the same task: the generation of a set of random numbers. However,
the code in the left-hand side of the listing is based on the C Standard General Utilities Library, and in the
right-hand side, the code is using the Intel MKL. The performance on the Intel Xeon Phi coprocessor is better
with MKL by a factor of at least 7x. In addition, the MKL implementation is thread-safe and can be efficiently
used by multiple threads in an application. That is not true of the random number generator implemented in
the C Standard General Utilities Library.
g
an
three Cartesian coordinates: ~r ≡ (ri,x , ri,y , ri,z ) and charge qi . We need to calculate the electric potential
W
~ at multiple observation locations R
Φ(R) ~ j ≡ (Rj,x , Rj,y , Rj,z ), where the index j denotes one of n locations.
The expression for that potential is given by Coulomb’s law:
e ng
m
nh qi
X
~j = −
Φ R , (4.1)
Yu
i=1
~j
~ri − R
r
fo
q
~ 2 2 2
re
Figure 4.1 is a visual illustration of the problem. In the left panel, m = 512 charges are distributed in a
re
lattice-like pattern. Each of these particles contributes to the electric potential at every point in space. The right
yP
panel of the figure shows the electric potential at m = 128 × 128 points in the xy-plane at z = 0 calculated
el
Charge Distribution
cl
Negative charges
0.4
0.3 0.4 0.3
0.3 0.2
0.2 0.1
0.2 0
0.1
Φ(x,y,z=0)
0.1 -0.1
-0.2
z 0 0 -0.3
-0.1 -0.4
-0.1
-0.2
-0.2 -0.3
1 1
-0.3 y -0.4 y
0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0
x x
Figure 4.1: Left panel: a set of charged particles. Right panel: the electric potential Φ in the z = 0 plane produced by
~
charged particles shown in the left panel. For every point in the xy-plane, Equation 4.1 was applied to calculate Φ(R),
where the summation from i = 1 to i = m is taken over the m charged particles.
Listing 4.17: Data organization as an array of structures is often inefficient for vectorization.
~ defined
The code Listing 4.18 demonstrates a function that calculates the electric potential at a point R
by quantities Rx, Ry and Rz in the code.
g
an
W
1 // This version performs poorly, because data layout of class Charge
ng
2 // does not allow efficient vectorization
3 void calculate_electric_potential( e
const int m, // Number of charges
nh
4
5 const Charge* chg, // Charge distribution (array of structures)
Yu
6 const float Rx, const float Ry, const float Rz, // Observation point
7 float & phi // Output: electric potential
r
fo
8 ) {
9 phi=0.0f;
ed
12
re
14
15 phi -= chg[i].q / sqrtf(dx*dx+dy*dy+dz*dz); // Coulomb’s law
}
el
16
}
iv
17
us
cl
Listing 4.18: Inefficient solution for Coulomb’s law application: access to quantities x, y, z and q has a stride of 4 rather
Ex
A reference calculation time for this code on a two-way system with Intel Xeon E5-2680 processors with
m = 211 and n = 222 is 0.90 seconds. On a single Intel Xeon Phi coprocessor (pre-production sample), the
runtime is 0.73 seconds.
In order to understand why this result can be improved, consider the inner for-loop in line 10 of
Listing 4.18. The variable chg[i].x in the i-th iteration is 4*sizeof(float)=16 bytes away in memory
from chg[i+1].x used in the next iteration. This corresponds to a stride of 16/sizeof(float)=4
instead of 1, which will incur a performance hit when the data are loaded into the processor’s vector registers.
The same goes for members y, z and q of class Charge. Even though Intel Xeon Phi coprocessors support
gather/scatter vector instructions, unit-stride access to vector data is almost always more efficient.
1 struct Charge_Distribution {
2 // This data layout permits effective vectorization of Coulomb’s law application
3 const int m; // Number of charges
4 float * x; // Array of x-coordinates of charges
5 float * y; // ...y-coordinates...
6 float * z; // ...etc.
7 float * q; // These arrays are allocated in the constructor
8 };
g
an
Listing 4.19: Data storage as a structure of arrays is usually beneficial for vectorization.
W
e ng
nh
With this new data structure, the function calculating the electric potential takes on the form shown in the
code listing in Listing 4.20.
r Yu
fo
d
re
2 void calculate_electric_potential(
3 const int m, // Number of charges
re
5 const float Rx, const float Ry, const float Rz, // Observation point
6 float & phi // Output: electric potential
el
7 ) {
iv
8 phi=0.0f;
us
11
12 const float dy=chg.y[i] - Ry;
13 const float dz=chg.z[i] - Rz;
14 phi -= chg.q[i] / sqrtf(dx*dx+dy*dy+dz*dz);
15 }
16 }
Clearly, the inner for-loop in line 9 of Listing 4.20 has unit-stride data access, as chg.x[i] is imme-
diately followed by chg.x[i+1] in memory, and the same is true for all other quantities accessed via the
array iterator i. The code execution time for m = 211 , n = 222 is now 0.51 second on the host system and
0.37 seconds on the coprocessor. Figure 4.2 summarizes the results.
Note that this performance can be further improved by excluding denormals from the special cases of
floating-point numbers handled by the reciprocal square root function. This optimization is discussed in the
Exercise Section A.4.4.
0.6
0.51 s 0.51 s
0.4 0.37 s
0.22 s
0.2
g
0.0
an
Non-unit stride Unit-stride Unit-stride with
(array of structures) (structure of arrays) relaxed precision
W
e ng
Figure 4.2: Electric potential calculation with Coulomb’s law discussed in Section 4.3.1. The non-unit stride case uses an
array of structures to represent particles, unit-stride case uses a structure of arrays. The relaxed arithmetics case is not
nh
discussed in the text, but is available in the Exercise Section A.4.4.
r Yu
The example considered above demonstrates how converting data layout from an array of structures
fo
to a structure of arrays can significantly improve vectorized code performance. Note that the optimal data
ed
layout, a structure of arrays, somewhat limits the opportunities for abstraction in C++ codes, reverting the code
ar
back to the old C-style loop- and array-centric paradigm. This limitation has been expressed as the statement
p
re
“abstraction kills parallelism”, which is attributed to Prof. Paul Kelly, Imperial College [56]. However, the
yP
necessity of unit-stride access is an inherent property of all computer architectures with hierarchical memory
structure. It is a consequence of a more general prerequisite of high performance, locality of reference of data
el
in space. See also Section 4.5 for an illustration of how improving locality of reference in time can improve
iv
By default, when the compiler encounters a loop for which vectorization may be beneficial, multiple
versions of this loop will be generated. For instance, if the loop iteration count is not known at compile time,
the compiler may generate a scalar version for very short loops and a SIMD version for longer loops.
If the programmer expects a certain iteration count of loops in the code, it may be beneficial to inform
the compiler of that count. This may result in a more efficient execution path selection, because the compiler
will be able to make better decision regarding the vectorization strategy, loop unrolling, iteration peeling, etc.
#pragma loop count is used to declare the expected loop iteration count. Note that the iteration count
must be a constant known at compile time. Listing 4.21 shows a function performing the multiplication of a
packed sparse matrix by a vector. The sparse matrix is stored as an array of contiguous chunks of non-zero
elements. The average length of the chunk is 100.
g
an
1 void PackedSparseMatrix::MultiplyByVector(const float* inVector, float* outVector) {
// This function computes the matrix-vector product Ma=b, where
W
2
3 // M is a sparse matrix stored in a packed format in this->packedData;
ng
4 // a is the input vector of length this->nRows represented by inVector
5 // b is the output vector of length this->nCols represented by outVector
e
nh
6 #pragma omp parallel for schedule(dynamic,30)
7 for (int i = 0; i < nRows; i++) {
Yu
8 outVector[i] = 0.0f;
9 for (int nb = 0; nb < blocksInRow[i]; nb++) {
r
fo
12
13 const int idx = blockFirstIdxInRow[i]+nb; // Block index in storage
pa
15 const int j0 = blockCol[idx]; // Column in the original matrix for to this block
yP
19 // When the actual loop count blockLen[idx] is guessed correctly at compile time,
// this pragma may boost performance.
us
20
21 #pragma loop count avg(100)
cl
Listing 4.21: Multiplication of a sparse matrix in packed row format by a vector. Performance is improved with the help
of #pragma loop count, which informs the compiler of the expected for-loop count. This facilitates a better choice
of vectorization strategy.
Before including #pragma loop count, the multiplication of a sparse 20000 × 20000 matrix with
a filling factor of 10% and average contiguous block size of 100 by a vector takes 2.00 ± 0.08 ms on the
coprocessor. After the inclusion of #pragma loop count avg(100), this time drops to 1.79 ± 0.03 ms.
More details on this problem can be found in the Exercise Section A.4.5.
g
an
4 // a is the input vector of length this->nRows represented by inVector
5 // b is the output vector of length this->nCols represented by outVector
W
6 #pragma omp parallel for schedule(dynamic,30)
ng
7 for (int i = 0; i < nRows; i++) {
8 outVector[i] = 0.0f; e
for (int nb = 0; nb < blocksInRow[i]; nb++) {
nh
9
10 // For each row i, the value blocksInRow[i]
Yu
18 // vectorized arrays used in the first iteration are aligned on a 64-byte boundary
#pragma vector aligned
yP
19
20 #pragma loop count avg(128) min(16)
for (int c = 0; c < blockLen[idx]; c++) {
el
21
// This computes the expression a[i] = sum(M[i,j]*b[j])
iv
22
23 // using packed matrix data.
us
24 sum += packedData[offs+c]*inVector[j0+c];
cl
25 }
Ex
26 outVector[i] += sum;
27 }
28 }
29 }
Listing 4.22: Multiplication of a sparse matrix in packed row format by a vector. Here, #pragma vector aligned
informs the compiler that data alignment in memory is guaranteed by the programmer, and therefore alignment checks are
not necessary.
Note that in order for the code in Listing 4.22 to work, arrays packedData and inVector must
be aligned in such a way that for every instance of the for (int c = 0; ...) loop, the addresses of
packedData[offs] and inVector[j0] are on a 64-byte aligned boundary. If this data is not aligned,
but #pragma vector aligned is used, the code will crash. In this code, the alignment property is
guaranteed by the following:
a) Arrays packedData and inVector are allocated using the function _mm_malloc() discussed in
Section 3.1.4. The listing below illustrates the allocation and deallocation of these arrays:
b) The length of each block of contiguous zero elements in the matrix is padded to a value that is a multiple
of 64. That is, for every idx, blockLen[idx] % 64 == 0 and therefore blockCol[idx] % 64
== 0 and blockOffset[idx] % 64 == 0. This is illustrated in the full listing included in Exercise
Section A.4.5
Summary
Figure 4.3 summarizes the effect of user-guided automatic vectorization with #pragma loop count
and #pragma vector aligned.
g
an
Multiplication of a packed sparse matrix by vector
W
Host system
ng
5 Intel Xeon Phi Coprocessor
4.60 s
e
4 3.99 s 3.97 s nh
Yu
Time, s (lower is better)
r
fo
3
d
re
2.10 s
pa
2 1.95 s
1.65 s
re
yP
1
el
iv
0
us
Figure 4.3: Results of the sparse matrix by vector multiplication benchmark after assisting the compiler with vectorization
pragmas.
Note that in the case of the #pragma loop count, no additional modifications in the code are
required to reap the benefits of increased performance. However, the performance of the code may be degraded
if the loop count used at runtime is significantly different from the value used at compile time in the pragma.
In the case of #pragma vector aligned, the programmer must ensure that the accessed data is
indeed aligned in memory in such a way that in every instance of the vectorized loop, the data accessed in the
first iteration resides on a 64-byte aligned address in memory. See Section 3.1.4 for more information on data
alignment.
Finally, one must note that on the coprocessor, data alignment and the alignment notice increased
performance. However, on the host system, the performance slightly dropped when data alignment was used.
This happened because:
a) AVX instructions used by the Intel Xeon E5 Sandy Bridge processor are not sensitive to alignment.
Therefore, the host version of the application was not slowed down by misaligned data.
b) In this particular application, data alignment could be guaranteed only by padding some of the data blocks
to a length that is a multiple of 64-bytes. Therefore, the total data set on which the application was operating
was increased when data alignment was implemented. This explains the performance drop on the host
system. At the same time, it illustrates that on the Intel Xeon Phi coprocessor, it may be more efficient to
process a larger amount of aligned data than a smaller amount of unaligned data.
See Exercise Section A.4.5 for the complete code used in this example.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Branches expressed with if-statements, while-loops and the ternary operator (?:) are common in
combinatorial problems. Combinatorial algorithms with numerous branches are generally non-vectorizable and
often inherently sequential. In order to optimize branch performance in scalar algorithms, one must consider
the behavior of branch predictors and the cost of pipeline flushes for mispredicted branches. Oftentimes,
algorithms benefit from the elimination of fine-grained branches and replacing them with coarse-grained
branches, because it reduces the frequency of branching and branch misprediction.
In this training, we do not discuss optimizations relevant to combinatorial algorithms, because combina-
torial problems are rarely suitable for execution on Intel Xeon Phi coprocessors. However, we will discuss a
special case of branches in vectorized loops that can occur in numerical problems: masked calculations.
g
an
Masked SIMD Calculations
W
ng
In this section we consider masked calculations expressed as branches that are applied to every iteration
e
in a vector loop. For instance, consider the code in Listing 4.23.
nh
r Yu
1 void NonMaskedOperations(const int m, const int n, const int* flag, float* data) {
fo
7 }
yP
8
9 void MaskedOperations(const int m, const int n, const int* flag, float* data) {
el
14 if (flag[j] == 1)
15 data[i*n+j] = sqrtf(data[i*n+j]);
16 }
Listing 4.23: Function NonMaskedOperations applies an expensive arithmetic operation to the elements of matrix
data. Function MaskedOperations applies the same operation only to elements for which the mask flag[] is set
to 1.
1 void MaskedOperationsHugeStride(const int m, const int n, const int* flag, float* data) {
2 #pragma omp parallel for schedule(dynamic)
3 for (int j = 0; j < n; j++) {
4 for (int pass = 0; pass < 10; pass++)
5 if (flag[j] == 1)
6 for (int i = 0; i < m; i++)
7 data[i*n+j] = sqrtf(data[i*n+j]);
8 }
9 }
Listing 4.24: Inefficient attempt at optimizing the code shown in Listing 4.23. In this code, the branch is taken outside the
loop, which reduces the number of branch condition evaluations. However, matrix data is accessed with a stride of n,
which is inefficient.
Let us return to the original code in Listing 4.23. Compilation with the flag -vec-report3 re-
veals that the compiler did vectorize the inner j-loop in both functions: NonMaskedOperations() and
g
NonMaskedOperations(). In order to understand what exactly happened, we can benchmark the code
an
with different masks:
W
taken, and all elements of the array data remain unchanged.
e ng
(a) With the mask flag[0]=flag[1]=...flag[n-1]=0, the branch if (flag[j]==1) is never
nh
Yu
(b) With the mask flag[0]=...=flag[n-1]==1, then the branch if (flag[j]==1) is always
r
fo
(c) If the mask has alternating zeroes and ones (flag[0]=0, flag[1]=1, flag[2]=0, flag[3]=1,
p
etc.), then the branch is taken every other time, and sqrtf() must be applied to every other element.
re
yP
(d) Finally, we will consider a mask that contains 16-element blocks of zeroes alternating with 16-element
el
We benchmarked this application on a matrix with m=213 , n=215 on the host system and on the coproces-
cl
sor (as a native application). The result can be seen in Figure 4.4.
Ex
In the case of mask (a) (all flags set to 0), the execution time is significantly lower than in all other cases.
This means that the code was able to recognize at runtime that none of the expensive sqrtf() operations
must be executed.
Between the cases masks (b) (all flags set to 1) and (c) (alternating 0 and 1 flags), the performance is
not significantly different. This is because the code took a vectorized path. In this path, each vector iteration
computes sqrtf() for several consecutive scalar iterations (8 on processor, 16 on coprocessor). After that,
the results for scalar iterations with the flag set to 1 are written to the destination array, and results for which
the flag is not set are discarded. Therefore, in cases (b) and (c), the sqrtf() function must be computed for
all elements, which explains similar performance.
For mask (d) (16 flags set to 1 alternating with 16 flags set to 0), the execution time is considerably lower.
This case deserves special attention. Indeed, the fraction of the flags set to 1 in this case is 50%, just like in
case (c), but the performance is almost twice as high. The difference is that with the pattern of flags as in mask
(d), the application and processor are able to recognize that every other SIMD vector gets discarded completely.
The application takes advantage of this and saves computing time by calculating only every other vector. Note
that data alignment is crucial for case (d) to be efficient on the Intel Xeon Phi coprocessor, because without
alignment, the pattern of saved and discarded vector lanes may not permit skipping some vector iterations.
1 void NonMaskedOperations(const int m, const int n, const int* flag, float* data) {
2 #pragma omp parallel for schedule(dynamic)
3 for (int i = 0; i < m; i++)
4 for (int pass = 0; pass < 10; pass++)
5 #pragma simd
6 for (int j = 0; j < n; j++)
7 data[i*n+j] = sqrtf(data[i*n+j]);
8 }
9
10 void MaskedOperations(const int m, const int n, const int* flag, float* data) {
11 #pragma omp parallel for schedule(dynamic)
12 for (int i = 0; i < m; i++)
13 for (int pass = 0; pass < 10; pass++)
g
14 #pragma simd
an
15 for (int j = 0; j < n; j++)
W
16 if (flag[j] == 1)
17 data[i*n+j] = sqrtf(data[i*n+j]);
ng
18 }
e
nh
Listing 4.25: A modification of code in Listing 4.23 with vectorization made mandatory using #pragma simd.
r Yu
fo
The benchmark of the masked vector operations with #pragma simd (Listing 4.25) are shown in
d
Figure 4.5. The execution time now does not depend on the mask. Most notably, for mask (a) and (b), the
re
execution time is identical, even though in (a), all flags are unset, and in (b), all flags are set. At the same time,
pa
on the host system, the execution time for masks (b), (c) and (d) is lower than in the case without #pragma
re
This example demonstrates that whether mandatory vectorization with #pragma simd is beneficial
depends on the pattern of branches in vectorized code. Specifically, when the vectorized loop contains branches
el
iv
which are rarely taken (e.g., mask (a)), the default vectorization method produces better results than the method
us
Some codes with masked operations may benefit from tuning the vectorization method to the pattern of
Ex
masks, or by changing the data layout in such a way that the mask has a predictable pattern convenient for
vectorization.
100 92.7 ms
65.4 ms 65.4 ms
58.4 ms
50 48.8 ms
31.7 ms 32.5 ms
21.2 ms
g
0
an
Mask (a), Mask (b) Masked (c), Masked (d), No Mask
0 0 0 0 ... 1 1 1 1 ... 0101 0..0 1..1 0..0 1..1 ...
W
Figure 4.4: Benchmark of the code in Listing 4.23 performing masked operations on an [m x n] matrix with m=213
e ng
and n=215 . Mask (a) has all elements unset (i.e., the branch leading to the computation of sqrtf() is never taken), mask
nh
(b) has all elements set (branch is always taken), mask (c) has every other element unset and (d) has is made of blocks of
Yu
200
iv
Host system
us
150
Time, ms (lower is better)
100
76.0 ms 75.9 ms 75.9 ms 75.9 ms
66.8 ms 66.8 ms 66.8 ms 66.9 ms
53.3 ms
50
32.8 ms
Figure 4.5: Benchmark of the code in Listing 4.25, in which vectorization is made mandatory with #pragma simd.
See caption to Figure 4.4 for details.
2. It is also possible to use the Intel VTune Amplifier XE in order to diagnose the number of scalar and vector
instructions issued by the code.
3. Finally, there is a simple and practical way to test the effect of automatic vectorization on the application
performance. First, compile and benchmark the code with all the default compiler options. Then, compile
the code with arguments -no-vec -no-simd and benchmark again. With these options, automatic
vectorization is disabled. If the difference in the performance is significant, it indicates that automatic
g
vectorization already contributes to the performance. Note that this method works best when the application
an
is benchmarked with only one thread. This reduces the impact of other factors (such as memory traffic and
W
multithreading overhead) on the execution time of the code.
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
4.4.1 Too Much Synchronization. Solution: Avoiding True Sharing with Private
Variables and Reduction
When more than one thread accesses a memory location, and at least one of these accesses is a write, a
race condition occurs. In order to avoid unpredictable behavior of a program with a race condition, mutexes
must be used, which generally incurs a performance penalty. In some cases it is possible to avoid a mutex by
using private variables, atomic operations or reducers, as shown in this section.
g
an
Example Problem: Computing a Histogram
W
Consider the problem of computing a histogram. Let us assume that array float age[n] contains
ng
numbers from 0 to 100 representing the ages of n people. The task is to create array hist[m], elements
e
nh
of which contain the number of people in age groups 0-20 years, 20-40 years, 40-50 years, etc. For sim-
Yu
plicity, suppose const int m=5 (number of age groups covering the range 0-100) and const float
group_width=20.0f; (how many years each group spans).
r
fo
An unoptimized serial C code that performs this calculation is shown in Listing 4.26. This code is not
protected from situations when one of the members of age[] is outside the range [0.0 . . . 100.0). We assume
ed
that the user of the function Histogram() is responsible for ensuring that the array age has only valid
ar
entries.
p
re
yP
2
for (int i = 0; i < n; i++) {
iv
3
4 const int j = (int) ( age[i] / group_width );
us
5 hist[j]++;
cl
6 }
Ex
7 }
Listing 4.26: Serial, scalar code to compute the histogram of the number of people in age groups.
This code is not optimal, because it cannot be auto-vectorized. The problem with vectorization is a true
vector dependence in the access to array hist. Indeed, consecutive iterations of the i-loop cause scattered
writes to hist, which is not possible to express with vector instructions. However, the operation of computing
the index j does not have a vector dependence, and therefore this part of the calculation can be vectorized.
Before we proceed with parallelizing this code, let us first ensure that it is vectorized. In order to facilitate
automatic vectorization, we can apply a technique called “strip-mining”. In order to “strip-mine” the i-loop,
we express it as two nested loops: the outer loop with index ii and has a stride of vLen = 16, and an
inner loop with index i that “mines” the strip of indexes between ii and ii+vLen. After that, we can split
the inner loop into two consecutive loops: a vector loop for computing the index j and, a scalar loop for
incrementing the histogram value. The strip-mining technique is further discussed in Section 4.4.4. Note that
the choice of the value of vLen=16 is dictated by the fact vector registers of Intel Xeon Phi coprocessors can
fit 16 values of type int.
Listing 4.27 demonstrates a code that produces the same results as the code in Listing 4.26, but faster,
thanks to automatic vectorization. In addition to vectorization, we also implemented a scalar optimization by
replacing the division by group_width with the multiplication by its reciprocal value. We assume in this
code that n is a multiple of vLen, i.e., n%vLen==0. This assumption is easy to lift, as shown in Exercise
Section A.4.7.
g
12
an
13 // Vectorize the multiplication and rounding
#pragma vector aligned
W
14
15 for (int i = ii; i < ii + vecLen; i++)
ng
16 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
e
17
18
19
// Scattered memory access, does not get vectorized
for (int c = 0; c < vecLen; c++) nh
Yu
20 hist[histIdx[c]]++;
}
r
21
fo
22 }
d
re
Listing 4.27: Vectorizable serial code to compute the histogram of the number of people in age groups.
pa
re
The performance of codes in Listing 4.26 and Listing 4.27 can be found in Figure 4.6. The vector code
yP
performance is the baseline for this example. The function computes with n=230 in 1.27 s on the host system
el
Now that scalar optimization and vectorization have been implemented, let us proceed with the paral-
us
g
6
an
7 // Distribute work across threads (n%vLen==0 is assumed)
#pragma omp parallel for schedule(guided)
W
8
9 for (int ii = 0; ii < n; ii += vecLen) {
ng
10 int histIdx[vecLen] __attribute__((aligned(64)));
11
e
nh
12 #pragma vector aligned
13 for (int i = ii; i < ii + vecLen; i++)
Yu
15
fo
19 hist[histIdx[c]]++;
p
20 }
re
21 }
yP
el
Listing 4.28: Parallel code to compute the histogram with atomic operations to avoid race conditions.
iv
us
The third set of bars in Figure 4.6 reports the performance result for this code. On the host system, the
cl
execution time is 24.0 s and on the coprocessor, it is 37.7 s. This result shows that the use of atomic operations
Ex
in this code is not a scalable solution. The parallel performance is, in fact, worse than the performance of the
serial code in Listing 4.27. Atomic operations may be a viable solution if they are not used very frequently in
an application; however, they are used too often in the histogram calculation. Another approach must be taken
to parallelize this code, as discussed in the next section.
g
10 int hist_priv[m];
an
11 hist_priv[:] = 0;
W
12
13 int histIdx[vecLen] __attribute__((aligned(64)));
ng
14
e
15 // Distribute work across threads
16 #pragma omp for schedule(guided)
for (int ii = 0; ii < n; ii += vecLen) { nh
Yu
17
18 #pragma vector aligned
r
21
re
25 }
yP
26
27 // Reduce private copies into global histogram
el
31
}
Ex
32
33 }
34 }
Listing 4.29: Parallel code to compute the histogram with private copies of shared data. Execution time for n=230 is
0.35 s, which is 13x faster than the serial code.
In Listing 4.29, threads are spawned with #pragma omp parallel before the loop begins. Variable
int hist_priv[m] declared within the scope of that pragma is automatically considered private to each
thread. In the for-loop that follows in line 18, each thread writes to its own histogram copy, and no race
conditions occur. Notice the absence of the word parallel in #pragma omp for in line 17: the loop is
already inside of a parallel region.
After the loop in line 18 is over, the loop in line 34 is executed in each thread, accumulating the result of
all calculations in the shared variable. Atomic operations must still be used here. However, they do not incur a
significant overhead, because m is much smaller than n.
The optimized parallel code performs significantly faster than the serial code, completing n=230 iterations
in 0.12 s on the host system, and in 0.07 s on the coprocessor. This benchmark produces optimal results on the
host when hyper-threading is not used, i.e., the variable OMP_NUM_THREADS is set to 16 on the host system
with two eight-core Intel Xeon processors.
Figure 4.6 summarizes the effect of optimization of the histogram calculation code discussed in this
section.
50
40 37.70 s
30
g
24.00 s
an
20
W
ng
10 9.23 s
5.06 s
e
1.27 s
nh
0 0.12 s 0.07 s
Scalar Serial Code Vectorized Serial Code Vectorized Parallel Code Vectorized Parallel Code
Yu
Figure 4.6: The performance of histogram calculation for n=230 and m=5 using codes in Listing 4.26 (“Scalar Serial
ed
Code”), Listing 4.27 (“Vectorized Serial Code”), Listing 4.28 (“Vectorized Parallel Code (Atomic Operations)”) and
ar
Listing 4.29 (“Vectorized Parallel Code (Private Variables)”). The third case (atomic operations) is severely bottlenecked
p
by synchronization in the atomic mutex. The amount of synchronization is significantly reduced in the fourth case through
re
Exercise Section A.4.7 contains working code samples and a more general version of the code, which
iv
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
Figure 4.7: Illustration of false sharing in parallel architectures with cache coherency.
1 #include <omp.h>
2
3 void Histogram(const float* age, int* const hist, const int n,
4 const float group_width, const int m) {
5
6 const int vecLen = 16;
7 const float invGroupWidth = 1.0f/group_width;
8 const int nThreads = omp_get_max_threads();
g
9 // Shared histogram with a private section for each thread
an
10 int hist_thr[nThreads][m];
W
11 hist_thr[:][:] = 0;
12
13
14
#pragma omp parallel
{
e ng
nh
15 // Get the number of this thread
Yu
18
fo
23
re
27
iv
30 hist_thr[iThread][histIdx[c]]++;
Ex
31 }
32 }
33
34 // Reducing results from all threads to the common histogram hist
35 for (int iThread = 0; iThread < nThreads; iThread++)
36 hist[0:m] += hist_thr[iThread][0:m];
37 }
Listing 4.30: This code computes a histogram for data in array age. This code is vectorized and it utilizes thread
parallelism, like the code in Listing 4.29. However, unlike the code in Listing 4.29, this code has no synchronization at all.
Race condition is avoided by introducing an array of histograms hist_thr, one histogram for each thread. However,
with m=5, false sharing occurs when threads access their regions of the shared array hist_thr.
The code in Listing 4.30 may look like it should work similarly to the code in Listing 4.29, because each
thread accesses its own region of memory. At the first glance, this method is even better than the method
with private variables illustrated in Listing 4.29, because there are no mutexes at all in this implementation.
However, in practice, code in Listing 4.30 exhibits poor performance on the host system (see Figure 4.8) and
on the coprocessor.
The cause of performance degradation is that the value of m=5 is rather small, and therefore array ele-
ments hist_thr[0][:] are within m*sizeof(int)=20 bytes of array elements hist_thr[1][:].
Therefore, when thread 0 and thread 1 are accessing their elements simultaneously, there is a chance of hitting
the same cache line or the same block of the coherent L1 cache, which results in one of the threads having to
wait until the other thread unlocks that cache line.
1 #include <omp.h>
2
3 void Histogram(const float* age, int* const hist, const int n,
4 const float group_width, const int m) {
5
g
6 const int vecLen = 16;
an
7 const float invGroupWidth = 1.0f/group_width;
W
8 const int nThreads = omp_get_max_threads();
9 // Padding for hist_thr[][] in order to avoid a situation
ng
10 // where two (or more) rows share a cache line.
e
11 const int paddingBytes = 64;
12 const int paddingElements = paddingBytes / sizeof(int);
const int mPadded = m + (paddingElements-m%paddingElements); nh
Yu
13
14 // Shared histogram with a private section for each thread
r
15 int hist_thr[nThreads][mPadded];
fo
16 hist_thr[:][:] = 0;
d
17
re
19 {
20 // Get the number of this thread
re
22
23 int histIdx[vecLen] __attribute__((aligned(64)));
el
24
iv
27
#pragma vector aligned
Ex
28
29 for (int i = ii; i < ii + vecLen; i++)
30 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
31
32 for (int c = 0; c < vecLen; c++)
33 hist_thr[iThread][histIdx[c]]++;
34 }
35 }
36
37 for (int iThread = 0; iThread < nThreads; iThread++)
38 hist[0:m] += hist_thr[iThread][0:m];
39
40 }
Listing 4.31: Increasing the size of the inner dimension of the array hist_thr eliminates false sharing.
Listing 4.31 shows how padding can be done in the case considered above. The only difference between
Listing 4.31 and Listing 4.30 is that the inner dimension of hist_thr is now mPadded instead of m.
Increasing the size of the inner dimension from m to mPadded separates the memory regions in which
different threads operate. This reduces the penalty paid for cache coherence, which the processor must enforce
even though it is not required in this case. As a result, false sharing is eliminated.
Figure 4.8 summarizes the performance results of the code in Listing 4.30 and Listing 4.31. The latter
code was compiled and benchmarked in three variations: with paddingBytes=64, 128 and 256. For the
last case, the performance of the code is restored to the performance of the baseline code.
1.2
1.0
g
0.8 0.720 s
an
0.6
W
0.4 0.369 s
ng
0.270 s
e
nh
0.2 0.116 s 0.114 s 0.068 s
0.073 s 0.067 s 0.067 s
Yu
Figure 4.8: The performance of histogram calculation for n=230 and m=5 using codes in Listing 4.29 (“Baseline: Parallel
ar
Code”), Listing 4.30 (“Poor Performance: False Sharing”) and Listing 4.31 (“Padding to 64/128/256 bytes”).
p
re
yP
More information on false sharing can be found in the article [57] by Nicholas Butler.
el
iv
us
cl
Ex
4.4.3 Load Imbalance. Solution: Load Scheduling and Grain Size Specification
In Section 3.2.3 we discussed how parallel loops can be executed in different scheduling modes. Specifi-
cally, in OpenMP, static, dynamic and guided modes are available, and scheduling granularity can also
be specified. Choosing a scheduling mode is often a trade-off. Lightweight, coarse-grained scheduling modes
incur little overhead, but may lead to load imbalance. On the other hand, complex, fine-grained scheduling
modes can improve load balance, but may introduce a significant scheduling overhead.
Consider a parallel for-loop that calls a thread-safe serial function in every iteration shown in Listing 4.32.
Listing 4.32: Sample parallel loop calling a serial function, the execution time of which varies from iteration to iteration.
g
an
Suppose that the execution time of the function varies significantly from call to call. Such an application
W
is prone to load imbalance because some of the parallel threads may be “lucky” to get a quick workload,
ng
and other threads may struggle with a more expensive task. “Lucky” threads will have to wait for all other
e
threads, however, the application cannot proceed further until all of the loop iterations are processed. In order
nh
to improve the performance, we can specify a scheduling mode and a grain size, as in Listing 4.33
r Yu
fo
Listing 4.33: Sample parallel loop calling a serial function, the execution time of which varies from iteration to iteration.
re
yP
Here, the dynamic scheduling mode indicates that the iteration space must be split into chunks of
el
length 4, and these chunks must be distributed across available threads. As threads finish with their task, they
iv
will receive another chunk of the problem to work on. Other scheduling modes are static, where iterations
us
are distributed across threads before the calculations begin, and guided, which is analogous to dynamic,
cl
except that the chunk size starts large and is gradually reduced toward the end of the calculation.
Ex
The grain size of scheduling in Listing 4.32 is chosen as 4. Choosing the grain size is a trade-off: with too
small a grain size, too much communication between threads and the scheduler may occur, and the application
may be slowed down by the task of scheduling; with too large a grain size, load balancing may be limited. In
order to be effective, the grain size must be greater than 1 and smaller than n/T, where n is the number of
loop iterations and T is the number of parallel threads.
g
int iterations = 0;
an
20
21 do {
W
22 iterations++;
// Jacobi method
ng
23
24 for (int i = 0; i < n; i++) { e
25 double c = 0.0;
nh
26 #pragma vector aligned
Yu
29
30 }
ed
31
// Verification
ar
32
33 bTrial[:] = 0.0;
p
34
35 #pragma vector aligned
yP
38 }
iv
40
cl
42 return iterations;
43 }
Listing 4.34: An iterative Jacobi solver for nonhomogeneous systems of linear algebraic equations. The number of
iterations depends on the initial value of vector ~
x and on the requested accuracy minAccuracy.
Listing 4.35: Parallel loop that calls the Jacobi solver with different vectors ~b and requests a different accuracy for every
call. This loop exhibits load imbalance if scheduling mode is not specified.
We benchmarked the Jacobi solver code with various settings for the loop scheduling mode. The results
can be found in Figure 4.9.
0.167
0.223
0.222
Time, s (lower is better)
0.127
0.117
0.116
0.113
0.113
0.08087
0.086
0.10
0.084
0.073 3
0.078
0.078
0.08
0.0
0.076
0.070
0.070
0.068
0.068
0.0638
0.0637
0.0646
0.065
0.06
0.06
0.06
0.05
0.00
2
nedg, 32
ic, 1
c, 1
1
ult
56
56
256
ic, 4
c, 4
W guided, 4
Plus
ic, 3
c, 3
ed,
ic, 2
c, 2
defa
ami
ami
stat
stat
ed,
Cilk
ami
guid
stat
ami
gauid
stat
dyn
dyn
guid
dyn
dyn
ng
Figure 4.9: Performance of the parallel loop executing the Jacobi solver (Listing 4.34 and Listing 4.35) for a set of vectors
e
~b with various OpenMP loop scheduling modes.
nh
Yu
For the default (unspecified) scheduling mode, the performance on the host system is almost 2x worse
r
fo
than in the case of dynamic or guided scheduling, which exhibits the best performance. This is explained
by the fact that we intentionally randomized the accuracy requirement for each call to the solver. With dynamic
d
re
or guided scheduling, threads that were “lucky” to get low-accuracy calculations are loaded with additional
pa
work. On the coprocessor, the performance was optimal for the default scheduling mode.
re
The grain size for dynamic scheduling in this problem has a “sweet spot” at the value of 4. However,
yP
with guided scheduling, a grain size of 1 works as well as a grain size of 4. That is because guided scheduling
reduces the scheduling overhead by gradually reducing the grain size from a large value down to the grain size
el
specified by the user. Large grain size has a significant negative effect on performance. That happens because
iv
a large grain size is a significant fraction of the iteration space size. This is true for both the host system and
us
the coprocessor, but more pronounced on the latter, due to a greater number of threads.
cl
Ex
The performance of the _Cilk_for on the coprocessor fluctuated greatly (almost by a factor of two)
from one trial to another, and the average performance is reported on the plot.
g
an
W
e ng
nh
Figure 4.10: Concurrency profile of the Jacobi solver (Listing 4.34 and Listing 4.35) on the host system in Intel VTune
Yu
Amplifier XE. 10 instances of parallel loop were run with default scheduling. Load imbalance can be seen at the end of
r
Principles: Strip-Mining
Strip-mining is a technique that applies to vectorized loops operating on one-dimensional arrays. This
technique allows to parallelize such loops, retaining vectorization. When this technique is applied, a single
loop operating on a one-dimensional array is converted into two nested loops. The outer loop strides through
“strips” of the array, and the inner loop operates on the data inside the strip (“mining” it). Sometimes, this
technique is used by the compiler “behind the scenes” in order to apply thread parallelism to vectorized loops.
g
However, in some cases, explicit application of strip-mining may be necessary. For example, when nested
an
loops are collapsed (see Listing 4.40 and Listing 4.44), the compiler may be unable to automatically vectorize
W
the loops. Listing 4.36 demonstrates the strip-mining transformation.
e ng
1
2 #pragma omp parallel for nh
// Compiler may be able to simultaneously parallelize and auto-vectorize this loop
Yu
3 #pragma simd
for (int i = 0; i < n; i++) {
r
4
fo
5 // ... do work
}
d
6
re
7
// The strip-mining technique separates parallelization from vectorization
pa
8
9 const int STRIP=80;
re
14 // ... do work
us
15 }
cl
Ex
Listing 4.36: Strip-mining technique is usually implemented by the compiler “behind the scenes”. However, it is easy to
implement it manually, as shown in this listing.
The size of the strip must usually be chosen as a multiple of the SIMD vector length in order to facilitate
the vectorization of the inner loop. Furthermore, if the iteration count n is not a multiple of the strip size, then
the programmer must peel off n%STRIP iterations at the end of the loop.
Loop collapse is a technique that converts two nested loops into a single loop. This technique can
be applied either automatically (for example, using the collapse clause of #pragma omp for), or
explicitly. Loop collapse is demonstrated in Listing 4.37.
g
10 #pragma omp parallel for collapse(2)
an
11 for (int i = 0; i < m; i++) {
W
12 for (int j = 0; j < n; j++) {
13 // ... do work
ng
14 }
15 } e
nh
16
17 // The example below demonstrates explicit loop collapse.
Yu
19
fo
21
22 const int j = c % n;
ar
23 // ... do work
p
24 }
re
yP
Listing 4.37: Loop collapse exposes more thread parallelism in nested loops, rowsum_unoptimized.cc. The first
el
piece of code does not use loop collapse; the second relies on the automatic loop collapse functionality of OpenMP, and
iv
Consider the problem of performing a reduction (sum, average, or another cumulative characteristic)
along the rows of a matrix M[m][n] as illustrated by the equation below:
n
X
Si = Mij , i = 0 . . . m. (4.3)
j=0
Assume that m is small (smaller than the number of threads in the system), and n is large (large enough so that
the matrix does not fit into cache). A straightforward implementation of summing the elements of each row is
shown in Listing 4.38.
1 #include <omp.h>
2 #include <stdio.h>
3 #include <math.h>
4 #include <malloc.h>
5
6 void sum_unoptimized(const int m, const int n, long* M, long* s){
7 #pragma omp parallel for
8 for (int i=0; i<m; i++) {
9 long sum=0;
10 #pragma simd
11 #pragma vector aligned
12 for (int j=0; j<n; j++)
13 sum+=M[i*n+j];
14 s[i]=sum;
15 }
16 }
17
18 int main(){
19 const int n=100000000, m=4; // problem size
g
long* M=(long*)_mm_malloc(sizeof(long)*m*n, 64);
an
20
21 long* s=(long*)_mm_malloc(sizeof(long)*m, 64);
W
22
printf("Problem size: %.3f GB, outer dimension: %d, threads: %d\n",
ng
23
24 (double)(sizeof(long))*(double)(n)*(double)m/(double)(1<<30),
e
25 m, omp_get_max_threads());
26
nh
Yu
27 const int nl=10;
28 double t=0, dt=0;
r
29
30 for (int i=0; i<m*n; i++) M[i]=0;
d
31
re
34
35 const double t1=omp_get_wtime();
yP
38 dt+=(t1-t0)*(t1-t0)/(double)(nl-2);
iv
39 }
us
40 }
cl
41 dt=sqrt(dt-t*t);
Ex
Listing 4.38: Function sum_unoptimized() calculates the sum of the elements in each row of matrix M. When the
number of rows, m, is smaller than the number of threads in the system, the performance of this loop suffers from a low
degree of parallelism.
This implementation suffers from insufficient parallelism, because m is too small to keep all cores
occupied. In fact, this is a bandwidth-bound problem, because memory access has a regular pattern, and the
arithmetic intensity is equal to 1. Therefore, the performance concern is utilizing all memory controllers,
rather than all cores. There are 16 memory controllers in the KNC architecture. The performance of this
code, measured on the host system of two Intel Xeon E5-2680 processors, and on an Intel Xeon Phi 5110P
coprocessor, is shown in Listing 4.39.
Listing 4.39: Baseline performance of the row-wise matrix reduction code shown in Listing 4.38.
Insufficient parallelism may be seen in the VTune profile of the application captured in Figure 4.12. The
g
concurrency histogram (top panel) shows that for the bulk of the runtime, only 4 threads or fewer (out of
an
available 32) were running. The timeline (bottom panel) shows only 4 horizontal lines with dark green patches.
W
This illustrates the same problem with insufficient parallelism as the concurrency histogram.
e ng
In order to improve the performance of this application, the amount of exploitable parallelism in the code
must be expanded. In the remainder of this section, we will implement three optimization techniques:
nh
Yu
1. First, we will try to move the parallel code into the inner loop, which has more iterations.
r
fo
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
Figure 4.12: Thread concurrency profile of host implementation of the row-wise matrix reduction code with insufficient
parallelism (Listing 4.38). Top panel: concurrency histogram, bottom panel: timeline with the threads panel expanded.
g
Listing 4.40: rowsum_inner.cc improves the performance of rowsum_unoptimized.cc shown in Listing 4.38
an
by parallelizing the inner loop instead of the outer loop. This optimization improves the parallel scalability, but does not
W
achieve the best performance.
e ng
Listing 4.41 demonstrates the performance of the code with parallelized inner loop.
nh
Yu
user@host$ ./rowsum_inner
fo
Inner loop parallelized: 0.083 +/- 0.001 seconds (38.65 +/- 0.43 GB/s)
user@host$
ar
Inner loop parallelized: 0.038 +/- 0.000 seconds (84.89 +/- 0.29 GB/s)
us
user@mic0$
cl
Ex
Listing 4.41: Performance of the row-wise matrix reduction code with parallelized inner loop.
With parallel inner loop, the performance on the coprocessor has improved, while the performance on the
host system dropped. Performance increase on the coprocessor is explained by the fact that with more parallel
threads operating, the memory controllers are not as severely under-utilized as in the unoptimized version.
However, the performance drop on the host indicates that the code is still not optimal. The host does not benefit
from additional parallelism as much as the coprocessor, because the host has fewer memory controllers, and
therefore, it takes fewer threads to utilize the memory bandwidth on the host. However, the host performance
is impeded in the new version of the code because OpenMP threads are spawned for every i-iteration, which
incurs parallelization overhead. Additionally, when the inner loop is parallelized, the OpenMP library does not
see the whole scope of the data processed by the problem, and has less freedom for optimal load scheduling.
Even though we observed a performance increase on the coprocessor, we will mark this method as
sub-optimal for this problem because of the problems stated above.
g
14 // Arrays cannot be declared as reducers in pragma omp,
an
15 // and so the reduction must be programmed explicitly.
W
16 for (int i=0; i<m; i++)
17 #pragma omp atomic
ng
18 s[i]+=sum[i];
e
19 }
20 }
nh
r Yu
Listing 4.42: rowsum_collapse.cc attempts to expand the iteration space of the row-wise matrix reduction problem
fo
by collapsing nested loops. This gives the OpenMP more freedom for load balancing, but precludes automatic vectorization
d
Collapse nested loops: 0.113 +/- 0.000 seconds (28.33 +/- 0.01 GB/s)
us
user@host$
user@host$ icpc -openmp rowsum_collapse.cc -o rowsum_collapse_mic -mmic
cl
Listing 4.43: Performance of the row-wise matrix reduction code with collapsed nested loops.
While the collapse(2) directive makes OpenMP expand the iteration space into two loops, the code
works slowly on both the host system and the coprocessor. This happens because vectorization fails in this case.
It can be verified by compiling the code with the argument -vec-report3. The inclusion of #pragma
simd does not help to enforce vectorization.
Even though we did not achieve optimal performance with this optimization, we are on the right track,
because we expose the most parallelism to the compiler. In the next optimization step, we will strip-mine the
inner loop, at the same time using the loop collapse pragma for the outer loop. This will enable automatic
vectorization, at the same time exposing the whole iteration space to thread parallelism.
g
12
an
13 #pragma simd
#pragma vector aligned
W
14
15 for (int j=jj; j<jj+STRIP; j++)
ng
16 sum[i]+=M[i*n+j];
17 e
nh
18 // Reduction
19 for (int i=0; i<m; i++)
Yu
22 }
23 }
ed
ar
inner loop. This allows OpenMP to balance the load across available threads, while automatic vectorization succeeds in
yP
user@host$ ./rowsum_stripmine
Ex
Listing 4.45: Performance of the row-wise matrix reduction code with collapsed nested loops and strip-mined inner loop.
Evidently, this optimization is the best of all cases we had considered so far. This is true of both the
host and the coprocessor performance. The success of this version is explained by the fact that parallelism
was exposed to the compiler at two levels: vectorizable inner loop and parallelizable outer loops with a large
iteration count.
Note that for optimal performance, the value of the parameter STRIP must be a multiple of the SIMD
vector length and much greater than the SIMD vector length. It must also be much smaller than the array
length n. The code in Listing 4.45 assumes that n is a multiple of STRIP. It is easy to relax this condition
with an additional peel loop. See, e.g., Listing 4.58.
Figure 4.13 contains a summary of the performance of all versions of the row-wise matrix reduction
algorithm considered in this section. Note that the metric shown in this loop is the effective memory bandwidth
achieved in the algorithm, in GB/s. In this plot, greater bandwidth means better performance.
120
100
g
84.9
an
80
W
ng
60 53.7
47.5
e
40 38.6
nh
28.3
Yu
20
r
5.9 6.5
fo
0 Unoptimized Parallel inner loop Collapse nested loops Strip-mine and collapse
d
re
pa
re
Figure 4.13: Performance of all versions of the row-wise matrix reduction code (Listing 4.38, Listing 4.40, Listing 4.42
yP
and Listing 4.44) on the host system and on the Intel Xeon Phi coprocessor.
el
It is also informative to consider the concurrency profile of the optimized code in VTune. The profiling
iv
us
results are shown in Figure 4.14). Comparing them to the profile of the unoptimized code shown in Figure 4.12,
we see that all 32 threads were occupied on the host for the bulk of the execution time, which indicates high
cl
Ex
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Figure 4.14: Thread concurrency profile of host implementation of the optimized row-wise matrix reduction code
(Listing 4.44). Top panel: concurrency histogram, bottom panel: timeline with the threads panel expanded.
1. In HPC applications that utilize the whole system, OpenMP threads may migrate from core to another
according to the OS decisions. This migration leads to performance penalties. For example, the migrated
thread may have to re-fetch the cache contents into the new core’s L1 cache. Using thread affinity, the
programmer can forbid thread migration and thus improve performance;
2. For memory bandwidth-bound codes, the optimum number of threads on Intel systems is usually equal
to or smaller than the number of physical cores. In other words, hyper-threading is counter-productive
for bandwidth-bound codes. However, for optimum performance, software threads must be distributed
g
across different physical cores rather than share two logical cores on the same physical core. Placing the
an
threads on different physical cores allows to efficiently utilize all available memory controllers. Setting
W
the corresponding thread affinity pattern helps to achieve such thread distribution;
ng
3. For compute-bound codes with hyper-threading, application performance may be improved by placing
e
nh
threads operating on adjacent data onto the same physical core, so that they may share the data in the
Yu
local L2 cache slice on the Intel Xeon Phi coprocessor. This task may be accomplished by setting thread
affinity and orchestrating the load distribution across OpenMP threads accordingly;
r
fo
4. In applications using Intel Xeon Phi coprocessors in the offload mode, it is preferable to exclude cores
d
re
0-3 from the affinity mask of the calculation. These cores are used by the uOS for offload management,
pa
and thread contention on these cores may slow down the whole application;
re
5. When several independent processes are running on a Non-Uniform Memory Access (NUMA) system,
yP
sharing its resources, it is beneficial to keep each process assigned to a specific core or processor. This
el
facilitates local data allocation and access in NUMA systems and benefits performance. Thread affinity
iv
can be used to effectively partition the system and bind each process to the respective local resources.
us
cl
Thread affinity in OpenMP applications can be controlled at the application level by setting the environ-
ment variable KMP_AFFINITY [58]. The format of the variable is
KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>]
g
an
threads to OS procs. The format of <proc_list> is a comma-
separated string containing the numerical identifiers of OS procs or
W
their ranges, and float lists enclosed in brackets {}. Example:
e ng
proclist=[7,4-6,{0,1,2,3}] maps OpenMP thread 0 to OS proc
7, threads 1, 2 and 3 to procs 4, 5 and 6, respectively, and thread 4 is allowed
nh
to float between procs 0, 1, 2 and 3.
Yu
type none type=compact assigns each OpenMP thread to a thread context as close
r
fo
intensive calculations.
type=scatter is the opposite of compact: OpenMP threads are dis-
ar
tributed as evenly as possible across the system. This type is beneficial for
p
re
bandwidth-bound applications.
yP
permute 0 For compact and scatter affinity maps, controls which levels are most
significant when sorting the machine topology map. A value for permute
forces the mappings to make the specified number of most significant levels
of the sort the least significant, and it inverts the order of significance. The
root node of the tree is not considered a separate level for the sort operations.
offset 0 indicates the starting position (proc ID) for thread assignment.
Table 4.3: Arguments of the KMP_AFFINITY environment variable. This summary table is based on the complete
description in the Intel C++ Compiler Reference Guide [58].
Note that in order to use different affinity masks on the host and on coprocessors with offload applications,
the environment variable MIC_ENV_PREFIX may be set (see Section 2.2.5). For example, the setup in
Listing 4.46 results in affinity type compact on the host and balanced on the coprocessor.
Listing 4.46: Using MIC_ENV_PREFIX to set different affinity masks on the host and on the coprocessor.
In most cases, affinity types compact, scatter or balanced, possibly combined with offset,
allow to set up an efficient thread affinity mask. Examples below illustrate the usage of KMP_AFFINITY for
commonly encountered cases.
g
an
In applications bound by memory bandwidth, it is usually beneficial to use one thread per core or less
W
on the host system and two threads per core or less on the coprocessor. This reduces thread contention on
ng
memory controllers. Additionally, affinity type scatter can be used to improve the effective bandwidth.
The bandwidth is improved because all memory controllers are utilized uniformly.
e
nh
Consider the row-wise matrix reduction code from Section 4.4.4. The code shown in Listing 4.44
Yu
demonstrated optimal performance on the host and the coprocessor. However, running this code, one may
notice that the performance fluctuates from run to run. It is possible to fix the performance at the maximum
r
fo
Strip-mine and collapse: 0.059 +/- 0.002 seconds (54.11 +/- 1.56 GB/s)
iv
Strip-mine and collapse: 0.077 +/- 0.001 seconds (41.71 +/- 0.69 GB/s)
us
Strip-mine and collapse: 0.079 +/- 0.002 seconds (40.42 +/- 1.01 GB/s)
Strip-mine and collapse: 0.070 +/- 0.005 seconds (45.59 +/- 3.14 GB/s)
cl
Strip-mine and collapse: 0.077 +/- 0.001 seconds (41.43 +/- 0.75 GB/s)
Ex
user@host$
user@host$ export OMP_NUM_THREADS=16
user@host$ export KMP_AFFINITY=scatter
user@host$ ./rowsum_stripmine; for i in {1..5}; do ./rowsum_stripmine | tail -1 ; done
Problem size: 2.980 GB, outer dimension: 4, threads: 16
Strip-mine and collapse: 0.059 +/- 0.004 seconds (54.47 +/- 3.25 GB/s)
Strip-mine and collapse: 0.059 +/- 0.002 seconds (54.01 +/- 1.81 GB/s)
Strip-mine and collapse: 0.061 +/- 0.004 seconds (52.30 +/- 3.30 GB/s)
Strip-mine and collapse: 0.062 +/- 0.005 seconds (51.37 +/- 4.29 GB/s)
Strip-mine and collapse: 0.060 +/- 0.002 seconds (53.59 +/- 2.13 GB/s)
Strip-mine and collapse: 0.058 +/- 0.001 seconds (55.48 +/- 1.27 GB/s)
user@host$
Listing 4.47: Setting a thread affinity of type scatter in order to improve the performance of bandwidth-bound applica-
tion. Notice how the bandwidth fluctuates without thread affinity, but remains high with KMP_AFFINITY=scatter.
Additional case study for affinity setting in bandwidth-bound applications on a four-way NUMA system
is can be found in the Colfax Research publication [59].
1 #include <mkl.h>
2 #include <stdio.h>
3 #include <omp.h>
4
5 int main() {
6 const int N = 10000; const int Nld = N+64;
7 const char tr=’N’; const double v=1.0;
8 double* A = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);
9 double* B = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);
g
10 double* C = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);
an
11 _Cilk_for (int i = 0; i < N*Nld; i++) A[i] = B[i] = C[i] = 0.0f;
W
12 int nIter = 10;
13 for(int k = 0; k < nIter; k++)
14
15
{
double t1 = omp_get_wtime();
e ng
nh
16 dgemm(&tr, &tr, &N, &N, &N, &v, A, &Nld, B, &Nld, &v, C, &N);
double t2 = omp_get_wtime();
Yu
17
18 double flopsNow = (2.0*N*N*N+1.0*N*N)*1e-9/(t2-t1);
r
20 }
ed
Listing 4.48: Code bench-dgemm.cc, a benchmark of the DGEMM function in Intel MKL.
yP
el
In Listing 4.49, the code bench-dgemm.cc is compiled as a native application for Intel Xeon Phi
iv
coprocessors, and executed on the coprocessor. First, the application is executed without an affinity mask.
us
Then the affinity mask is specified by passing the environment variable KMP_AFFINITY=compact to the
cl
Listing 4.49: Compiling the DGEMM benchmark bench-dgemm.cc as a native Intel Xeon Phi coprocessor application
and running it on the coprocessor in two modes: without an affinity mask, and with an affinity mask of type compact.
1 #include <omp.h>
2 #include <math.h>
3 #include <stdio.h>
4 #include "mkl_dfti.h"
5
6 int main(int argc, char** argv) {
7 const size_t n = 1L<<30L;
g
an
8 const char* def = "(single instance)";
9 const char* inst = (argc < 2 ? def : argv[1]);
W
10 const double flopsPerTransfer = 2.5*log2((double)n)*n;
ng
11 float *x = (float*)malloc(sizeof(float)*n);
12 _Cilk_for(int i = 0; i < n; i++) x[i] = 1.0f;
e
DFTI_DESCRIPTOR_HANDLE fftHandle;
nh
13
14 MKL_LONG size = n;
Yu
15 DftiCreateDescriptor ( &fftHandle, DFTI_SINGLE, DFTI_REAL, 1, size);
16 DftiCommitDescriptor ( fftHandle );
r
fo
20 DftiComputeForward ( fftHandle, x );
pa
23
24 inst, t+1, 1e3*(t2-t1), gflops);
}
el
25
DftiFreeDescriptor ( &fftHandle );
iv
26
27 free(x);
us
28 }
cl
Ex
Listing 4.50: The code in bench-fft.cc computes a large one-dimensional Discrete Fast Fourier Transform (DFFT).
Listing 4.51: Running the FFT benchmark bench-fft.cc using all available logical cores.
In Listing 4.51, we benchmark this application using 32 threads, which is equal to the total number of
available logical cores on the system. The test platform is a system with two Intel Xeon processors, each
containing eight cores with two-way hyper-threading. We do not use thread affinity in this case.
Now suppose that we have to calculate a large number of such FFTs, and all calculations are completely
independent. We can speed up the calculation by using multiple processes with fewer threads, as shown in
Listing 4.52. This time, we are using a script called fftrun_noaffinity to launch two instances of the
application with 16 threads each.
g
export MKL_NUM_THREADS=16
an
export KMP_AFFINITY=
W
./bench-fft 1 &
./bench-fft 2 &
ng
wait
user@host$ ./fftrun_noaffinity
e
nh
Instance 2, iteration 1: 4347.238 ms (18.5 GFLOP/s)
Instance 1, iteration 1: 4703.092 ms (17.1 GFLOP/s)
Yu
Listing 4.52: Running the FFT benchmark bench-fft.cc as two processes with 16 threads each.
iv
us
cl
Note that the average performance of the code in the case of two 16-threaded processes is 18.5 GFLOP/s
Ex
per process, which amounts to 37 GFLOP/s for the whole system. This is 25% better than the performance
of a single 32-threaded process reported in Listing 4.51. The cause of the performance difference is the
multi-threading and NUMA overhead. This overhead is greater in a single process with 32 cores than in each
of the 16-threaded processes.
In Listing 4.52, we did not restrict the affinity of the threads. This means that some threads were accessing
non-local NUMA memory, incurring a performance hit. In order to optimize the execution of the benchmark,
we can set the environment variable KMP_AFFINITY as shown in Listing 4.53.
The average performance now (Listing 4.53) is 19.4 GFLOP/s, which is 5% better than without the
affinity mask (Listing 4.52). The affinity mask requested here is of type compact, which places the threads
as close to one another as possible. With MKL_NUM_THREADS=16, all threads will end up on the same
CPU socket, because the CPU supports 16 hyper-thread contexts. The setting granularity=fine forbids
threads from moving across hyper-thread contexts. The numbers 0,0 and 0,16 tell the OpenMP library
to start placing threads from OS proc 0 (in the first case) or from OS proc 16 (in the second case). These
two numbers are the permute and offset arguments of KMP_AFFINITY. Note that we must specify
permute in order to set the value of offset, because if we specified only a single number, it would have
been interpreted as permute.
g
Instance 2, iteration 3: 4144.214 ms (19.4 GFLOP/s)
an
Instance 1, iteration 4: 4159.326 ms (19.4 GFLOP/s)
W
Instance 2, iteration 4: 4228.130 ms (19.0 GFLOP/s)
Instance 1, iteration 5: 4221.576 ms (19.1 GFLOP/s)
ng
Instance 2, iteration 5: 4162.325 ms (19.3 GFLOP/s)
e
nh
Yu
Listing 4.53: Running the FFT benchmark bench-fft.cc as two processes with 16 threads each, using the
KMP_AFFINITY variable in order to bind each processes to the respective CPU socket.
r
fo
d
If the available RAM size permits, it may be possible to further improve the performance of this problem
re
by increasing the number of processes while proportionally reducing the number of threads per process. While
pa
doing so, it is beneficial to set the affinity mask of each process in order to reserve a certain partition of the
re
This example and optimization method are motivated by an astrophysical research project reported on in
[60]. Note that the optimization demonstrated above is not specific to the computational problem of FFT; it
el
iv
applies to any other application in a NUMA system which can be run as multiple independent instances.
us
cl
Ex
g
an
In the process of porting and optimizing applications on Intel Xeon processors and Intel Xeon Phi
W
coprocessors, the programmer must ensure good parallel scalability of the application. On the host, applications
e ng
must efficiently scale to 16-32 threads in order to harness the task parallelism of Intel Xeon processors. On
a 60-core coprocessor, the application must have a good scaling up to 120-240 threads in order to reap the
nh
benefits of parallelism. Excessive synchronization, insufficient exposed parallelism and false sharing may limit
Yu
the parallel scalability and prevent performance gains on the Intel Xeon Phi architecture.
r
The following simple test may help to assess the need for shared-memory algorithm optimization. This
fo
test works in most OpenMP applications without any special tools. First, the application must be benchmarked
ed
with a single thread by setting OMP_NUM_THREADS=1. Then, the benchmark must be run with 32 OpenMP
ar
threads on the host and 240 threads (or 236 for offload applications) on the coprocessor. For compute-
p
application, the performance difference between a single-threaded and multi-threaded calculation must be a
yP
factor of 16 or more on the host, and a factor of 120 or more on the coprocessor. If the performance difference
el
is not significant, it is worthwhile investigating the common shared-memory pitfalls discussed in Section 4.4.
iv
us
cl
Ex
Here, the factor 2 assumes that the fused multiply-add operation is employed, performing two floating-point
operations per cycle. At the same time, the peak memory bandwidth of this system performing 6.0 GT/s using
8 memory controllers with 2 channels in each, working with 4 bytes per channel, is
where η is the practical efficiency of bandwidth accessibility, η ≈ 0.5. This amounts to 0.5 × 384/8 = 24
billion floating-point numbers per second (in double precision). Therefore, in order to sustain optimal load
on the arithmetic units of an Intel Xeon Phi coprocessor, the code must be tuned to perform no less than
960/24 = 40 floating-point operations on every number fetched from the main memory. In comparison, a
g
an
system based on two eight-core Intel Xeon E5 processors clocked at 3.0 GHz delivers up to
W
Arithmetic Performace = 2 sockets × 8 × 3.0 × (256/64) × 2 = 384 GFLOP/s, (4.6)
ng
e
with a memory bandwidth
where the additional factor of 2 in the estimate of performance reflects the presence of two ALUs (Arithmetic
fo
Logic Units) in each Sandy Bridge processor. Even though this processor does not have an FMA instruction,
d
re
xAXPY-like algorithms may favorably utilize the processor’s pipeline and employ both ALUs.
pa
Generally, the more arithmetic operations per memory access a code performs, the easier it is to fully
utilize the arithmetic capabilities of the processor. That is, high arithmetic intensity applications tend to be
re
yP
compute-bound. In contrast, low arithmetic intensity applications are bandwidth-bound, if they access memory
in a streaming manner, or latency-bound if their memory access pattern is irregular.
el
700 700
Performance, GFLOP/s
400
Waterman & Patterson [61]. In order to build the
nd
th
ba
nd
ba
eo
200 200
intensity is plotted along the horizontal axis, and
ax
Th
r. m
application and extending upwards until it hits the “roof” represented by the model. If the column hits the
sloping part of the roof (the bandwidth line), then the application is bandwidth-bound. Such an application may
be optimized by improving the memory access pattern to boost the bandwidth or by increasing the arithmetic
intensity. If the column hits the horizontal part of the roof (the performance line), then the application is
compute-bound. Therefore, it may be optimized by improving the arithmetic performance by vectorization,
utilization of specialized arithmetic instructions, or other arithmetics-related methods.
The roofline model can be extended by adding ceilings to the model. Figure 4.16 demonstrates an
extended roofline model for the host system with two Intel Xeon E5 processors and for a single Intel Xeon
Phi coprocessor. In this figure, an additional model is produced by introducing a realistic memory bandwidth
efficiency η=50%. Additionally, we introduced a ceiling “without FMA” for the coprocessor and “one ALU”
for the host. One of these ceiling correspond to applications that do no employ the fused multiply-add operation
on the coprocessor, or do not fill the host processor pipeline in a fashion that utilizes both arithmetic logic
units (ALUs) of Sandy Bridge processors. This assumption reduces the maximum arithmetic performance by
approximately a factor of 2. Another ceiling additionally assumes that the application is scalar, i.e., does not
use vector instructions. In double precision, this reduces the theoretical peak performance on the host by a
g
factor of 4 and on the coprocessor by a factor of 8 (see Section 1.3.2 for additional discussion on this subject).
an
W
Roofline model: various conditions
103
Host system: theoretical maximum
practical bandwidth (η=50%)
one ALU
e ng Coprocessor, theor. max performance
103
nh
no vectorization
Coprocessor: theoretical maximum
Yu
b η=
Performance, GFLOP/s
ax
fo
ncy
.m ie Host, theor. max performance
or ic idt
h
e eff
ed
, th h dw
s or w idt b an
es nd
ar
C ss he =5
ce T yη
re
p ro m, en
c
Co ste ici
yP
y ff
102 ts he 102
os dt
H
d wi
el
Ban
Coprocessor, without FMA or vectorization
iv
us
101 101
1/2 1 2 4 8 16 32 64 128 256
Arithmetic Intensity
Figure 4.16: Extended roofline model for a host with two Intel Xeon E5 processors and for an Intel Xeon Phi coprocessor
with a realistic bandwidth efficiency factor and additional ceilings.
The information in the roofline model plot can be used in order to predict which optimizations are
likely to benefit a given application. It also indicates the threshold arithmetic intensity at which the workload
transitions from bandwidth-bound to compute-bound. The arithmetic intensity is a property of the numerical
algorithm and can be varied for algorithms more expensive than O(N ). Code optimizations that improve the
memory performance and increase the arithmetic intensity are presented in the present section.
4.5.1 Cache Organization on Intel Xeon Processors and Intel Xeon Phi Coprocessors
Intel Xeon processors and Intel Xeon Phi coprocessors have similar memory hierarchy: a large, but
relatively slow RAM is cached with a smaller, but faster L2 cache, which, in turn, is cached by an even smaller
and even faster L1 cache, which resides in direct proximity of the core registers. See Figure 1.6 for Knights
Corner core topology, Table 1.2 for technical specifications, and Table 1.1 for cache properties.
One aspect of cache organization distinguishes Intel Xeon processors from Intel Xeon Phi coprocessors.
Intel Xeon processors have the L2 cache symmetrically shared between all cores, while in Intel Xeon Phi
coprocessors, the L2 cache can be viewed as slices local to every core and connected via an IPN (Figure 1.5
illustrates the die layout).
g
from cache and fetched from RAM or lower-level cache multiple times. Every cache miss on a read operation
an
makes the core stall until the data requested by the core is fetched from memory. A cache miss on a write does
W
not necessarily stall the core, because the core may not need to wait until the data is written.
ng
The latencies of communication with caches can be masked (i.e., overlapped with calculations). In both
e
Intel Xeon processors and Intel Xeon Phi coprocessors, hyper-threading is used in order to allow one parallel
nh
thread to use the core while the other thread is waiting for data to be read or written. Additionally, Intel
Yu
Xeon cores are out-of-order processors, and, depending on the algorithm, they can execute other instructions
r
(sometimes speculatively) while a cache miss is being processed. This is not true of Intel Xeon Phi cores,
fo
It is usually feasible to estimate the theoretical minimum of cache misses for any given algorithm.
pa
Furthermore, it is often possible to reduce the occurrence of cache misses in an algorithm by changing the
order of operations. The techniques for such optimizations include:
re
yP
2) loop tiling (also known as loop blocking) for algorithms with nested loops acting on multi-dimensional
iv
us
arrays;
cl
1 #include <omp.h>
2 #include <stdio.h>
3
4 void loop1(int n, double* M, double* a, double* b){
5 // More optimized: unit-stride access to matrix M
6 for (int i=0; i<n; i++)
g
7 for (int j=0; j<n; j++)
an
8 b[i]+=M[i*n+j]*a[j];
W
9 }
10
ng
11 void loop2(int n, double* M, double* a, double* b){
// Less optimized: stride n access to matrix M
12
e
nh
13 for (int j=0; j<n; j++)
14 for (int i=0; i<n; i++)
Yu
15 b[i]+=M[i*n+j]*a[j];
}
r
16
fo
17
int main(){
ed
18
19 const int n=10000, nl=10; // n is the problem size
ar
20 double t0, t;
p
21 double* M=(double*)malloc(sizeof(double)*n*n);
re
22 double* a=(double*)malloc(sizeof(double)*n);
yP
23 double* b=(double*)malloc(sizeof(double)*n);
24 M[0:n*n]=0; a[0:n]=0; b[0:n]=0;
el
25
iv
28
29 loop1(n, M, a, b);
Ex
30 t+=(omp_get_wtime()-t0)/(double)nl;
31 }
32 printf("Loop 1 (stride 1): %.3f s (%.2f GFLOP/s)\n",
33 t, (double)(2*n*n)/t*1e-9);
34
35 t=0; // Benchmarking loop 2 (nl runs for statistics)
36 for (int l=0; l<nl; l++) {
37 t0=omp_get_wtime();
38 loop2(n, M, a, b);
39 t+=(omp_get_wtime()-t0)/(double)nl;
40 }
41 printf("Loop 2 (stride n): %.3f s (%.2f GFLOP/s)\n",
42 t, (double)(2*n*n)/t*1e-9);
43
44 free(b); free(a); free(M);
45 }
Listing 4.54: matvec-miss.cc executes and times a serial matrix-vector multiplication. Two implementations are
tested: in loop1(), the j loop is nested inside the i loop, and in loop2(), the i-loop is nested inside the j-loop.
First, let us compile this code with the arguments -O1 and -openmp (the latter is needed only to enable
the convenient OpenMP timing function omp_get_wtime()). The resulting code will have a low level
of optimization, but it will allow to illustrate the difference between loops 1 and 2. The result is shown in
Listing 4.55.
g
an
Loop 1 (stride 1): 2.312 s (0.09 GFLOP/s)
Loop 2 (stride n): 71.567 s (0.00 GFLOP/s)
W
user@mic0$
e ng
Listing 4.55: Running the code in Listing 4.54 compiled with the optimization level -O1.
nh
r Yu
If the code in Listing 4.54 is compiled with the optimization level -O1, then on the host system, the
fo
function loop2() performs almost three times slower than loop1() due to the cache misses that it incurs.
d
On the coprocessor, loop2() is more than 30x slower than loop1(). The penalty of unoptimized cache
re
The performance of loop2() is poor due to the scattered memory access pattern in the nested loops.
re
Indeed, the matrix M is the largest data container in the problem, and the inner i-loop in loop2() accesses
yP
this matrix with a stride of n. Such access pattern is inefficient for two reasons:
el
iv
a) Every access to memory fetches not one floating-point number, but a whole cache line containing this
us
number. Cache lines are 64 bytes long in Intel Xeon processors and Intel Xeon Phi coprocessors, and
cl
they map to contiguous 64 bytes in the main memory. in loop2(), the inner i-loop uses only one
Ex
floating-point number from the fetched cache line and moves on to fetch another cache line in the next
i-iteration. In contrast, in loop1(), the inner loop is the j-loop, which continues to use the cache line
that is already in memory, reading 64/sizeof(double)=8 consecutive floating-point numbers. This
means that for 8 consecutive iterations of the j-loop, no further cache misses on matrix M are incurred.
b) By the time that the i-loop in loop2() finishes and returns to the same value of i in the next j-iteration,
the cache line containing M[i*n+j] and M[i*n+j+1] will likely already be evicted from cache, and a
new cache miss would occur.
Figure 4.17 shows the result of the General Exploration analysis for the Sandy Bridge architecture in
Intel VTune Amplifier XE. The important metric indicating the need for cache optimization is the Last level
cache (LLC) hit ratio of 1.0 (see the Summary view screenshot in the top panel). Additionally, the bottom-up
view, shown in the screenshot in the bottom, indicates that function loop2() suffers from a high ratio of
memory replacements. In contrast, loop1() has far fewer memory replacements, and takes almost 3 times
fewer clock cycles.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Figure 4.17: VTune General Exploration for the code in Listing 4.54. Notice how function loop2() incurs far more
memory replacements, LLC hits and DTLB overhead than loop1().
The compiler is often able to determine sub-optimal memory access patterns and optimize them. For
instance, at the default optimization level, -O2, the compiler may permute (interchange) nested for-loops in
loop2(), so that memory traffic is optimized. Listing 4.56 shows the result of compiler optimization with
the argument -O2. The argument -vec-report is used in order to make it visible that the loop in line 38
was permuted. This information is also included in the optimization report, which can be requested by adding
the -opt-report compiler argument.
g
user@host$ ./a.out
an
Loop 1 (stride 1): 0.076 s (2.64 GFLOP/s)
W
Loop 2 (stride n): 0.077 s (2.59 GFLOP/s)
user@host$
ng
user@host$ icpc -openmp -O2 matvec-miss.cc -vec-report -mmic
e
matvec-miss.cc(24): (col. 7) remark: LOOP WAS VECTORIZED.
matvec-miss.cc(24): (col. 20) remark: LOOP WAS VECTORIZED.
nh
Yu
matvec-miss.cc(24): (col. 31) remark: LOOP WAS VECTORIZED.
matvec-miss.cc(29): (col. 5) remark: LOOP WAS VECTORIZED.
r
user@mic0$ ./a.out
yP
user@mic0$
iv
us
cl
Listing 4.56: At optimization levels -O2 and higher, the compiler optimizes the code by permuting (interchanging) lines
Ex
13 and 14 in loop2() in order to reduce cache misses. Both loops run more efficiently and equally fast. Compiler
argument -g is necessary in order to allow VTune to resolve the symbols in the C code.
With loop permutation automatically performed at optimization level -O2, the performance of both loops
is identical. In addition, both loops gain additional speed due to automatic vectorization, which is enabled at
-O2. Speedup is observed both on the host system, and on the Intel Xeon Phi coprocessor. The performance
boost is more pronounced on the coprocessor.
Note that profiling of this optimized code with VTune profiling yields higher values of negative cache
performance metrics, such as memory replacements and LLC misses (see Figure 4.18). This may seem
counter-intuitive, because the wall clock performance was actually improved. However, high values of these
metrics do not mean that the code performs worse. These metrics are relative to the number of instructions
retired. Vectorization reduces the number of retired instructions almost by a factor of 4. The fact that LLC
misses and memory replacements are still high indicates the need for further improvement, as discussed in the
next section.
While the compiler can detect and alleviate cache performance issues such as shown here, additional
techniques can yield even better results. The Section 4.5.4 and Section 4.5.5 discuss these techniques.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Figure 4.18: VTune General Exploration for the code in Listing 4.54. The compiler argument -O2 makes functions
loop1() and loop2() equally fast. It also enables automatic vectorization, which improves the wall clock performance
by a factor of 2. Cache performance metrics indicate opportunities for further cache optimization.
g
9
an
10 compute(a[i], b[j]); // Memory access is unit-stride in j
W
11
12 // Doubly tiled nested loops
ng
13 for (int ii = 0; ii < m; ii += TILE)
14 for (int jj = 0; jj < n; jj += TILE)
e
nh
15 for (int i = ii; i < ii + TILE; i++) // Re-use data for each j with several i
16 for (int j = jj; j < jj + TILE; j++)
Yu
17 compute(a[i], b[j]); // Memory access is unit-stride in j
r
fo
Listing 4.57: Schematic organization of loop tiling. If the array b[0:n] does not fit in cache, then tiling the outer ii
d
loop and and adding the innermost i loop increases the locality of data access and thus improves cache traffic. Tiling the
re
In order to analyze this optimization, let us assume that the array b does not fit in cache. Then, in
the unoptimized version (lines 1 through 4 in Listing 4.57), for every iteration in i, all the data of b will
el
have to be read from memory into cache, evicted from cache and then fetched again in the next i-iteration.
iv
Re-organization of the loops with tiling (lines 6-10 in Listing 4.57) ensures that every value of b[j] is used
us
several times before b[j+1] is fetched. In the case with loop tiling, the j-loop may be vectorized. Then
cl
every value of b[j]. . . b[j+V-1] is used several times (here V is the vector register length). Re-usage of
Ex
data reduces the number of times that array b has to be loaded into cache and, thus, improves performance.
The loop in j can be tiled in a similar manner (lines 12-17 in Listing 4.57).
Ideally, the data spanned by the innermost loops should utilize the whole cache. This means that the size
of the tile must be tuned to the specifics of computer architecture.
Special precautions should be taken with loop tiling:
1) When the values of m and n are not multiples of the tile size TILE, some iterations must be peeled off, as
shown in Listing 4.58.
2) If in the original algorithm, automatic vectorization was used in the inner j loop, then in the tiled version,
automatic vectorization must be applied to the j loop, which is not inner in the tiled version. That can be
achieved either by unrolling the inner i-loop (see Listing 4.59), or by using #pragma simd and making
the value of TILE a constant known at compile time (see Listing 4.60).
3) It is best to make array termination conditions as simple as possible, and use constants for tile sizes, in
order to facilitate automatic vectorization. Additionally, multiversioned, redundant code such as shown in
Listing 4.58 is instrumental for compiler-assisted optimization, even if it does not look elegant.
Listing 4.58: Peeling off the last several iterations of a tiled loop when m is not a multiple of TILE.
g
5
an
6 compute(a[ii + 2], b[j]);
7 compute(a[ii + 3], b[j]);
W
8 }
e ng
Listing 4.59: Explicit unrolling of the inner loop in order to retain vectorization in j.
nh
r Yu
fo
1 const int TILE = 4; // TILE must be a constant in order to unroll the i -loop
for (int ii = 0; ii < m; ii += TILE)
ed
2
3 #pragma simd
ar
5 #pragma unroll(TILE)
re
6 for (int i = ii; i < ii + TILE; i++) // Compiler can unroll this loop
yP
7 compute(a[i], b[j]);
el
iv
Listing 4.60: Automatic unrolling of the inner loop in order to retain vectorization in j. Note the #pragma simd
us
instruction necessary to vectorize the outer loop. #pragma unroll may not be necessary if the loop body is not too
cl
complex.
Ex
g
11
an
12 const double product = gsd*crossSection;
double result = 0;
W
13
14 #pragma vector aligned
ng
15 for (int k = 0; k < tempBins; k++)
16 result += planckFunc[i*tempBins + k]*distribution[j*tempBins + k];
e
17
18 }
sum += result*product;
nh
Yu
19 emissivity[i] = sum*wavelength[i];
20 }
r
fo
21 }
d
re
One does not have to understand the physical meaning of the calculation in Listing 4.61 in order to
yP
optimize this code. However, the important aspects of this function is how it is executed, and how big the data
el
set is.
iv
us
• This function is a thread-safe routine. Multiple instances of this function are called from a parallel loop
cl
• In the parallel loop, the input array distribution and the output array emissivity are private to
each parallel instance of function ComputeEmissivity().
• All other input arrays are shared between all instances of the function.
• The input data of this function is guaranteed to remain intact for the duration of the execution.
• The values of the parameters in this code are in the neighborhood of wlBins≈gsMax≈tempBins≈128.
It is also safe to assume that tempBins is a power of 2.
The size of the most used 2-dimensional arrays in this code, planckFunc and distribution, can be
estimated at 64 kB, and therefore these arrays do not fit into the L1 cache of the host processor, nor the of the
coprocessor. Furthermore, with hyper-threading, there may be insufficient L2 cache in each core to hold the
arrays belonging to all hyper-threads. Therefore, cache optimizations must be employed in this application.
First, one may consider opportunities for loop interchange. The code of the function has three nested
loops in variables i, j and k, respectively. The inner loop in k is executed the greatest number of times,
and therefore, optimizing memory access in this loop promises the greatest benefits. Visual inspection of the
loop in variable k shows that the loop has unit stride and operates with aligned vector instructions. These are
desirable properties, which we would like to preserve. If we cannot touch the loop in k, we can attempt to
interchange loops in i and j. This interchange leads to a significant improvement in performance, and we
explain this optimization below. Listing 4.62 shows the code of the function with interchanged i and j loops.
g
13 for (int i = 0; i < wlBins; i++) {
an
14 double result = 0;
#pragma vector aligned
W
15
16 for (int k = 0; k < tempBins; k++)
ng
17 result += planckFunc[i*tempBins + k]*distribution[j*tempBins + k];
18
e
nh
19 const double crossSection = absorption[j*wlBins + i];
20 const double product = gsd*crossSection;
Yu
21 emissivity[i] += result*product*wavelength[i];
}
r
22
fo
23 }
24 }
ed
ar
Listing 4.62: Emissivity calculation with interchanged i and j-loops. Note that we moved the access to grainSizeD
p
re
The performance of the new code is significantly better than in the unoptimized version, as shown in
iv
Figure 4.19. This is explained by the change in the locality of access to array distribution:
us
cl
• In the original code, array distribution was read from front to back in every iteration of the outer
Ex
loop, the i-loop. By the end of the i-iteration, the beginning of the array distribution may have
been evicted from the L2 cache, and in the subsequent i-iteration, this array must be fetched into
cache again. Therefore, array distribution was read in from memory into the L2 cache up to
wlBins≈128 times.
• In contrast, in the optimized code, for every iteration of the outer loop (the j-loop in the optimized
version), only the j-th row of array distribution is used. A single row is only sizeof(float)
× tempBins ≈ 512 bytes long, which is small enough to fit it in the L1 cache. Therefore, array
distribution is fetched from memory into cache only once in function ComputeEmissivity()
At the first glance, is may seem counter-intuitive that this loop interchange results in a performance
increase. Indeed, we improved the locality of access to distribution, but ruined the locality of access to
equally large array planckFunc. In order to understand why this optimization was effective, recall that the
function ComputeEmissivity() is called from a thread-parallel region, and distribution is private
to each thread in the region, while planckFunc is shared between all threads. This explains why it is more
important for performance to maintain the locality of access to distribution. Indeed, for four hyper-
threads working on a single core of the Intel Xeon Phi coprocessor, only one copy of planckFunc needs to
reside in the L2 cache, but four instances of distribution must also be there. Therefore, non-local access
to distribution leads to a greater number of cache evictions and fetches in each core than non-local
access to planckFunc, and this explains why loop interchange successfully improved performance.
Let us continue with the optimization of this code and apply loop tiling in order to further improve the
efficiency of memory access to array distribution. This optimization is demonstrated in Listing 4.63.
The size of the tile must be chosen empirically, i.e., we must use the tile size that produces the best performance.
Oftentimes, the optimal tile size is a power of 2 not exceeding the number 16. This upper limit is tied to the
number of vector registers in processor and coprocessor cores. In the case of this code, the optimal tile size
turned out to be equal to 8.
g
7 // In this version, loop tiling is implemented in the i-loop
an
8 const int iTile = 8; // Found empirically
W
9 assert(wlBins%iTile==0);
10 emissivity[0:wlBins] = 0.0;
ng
11 for (int j = 0; j < gsMax; j++) {
e
12 const double gsd = grainSizeD[j];
13 for (int ii = 0; ii < wlBins; ii+=iTile) { // i-loop tiling
double result[iTile]; result[:] = 0.0; nh
Yu
14
15 #pragma vector aligned
r
16 #pragma simd
fo
18 #pragma novector
re
25 emissivity[i] += result[i-ii]*product*wavelength[i];
iv
26 }
us
27 }
cl
28 }}
Ex
Listing 4.63: Emissivity calculation with tiled i-loop. Note that the variable result must now be an array of size equal
to the tile size, and that #pragma simd must be used in order to vectorize the k-loop, because now it has another loop
nested in its body.
In the optimized code in Listing 4.63, additional performance is gained because the locality of access
to distribution is improved in a higher level of cache hierarchy. Indeed, in the previous version
(Listing 4.62), for every i-iteration, the j-th row of distribution was read from front to back. In the
subsequent i-iteration, this row must be fetched again from the L1 cache into registers and, potentially, from
the L2 cache into the L1 cache. With tiling (Listing 4.63), every time a vector register is filled with the data
of distribution, this register is used iTile=8 times before this register is discarded. It can be seen in
Figure 4.19 that the effect of this optimization is a factor of 1.6x improvement on the coprocessor and 1.4x on
the host system.
3
2.67 s
2 2.01 s
1.61 s 1.46 s 1.40 s
1 1.03 s
0.67 s
g
0
an
Unoptimized Permuted i- and j- loops Tiled j-loop Tiled j- and i-loops
W
and i-loops”, can be found in Exercise Section A.4.11.
e ng
Figure 4.19: Performance of the emissivity calculation (Listing 4.61, Listing 4.62 and Listing 4.63). The last case, “tiled j-
nh
Yu
It is possible to further optimize this code by improving the locality of access to array planckFunction.
r
fo
This can be done by tiling the j-loop. We will not discuss this optimization in the main text and refer the
reader to Exercise Section A.4.11, where this optimization can be found along with full code of this example.
ed
p ar
re
yP
el
iv
us
cl
Ex
g
an
W
Listing 4.64: Cache-ignorant (i.e., unoptimized) algorithm of parallel in-place square matrix transposition.
ng
Running on the host system with two Intel Xeon processors and natively on the Intel Xeon Phi coprocessor
e
nh
with n=28000, the code in Listing 4.64 performs as shown in Figure 4.20 for case “Unoptimized”.
Yu
The performance of the code can be improved by loop tiling. In Listing 4.65 we show an optimized
implementation of the transposition function with double loop tiling.
r
fo
d
#include <cassert>
re
1
2
pa
5 #ifdef __MIC__
6 const int TILE = 16;
el
7 #else
iv
9 #endif
10 // The below restriction can be lifted using an additional peel loop
cl
11 assert(n%TILE == 0);
Ex
Performance results of this the tiled algorithm are shown in Figure 4.20, labelled as “Loop tiling”. Loop
tiling increases the performance of the code on both the host system and the Intel Xeon Phi coprocessor by a
factor of 5x-6x. There are two reasons why tiling is beneficial in this case:
1) First, similarly to the emissivity calculation code in Example 1 (Listing 4.63), tiling improves the locality
of data access at the highest level of caching. For instance, the inner j-loop performs a scattered write to
elements A[j*n+i]. In the subsequent i-iteration, the j-loop modifies elements A[j*n+i+1]. If the
loop is not tiled, then every i-iteration modifies in this way n cache lines, and for a large n, these cache
lines must be evicted to the higher-level cache or to the main memory. On the other hand, in a tiled loop,
every i-iteration modifies only TILE≤ 32 cache lines, and for iteration i+1, these cache lines are still in
the L1 cache or even in the registers.
2) Another benefit of tiling in this case is that it reduces false sharing. Without tiling, one Intel Cilk Plus
worker may be operating on a certain i and writing to A[j*n+i]; at the same time, another worker
is processing i+1 and writing to A[j*n+i+1]. There will be collisions in which A[j*n+i] and
A[j*n+i+1] are in the same cache line. As we have seen in Section 4.4.2, such situations lead to false
g
sharing where the processor locks the “dirty” cache line until its modification is propagated across all
an
caches accessing it. This incurs significant performance penalties. With tiling, on the other hand, each
W
worker processes a block of iterations in the ij-space, and therefore false sharing does not occur upon
writing to A[i*n+j] or A[j*n+i]. e ng
nh
Note that for optimum performance, we use different tile sizes for the host system and for the coprocessor.
Yu
That the values TILE=32 on the host and TILE=16 on the coprocessor were obtained empirically. We
r
only had to test multiples of 16 and 8 for the tile size (on coprocessor and processor, respectively), because
fo
we intended to have the inner loop automatically vectorized. Additionally, the coprocessor performance
ed
is improved in this code with the help of #pragma simd to enforce vectorization, and #pragma loop
ar
count to tune the loop to the most frequently used number of iterations. More information about these
p
Loop tiling is most efficient when the tile size is precisely tuned to the cache size of the system. There
exists another approach to cache traffic optimization, known as cache-oblivious algorithms, in which tuning
does not have to be as precise as with tiling. This approach may provide a solution that is more portable across
different computer architectures.
Cache-oblivious algorithms exploit recursion in order to work efficiently for any size of the cache and
of the problem. These methods were introduced by Prokop [64] and subsequently elaborated by Frigo at
al. [65]. The principle of cache-oblivious algorithms is to recursively divide the data set into smaller and
smaller chunks. Regardless of the cache size of the system, recursion will eventually reach a small enough
data subset that fits into the cache. This property of cache-oblivious algorithms provides portability across
various architectures. Listing 4.66 illustrates this approach.
g
an
1 // Unoptimized algorithm
2 void CalculationUnoptimized(void* data, const int size) {
W
3 for (int i = 0; i < n; i++) {
ng
4 // ... perform work;
5 }
e
}
nh
6
7
Yu
8 // Optimized recursive cache-oblivious algorithm
9 void CalculationOptimized(void* data, const int size) {
r
fo
10 // Initiate recursion
11 CalculationRecurse(data, 0, size);
d
12 }
re
13
pa
16
17 for (int i = iStart; i < iEnd; i++) {
// ... perform work
el
18
}
iv
19
20 } else {
us
In practice, decomposing the problem into size 1 operations is not optimal, as the overhead of thread
creation and function calls may outweigh the benefit of cache-efficient data handling. In addition, size 1
operations prevent vectorization. Therefore, it is beneficial to introduce a threshold of the problem size, at
which the recursion stops, and another algorithm is applied to the reduced problem. This is the meaning of the
condition (iEnd - iStart < recursionThreshold). The value of the recursion threshold must
be chosen empirically; however, it usually does not have to be tuned to the architecture as precisely as the tile
size in tiled algorithms.
Parallelization of the divide-and-conquer approach of cache-oblivious algorithms is straightforward in
the fork-join model of parallelism. Indeed, with the spawning functionality of Intel Cilk Plus it is easy to
spawn off the first of the two recursive calls, as illustrated in the following example.
Example
Recursive cache-oblivious algorithm for matrix transposition was proposed and evaluated by Tsifakis,
Rendell & Strazdins [66]. An implementation of this algorithm employing Intel Cilk Plus is shown in
Listing 4.67. Note that this algorithm may be further improved by eliminating redundant forks, unnecessary
transposition of the diagonal elements, introducing a blocked, multiversioned code for the inner loop in the
non-recursive transposition engine, etc. However, the implementation presented here is intentionally kept
simple in order to convey the underlying principle.
1 void transpose_cache_oblivious_thread(
2 const int iStart, const int iEnd,
3 const int jStart, const int jEnd,
4 float* A, const int n){
5 #ifdef __MIC__
6 const int RT = 64; // Recursion threshold on coprocessor
7 #else
8 const int RT = 32; // Recursion threshold on host
g
9 #endif
an
10 if ( ((iEnd - iStart) <= RT) && ((jEnd - jStart) <= RT) ) {
W
11 for (int i = iStart; i < iEnd; i++) {
12 int je = (jEnd < i ? jEnd : i);
13
14
#pragma simd
for (int j = jStart; j < je; j++) {
e ng
nh
15 const float c = A[i*n + j];
Yu
18 }
fo
19 }
ed
20 return;
21 }
ar
22
p
30 } else {
Ex
Listing 4.67: Cache-oblivious recursive algorithm of parallel in-place square matrix transposition.
In order to effect parallel recursion, _Cilk_spawn is used to asynchronously execute the first of
the recursive functions. The problem splitting direction alternates between horizontal (i-wise) and vertical
(j-wise) sectioning of the matrix. Race conditions are avoided by ensuring that jMax<=iSplit.
The following fine tuning aspects of the code in Listing 4.67 are crucial for performance:
1) Recursion stops when the subset of the problem is reduced to the threshold set by the variable RT.
2) The threshold RT is chosen empirically and has different optimal values on the host and on the coprocessor.
3) This smallest problem partition is processed serially in each Intel Cilk Plus strand, utilizing vector
instructions for streaming reads and scattered writes.
4) Just like with the tiled algorithm, we found #pragma simd and #pragma loop count to be benefi-
cial for the performance on the coprocessor.
5) Additionally, we facilitate aligned memory accesses on the coprocessor by ensuring that (a) the matrix
begins at an aligned address, (b) the row length and the problem split points are a multiple of the SIMD
vector length. This length is equal to 16 for single-precision floating-point numbers on the coprocessor.
Figure 4.20 shows that the performance of this code exceeds that of the tiled code shown in Listing 4.65
g
by 1.3x on the coprocessor and 1.5x on the host system. The techniques of loop tiling and cache-oblivious
an
recursion benefit the performance of codes both on the host system with Intel Xeon processors and on Intel
W
Xeon Phi coprocessors.
e ng
Parallel, in-place square matrix transposition
600 nh Host system
Yu
545 ms
Intel Xeon Phi Coprocessor
r
500 478 ms
fo
d
re
Time, ms (lower is better)
400
pa
re
300
yP
el
200
iv
us
112 ms
95 ms
cl
100 76 ms 71 ms
Ex
Figure 4.20: Execution time of the square matrix transposition algorithm on the host system and on the coprocessor.
Matrix size in this benchmark is n×n with n=28000. See Listing 4.64, Listing 4.65 and Listing 4.67 for the corresponding
source code.
The complete code of the parallel matrix transpose with blocking and recursion, along with step-by-step
optimization instructions, can be found in Exercise Section A.4.11.
g
11
an
W
Listing 4.68
e ng
Loop fusion is beneficial for cache performance, because in the case of two disjoint loops, by the time
nh
that the first loop is finished, the beginning of the data set may have been evicted from caches. However, in a
Yu
fused loop, all stages of data processing occur while the data is still in the caches. In addition, loop fusion may
r
help to reduce the memory footprint of temporary storage if such storage was needed in order to carry some
fo
If two loops that are candidates for fusion are located within the same lexical scope, the Intel C++
ar
Compiler may fuse them automatically. The compiler is also capable of some inter-procedural optimization.
p
However, automatic loop fusion may fail if the compiler does not see both loops at compile time (e.g., the
re
loops are located in separate source files), or if additional measures must be taken for value-safe fusion. In
yP
cases when automatic loop fusion fails, the programmer may need to implement it explicitly.
el
iv
us
cl
Ex
In the code in Listing 4.69, Intel MKL is used to generate random arrays, and in each array, the mean and
standard deviation of generated numbers is computed. This code is representative of pipelined applications in
which some temporary data are generated and processed in subsequent functions.
1 #include <omp.h>
2 #include <malloc.h>
3 #include <mkl_vsl.h>
4 #include <math.h>
5
6 void GenerateRandomNumbers(const int m, const int n, float* const data) {
7 // Filling arrays with normally distributed random numbers
8 #pragma omp parallel
9 {
10 VSLStreamStatePtr rng; const int seed = omp_get_thread_num();
11 int status = vslNewStream(&rng, VSL_BRNG_MT19937, omp_get_thread_num());
g
12
an
13 #pragma omp for schedule(guided)
W
14 for (int i = 0; i < m; i++) {
15 const float mean = (float)i; const float stdev = 1.0f;
ng
16 status = vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,
e
17 rng, n, &data[i*n], mean, stdev);
18 }
vslDeleteStream(&rng); nh
Yu
19
20 }
r
21 }
fo
22
d
33 }
resultMean[i] = sumx/(float)n;
Ex
34
35 resultStdev[i] = sqrtf(sumx2/(float)n-resultMean[i]*resultMean[i]);
36 }
37
38 }
39
40 void RunStatistics(const int m, const int n,
41 float* const resultMean, float* const resultStdev) {
42 // Allocating memory for scratch space for the whole problem
43 // m*n elements on heap (does not fit on stack)
44 float* data = (float*) _mm_malloc((size_t)m*(size_t)n*sizeof(float), 64);
45
46 GenerateRandomNumbers(m, n, data);
47 ComputeMeanAndStdev(m, n, data, resultMean, resultStdev);
48
49 // Deallocating scratch space
50 _mm_free(data);
51 }
Listing 4.69: Generation and processing of pseudo-random data in a function with disjoint parallel for-loops.
In the above code, function RunStatistics() is the interface function called from an external code.
This function allocates some temporary data to store random numbers and then calls two functions: one
to generate the pseudo-random data, and another to analyze it statistically. In this example, the compiler
does not fuse the i-loops in GenerateRandomNumbers() and ComputeMeanAndStdev(). With
data generation isolated from data processing, this code is logically structured. However, if array data does
not fit into caches, the performance may suffer, because
a) during the execution of GenerateRandomNumbers(), array data will be fetched into caches, but
then most of it will be evicted by the time the function terminates;
b) the function ComputeMeanAndStdev() must fetch data from memory once again,
c) the amount of memory temporarily allocated for data may be excessively large.
In order to improve the code, we can apply loop fusion, as shown in Listing 4.70.
1 #include <omp.h>
g
an
2 #include <mkl_vsl.h>
3 #include <math.h>
W
4
ng
5 void RunStatistics(const int m, const int n,
6 float* const resultMean, float* const resultStdev) {
e
#pragma omp parallel
nh
7
8 {
Yu
15
re
17
18 const float seedMean = (float)i; const float seedStdev = 1.0f;
status = vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,
el
19
rng, n, data, seedMean, seedStdev);
iv
20
21 // Processing data to compute the mean and standard deviation
us
Listing 4.70: Generation and processing of pseudo-random data in a function with fused parallel for-loops.
With loop fusion, both operations: the generation of random numbers, and the computation of cumulative
statistical information, will have to reside in the same function. While this may be “poor style” from the
structured programming point of view, this optimization improves cache traffic and allows us to get rid of
the huge array data. Indeed, the data for every iteration in i can be generated, used, and discarded before
proceeding to the next iteration. Therefore we can allocate only sizeof(float)*T*n bytes of scratch
space as opposed to sizeof(float)*m*n bytes, where T is the number of threads in the system, and m is
the number of test arrays. We are assuming here that mT, because otherwise, the application will be troubled
by insufficient parallelism (see Section 4.4.4).
The effect of loop fusion along with reduced scratch memory footprint is shown in Figure 4.21.
Generation of pseudo-random data and statistical analysis before and after loop fusion
300
281 ms Host system
Intel Xeon Phi Coprocessor
250
233 ms
Time, ms (lower is better)
200
180 ms
150
135 ms
g
an
100
W
ng
50
e
0 Unoptimized
nh Optimized
Yu
(disjoint loops) (fused loops)
r
fo
Figure 4.21: Performance of the code generating and analyzing pseudo-random data. The unoptimized case is shown in
d
re
The complete working code for this example can be found in Exercise Section A.4.13.
yP
el
iv
us
cl
Ex
g
where the address can be predicted in advance (e.g, *(A+i*n+j)). However, the compiler does not issue
an
prefetch instructions for accesses in the form A[B[i]]. It is possible to see the report on compiler prefetching
W
using the compiler arguments -opt-report3 -opt-report-phase hlo.
e ng
In order to fine-tune the application performance, the programmer may wish to control software prefetch-
ing. If this approach is taken, it is advisable to first turn off automatic compiler prefetching using the compiler
nh
argument -no-opt-prefetch (to disable prefetching in the whole source file) or placing #pragma
Yu
noprefetch before a function or a loop (for a more fine-grained control). After that, prefetching can be
r
modified with the argument -opt-prefetch-distance (to affect the whole file) or effected with the
fo
The following general considerations may be helpful in the planning of prefetch optimization:
p ar
a) It is possible to diagnose whether the performance of a particular application can be improved with software
re
prefetching. The most simple test is to turn off prefetching in the whole application or a particular loop or
yP
function. If the performance drops significantly, then prefetching plays an important role, and fine-tuning
el
b) Prefetching is more important for Intel Xeon Phi coprocessors than for Intel Xeon processors. This is in
cl
part explained by the fact that Intel Xeon cores are out-of-order processors, while Intel Xeon Phi cores
Ex
are in-order. Out-of-order execution allows Intel Xeon processors to overlap computation with memory
latency. Additionally, the lack of a hardware L1 prefetcher on Intel Xeon Phi coprocessors makes software
prefetching necessary on the lowest level of cache traffic.
c) Loop tiling and recursive cache-oblivious algorithms (see Section 4.5.4 and Section 4.5.5) improve
application performance by reducing cache traffic, and, therefore, prefetching becomes less important for
algorithms optimized with these techniques.
d) If software prefetching maintains good cache traffic, hardware prefetching does not come into effect.
Additional information on prefetching on Intel Xeon Phi coprocessors can be found in this presentation
by Rakesh Krishnaiyer [67]. The Intel C++ Compiler Reference Guide has detailed information about pragmas
prefetch and noprefetch [68] and compiler argument -opt-prefetch [69].
g
6 {
an
7 data[0] = 0; // touch array to avoid dead code elimination
W
8 }
9 }
e ng
nh
Listing 4.71: Function performing offload of data to the coprocessor in the default offload mode. Memory for transferred
data is allocated and deallocated at the beginning and end of each offload. Data is transferred fully in each offload.
r Yu
fo
This function transfers array data to the coprocessor in the default mode. At the beginning of each
d
offload, the offload Runtime Library (RTL) will allocate memory for the respective array on the coprocessor,
re
then the data will be transferred over the PCIe bus, calculations will be performed, and memory will be
pa
deallocated.
re
yP
Consider the case when a function performing offload is called multiple times, and all or some of
us
the pointer-based arrays sent to the coprocessor have the same size. Then offload can be optimized using
cl
the clauses free_if and offload_if in order to preserve the memory allocated for the array on the
Ex
coprocessor (see also Section 2.2.9). This optimization is shown in Listing 4.72.
Listing 4.72: Function performing offload with data transfer, optimizer with memory retention on the coprocessor. Data is
transferred fully in each offload, however, memory allocated for the data on the coprocessor is allocated only during the
first offload and deallocated during the last offload.
The effect of memory retention on the offload performance is very significant, because the memory
allocation operation is essentially serial and therefore slow. For large arrays, the memory retention reduces
the latency associated with data offload almost by a factor of 10x. For smaller arrays, the effect is even more
dramatic because the latency of the memory allocation operation comes into play. As a rule of thumb, expect
to transfer data across the PCIe bus at a rate of 6 GB/s, but to allocate memory on the coprocessor at a rate of
0.5 GB/s.
g
2 const int k, const int nTrials) {
an
3
W
4 // Transfer data during the first iteration;
5 // skip transfer for subsequent iterations
6
7
const size_t transferSize = ( k == 0 ? size : 0); e ng
nh
8 #pragma offload target(mic:1) \
Yu
12 }
ed
13 }
p ar
Listing 4.73: Function performing offload of a calculation to the coprocessor with data persistence. After the first offload,
re
data is not transferred again, and memory is not allocated or deallocated until the last offload.
yP
el
In this case, we limited data transport with the help of the clause length(0) in all offloads but the
iv
first one. The array data[0] will still be available in full on the coprocessor, however, data will not be
us
transferred.
cl
Ex
g
an
MIC_USE_2MB_BUFFERS=1K.
The effective bandwidth of the PCIe traffic shown in Figure 4.23 is derived by dividing the transferred
W
array size by the offload latency with memory retention. The word “effective” indicates that this metric
ng
includes the communication time with the offload RTL and the data transfer time.
e
Results show that
nh
Yu
a) In the default offload mode, the offload latency is dominated by memory allocation. This is manifested in
the fact that retaining memory allocated on the coprocessor reduces the offload latency by more than a
r
fo
factor of 10x.
d
re
b) When memory retention is used, the latency is comprised of two components: data transfer time (propor-
pa
tional to the array size) and the time of communication with the offload RTL (the latter is independent of
re
the array size). For arrays greater than 256 kB, the data transfer time dominates the latency.
yP
c) The effective bandwidth increases with the array size. It plateaus at over 6.0-6.4 GB/s for arrays greater
el
than 16 MB.
iv
us
d) The maximum bandwidth is slightly greater for arrays over 16 MB when large TLB pages are set via
cl
MIC_USE_2MB_BUFFERS.
Ex
e) Arrays of size 4MB an anomalously low effective bandwidth. The bandwidth is also slightly below the
general trend for 2MB and 8 MB size arrays.
f) The effective bandwidth for arrays smaller than 32 kB is less than 10% of the maximum bandwidth.
The results and optimization techniques discussed above may help to optimize applications in which
computation to communication ratio is not very high. The complete code of the benchmark can be found in
Exercise Section A.4.14.
Offload latencies
Default offload: standard TLB pages
1000 2MB buffers r)
With memory retention: standard TLB pages sfe
ta tran
2MB buffers
n +da
With data persistence: standard TLB pages atio
2MB buffers a lloc
100 mory
(me
ffload
lt o ly)
Latency, ms
fau on
De er
nsf
tra
10 ta
da
n(
e ntio
ret
ory
mem
t h
1 Wi
MB
MB
MB
kB
kB
kB
MB
MB
MB
kB
kB
kB
B
an
B
128
256
512
1M
2M
4M
8M
128
256
512
1G
1k
2k
4k
8k
16
32
64
16
32
64
W
Array Size
Figure 4.22: Latencies associated with data and function offload to the coprocessor. Three sets of curves (dash-dotted,
solid and dotted) reflect different offload modes. For each of the offload modes, two settings for the TLB pages are tested:
e ng
nh
default 4-KB TLB pages (red curves with filled circles) and huge 2MB pages (blue curves with filled triangles). See
Yu
7
Default TLB pages
p
2MB buffers
re
6
yP
Effective Bandwidth, GB/s
5
el
iv
us
4
cl
Ex
0
MB
MB
MB
kB
kB
kB
MB
MB
MB
kB
kB
kB
B
B
128
256
512
1M
2M
4M
8M
128
256
512
1G
1k
2k
4k
8k
16
32
64
16
32
64
Array Size
Figure 4.23: The effective bandwidth of data transfer to the coprocessor calculated by dividing the transferred data size to
the offload latency for the “memory retention” case in Figure 4.22. Two settings for the TLB pages are tested: default
4-KB TLB pages (red curves with filled circles) and huge 2MB pages (blue curves with filled triangles). See Section 4.6.4
for details.
2. The total number of MPI processes can easily exceed 240 even on a single compute node. This is
at least an order of magnitude greater than in traditional systems. The consequence of this large
number is a large amount of MPI communication. If communication quenches the performance of the
g
algorithm, the programmer must consider communication-efficient algorithms. Additionally, hybrid
an
OpenMP/MPI programming can be employed in order to reduce the number of MPI processes, but
W
utilize multithreading within each process.
ng
In this section, we discuss and provide examples of MPI application optimization with the help of load
e
nh
balancing and inter-operation with OpenMP. For load balancing, we utilize “boss-worker” scheduling with
Yu
algorithms also adopted in parallel OpenMP loops:
r
a) static scheduling, where work distribution is known before the beginning of the parallel loop,
fo
d
b) dynamic scheduling, where the scheduler assigns work to the available workers in chunks, and when a
re
worker finishes its chunk of work, it reports to the “boss” to receive the next chunk, and
pa
re
c) guided scheduling, which is similar to dynamic scheduling, except that at the beginning, chunks are large,
yP
and towards the end of the calculation, the chunk size is gradually reduced.
el
We demonstrate how using OpenMP in MPI processes lowers the number of workers, which may decrease the
iv
amount of communication, benefitting performance. We also discuss the subject of process pinning in MPI in
us
Naturally, the optimization of MPI applications does not end at the techniques discussed above. Efficient
Ex
MPI application must minimize communication and overlap communication with computation, utilize SIMD
instructions, partition problems for local data access, and use efficient computational microkernels. These
topics have been discussed earlier in this chapter.
4.7.1 Example Problem: the Monte Carlo Method of Computing the Number π
We illustrate load balancing in MPI for an application that uses the Monte Carlo method to compute the
number π = 3.141592653589793 . . . This problem does not require intensive data transfer, and every Monte
Carlo trial is independent from other trials. Therefore, this method can be categorized as an embarrassingly
parallel algorithm.
The Monte Carlo method of computing the number π can be illustrated with a geometrical model.
Consider a quarter of circle of radius R = 1 inscribed in a square with the side length L = R (see Figure 4.24).
The surface area of the quarter of a circle is
y
1
Aquarter circle = πR2 (4.8)
4
L=1
and the surface area of the square is
1
R=
2
Asquare = L . (4.9)
x
L=1
g
Let us uniformly distribute a set of N random
an
points over the surface area of the square. The
W
- Monte Carlo trial
mathematical expectation of the number of these - unit square area
points inside the quarter of a circle is e ng - quarter circle area
nh
Aquarter circle
hNquarter circle i = N. (4.10) Figure 4.24: Monte Carlo calculation of the number π.
Yu
Asquare
r
fo
4 = 4 2 = π. (4.11)
N 4L
ar
In a computer code, we can generate N uniformly distributed points and compute Nquarter circle . Using
p
re
Nquarter circle
π≈4 . (4.12)
el
N
iv
us
The core algorithm for the number π calculation with this Monte Carlo method is expressed in the
cl
Listing 4.74: Serial algorithm of the number π calculation with Monte Carlo method (C source code)
Before we proceed to the MPI implementation of this code, let us optimize it using the methods described
earlier in this chapter. This code is bottlenecked by the generation of random numbers. We can improve the
performance of random number generation using one of the SIMD-capable random number generators (RNGs)
from the Intel MKL. These RNGs perform best when they generate an array of random numbers, as opposed
to one number at a time. In order to accommodate this requirement in the code, we must divide the iteration
space into sufficiently large blocks and transform the for-loop into two nested loops: one iterating over blocks,
another operating within a block. This optimization is nothing more than the strip-mining technique discussed
in Section 4.4.4. The resulting optimized code is shown in Listing 4.75.
1 #include <mkl_vsl.h>
2
3 //...
4
5 const long iter=1L<<32L;
6 const long BLOCK_SIZE=4096;
7 const long nBlocks=iter/BLOCK_SIZE;
8 const int seed = 2375041;
9
10 // Random number generator from MKL
11 VSLStreamStatePtr stream;
12 vslNewStream( &stream, VSL_BRNG_MT19937, seed );
13
14 for (long j = 0; j < nBlocks; j++) {
g
an
15
16 vsRngUniform( 0, stream, BLOCK_SIZE*2, r, 0.0f, 1.0f );
W
17 for (i = 0; i < BLOCK_SIZE; i++) {
ng
18 const float x = r[i];
19 const float y = r[i+BLOCK_SIZE];
e
// Count points inside quarter circle
nh
20
21 if (x*x + y*y < 1.0f) dUnderCurve++;
Yu
22 }
23 }
r
fo
Listing 4.75: Serial core algorithm of the number π calculation with Monte Carlo method (C source code)
pa
re
Note that random coordinates of a point (x and y) are obtained from the array of random numbers r in
yP
such a way that unit-stride access to r is ensured. This is done in order to enable automatic vectorization of
el
• MPI is used to distribute the work across multiple processes, and MPI initialization and termination
functions are called at the beginning and the end of the calculation.
• Each process computes only a fraction of the work, and this fraction is the same for all processes.
Variable blocksPerProc is the average number of blocks for which each process must run the Monte
Carlo simulation. This variable is declared as a floating-point number in order to deal with situations
where the number of blocks is not a multiple of the number of MPI processes.
• Variables myFirstBlock and myLastBlock contain the range of blocks for each process. These
are integer variables computed in such a way that on average, each MPI process works on a total of
blocksPerProc blocks. Note that computing the beginning and ending block is not necessary in the
calculation of the number π, because blocks are not associated with any data. However, we use these
g
an
quantities in order to make our approach applicable to other loop-centric problems.
W
• Every MPI process initializes the random number generator with a different random seed.
ng
• MPI_Sum reduction collects the total number of points within the quarter circle area from all MPI
e
nh
processes into the variable UnderCurveSum. Only one MPI process rank=0 performs the final
Yu
• For a better timing estimate, the calculation is run multiple times (trials=10).
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
1 #include <mpi.h>
2 #include <stdio.h>
3 #include <stdlib.h>
4 #include <mkl_vsl.h>
5
6 const long iter=1L<<32L, BLOCK_SIZE=4096L, nBlocks=iter/BLOCK_SIZE, nTrials = 10;
7
8 void RunMonteCarlo(const long firstBlock, const long lastBlock,
9 VSLStreamStatePtr & stream, long & dUnderCurve) {
10 // Performs the Monte Carlo calculation for blocks in range [firstBlock; lastBlock)
11 // to count the number of random points inside of the quarter circle
12 float r[BLOCK_SIZE*2] __attribute__((align(64)));
13 for (long j = firstBlock; j < lastBlock; j++) {
14 vsRngUniform( 0, stream, BLOCK_SIZE*2, r, 0.0f, 1.0f );
15 for (long i = 0; i < BLOCK_SIZE; i++) {
16 const float x = r[i];
17 const float y = r[i+BLOCK_SIZE];
18 if (x*x + y*y < 1.0f) dUnderCurve++; // Count points inside quarter circle
19 }
g
}
an
20
21 }
W
22
int main(int argc, char *argv[]){
ng
23
24 int rank, nRanks, trial;
e
25 MPI_Init(&argc, &argv);
26 MPI_Comm_size(MPI_COMM_WORLD, &nRanks);
nh
Yu
27 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
28
r
29
30 const double blocksPerProc = (double)nBlocks / (double)nRanks;
d
31
re
34
35 const long myFirstBlock = (long)(blocksPerProc*rank);
yP
38
iv
40 VSLStreamStatePtr stream;
cl
First, let us establish the baseline of the performance of this code by running two benchmarks:
1. on the host system with two eight-core Intel Xeon E5-2670 processors with hyper-threading, using 32
MPI processes, and
2. on a 60-core Intel Xeon Phi B1PRQ-5110P coprocessor using 240 MPI processes.
The procedure for compiling and running the code is shown in Listing 4.77.
g
an
3.14160449 3.8e-06 0.844
3.14159184 -2.6e-07 0.846
W
3.14155282 -1.3e-05 0.846
ng
3.14154737 -1.4e-05 0.839
3.14159142 -3.9e-07 0.846 e
user@host%
nh
user@host% mpiicpc -mmic -mkl -o pi_mpi.mic pi_mpi.cc -vec-report
Yu
# pi Rel.err Time, s
ar
user@host%
Listing 4.77: Compiling and running the π calculation on the host system and on an Intel Xeon Phi coprocessor.
The performance of the code on a single coprocessor is in this case 1.9x better than the performance
on the host system. On the host, the calculation takes on average 0.84 seconds, and on the coprocessor it
takes 0.44 seconds. The number of blocks processed in the course of this time is represented by the variable
nBlocks= 232 /4096 = 220 . This translates to a performance of 220 /0.84 s ≈ 1.2 · 106 blocks per second
on the host and 220 /0.44 s ≈ 2.4 · 106 blocks per second on the coprocessor.
Next, we will benchmark the code using both the host processor and the coprocessor. The theoretical
maximum performance that we can expect from this combination is 1.2 · 106 + 2.4 · 106 = 3.6 · 106 blocks
per second. Accordingly, the theoretical minimum runtime is 220 /3.6 · 106 ≈ 0.29 s. However, as we will see
shortly, achieving this performance requires some additional measures in order to improve load balancing.
See Figure 4.27 for a summary of these performance results.
Load balancing is not an issue for a homogeneous system such as an Intel Xeon processor or an Intel Xeon
Phi coprocessor. However, problems begin when we try to utilize the coprocessor (or several coprocessors)
simultaneously with the host. As we estimated above, the theoretical minimum runtime for this code when
it runs on the host together with coprocessor is 0.29 s, which is 50% better than only on the coprocessor.
However, as shown in Listing 4.78, when we split the work between the host and the coprocessor, the runtime
is 0.37 seconds, which is only 20% better than on the coprocessor alone.
user@host% mpirun -np 32 -host localhost ./pi_mpi : -np 240 -host mic0 ~/pi_mpi.mic
# pi Rel.err Time, s
3.14156523 -8.7e-06 0.885
3.14156675 -8.2e-06 0.376
3.14159230 -1.1e-07 0.360
3.14161923 8.5e-06 0.378
3.14161673 7.7e-06 0.379
3.14157532 -5.5e-06 0.359
g
3.14156212 -9.7e-06 0.364
an
3.14158508 -2.4e-06 0.359
W
3.14153097 -2.0e-05 0.359
3.14160180 2.9e-06 0.359
ng
user@host%
e
nh
Listing 4.78: Heterogeneous calculation of the number π on the host system plus an Intel Xeon Phi coprocessor.
r Yu
fo
d
In order to understand the cause of the lower than expected performance, we will visualize the load
re
balance using the Intel Trace Analyzer and Collector Event Timeline Charts. Listing 4.79 demonstrates how to
pa
collect data for Intel Trace Analyzer and Collector during the run of the application.
re
yP
el
iv
Listing 4.79: Intel MPI execution of pi_mpi with data collection for Intel Trace Analyzer and Collector.
The event timeline chart corresponding to this run is shown in Figure 4.25. In this chart, blue regions
correspond to computation, and red regions correspond to MPI waiting for communication. Two bands shown
in the figure are grouped information for host processes (top) and coprocessor processes (bottom).
g
an
W
e ng
nh
Figure 4.25: Intel Trace Analyzer and Collector Event Timeline Chart for heterogeneous Intel MPI run with 32 processes
on the host and 240 on the coprocessor, grouped by the host name.
r Yu
fo
The red regions in the top band in Figure 4.25 are the key to understanding the cause of lower than
expected performance of the heterogeneous calculation. These regions indicate that the host was waiting for
ed
the coprocessor for the majority of the elapsed time. This waiting is caused by the fact that the source code
ar
in Listing 4.76 does not differentiate between MPI processes running on the host and on the coprocessor.
p
re
Therefore, the total number of iterations iter is evenly distributed between all 240 + 32 = 272 processes of
yP
the MPI_COMM_WORLD communicator. Because the host processor is slower than the coprocessor by a factor
of 0.84 s/0.44 s = 1.9, but receives only a fraction of 32/272 = 0.12 of the total work, the MPI processes on
el
Load imbalance in this problem is a consequence of the heterogeneous nature of the computing system
comprised of an Intel Xeon processor and an Intel Xeon Phi coprocessor. In the next section we discuss how
cl
Ex
1 // ...
2
3 // Count the number of processes running on the host and on coprocessors
4 int nProcsOnMIC, nProcsOnHost, thisProcOnMIC=0, thisProcOnHost=0;
5 #ifdef __MIC__
6 thisProcOnMIC++; // This process is running on an Intel Xeon Phi coprocessor
7 #else
8 thisProcOnHost++; // This process is running on an Intel Xeon processor
9 #endif
10 MPI_Allreduce(&thisProcOnMIC, &nProcsOnMIC, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
11 MPI_Allreduce(&thisProcOnHost, &nProcsOnHost, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
12
g
// Work sharing calculation
an
13
14 const char* alphast = getenv("ALPHA"); // Load balance parameter
W
15 if (alphast==NULL) { printf("ALPHA is undefined\n"); exit(1); }
ng
16 const double alpha = atof(alphast);
17 #ifndef __MIC__
e
// Blocks per rank on host
nh
18
19 const double blocksPerRank =
Yu
20 ( nProcsOnMIC > 0 ? alpha*nBlocks/(alpha*nProcsOnHost+nProcsOnMIC) :
21 (double)nBlocks/nProcsOnHost );
r
22
23 const int rankOnDevice = rank;
d
24 #else
re
28
29 #endif
el
30
// Range of blocks processed by this process
iv
31
32 const long myFirstBlock = blockOffset + (long)(blocksPerRank*rankOnDevice);
us
35
36 // Create and initialize a random number generator from MKL
37 VSLStreamStatePtr stream;
38 vslNewStream(&stream, VSL_BRNG_MT19937, rank*nTrials + t);
39 RunMonteCarlo(myFirstBlock, myLastBlock, stream, dUnderCurve);
40 vslDeleteStream( &stream );
41
42 // Reduction to collect the results of the Monte Carlo calculation
43 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
44
45 // Compute pi
46 if (rank==0) {
47 const double pi = (double) UnderCurveSum / (double) iter * 4.0 ;
48 // ...
49 }
Listing 4.80: Monte Carlo code modified to statically balance the load according to a user-defined parameter α.
In the code in Listing 4.80, parameter α is read from the environment variable ALPHA. This parameter is
quantitatively defined as
bhost
α= , (4.13)
bMIC
where bhost and bMIC are the number of blocks per process on the host and on the coprocessor, respec-
tively. These values are represented by the variable blocksPerProc in the code. A reminder: in our
implementation, a block is a minimal partition of the problem, equal to a set of 4096 random points.
The value α = 1.0 reproduces the case of Listing 4.76, where every MPI process is assigned the
same amount of work. Values α > 1.0 assign more work to each process running on the host than to each
process running on the coprocessor. Correspondingly, for α < 1.0, each host process receives less work than
each coprocessor process. The optimal value of α depends on the specific problem and computing system
components (the number of coprocessors, the clock frequency of host processors, etc).
In order to compute bhost and bMIC for a given value of α, the following relationship must be used:
g
an
Btotal = bhost Phost + bMIC PMIC , (4.14)
W
ng
where Phost is the number of MPI processes running on host processors, PMIC is the number of processes
e
nh
running on coprocessors, and Bhost is the total number of blocks in the problem. In the code, these quantities
Yu
are represented by variables nProcsOnHost, nProcsOnMIC and nBlocks, respectively. From these
relations, the amounts of work per process on the host and on the coprocessor are expressed as
r
fo
ed
Btotal
bhost = , (4.15)
ar
αPhost + PMIC
p
αBtotal
re
bMIC = . (4.16)
αPhost + PMIC
yP
el
Note that in order to count the number of MPI processes running on the host and coprocessors, target-
iv
specific code is used to increment the respective process counter in lines 5-9 of Listing 4.80, and then
us
all-to-all reduction with MPI_Allreduce is performed (in lines 10-11). Target-specific code is discussed in
cl
Section 2.2.6.
Ex
As mentioned above, α = 1.0 reproduces the work sharing scheme of the unoptimized code in List-
ing 4.76. Generally, a in a heterogeneous system comprised of host processors and Intel Xeon Phi coprocessors,
better load balancing can be achieved by using a value of α other than 1.0. The optimal value depends on
the performance of the calculation on each of the platforms and on the system configuration. It is possible
to estimate the optimal value of alpha from the event timeline chart in ITAC shown in Figure 4.25. In this
figure, two numbers are highlighted with green ellipses. These numbers are average times of the main loop
3.18
calculation on the host and on the coprocessor. The ratio of those two numbers is 0.893 ≈ 3.56. Therefore, if
the original code is modified so that MPI processes on the host execute 3.56 times more iterations, there will
be less time wasted by host processes for waiting. This corresponds to the value of α = 3.56.
It is also possible to to determine the optimal value of α with a calibration study that tests the performance
for multiple values of this parameter. Figure 4.26 shows the dependence of the execution time of the Monte
Carlo code from Listing 4.80 as a function of parameter α. This empirical approach provides the optimal
α = 3.3 − 3.7, which is close to the estimate that we obtained using Intel Trace Analyzer and Collector.
In practical HPC applications, the determination of the optimal α can be performed either by running
a small calibration workload prior to starting a production calculation, or “on the fly”, by measuring the
performance of the computing kernels on all devices of the heterogeneous system.
Effect of load balancing between host and coprocessor in the Monte Carlo calculation of π
Run time
0.5
Load imbalance on Host
Load imbalance on Coprocessor
0.4
Baseline (no load balancing)
Time, s (lower is better)
0.2
0.1
g
an
0.0
0 1 2 3 4 5 6 7 8
Parameter α
W
ng
Figure 4.26: Load balance: execution time (in seconds) of the number π calculation with the Monte Carlo method as a
e
function of the parameter α (the ratio of the amount of work per process on the host to the amount of work per process on
the coprocessor). nh
r Yu
fo
In Figure 4.27, we summarize the performance results for the Monte Carlo calculation of π on the host
d
with two Intel Xeon E5-2670 processors, an Intel Xeon B1PRQ-5110P coprocessor, and on both devices
re
simultaneously.
pa
re
yP
0.839
iv
0.8
us
0.7
cl
Time, s (lower is better)
Ex
0.6
0.5
0.449
0.4 0.366
0.3 0.283
0.2
0.1
0.0
es) es) 1.0 3.4
ess ess = =
roc roc h i, α hi, α
(32p 0p nP nP
ly (24 eo eo
on nly +X +X
on hio on on
Xe nP Xe Xe
o
Xe
Figure 4.27: Load balance: execution time (in seconds) of the Monte Carlo calculation of π with 32 MPI processes on an
Intel Xeon host, 240 processes on an Intel Xeon Phi coprocessor, unbalanced using both an the host and coprocessor, and
finally, using both devices with static load balancing (α = 3.4).
g
The drawback of dynamic load balancing schemes is that from one run to another, the distribution of
an
work across MPI processes may vary depending on the runtime conditions and MPI message arrival times.
W
As a consequence, the exact result of a calculation is not reproducible if the calculation involves random
exclusively on integers.
e ng
numbers or not precisely associative operations. This is true of almost all applications except those operating
nh
In this section we will show how to implement dynamic load balancing in the boss-worker model using
Yu
Intel MPI communication between ranks. Boss — one of the MPI processes — is dedicated to assigning parts
r
of the problem (called “chunks” in this context) to workers — the rest of MPI processes in the run. When a
fo
worker finishes its assigned chunk of work, it reports back to the boss to receive either another chunk, or a
ed
In order to illustrate dynamic load balancing in MPI, we will use the same example problem as in
p
Section 4.7.3, the calculation of π with a Monte Carlo method. The core of the calculation remains the
re
yP
same, however, additional communication is included in the code of each process. Before a worker begins
calculation, it waits for a message from the boss (rank 0) with the values of begin and end, indicating the
el
first and last block that the worker must process. When all of the blocks are processed, the worker waits for
iv
more messages, until the received message contains a negative value for begin and end, which indicates
us
the end of the calculation. This scheme is similar to the dynamic scheduling mode for OpenMP loops (see
cl
Section Section 4.4.3). The code that implements this scheduling algorithm is shown in Listing 4.81.
Ex
1 // ...
2 long dUnderCurve = 0, UnderCurveSum = 0;
3
4 if (rank == 0) {
5
6 // Boss assigns work
7 const char* grainSizeSt = getenv("GRAIN_SIZE");
8 if (grainSizeSt == NULL) { printf("GRAIN_SIZE undefined\n"); exit(1); }
9 grainSize = atof(grainSizeSt);
10 long currentBlock = 0;
11 while (currentBlock < nBlocks) {
12 msg[0] = currentBlock; // First block for next worker
13 msg[1] = currentBlock + grainSize; // Last block
14 if (msg[1] > nBlocks) msg[1] = nBlocks; // Stay in range
15 currentBlock = msg[1]; // Update position
16
17 // Wait for next worker
18 MPI_Recv(&worker, 1, MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD, &stat);
19
g
// Assign work to next worker
an
20
21 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
W
22 }
ng
23
24 // Terminate workers
e
25 msg[0] = -1; msg[1] = -2;
26 for (int i = 1; i < nRanks; i++) {
nh
Yu
27 MPI_Recv(&worker, 1, MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD, &stat);
28 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
r
}
fo
29
30 } else {
d
31
re
34
35 float r[BLOCK_SIZE*2] __attribute__((align(64)));
yP
36
37 // Range of blocks processed by this worker
el
38 msg[0] = 0;
iv
Listing 4.81: Boss-worker model used for dynamic load balancing of the Monte Carlo simulation.
The tuning parameter for this code is size of the chunk determined by the environment variable
GRAIN_SIZE. Chunk size must be small enough that the number of chunks is much greater than the
number of workers. This will enable even load distribution across workers. At the same time, chunk size must
be large enough so that the workers do not have to communicate with the boss too often. Therefore, we expect
that there is a window of values for chunk size in which the performance is optimal.
For our example, blockSize= 4096, and the total number of iterations is 232 , so the number of
blocks in the problem is 232 /4096 = 220 ≈ 106 . So, for GRAIN_SIZE=1, the number of MPI point-to-point
communications between the boss and the workers is of order 106 . For GRAIN_SIZE= 106 /272 ≈ 4 · 103 ,
the number of chunks is equal to the number of workers in the system comprised of dual host processors and a
single 60-core Intel Xeon Phi coprocessor. Therefore, we expect that for GRAIN_SIZE somewhere between
1 and 4 · 103 , the performance will be optimal.
Running the Monte Carlo calculation in Intel MPI requires some care with the placement of the boss
process. The boss is the process with the rank equal to 0, so this process must be the first one specified in the
command line or in the machine file. Furthermore, in an optimized calculation, the boss process will not use
much CPU time and does not have to occupy a whole logical core. Therefore, on the host (two eight-core
processors with two-way hyper-threading), we can launch 2 × 8 × 2 = 32 workers and one boss.
g
an
Important: by default, Intel MPI pins processes to certain parts of compute devices (sockets, cores,
W
threads). This pinning generally improves performance and should be used for performance-critical calcula-
ng
tions. However, the boss process must be unpinned. Otherwise, one of the worker processes on the host will
be pinned to the same logical core as the host, which may throttle down the scheduling workflow. Listing 4.82
e
nh
demonstrates how to achieve that in a heterogeneous calculation that employs the host and the coprocessor.
r Yu
user@host% mpirun \
yP
Listing 4.82: Compiling and running the Monte Carlo calculation of π with dynamic load scheduling. Starting one
unpinned boss process on the host, 32 worker processes on the host and 240 workers on the coprocessor.
We measured the performance of the code with dynamic scheduling for a range of values of GRAIN_SIZE.
The results are are shown in Figure 4.28. The plot shows the total calculation time along with additional
performance metrics: (i) MPI communication time is the average time in worker processes between the start
of sending a request to the boss process and receiving a chunk of work to process, (ii) unbalanced time per
process is the average time in worker processes between the end of the work and the reduction of results to the
boss (a barrier was used before the reduction). The latter metric reflects the time that processes that received
too little work are waiting for processes that received too much work.
2.0 Effect of Grain Size on Dynamic Scheduling in Heterogeneous Monte Carlo Calculation of π
Host unbalanced time (average per process)
Coprocessor unbalanced time (avgerage per process)
Host MPI communication time (avgerage per process)
1.5 Coprocessor MPI communication time (avgerage per process)
Total computation time
Time, s (lower is better)
g
an
1.0
W
e ng
0.5 nh
Yu
Theoretical best
r
fo
0.0 0
d
Grain Size
pa
re
Figure 4.28
yP
As expected, for low values of GRAIN_SIZE, the runtime is poor due to excessive MPI communication
el
iv
for scheduling. This is illustrated in the Intel Trace Analyzer and Collector timeline chart in Figure 4.29
us
obtained for a run with GRAIN_SIZE=1 with only 32 workers on the host. In fact, according to this figure,
cl
due to message contention on the boss process, some of the workers never receive any work and remain idle
Ex
for the duration of the calculation. This wrecks performance by reducing the amount of hardware parallelism
available to this application.
For large grain sizes, performance is poor due to unbalanced load. The boss quickly hands out large
chunks of work without regard for the number of workers expecting assignments. As a result, all work may be
handed out before the last worker receives a chunk. In a less severe case of imbalance, some workers will
receive two or three chunks while other will receive one or two.
The “sweet spot” for performance appears to be for GRAIN_SIZE between 1000 and 3000. However,
the best performance achieved in this parameter window is 0.40 s per run, which is worse than 0.29 s per run
achieved with static load balancing.
The ability of dynamic balancing to evenly distribute work across the processor and coprocessor is
limited by the number and computational cost of iterations and by the MPI communication latency. However,
the situation may be remedied with the help of OpenMP. Indeed, if we were to run fewer MPI processes, then
larger GRAIN_SIZE can be used, for which the MPI communication throughput is not saturated, but load is
balanced. If OpenMP is used in each of these processes, then all available cores can be occupied for parallel
computing. We discuss this optimization in the next section.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Figure 4.29: Intel Trace Analyzer and Collector screenshot of dynamic load balancing MPI communication. In this
example, GRAIN_SIZE is too small for the available MPI communication throughput. As a result, due to message
contention on the boss process, some workers never receive work and remain idle for the duration of the calculation.
1 if (rank != 0) {
2 // Worker performs the Monte Carlo calculation
g
3
an
4 // Create and initialize a random number generator from MKL
W
5 VSLStreamStatePtr stream[omp_get_max_threads()];
6 #pragma omp parallel
ng
7 {
e
8 // Each thread gets its own random seed
9
nh
const int randomSeed = nTrials*omp_get_thread_num()*nRanks + nTrials*rank + t;
vslNewStream(&stream[omp_get_thread_num()], VSL_BRNG_MT19937, randomSeed);
Yu
10
11 }
r
12
fo
13 msg[0] = 0;
d
22 {
us
25
26 for (long j = firstBlock; j < lastBlock; j++) {
27 vsRngUniform( 0, stream[myThread], BLOCK_SIZE*2, r, 0.0f, 1.0f );
28 for (long i = 0; i < BLOCK_SIZE; i++) {
29 const float x = r[i];
30 const float y = r[i+BLOCK_SIZE];
31 // Count points inside quarter circle
32 if (x*x + y*y < 1.0f) dUnderCurve++;
33 }
34 }
35 }
36 }
37
38 #pragma omp parallel
39 {
40 vslDeleteStream( &stream[omp_get_thread_num()] );
41 }
42 }
Listing 4.83: Worker code for hybrid MPI/OpenMP Monte Carlo calculation of the number π.
Note that in the thread-parallel implementation of the worker code, we make sure to assign an individual
RNG to every thread. We also re-use the RNG for all chunks of work that the worker processes. This is
preferable to initializing a new RNG for every chunk, because the overhead of initialization and the potential
cross-correlation between random streams with different seeds is undesirable.
Listing 4.84 demonstrates how to compile and run the hybrid code using 16 threads per process on
the host and on the coprocessor. We could choose any other number of threads per process, but to avoid
over-subscribing the system, the product of the number of processes and the number of threads should be
equal to the number of logical cores on the respective device. Therefore, we launch two 16-threaded workers
on the host (2 × 16 = 32) and fifteen 16-threaded workers on the coprocessor (15 × 16 = 240).
g
user@host% mpirun \
an
-host localhost -np 1 -env I_MPI_PIN 0 ./pi-dynamic-hybrid-host : \
W
-host localhost -np 2 -env OMP_NUM_THREADS 16 ./pi-dynamic-hybrid-host : \
-host mic0 -np 15 -env OMP_NUM_THREADS 16 ~/pi-dynamic-hybrid-mic
#
#
pi
pi
Rel.err Time, s GrainSize
Rel.err Time, s GrainSize
e ng
nh
3.14156822 -7.8e-06 0.781 16384
3.14159674 1.3e-06 0.312 16384
Yu
user@host%
el
iv
Listing 4.84: Compiling and running the Monte Carlo calculation of π with dynamic load scheduling and OpenMP in
us
worker code. Starting one unpinned boss process on the host, two 16-threaded worker processes on the host and fifteen
cl
The execution time in this hybrid OpenMP/MPI code is 0.31 s, which is significantly better than the
best case of 0.40 s with single-threaded MPI processes. The runtime achieved here is close to the theoretical
minimum runtime of 0.29 s.
In order to assess the range of the values of GRAIN_SIZE in which the performance is near the optimal
value, we performed benchmarks with multiple values of this parameter. For each value of GRAIN_SIZE, we
ran the code in four configurations:
1. Single-threaded MPI processes. In this case, 32 worker processes are launched on the host and 240
processes on the coprocessor. This is equivalent to the case discussed in Section 4.7.4.
2. MPI processes with 4 threads in each. In this setup, 8 worker processes run on the host and 60 worker
processes on the coprocessor.
3. MPI processes with 16 threads in each. This corresponds to 2 workers on the host and 15 workers on
the coprocessor.
4. A single MPI worker on the host with 32 threads and a single worker on the coprocessor with 240
threads.
g
an
W
2.0 Performance of Heterogeneous Hybrid Monte Carlo Calculation of π with Dynamic Scheduling
ng
Single-threaded MPI processes
4 OpenMP threads per MPI process
e
nh
16 OpenMP threads per MPI process
One 32-threaded MPI process on host,
Yu
1.5
one 240-threaded process on coprocessor
Time, s (lower is better)
r
fo
d
re
1.0
pa
re
yP
0.5
el
Theoretical best
iv
us
cl
0.0
21 23 25 27 29 211 213 215 217 219
Ex
Grain Size
Figure 4.30
As expected, using multiple OpenMP threads in worker processes improves performance. With 4 threads
per worker, the performance is close to the theoretical maximum in the range of GRAIN_SIZE from 29 to
211 . As the number of threads per worker increases, the window of optimal performance expands and shifts
towards greater values of GRAIN_SIZE.
1. First of all, the thread-safe version of Intel MPI Library must be linked by using the compiler flag
-mt_mpi.
2. MPI must be initialized with the call MPI_Init_thread(), as shown in Listing 4.85.
1 int required=MPI_THREAD_SERIALIZED;
2 int provided;
3
4 MPI_Init_thread(&argc, &argv, required, &provided);
5
g
6 if (provided < required){
an
7 if (rank == 0)
W
8 printf("Warning: MPI implementation provides insufficient threading support.\n";
omp_set_num_threads(1);
ng
9
10 } e
nh
Yu
MPI_THREAD_FUNNELED The process may be multi-threaded, but the application must ensure that
re
MPI_THREAD_SERIALIZED The process may be multi-threaded, and multiple threads may make
el
The call to MPI_Init_thread() will set the value of parameter provided to the value of granted
Ex
g
an
Listing 4.86 demonstrates the implementation of the boss process that performs load balancing with the
guided scheduling algorithm.
W
e ng
1 if (rank == 0) {
2 // Boss assigns work
const char* grainSizeSt = getenv("GRAIN_SIZE"); nh
Yu
3
4 if (grainSizeSt == NULL) { printf("GRAIN_SIZE undefined\n"); exit(1); }
r
5 grainSize = atof(grainSizeSt);
fo
6 long currentBlock = 0;
d
16
cl
18
19
20 // Assign work to next worker
21 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
22
23 currentBlock = msg[1]; // Update position
24 }
25
26 // Terminate workers
27 msg[0] = -1; msg[1] = -2;
28 for (int i = 1; i < nRanks; i++) {
29 MPI_Recv(&worker, 1, MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD, &stat);
30 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
31 }
32
33 } else {
34 // ...Worker code...
35 }
Quantitatively, the chunk size for any interaction with a worker is chosen using the following equation:
Btotal − Bscheduled
Bchunk = max GRAIN_SIZE, η , (4.17)
Ptotal
where Bchunk is the number of blocks in the chunk assigned to the next worker, Btotal is the total number of
blocks in the iteration space, Bscheduled is the number of blocks already scheduled for processing by other
workers, and Ptotal is the number of workers. The coefficient η is another parameter of the algorithm. If the
ratio of the performance of the fastest worker to the performance of the slowest worker does not exceed the
number of workers, then this coefficient can be set to η = 1. For a greater difference between the performance
of individual workers, this parameter can be set to a value between 0 and 1. For the Monte Carlo calculation of
π, we chose η = 0.5. With this value of η, even with only two workers, load balancing is achieved.
We benchmarked the Monte Carlo calculation of π with the guided work scheduling scheme shown
in Listing 4.86 for a range of values of GRAIN_SIZE. For each value of GRAIN_SIZE, four cases were
studied, just as in Section 4.7.5: single-threaded workers, 4-threaded workers, 16-threaded workers and the
case with one 32-threaded worker on the host and one 240-threaded worker on the coprocessor. The results of
g
an
the benchmark are shown in Figure 4.31.
W
2.0 e ng
Performance of Heterogeneous Hybrid Monte Carlo Calculation of π with Guided Scheduling
Single-threaded MPI processes
nh
4 OpenMP threads per MPI process
Yu
1.5
one 240-threaded process on coprocessor
fo
Time, s (lower is better)
ed
ar
1.0
p
re
yP
el
0.5
iv
us
Theoretical best
cl
Ex
0.0
21 23 25 27 29 211 213 215 217 219
Grain Size
Figure 4.31
As expected, guided scheduling allows the heterogeneous application to utilize resources optimally in a
wide range of values of GRAIN_SIZE. In fact, the amount of MPI communication is reduced so much that
even GRAIN_SIZE= 1 achieves the theoretical best performance if a sufficient number of threads is used.
g
an
b) Choosing how much work must be stolen in each transaction.
W
c) Tuning the criterion based on which workers decide to stop contacting other workers and exit the calculation.
ng
This involves designing an algorithm that propagates the information about work completion across the
e
MPI world.
nh
Yu
d) Optimizing the selection of the worker to steal work from. In a heterogeneous architecture, it may be more
efficient to contact workers from another platform with a greater probability, because load imbalance is
r
fo
more likely to occur because of the performance differences between the different platforms.
d
re
e) Instrumenting learning or another type of dynamic feedback to adjust the parameters of the work stealing
pa
Intel R MKL
g
BLAS Multidimentional Trigonometric Congruential Kurtosis Splines
an
LAPACK (up to 7D) Hyperbolic Recursive Variation Interpolation
W
Sparse solvers FFTW interfaces Exponential Wichmann-Hill coefficitent Cell search
ng
ScaLAPACK Cluster FFT Lagorithmic Mersenne Twister Quantiles, order
Power/Root Sobol
e statistics
nh
Rounding Neiderreiter Min/max
Yu
Non-deterministic Variance-
r
covariance
fo
...
ed
p ar
re
Source
yP
Intel MKL
el
iv
Multicore CPU Multicore CPU Intel Xeon Phi Multicore Clusters with Multicore
us
Earlier in our discussion we have given examples of workloads that use the Intel MKL (see Sections 4.2.4,
4.4.5 and 4.7.1). In this section, we outline the MKL usage models and provide general usage and optimization
advice. Complete documentation on the Intel MKL can be found in the Reference Manual [54].
We discuss the Intel MKL version 11.0 for Linux* OS. It supports computation on Intel Xeon Phi
coprocessors in three modes of operations, discussed in detail in this section:
1. Automatic Offload (AO)
• No code change is required in order to offload calculations to an Intel Xeon Phi coprocessor;
• Automatically uses both the host and the Intel Xeon Phi coprocessor;
• The library takes care of data transfer and execution management.
2. Compiler Assisted Offload (CAO)
• Programmer maintains explicit control of data transfer and remote execution, using compiler
offload pragmas and directives;
g
an
• Can be used together with Automatic Offload.
W
3. Native Execution
ng
• Uses an Intel Xeon Phi coprocessor as an independent compute node.
e
nh
• Data is initialized and processed on the coprocessor or communicated via MPI.
Yu
The operation modes discussed above enable heterogeneous computing, which takes an advantage of
r
fo
both the multi-core host system and many-core Intel Xeon Phi coprocessors. Choice of operation modes can
d
be used to execute previously developed legacy code employing the Intel MKL without modification, or with
re
fine control over the compute devices, if such approach is required by the problem.
pa
re
yP
el
iv
us
cl
Ex
• Sparse BLAS Level 1, 2, and 3 (basic operations on sparse vectors and matrices)
• LAPACK routines for solving systems of linear equations
• LAPACK routines for solving least squares problems, eigenvalue and singular value problems, and
Sylvester’s equations
g
• Auxiliary and utility LAPACK routines
an
W
• ScaLAPACK computational, driver and auxiliary routines (only in Intel MKL for Linux* and Windows*
operating systems)
e ng
• PBLAS routines for distributed vector, matrix-vector, and matrix-matrix operation
nh
Yu
• Vector Mathematical Library (VML) functions for computing core mathematical functions on vector
arguments (with Fortran and C interfaces)
ed
ar
• Vector Statistical Library (VSL) functions for generating vectors of pseudorandom numbers with
p
different types of statistical distributions and for performing convolution and correlation computations
re
yP
• General Fast Fourier Transform (FFT) Functions, providing fast computation of Discrete Fourier
el
Transform via the FFT algorithms and having Fortran and C interfaces
iv
us
• Cluster FFT functions (only in Intel MKL for Linux* and Windows* operating systems)
cl
• Tools for solving partial differential equations: trigonometric transform routines and Poisson solver
Ex
• Optimization solver routines for solving nonlinear least squares problems through trust region algorithms
and computing the Jacobi matrix by central differences
• Basic Linear Algebra Communication Subprograms (BLACS) that are used to support a linear algebra
oriented message passing interface
• Data fitting functions for spline-based approximation of functions, derivatives and integrals of functions,
and search
• GNU Multiple Precision (GMP) arithmetic functions
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
Figure 4.33: Web interface of the Intel MKL Link Line Advisor.
1 http://software.intel.com/sites/products/mkl/MKL_Link_Line_Advisor.html
Listing 4.87: Two ways to enable the Intel MKL Automatic Offload (AO).
g
an
In order for AO to work, the application must be compiled using the Intel C++ Compiler with support
W
for the Intel Xeon Phi architecture. Nothing else needs to be done to use the coprocessor. The library
e ng
will automatically detect available coprocessors, decide when it is beneficial to offload calculations to the
coprocessor, transfer the data over the PCIe bus, and initiate offloaded computation on the coprocessor.
nh
In order to disable AO after it was previously enabled, use the corresponding support function call or
Yu
user@host% export \
1 mkl_mic_disable();
p
MKL_MIC_ENABLE=0
re
yP
el
Listing 4.88: Two ways to disable the Intel MKL Automatic Offload (AO).
iv
us
cl
For some functions, users can control the amount of work that must be performed on the host and on the
Ex
coprocessor in the AO mode. This can be done by setting an environment variable, or calling the respective
function, as shown in Listing 4.89.
Listing 4.89: Offload 50% of computation to the first Intel Xeon Phi coprocessor. Note: The support function calls take
precedence over environment variables.
The third argument of the function mkl_mic_set_workdivision() is the fraction of the work to
be performed on the coprocessor (from 0 to 1), and the value of the environment variable
MKL_MIC<card_number>_WORKDIVISION is the percentage (from 0 and 100). Work is measured in
floating-point operations.
Example
Listing 4.90 demonstrates the usage of automatic offload in an application using the Intel MKL.
g
15 A, &newLda, B, &newLda, &beta, C, &SIZE);
an
16 double time_start_AO = dsecnd();
W
17 for( k = 0; k < LOOP_COUNT; k++)
18 {
ng
19 sgemm(&transa, &transb, &SIZE, &SIZE, &SIZE, &alpha,
e
20 A, &newLda, B, &newLda, &beta, C, &SIZE);
21 }
double time_end_AO = dsecnd(); nh
Yu
22
23 double time_avg_AO = ( time_end_AO - time_start_AO )/LOOP_COUNT;
r
26
re
27 }
pa
re
Listing 4.90: Fragment of Automatic Offload code with the sgemm function call from Intel MKL with corresponding
yP
performance calculations.
el
iv
us
cl
Ex
g
7
an
8 }
W
Listing 4.91: C/C++ example of Intel MKL Compiler Assisted Offload usage.
e ng
nh
Yu
2 // Allocate matrices
fo
3
ed
13 }
cl
14
// Re-use the data of matrix A on the coprocessor (data persistence)
Ex
15
16 // and re-use the memory allocated for B and C (memory retention)
17 #pragma offload target(mic) \
18 in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) \
19 nocopy(A: length(NCOLA * LDA) alloc_if(0) free_if(0)) \
20 in(B: length(NCOLB * LDB) alloc_if(0) free_if(0)) \
21 out(C: length(N * LDC1) into(C1) alloc_if(0) free_if(0))
22 {
23 sgemm(&transa1, &transb1, &M, &N, &K, &alpha1,
24 A, &LDA, B, &LDB, &beta1, C, &LDC1);
25 }
26
27 // Deallocate A, B and C on the coprocessor
28 #pragma offload target(mic) \
29 nocopy(A:length(NCOLA * LDA) free_if(1)) \
30 nocopy(B:length(NCOLB * LDB) free_if(1)) \
31 nocopy(C:length(NCOLC * LDC) free_if(1))
32 { }
Listing 4.92: C/C++ example of Intel MKL Data Persistence at Compiler Assisted Offload.
g
regular pattern.
an
For instance, on a 60-core coprocessor, the procedure shown in Listing 4.93 may produce better results
W
for SGEMM calculation than the default execution in the AO mode.
e ng
nh
user@host% export MIC_OMP_NUM_THREADS=236
user@host% export MIC_ENV_PREFIX=MIC
Yu
user@host% export MIC_KMP_AFFINITY=compact,granularity=fine
user@host% export MIC_PLACE_THREADS=59C,4t
r
fo
user@host% ./my-sgemm
re
pa
re
Listing 4.93: Optimizing execution parameters for an AO application with SGEMM calculation on a coprocessor.
yP
For native applications, the prefix MIC_ENV_PREFIX does not affect the environment variable sharing.
el
In order to set environment variables for native applications, use one of the following methods:
iv
us
a) for native applications launched from a shell on the coprocessor (i.e., when you SSH into the µOS), use the
cl
b) for applications launched with the tool micnativeloadex, use the argument -e to pass an environment
variable to the coprocessor. Example:
user@host% micnativeloadex ./my-native-application -e "OMP_NUM_THREADS=120"
c) for MPI applications launched with micrun, use the argument -env to pass environment variables
user@host% micrun \
-host mic0 -env OMP_NUM_THREADS 120 -np 1 ./my-native-application : \
-host mic1 -env OMP_NUM_THREADS 120 -np 1 ./my-native-application
Chapter 5
Thank you for learning about Intel Xeon Phi coprocessor programming with “Parallel Programming and
g
TM
an
Optimization with Intel R Xeon Phi Coprocessors” by Colfax International! We hope that, whatever scope
W
of information you were looking for, you were able to find answers or pointers in this book. In this last brief
chapter, we will summarize the key findings of our experience with Intel Xeon Phi coprocessor programming,
ng
and provide references for future learning.
e
nh
Yu
TM
5.1 Programming Intel R Xeon Phi Coprocessors is Not Trivial, but
r
fo
Computing accelerators such as GPGPUs and Intel Xeon Phi coprocessors will be extremely important
pa
in the future on all levels of high performance computing, from workstation to exascale. The launch of
re
the Intel Xeon Phi product family changed the landscape of computing accelerators by offering developers
yP
something new that GPGPUs cannot match. However, it is important to realize that this novelty is not the ease
el
of programming. It is not trivial to achieve good performance with Intel Xeon Phi coprocessors, especially
iv
when one compares it to the performance of modern Intel Xeon processors with the Sandy Bridge architecture.
us
The new truth that HPC programmers must learn is: if a parallel code does not perform fast on Intel Xeon Phi
cl
coprocessors, it probably is not doing very well on Intel Xeon processors, either. The flip side of this truth
Ex
is that when developers invest time and effort into optimizing for the many-core architecture, they also reap
performance benefits on multi-core processors. In this sense, optimization for the Intel MIC platform yields
“double rewards” by also tapping more performance from the host system. That said, we concur with Intel’s
James Reinders, who expresses the “double advantage” in this way [2]:
The single most important lesson from working with Intel Xeon Phi coprocessors is this: the best way to prepare for
Intel Xeon Phi coprocessors is to fully exploit the performance that an application can get on Intel Xeon processors
first. Trying to use an Intel Xeon Phi coprocessor, without having maximized the use of parallelism on Intel Xeon
processor, will almost certainly be a disappointment.
...
The experiences of users of Intel Xeon Phi coprocessors . . . point out one challenge: the temptation to stop tuning before
the best performance is reached. . . . There ain’t no such thing as a free lunch! The hidden bonus is the “transforming-
and-tuning” double advantage of programming investments for Intel Xeon Phi coprocessors that generally applies
directly to any general-purpose processor as well. This greatly enhances the preservation of any investment to tune
working code by applying to other processors and offering more forward scaling to future systems.
In the programming and optimization examples presented throughout this book, we strived to convey
two important messages:
1) Optimization methods that benefit applications for Intel Xeon Phi coprocessors usually also improve
performance on Intel Xeon processors, and vice-versa. Consequently, an attractive feature of Intel Xeon
Phi coprocessors as accelerators is that the developer may write and optimize the computational kernel
code only once to run on the host system as well as on the target coprocessor.
2) High performance can be achieved by relying on automatic vectorization in the Intel C++ Compiler and
traditional parallel frameworks such as OpenMP and MPI. This means that
a) “Ninja programming” (i.e., low-level optimization that may involve assembly coding or the usage of
intrinsics) is not necessary for high performance with Intel Xeon Phi coprocessors [72]. In fact, tradi-
tional programming methods can lead to good performance, if the programmer follows the guidelines
for developing vectorizable, scalable code with data locality and infrequent synchronization.
b) A single source code can be used on today’s Intel Xeon processors, Intel Xeon Phi coprocessors, and
g
future technologies based on x86-like architecture. In this sense, applications designed for the MIC
an
architecture using common programming methods are “future-proof”.
W
stimulating and enjoyable as it has been for us.
e ng
We do hope that your experience with the adoption of Intel Xeon Phi coprocessors is as intellectually
nh
Yu
Colfax International is ready to offer you the opportunity to try using Intel Xeon Phi coprocessors
ar
and Intel software development tools. You can get access to computing systems equipped with Intel Xeon
processors and Intel Xeon Phi coprocessors, and Intel software development tools by participating in the
p
re
Colfax Developer Training, for which this book was written. The training is available in two formats:
yP
1) Self-study course with remote access to computing systems hosted by Colfax International, and
el
iv
2) Instructor-led classes, which can be taken at Colfax’s location, or brought to your company’s offices.
us
cl
1) Another perspective on programming for the MIC architecture, more examples of high performance
codes, and best practices advice from Intel’s senior engineers Jim Jeffers and James Reinders can be
found in “Intel Xeon Phi Coprocessor High Performance Programming” [35]. The book has a Web site at
http://www.lotsofcores.com/ [36]
2) For a solid foundation of traditional parallel programming methods with OpenMP and MPI, refer to
“Parallel Programming in C with MPI and OpenMP” by Michael J. Quinn [39].
3) A new look at parallel algorithms and novel parallel frameworks are presented in “Structured Parallel
Programming: Patterns for Efficient Computation” by Michael McCool, Arch D. Robinson and James
g
an
Reinders [37]. The Web site of the book is http://parallelbook.com/ [38].
W
4) In order to gain a better understanding of computer architecture in general, and specifically the architecture
ng
of Intel Xeon Phi coprocessors, refer to “Compute Architecture: Quantitative Approach" by John L.
e
Hennessy and David A. Patterson [1] and “Intel Xeon Phi Coprocessor System Software Developer’s
Guide”, a publication by Intel [73]. nh
r Yu
Reference Guides
fo
d
The following list is a collection of URLs for software development tool and programming framework
re
reference guides.
pa
re
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-lin/index.htm
el
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/lin/ug_docs/index.htm
us
cl
http://software.intel.com/sites/products/documentation/hpc/ics/itac/81/ITA_Reference_Guide/index.htm
http://software.intel.com/sites/products/documentation/hpc/ics/itac/81/ITC_Reference_Guide/index.htm
Online Resources
1) The Intel Developer Zone has a portal for Intel Xeon Phi coprocessor developers with white papers, links
to products, forums and case studies, and other essential information [77]:
http://software.intel.com/mic-developer
2) This online article submitted by Intel’s Technical Consulting Engineer Wendy Doerner contains a wealth of
information on optimization for Intel Xeon Phi coprocessors in the form of blog posts, white papers and
presentation slides [78]:
http://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture
3) The YouTube channel Intel Software TV has published a set of video tutorials on Intel Xeon Phi coprocessor
programming [79]:
http://www.youtube.com/playlist?list=PLg-UKERBljNwuVuid_rhZ1yVUrTjC3gzx
Community Support
g
an
1) The forum “Intel Many Integrated Core Architecture” in the Intel Developer Zone is a great place to ask
W
questions and exchange ideas [80]:
http://software.intel.com/en-us/forums/intel-many-integrated-core
ng
This forum gets contributions from developers working with Intel Xeon Phi coprocessors, and it is also
e
nh
monitored by Intel’s engineers involved in the development of the MIC architecture.
Yu
2) Another forum in the Intel Developer Zone, “Threading on Intel Parallel Architectures” [81]
r
http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures
fo
is a good place to communicate with peers about parallel programming, not necessarily in the context of
ed
3) Find connections and stay updated on the latest news related to the MIC technology by joining the LinkedIn
p
re
http://www.linkedin.com/groups/Parallel-Computing-Intel-Xeon-Phi-4722265/about
el
iv
Contact Us
us
If you have questions, ideas, suggestions, corrections, or need information about purchasing or test-
cl
Ex
driving computing systems with Intel Xeon Phi coprocessors, please refer to:
Appendix A
Practical Exercises
g
A.1 Exercises for Chapter 1:
an
W
A.1.1 Power Management and Resource Monitoring
ng
The following practical exercises and questions correspond to the material covered in the Chapter 1,
e
Section 1.4 – Section 1.5 (pages 18 – 31).
nh
Yu
r
Goal
fo
The following practical exercises will show some of the tools we studied in Chapter 1 to get information
d
re
about the Intel Xeon Phi coprocessors and to monitor what resources are being used.
pa
re
Instructions
yP
1. Most of the administrative tools and utilities can be found in the /opt/intel/mic/bin directory.
el
Check if this location was added to your $PATH environment variable already.
iv
us
Some of these utilities require superuser privileges. If you have not already modified the PATH
environment variable, you should do so now, to facilitate the path lookup for these tools.
The above command will modify the path environment variable only for the current terminal. To apply
this changes to all users, we need to create a script, which will do it automatically. Use su or sudo to
access the system folders.
pathmunge is a function from /etc/profile1 , which will add the path to the environment at the
startup. Thus, for changes to take effect, we need to log out or reload the current profile.
1 applicable to RHEL*/CentOS/Fedora Linux distributions
user@host% . /etc/profile
Question 1.1.a. What does the Intel MIC architecture stand for?
2. Let’s check if the MPSS service is running. Root privileges should be used.
Question 1.1.b. What command should be used to start the MPSS service, if it is not running?
3. While testing Intel Xeon Phi coprocessors, it is important to check the temperature regularly.
Question 1.1.c. What utility should be used to find out the current MIC core rail temperature?
g
4. After the MPSS installation and starting all corresponding services, we can manually connect to the
an
Intel Xeon Phi coprocessors, or we can run some tests automagically.
W
ng
Question 1.1.d. What command should be used to run diagnostics on the coprocessors?
e
nh
5. A memory swapping mechanism has not been implemented for the Intel Xeon Phi coprocessors yet.
Yu
Therefore, you should avoid situations which would cause an overflow, otherwise your application will
crash with a runtime error.
r
fo
Question 1.1.e. How can I tell how much and what type of memory is installed on the Intel Xeon Phi
ed
coprocessor(s)?
p ar
re
6. Intel will provide new versions of the software stack, as they will be developed and optimized for
yP
performance.
el
Question 1.1.f. What utility should be used to display the MIC Flash version?
iv
us
7. Currently up to eight Intel Xeon Phi coprocessors can be installed in one computational node.
cl
Ex
Question 1.1.g. What should be done to reboot the second MIC card in a system (if several Intel Xeon
Phi coprocessors are installed)?
8. For highly parallel applications it is useful to know the load of individual cores for diagnostic, testing,
and debugging purposes.
Question 1.1.h. Is there a way I can display the utilization per core on my Intel Xeon Phi coprocessors.
9. Every new version of MPSS usually requires re-flashing the Intel Xeon Phi coprocessor’s.
Question 1.1.i. If I want to re-flash my Intel Xeon Phi coprocessors with a new flash image, what utility
would I use?
10. Intel Xeon Phi coprocessors can be reconfigured for a specific network configuration, power management
policy, etc.
Question 1.1.j. Once MPSS has been installed, what is the utility that is used to create the coprocessor
configuration files? And where are these configuration files created?
Answers
-n 1 parameter tells watch to run the micsmc -t command every second, thus providing a conve-
nient temperature monitoring tool.
g
an
Answer 1.1.d. miccheck
W
To check if the Intel Xeon Phi coprocessor is running properly, we can run miccheck, which will test
the standard unit operations.
e ng
user@host% miccheck
nh
rYu
Answer 1.1.e. micinfo provides memory information
fo
d
re
root@host% micctrl -r
re
root@host% micctrl -w
root@host% micflash -GetVersion
yP
el
iv
Answer 1.1.h. micsmc -cores or ssh mic0; top and press “1"
Ex
Answer 1.1.j. micctrl -initdefaults and micctrl -resetconfig to remove and recreate a
default configuration from the current MIC configuration files, which are stored at
/etc/sysconfig/mic/mic<N>.cfg
TM
A.1.2 Networking on Intel R Xeon Phi Coprocessors
The following practical exercises and questions correspond to the material covered in Section 1.5.2 –
Section 1.5.4 (pages 31 – 37).
Goal
The following practical exercises will show communication patterns with the Intel Xeon Phi coprocessors
with SSH, and will provide detailed instruction on using an NFS-shared mount of an Intel MPI folder on the
coprocessors.
Instructions
1. Generate SSH RSA and DSA keys and copy them to the Intel Xeon Phi coprocessor by reinitializing the
configuration files. Before we can do anything with the configuration files, however, we need to stop the
MPSS service, and restart it after we are done.
user@host% ssh-keygen
... follow the instructions ...
user@host% ssh-keygen -t dsa
... follow the instructions ...
user@host% sudo service mpss stop
Shutting down MPSS Stack: [ OK ]
user@host% sudo micctrl --resetconfig
user@host% sudo service mpss start
Starting MPSS Stack: [ OK ]
mic0: online (mode: linux image: /lib/firmware/mic/uos.img)
g
mic1: online (mode: linux image: /lib/firmware/mic/uos.img)
an
W
These actions are required for each newly created user. Once the SSH keys have been created, they will
2. Login to the Intel Xeon Phi coprocessor with the ssh command and run the following commands to
find the specifications of the device(s).
r
fo
...
ar
user@mic0% uname -a
us
3. Next we will use NFS mount to access /opt/item/impi, the Intel MPI folder on the host. It will be
needed later on for the Intel MPI labs.
(a) If iptables is enabled, allow traffic on ports 111 and 2049. Otherwise disable it, if it will not
compromise the security of the host.
user@host% sudo service iptables stop
(b) Check the status of services and install/start them, if any of them are stopped or missing.
root@host% sudo yum install nfs-utils
root@host% service rpcbind start
root@host% service nfslock start
root@host% service nfs start
(c) On the host system, modify the /etc/exports file. We assume you have two Intel Xeon Phi
coprocessors, otherwise, use only mic0 settings. Add the following line to the file:
# add this to the /etc/exports file
/opt/intel/impi mic0(rw,no_root_squash) mic1(rw,no_root_squash)
This can be done with your favorite text editor, for instance with vi. Or the following way:
root@host% \
% echo ’/opt/intel/impi mic0(rw,no_root_squash) mic1(rw,no_root_squash)’ \
g
% >> /etc/exports
an
W
Warning: Be sure to use two “greater than" signs to insert the line. If only one “greater than" sign
ng
is used, you will overwrite the contents of the file!
e
(d) Add ALL: mic0,mic1 line to the /etc/hosts.allow file:
nh
Yu
root@host% echo ’ALL: mic0,mic1’ >> /etc/hosts.allow
r
fo
root@host% exportfs -a
pa
re
(f) Everything is ready on the host for the export. Now we need to configure the coprocessor side to
yP
(h) As a root create the mounting folder on the coprocessor, and mount the NFS share.
root@host% ssh mic0
root@mic0% mkdir -p /opt/intel/impi
root@mic0% mount -a
root@mic0% ls /opt/intel/impi
4. You should see the Intel MPI folders mounted from the host, if we succeeded. But it will disappear the
next time MPSS restarts. Thus we need change MPSS files on the host, to apply those mounting settings
automagically.
root@host% cd /opt/intel/mic/filesystem
root@host% \
% echo ’host:/opt/intel/impi /opt/intel/impi nfs rsize=8192,wsize=8192,nolock,intr 0 0’ \
% >> mic0/etc/fstab
root@host% \
% echo ’host:/opt/intel/impi /opt/intel/impi nfs rsize=8192,wsize=8192,nolock,intr 0 0’ \
% >> mic1/etc/fstab
root@host% mkdir -p mic0/opt/intel/impi
root@host% mkdir -p mic1/opt/intel/impi
root@host% echo ’dir opt 755 0 0’ >> mic0.filelist
root@host% echo ’dir opt/intel/ 755 0 0’ >> mic0.filelist
root@host% echo ’dir opt/intel/impi 755 0 0’ >> mic0.filelist
root@host% echo ’dir opt 755 0 0’ >> mic1.filelist
root@host% echo ’dir opt/intel/ 755 0 0’ >> mic1.filelist
root@host% echo ’dir opt/intel/impi 755 0 0’ >> mic1.filelist
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Goal
This practical exercise will show how to: link and compile simple source code for native Intel Xeon Phi
coprocessor execution, how to use micnativeloadex tool for automatic native execution and resolving
library dependencies, how to monitor an activity on the Intel Xeon Phi coprocessor.
Preparation
Before linking and compiling any source code, we need to be sure that the compiler is installed in the
g
system and the environment is set up properly.
an
W
1. In the terminal, execute the following command to check if the Intel C Compiler is installed:
e ng
user@host% which icc
/opt/intel/composer_xe_2013.1.117/bin/intel64/icc
nh
r Yu
2. As was previously described in Section 1.4.3 it is essential to properly set up the environment variables
fo
for the Intel C Compiler and Intel C++ Compiler with the ia32 or intel64 option, which should
d
This script will export environment variables of the compilers, Intel Tread Building Blocks (TBB),
Intel MKL, and others. For convenience sake, this command line can be added to the .profile or
el
already present.
us
cl
Ex
Instructions
1. Link and compile the source code hello.c (code Lab B.2.1.1), which can be found in the labs folder:
labs/2/2.1-native/hello.c
user@host% cd ~/labs/2/2.1-native/
user@host% icc -o hello hello.c
user@host% ./hello
Hello world! I have 32 logical cores.
2. Next compile it with the -mmic flag to make it natively executable for the Intel Xeon Phi coprocessor.
Note that the resultant binary file can only be executed on the Intel Xeon Phi coprocessor. It can not be
executed on the host system, as shown in the listing above.
3. The Intel Xeon Phi coprocessor is an IP-addressable PCIe device with an independent µOS linux, with
an SSH server demon installed. So, we can use scp to copy the hello.MIC file to the home folder on
the card. Aliases and IP addresses for the devices can be found at the /etc/hosts file on the host.
Connect to the Intel Xeon Phi coprocessor through SSH, and execute the binary file locally (native
execution model):
4. We will use the micnativeloadex tool next, which can be used to upload a native application and
related dependent libraries to the Intel Xeon Phi coprocessor upon execution.
g
an
W
5. It also can be used to find detailed information about the binary target and library dependencies:
/home/user/labs/2/2.1-native/hello.MIC
ed
ar
SINK_LD_LIBRARY_PATH = /opt/intel/composer_xe_2013.0.079/compiler/lib/mic/:
yP
/opt/intel/mic/filesystem/:/opt/intel/impi/4.1.0/mic/lib/lib:
/opt/intel/impi/4.1.0/mic/bin/
el
iv
Dependencies Found:
us
(none found)
cl
Ex
Dependencies Not Found Locally (but may exist already on the coprocessor):
libm.so.6
libgcc_s.so.1
libc.so.6
libdl.so.2
Note: If the binary file cannot be executed due to missing dependencies, micnativeloadex will
inform you about it. Those missing libraries can be found with locate and added to the libraries path
environment variable (SINK_LD_LIBRARY_PATH). Then micnativeloadex can upload them
automatically.
the list of dependencies below to see which one is missing and update the SINK_LD_LIBRARY_PATH
environment variable to include the missing library.
Question 2.1.a. How would you resolve the missing dependencies by adding the path to those libraries
to the environment variable (SINK_LD_LIBRARY_PATH)?
(a) Run micnativeloadex with the source code compiled for the Intel Xeon Phi coprocessor and
Intel MKL (it should have been compiled with the -mmic and -mkl flags).
You should see the following error:
user@host% micnativeloadex hello.MIC
The remote process indicated that the following libraries could not be loaded:
libmkl_intel_lp64.so libmkl_intel_thread.so libmkl_core.so libiomp5.so
Error creating remote process, at least one library dependency is missing.
Please check the list of dependencies below to see which
one is missing and update the SINK_LD_LIBRARY_PATH
environment variable to include the missing library.
...
g
an
(b) When we compiled our source code with -mkl flag, we explicitly told the compiler to link our
W
binary with Intel MKL libraries:
ng
• libmkl_intel_lp64.so
e
• libmkl_intel_thread.so
nh
Yu
• libmkl_core.so
• libiomp5.so
r
fo
(c) To find the location of corresponding libraries we can use the following command:
d
re
/opt/intel/composer_xe_2013.1.117/mkl/lib/mic/libmkl_core.so
user@host% locate libiomp5.so|grep mic
re
/opt/intel/composer_xe_2013.1.117/compiler/lib/mic/libiomp5.so
yP
el
(d) So those locations should be added to the SINK_LD_LIBRARY_PATH path environment variable
iv
7. Next we will monitor an activity on Intel Xeon Phi coprocessor with the micsmc tool.
Ex
Consider the following source code, which uses pthreads for the parallelism B.2.1.2.
Note: Threads load the CPU with a series of infinite loops, thus the user will have to terminate the
process manually.
fflush(0) on line 9 insures, that all printf() (line 8) function calls on Intel Xeon Phi coprocessor
are printed out by flushing the I/O buffers.
This code will spawn as many threads as there logical cores in the system, which is found using
sysconf(_SC_NPROCESSORS_ONLN). The code was written with C99 standard in mind to keep
local variables within local scopes, like with the for loop at the line 18, thus -std=c99 flag should
be used during the compilation:
user@host% icc -pthread -std=c99 -o donothinger donothinger.c
user@host% ./donothinger
Spawning 2 threads that do nothing, press ^C to terminate.
Hello World from thread #0!
Hello World from thread #1!
...
Question 2.1.b. How do you compile the donothinger.c source code for native Intel Xeon Phi
coprocessor execution and run it with the micnativeloadex tool?
8. Use micsmc to monitor the activity of the Intel Xeon Phi coprocessors.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
Initially, only combined statistics are available with average temperature and sum of all cores and
Ex
memory volume. To see individual statistics per card, press the “Cards” button, highlighted in green on
the Figure A.1. The areas highlighted in red indicate additional buttons, which will change the view of
the individual card statistics.
9. Run the donothinger.c code compiled for native execution on the Intel Xeon Phi coprocessor with
micnativeloadex and monitor the activity through the micsmc statistics output.
Question 2.1.c. What should you use to execute the code on the second Intel Xeon Phi coprocessor
with the micnativeloadex tool?
10. Advanced: Study the offloading code B.2.1.3. Compile it and execute on the host with the offload on
different Intel Xeon Phi coprocessors.
Answers
Answer 2.1.a. Write all in one line and substitute your current version of the compiler:
export SINK_LD_LIBRARY_PATH=
$SINK_LD_LIBRARY_PATH:/opt/intel/composerxe/mkl/lib/mic/:
/opt/intel/composer_xe_2013.1.117/compiler/lib/mic/
Answer 2.1.b.
user@host% icc -mmic -pthread -std=c99 -o donothinger donothinger.c
user@host% micnativeloadex donothinger
Answer 2.1.c. If you have several cards available, use micnativeloadex with the -d 1 flag for native
execution on the 2nd coprocessor.
g
an
W
A.2.2 Explicit Offload: Sharing Arrays and Structures
ng
The following practical exercises correspond to the material covered in Section 2.2.1 – Section 2.2.9
e
(pages 45 – 55)
nh
Yu
Goal
r
fo
The explicit offload execution model differs from the native execution model. The main function is
d
re
executed on the host, and part of the code or specified function calls are offloaded to the Intel Xeon Phi
pa
coprocessor. It is a blocked operation. Therefore, the host processor will wait until the offloaded code finishes
re
Instructions
iv
1. Go to the corresponding lab folder and study the source code for offload.cpp (see B.2.2.2) , which
us
is just counting the number of non-zero elements randomly generated by the host processor.
cl
Ex
user@host% cd ~/labs/2/2.2-explicit-offload/step-00
user@host% make
icpc -c -o "offload.o" "offload.cpp"
icpc -o runme offload.o
user@host% ./runme
There are 893 non-zero elements in the array.
You can use the make command (see B.2.2.1) to compile the object file .o (use the -c flag with the
compiler) and link it to the ./runme executable. You can modify the source code to experiment with
the program. To clean the folder (e.g. delete the executable and the object file produced by the Intel C++
Compiler) just type:
2. We want the CountNonZero function to be offloaded to the Intel Xeon Phi coprocessor, as the final
result. But let us start from something simple.
Question 2.2.a. How would you add the printf function call with the message, “Hello World from
MIC!" and offload it to the Intel Xeon Phi coprocessor?
To print out the string from the offloaded printf() function call, proxy console I/O will be used.
Question 2.2.b. What can we do to guarantee that the text will be printed?
Corresponding changes were made to the files in the step-01 subfolder (see B.2.2.3).
3. To get some information about the offloading process, use the OFFLOAD_REPORT environment variable.
Assign “1" for the basic information, and “2" for more detailed report.
g
Makefile offload.cpp offloadMIC.o offload.o runme
an
user@host% ./runme
W
There are 893 non-zero elements in the array.
Hello from MIC![Offload] [MIC 0] [File] offload.cpp
ng
[Offload] [MIC 0] [Line] 24
[Offload] [MIC 0] [CPU Time] 0.522934 (seconds)
e
nh
[Offload] [MIC 0] [CPU->MIC Data] 0 (bytes)
[Offload] [MIC 0] [MIC Time] 0.000174 (seconds)
Yu
Note: There are two object files (<name>.o and <name>MIC.o) created. The object file with the
ed
MIC.o ending contains objects for the Intel Xeon Phi coprocessor architecture.
p ar
4. Use the __MIC__ precompiler macros within the #ifdef conditional directive to check if the binary
re
executed on the Intel Xeon Phi coprocessor or fell back and executed on the host processor. Print out
yP
Compare your solution with the source code from the subfolder step-02 (see B.2.2.4).
Ex
To check the program behavior with the fall-back scenario, we can explicitly ask the compiler to ignore
all the offload pragmas with the -no-offload compiler flag :
user@host% ./runme
Offload has failed miserably!
There are 893 non-zero elements in the array.
5. Finally, modify the original code: define the size variable and the data array as local variables within
the main function, and put this segment of code within the #pragma offload directive section,
together with the CountNonZero function call.
Question 2.2.d. What should be specified at the declaration of the offloaded CountNonZero function?
6. Modify the code to initialize all variables on the host and offload only the CountNonZero function
call. Use the #pragma offload_attribute push/pop directive to select all the variables and
the function for offloading.
Compare your solution with the source code from the subfolder step-04 (see B.2.2.6).
Answers
Answer 2.2.b. fflush(0) will flush the Intel Xeon Phi coprocessor’s output buffer.
g
an
Answer 2.2.c. Use the following construction:
W
#ifdef __MIC__
ng
printf("Offload is successful!");
#else
e
nh
printf("Offload has failed miserably!");
#endif
r Yu
fo
The following practical exercises correspond to the material covered in Section 2.2.9 – Section 2.2.10
yP
(pages 53 – 59)
el
iv
Goal
us
cl
Additional performance can be gained and memory can be saved through reuse of transfered data and
Ex
asynchronous function calls. The following practical exercises will cover those topics.
Instructions
1. In the previous lab we considered a simple case of the synchronous offload model with the variables
from the local scope (on the stack) transfered from the host memory to the coprocessor and back, after
the calculation is done.
Next we will write a source code, where the globally defined double sum variable will contain the
result of an array summation dynamically allocated on the heap, initialized on the host and passed to the
summation function call on the coprocessor (see B.2.3.1 and B.2.3.2).
Pass those variables with the in/out/inout/nocopy clauses, calculate the sum, and print it out.
If you have any difficulties, you can a compare your result with the step-01 subfolder’s source file
offload.cpp (see B.2.3.3).
2. Using the previously written source code, modify the offload pragma in such a way, so that the sum
variable will be not freed after the pragma’s end. Print out its value. Modify its value on the host by
incrementing it by one, and print out the value again. Use offload_transfer pragma to restore
the sum variable value from the coprocessor, and free the allocated memory. (step-02 or B.2.3.4)
Note: Do not forget to specify the number of the Intel Xeon Phi coprocessor card in the target
(mic:N) clause. If you do not, the variable sum might be copied from a different coprocessor, if you
have more than one.
3. Use the signal(p) and wait(p) clauses to implement asynchronous offload execution on the target
and synchronization at the offload_transfer pragma. (See step-03 subfolder or B.2.3.5)
Goal
g
an
Matrix-vector multiplication example is studied in the context of data persistence. Matrix content is
W
transfered beforehand, and used for multiplication with multiple vectors asynchronously.
Instructions
e ng
nh
Yu
1. Develop and run a code that performs serial matrix-vector multiplication on the host. Matrix-vector
multiplication is defined as A~b = ~c, where A is an [m × n] matrix of double precision numbers, ~b
r
fo
is a vector of length n and ~c is the resultant vector of length m. Do not worry about performance at
ed
this point, just design a serial code. The suggested C code of matrix-vector multiplication is shown in
ar
Listing A.1 (matrix A is stored as a one-dimensional array). Allocate all quantities on the stack.
p
re
yP
1 A[:]=1.0/(double)n;
2 b[:]=1.0;
el
3 c[:]=0;
iv
Or use source code from the step-00 subfolder of the corresponding lab folder (see B.2.4.1 and
B.2.4.2).
2. Modify this code so that matrix A and vector b are initialized on the host, but the calculation is offloaded
to the Intel Xeon Phi coprocessor. Vector c should be returned back to the host and verified against the
expected result. (step-01 or B.2.4.3).
3. Test the maximum value of m*n for which matrix A can be allocated on the stack.
4. Change data allocation: keep vector b on the stack, but matrix A on the heap, and test the maximum
problem size. (step-02 or B.2.4.4)
5. Improve the code so that it performs matrix-vector multiplication for multiple vectors b, but the same
matrix A. Take care to avoid unnecessary transfer of the data of matrix A.
Use the OFFLOAD_REPORT=2 environment variable to see the amount of data being transferred during
the offload:
user@host% export OFFLOAD_REPORT=2
6. Modify the code so that the data of matrix A is transferred to the coprocessor beforehand (previous step),
and matrix-vector multiplication executed in offload mode asynchronously, while the same calculations
are produced on the host, to be compared later with the coprocessor’s results.
Take a look at our implementation in the step-05 subfolder (see B.2.4.7).
g
an
W
A.2.5 Virtual-Shared Memory Model: Sharing Complex Objects
ng
The following practical exercises correspond to the material covered in Section 2.3 – Section 2.3.5 (pages
e
59 – 65).
nh
Yu
Goal
r
fo
MYO model allow to share complex objects (not only bit-wise copyable) between the host system and
d
re
Intel Xeon Phi coprocessors. You will be asked to do so for dynamically allocated data, structures, classes,
pa
Instructions
el
1. Using a simple serial program, which initializes two arrays with a predefined size, adds each of the
iv
corresponding elements, and saves the resulting summation in a third array res of the same size, with
us
2. Compare your version of the modified code with the implementation in the folder step-01 or the
source code at B.2.5.3.
If you have several Intel Xeon Phi coprocessors available, instead of using _Cilk_offload, you can
use _Cilk_offload_to to specify which coprocessors should be used for the offload.
3. Dynamically allocated data can be synchronized before offloading and after in a similar manner. Take
a look at the example in the following location:
labs/2/2.5-sharing-complex-objects/step-03/dynamic-alloc.cpp, (see B.2.5.4)
where the pointer to the dynamically allocated data:
should be allocated on both the host and the coprocessor at the dynamically synchronized memory area.
Therefore, _Offload_shared_malloc should be used instead of regular malloc. But since the
pointer will be shared as well, it will be declared with the _Cilk_shared keyword.
4. For extra credit, you can try to figure out how to recode the summation using parallel processing in the
ComputeSum() function call, which will be offloaded to the Intel Xeon Phi coprocessor for execution.
You can use the OpenMP reduction mechanism.
Compare your code with one possible solution at step-03/dynamic-alloc.cpp or the source
code at B.2.5.5.
5. Structures can be virtually shared between the host and the coprocessor as well. Take a look in the next
example at step-04/structures.cpp (B.2.5.6)
g
an
1 typedef struct {
W
2 int i;
ng
3 char c[10];
4 } person; e
nh
Yu
Write the code, where the structure above should be virtually shared. And offloaded function call
SetPerson() should change the variables of this structure, which will be printed out later on the
r
fo
host.
ed
6. Class Person serial implementation is presented in the source code step-06/classes.cpp (see
re
B.2.5.8).
yP
Use _Cilk_shared keyword for sharing the object of this class in the virtual memory. Make offload
el
iv
call of the class method on the Intel Xeon Phi coprocessor. Function’s arguments should be in the
us
virtual-shared memory as well. Therefore, method declaration will have _Cilk_shared keyword in
cl
Compare your result with the one at the step-07 folder, source code B.2.5.9.
7. To use the standard new operator to create an object in the virtual-shared memory, we need to use a
special implementation of this operator from the new library (#include <new>).
We also need to allocate the corresponding amount of memory in the virtual-shared memory region first,
and pass the pointer as the new() function parameter.
Based on the serial example from step-08/new.cpp (or source code B.2.5.10), try to modify it
for virtual-shared memory model, dynamic object creation with the new operator, and method call
offloaded to the Intel Xeon Phi coprocessor.
To check your result you can use the source code from the step-09 folder, or B.2.5.11.
Goal
Many parallel algorithms may utilize several computing devices and show almost linear or even super-
linear performance gain, if communication overhead is not significant. This practical exercise will focus on
using multiple Intel Xeon Phi coprocessors within a single computational node.
Instructions
1. We will use several common techniques of offloading a function to multiple Intel Xeon Phi coprocessors.
with Intel Cilk Plus
Compile and execute some simple C/C++ code shown below, which will print out the number of Intel
Xeon Phi coprocessors available in the system (see B.2.6.1 and B.2.6.2):
1 #include <stdio.h>
2 int _Cilk_shared numDevices;
3 int main(int argc, char *argv[]) {
g
4 numDevices = _Offload_number_of_devices();
an
5 printf("Number of Target devices installed: %d\n\n" ,numDevices);}
W
This code can be found at the following location:
ng
labs/2/2.6-multiple-coprocessors/step-00/multiple.cpp
e
nh
We used the _Cilk_shared keyword here to make the _Offload_number_of_devices()
Yu
function available at the linking stage. Try to delete the keyword _Cilk_shared from the source
code and recompile it again. You will see an error message stating,“undefined reference to ’_Of-
r
fo
fload_number_of_devices’". Intel Cilk Plus is a language extension, and it will be utilized by compiler
d
if only Intel Cilk Plus keywords are used in the source code.
re
pa
This very simple program prints out the number of available Intel Xeon Phi coprocessors, which we will
be using for the second step, to print out the current device number.
re
yP
2. If you have two or more coprocessors installed in your system (the program run from the previous step
el
returned two or more devices), then write an offloaded function call within the for loop. This function
iv
should print out the device number of the coprocessor currently running the offloaded function (use
us
_Offload_get_device_number()).
cl
To make an offload function call, use the Intel Cilk Plus compiler extension with the corresponding
Ex
We can save some data transfer operations and pass only one corresponding element of the response
array to each individual card. This can be done by specifying the slice of the array (Intel Cilk Plus
array notation) – the first element to be passed and the length of the slice, which is only one element
in this particular case. And since the array will be shared between the host system and Intel Xeon Phi
coprocessors, the target attribute should be specified for the response pointer.
Note: if you will get the following error code:
It might be due to zero-code elimination (one of the optimization techniques of the Intel C++ Compiler)
within the #ifdef __MIC__ statement. The source code within this statement will not be visible
to the host at compilation, and thus, the compiler will assume that the response array was not
manipulated. Therefore, this variable will be eliminated completely.
To fix this issue we can add #else statement into the #ifdef condition, and assign zero value instead,
g
if the code is executed on the host system. In this case variable and manipulations with it will be visible
an
to the compiler (the host part) and we will not get this error.
W
Compare your result with the code B.2.6.5 at:
labs/2/2.6-multiple-coprocessors/step-03/multiple.cpp e ng
nh
Yu
The following practical exercises correspond to the material covered in Section 2.4 (pages 66 – 73)
fo
ed
Goal
p ar
Asynchronous execution allow to run computations in parallel on multiple Intel Xeon Phi coprocessors.
re
yP
Instructions
el
iv
Make the source code (B.2.7.1 and B.2.7.2) offload the whatIsMyNumber() function call to the
cl
labs/2/2.7-asynchronous-offload/step-00/async.cpp
2. If the _Cilk_spawn keyword is used, offloaded function calls will be submitted without blocking
asynchronously (see the source code B.2.7.3 in the step-01 folder).
We are not using synchronization (_Cilk_sync) here, since it is not needed. It will happen implicitly
at the end of the main() function call.
Another note is that, instead of _Cilk_* keywords, we can use the header file <cilk/cilk.h>,
which defines macros that provide names with simpler conventions (cilk_spawn, cilk_sync
and cilk_for), described in Intel C++ Compiler reference manual. It is your choice to decide which
one to use.
system. Therefore, each thread will be run in parallel and will offload the code, which we need to specify
with the offload pragma and in/out/inout/nocopy data manipulation clauses, as well as the
target(mic:i) clause, to submit the offloading to different target devices.
Your result can be compared with B.2.7.5 from the step-03 subfolder.
g
an
Compare your result with B.2.7.7 from step-05/async.cpp.
W
5. Intel Cilk Plus _Cilk_for and _Cilk_offload
e ng
Previously we used an OpenMP parallel for loop to offload the function calls to the Intel Xeon Phi
nh
coprocessors simultaneously. The same approach can be used with the Intel Cilk Plus Intel C++ Compiler
Yu
extension.
r
fo
Use the _Cilk_shared and _Cilk_offloaded keywords where needed. Memory allocation for
d
shared arrays should be done with the _Offload_shared_malloc() function call. Instead of a
re
regular for loop or OpenMP parallel for, _Cilk_for can be used, which will start all the iterations
pa
simultaneously.
re
Use your previous code, or B.2.7.8 at the step-06/async.cpp. Afterwards, compare your results
yP
An asynchronous offload with Intel Cilk Plus will be covered next. You can start with the code B.2.7.10
Ex
in step-08/async.cpp and add the _Cilk_offload keyword for the Respond() function
call. In addition, add the _Cilk_spawn keyword to make this offload asynchronous, and add the
_Cilk_sync keyword for synchronization between the threads.
Don’t forget to compare your result with B.2.7.11 in step-09/async.cpp.
Goal
Message Passing Interface (MPI) allows to organize heterogeneous parallelism between several Intel
Xeon Phi coprocessors and the host within a single computational node, as well as between several computers
and multiple coprocessors.
Preparation
1. To use the Intel MPI on the Intel Xeon Phi coprocessors, the corresponding libraries and binary files
need to be copied, or otherwise, made available to the target devices. Therefore, you can copy the files
onto the target devices, or you can also NFS mount the host Intel MPI folder to allow the coprocessors
access them.
To NFS mount Intel MPI folder, follow the instruction from the Chapter 1.5.4 on page 35, or Lab A.1.2,
Instruction set 3. Modify the corresponding files to use NFS share on the appropriate number of Intel
Xeon Phi coprocessors.
2. Later in this practical exercise we will assign MPI jobs to multiple Intel Xeon Phi coprocessors. To
enable the communication between those devices, we also need to enable peer to peer communication
between them. Follow the instructions in Section 2.4.3 on page 76 to do so.
Instructions
g
1. Study the simple MPI “Hello World!" makefile B.2.8.1 and the source code B.2.8.2 at the following
an
location:
W
labs/2/2.8-MPI/step-00/HelloMPI.c
e ng
To compile it, we will need to use mpiicc or mpiicpc macros installed with the Intel MPI. We will
also use -mmic Intel C++ Compiler flag to compile Intel Xeon Phi binary HelloMPI.MIC for native
nh
execution.
Yu
2. First, we will execute this “Hello World!" application on the host system:
r
fo
3. Copy HelloMPI.MIC file to the Intel Xeon Phi coprocessors to the home folder. Since coprocessors
iv
us
are IP-addressable devices and have SSH servers running on them, you can use scp to copy the files.
cl
Question 2.8.a. What environment variable will enable the Intel Xeon Phi coprocessors support in Intel
Ex
MPIapplications?
4. In a similar manner, as we executed the code on the host, we can run it on an Intel Xeon Phi coprocessor
by specifying the -host flag.
Question 2.8.b. What command should we use on the host to execute the Intel MPI code on one of the
Intel Xeon Phi coprocessors.
5. Intel MPI processes can be assigned to multiple Intel Xeon Phi coprocessors, and even the host system
(heterogeneous model).
Question 2.8.c. What is the format of explicit multiple hosts assignment for an Intel MPI program?
6. For the large clusters with many hosts and coprocessors it might be not convenient to specify all the
hostnames separated with “:" symbol. Instead we can put all the hostnames in a text file and use it
instead.
Question 2.8.d. What are the parameters we can use to specify hostnames and mapping for Intel MPI
program?
Answers
Answer 2.8.a.
user@host% export I_MPI_MIC=1
Answer 2.8.b.
user@host% mpirun -host mic0 -n 2 ~/HelloMPI.MIC
Hello World from rank 1 running on mic0!
Hello World from rank 0 running on mic0!
MPI World size = 2 processes
Answer 2.8.c.
g
user@host% mpirun -host hostmic0 -n 2 ./HelloMPI : -host mic0 -n 2 \
an
% ~/HelloMPI.MIC : -host mic1 -n 2 ~/HelloMPI.MIC
W
Hello World from rank 0 running on host!
MPI World size = 6 processes
ng
Hello World from rank 1 running on host!
e
Hello World from rank 4 running on mic1!
Hello World from rank 2 running on mic0!
nh
Yu
Hello World from rank 3 running on mic0!
Hello World from rank 5 running on mic1!
r
fo
Note: Spaces around the colon symbol “:" are very important. Without them, the colon would be
d
Answer 2.8.d.
re
yP
Goal
In the following practical exercises, we will use the Intel C++ Compiler automatic vectorization feature.
Instructions
1. The automatic vectorization feature of the Intel C++ Compiler allows it to recognize operations, which
can be applied to multiple data elements simultaneously, and thus speed up the computations by
exploiting vector registers of Intel Xeon processor or Intel Xeon Phi coprocessors.
Question 3.1.a. What compiler flag allows you to turn on the explicit output of the Intel C++ Compiler
g
an
automatic vectorization log?
W
ng
2. Starting from the serial code, which sum up two arrays together B.3.1.1, located at:
labs/3/3.1-vectorization/step-00/vectorization.cpp, see if the code will be au-
e
nh
tomatically vectorized by the Intel C++ Compiler.
Yu
Question 3.1.b. How can we find out that the Intel C++ Compiler automatically vectorized the specific
r
loop successfully?
fo
ed
Additional instructions and pragmas will let the compiler auto-vectorize the code more effectively. In
ar
the code shown above, use the align attribute for the arrays’ alignment.
p
re
In the main summation loop, use explicit and inexplicit Intel Cilk Plus array notation. Check if the loop
yP
Compare your result with the source code B.3.1.2 in the step-01 subfolder.
iv
us
3. Next use dynamic memory allocation of the arrays A[:] and B[:], and explicit Intel Cilk Plus array
cl
Question 3.1.c. Why do you think implicit (inexplicit) Intel Cilk Plus array notation will raise compila-
tion errors for dynamically allocated arrays?
4. If the align attribute is not specified, the array allocated randomly in the memory and can have a
random offset. Using the source code from the previous steps, add the calculation of the offsets for the
arrays A[:] and B[:] relative to some alignment constant, for instance, const int al=64;
Question 3.1.d. Can you express mathematically and implement in C/C++ the offset calculation for
some pointer address, relative to some constant al – size of the memory block?
For the dynamically allocated arrays A[:] and B[:] with standard malloc() function calls, write a
program to calculate the offset for some alignment constant al, and print out those values.
Compare your implementation with B.3.1.4 located in the step-03 subfolder.
5. In the previous step, we calculated the offset relative to some alignment constant al. If the align
attribute is not implemented in the compiler, we can align an array or a variable manually, by allocating
slightly more memory (sizeof(array)+al-1), and shifting the address by the offset to get the
alignment.
Implement this algorithm of the dynamical memory allocation and shifting by the offset to get the
alignment. And remember to free the initial pointer, rather than the aligned one. Compare your result
with B.3.1.5 from the subfolder step-04.
6. To simplify the alignment of dynamically allocated memory, we can use the Intel C++ Compiler’s
intrinsic functions to allocate and free aligned blocks of memory:
Note: Memory that is allocated using _mm_malloc must be freed using _mm_free. Calling free
on memory allocated with _mm_malloc or calling _mm_free on memory allocated with malloc
will cause unpredictable behavior.
Use these intrinsic function calls to allocate memory blocks for two arrays, as in the previous steps.
g
Compare your results with B.3.1.6 from step-05.
an
As an additional exercise, combine the offset calculation with the intrinsic memory allocation to show
W
that allocated memory is indeed aligned.
e ng
7. Intrinsics, as described in the reference manual:
nh
Yu
Intrinsics are functions coded in assembly language that allow you to use C++ function calls
and variables in place of assembly instructions.
r
fo
Intrinsics are expanded inline, eliminating function call overhead. Providing the same benefit
d
as using inline assembly, intrinsics improve code readability, assist instruction scheduling,
re
Intrinsics provide access to instructions that cannot be generated using the standard constructs
re
Intrinsic function calls provide a fine-tuned and direct instruction set to work with vector registers and
el
iv
the like. However, it hardwires the code to a specific architecture and its feature set. In general, it is a
us
bad idea. More preferable approach is to allow the compiler to take care of those details.
cl
For educational purposes, try to implement a code with summation of two arrays by using intrinsic
Ex
functions for vector summation. Compile the code for native execution on the Intel Xeon Phi coprocessor.
Compare it with B.3.1.7 source code from the step-06 subfolder.
8. Scalar function calls can be vectorized automatically by compiler with the code inlining at the compila-
tion stage, if the function body is at the same source code file with the function call loop.
Write int my_simple_add(int x1, int x2){ return x1+x2;} scalar summation of
two integers and call it within the iterating for loop over the elements of arrays A and B. Compile it
and make sure, that the for loop was vectorized (see B.3.1.8 from step-07).
9. Next step is to move my_simple_add function to the separate file (worker.cpp) and leave the rest
of the code at main.cpp. This code will not be vectorized, since at the compilation time Intel C++
Compiler creates object files separately and will not inline the function calls (see B.3.1.9 and B.3.1.10
from step-08).
10. Elemental functions are a general language construct to express a data parallel algorithm. If (as in
previous step) function body is located in a separate file or function within an external library, those still
can be automatically vectorized by applying __attribute__((vector)) to them.
#pragma simd before the for loop will insure the Intel C++ Compiler that loop can be safely
vectorized.
Apply those changes to the previous example and compile it to make sure that compiler indeed auto-
vectorized worker.cpp and main.cpp corresponding regions of code. Compare your source code
with B.3.1.11 and B.3.1.12 for step-09 subfolder.
11. In many cases program developer knows more about data organization and access patterns then compiler.
Therefore, additional instructions and pragmas passed to the compiler will help it to optimize the code
much better. But it is developer’s responsibility to provide correct information.
In the next example we will show what might happened if ignore vector dependency pragma #pragma
ivdep is used where it should not be used, e.g. where actual vector dependency is exists.
Take a look at the source code files B.3.1.13 and B.3.1.14. #pragma ivdep in worker.cpp
tells compiler, that we will guarantee, that links to integer passed as arguments of the function
my_simple_add and used within the for loop are independent. But in the main.cpp we call
this function for n − 1 elements for links pointing to B and B + 1 – next element of array.
g
an
The result is unpredictable and most likely wrong:
W
ng
user@host% ./runme
0 0 0 e
nh
1 1 0
2 2 0
Yu
3 3 0
4 4 0
r
fo
5 5 4
6 6 5
ed
7 7 6
p ar
re
while the whole array B should contain only zeros: first element is zero; next element is assigned the
yP
12. Keyword restrict can be used in a similar manner as #pragma ivdep, but for individual pointers.
iv
For those pointers developer guarantee mutual independence. And therefore, for loops with those
us
During the compilation flag -restrict should be used to enable the keyword.
Using the previous example modify the code to use restrict keyword. Compare your results with
B.3.1.16 and B.3.1.17 from step-0b subfolder and Makefile B.3.1.15.
13. _Cilk_for keyword and cilk_for function from cilk/cilk.h indicates that a for loop’s
iterations can be executed independently in parallel, and moreover, it will be considered as candidate for
auto-vectorization. Therefore, double parallelism can be applied to the for loop indicated by Intel Cilk
Plus keyword: data and thread parallelism.
Implement the source code with Intel Cilk Plus for loop over 256 elements adding array elements B[:]
to array elements A[:]. Compare your results with B.3.1.18 from step-0c subfolder.
Answers
Answer 3.1.a. -vec-report3 and -vec-report6 (for more verbose output) will show the automatic
vectorization log of the compiler.
Answer 3.1.b. With the -vec-report3 flag, the Intel C++ Compiler will include the following line
among the rest of its output:
vectorization.cpp(14): (col. 3) remark: LOOP WAS VECTORIZED.
Answer 3.1.c. The compiler does not know the size of the array in dynamically allocated memory at a
compilation time. Therefore, you will get the following error:
array section length must be specified for incomplete array types.
Answer 3.1.d.
1 const int offset = (al - ( (size_t) A % al)) % al;
g
an
W
A.3.2 Parallelism with OpenMP: Shared and Private Variables, Reduction
e ng
The following practical exercises correspond to the material covered in Section 3.2 (pages 94 – 122)
nh
Yu
Goal
r
fo
OpenMP (Open Multi-Processing) is one of the most used parallelism model Application Programming
d
Instructions
yP
1. Write C++ source code for simple OpenMP program, which prints out the total number of OpenMP
el
threads and for each fork-join branch prints out “Hello world from thread %d" with printf() function
iv
call.
us
cl
Question 3.2.a. What OpenMP function will return the total number of available threads?
Ex
Question 3.2.b. What environment variable can control the total number of OpenMP threads?
Question 3.2.c. Multiple OpenMP threads can be used to run a code region in parallel. What pragma
can we use to do that?
Question 3.2.d. What OpenMP function will return the current thread number?
2. Using OpenMP write the program, which will run OpenMP parallel for loop with the total
number of iteration equal to maximum number of OpenMP threads available on the system, and print
out number of iteration and current thread number.
Question 3.2.e. What pragma should be used for OpenMP for loop?
Compare your result with B.3.2.2 source code from step-01 subfolder.
3. Variables visibility in the OpenMP program depends on location, where those variables were defined.
Write the OpenMP program, where constant variable nt will be initialized with the maximum value
of OpenMP threads and will be available for all of those threads. Private integer private_number
should be independent for each parallel region. And using OpenMP for loop print out the current
thread number and value of the private variable, to make sure, that it is private for each thread, when we
increment it by one.
Compare your result with B.3.2.3 from step-02 subfolder.
Question 3.2.f. OpenMP parallel region pragma will create available number of threads. What pragma
will distribute iterations of for loop between those threads, without creating nested parallelism?
g
user@host$ ./runme
an
OpenMP with 2 threads
W
Hello World from thread 0 (private_number = 1)
Hello World from thread 1 (private_number = 1)
e ng
nh
5. Number of threads in any OpenMP parallel region can be also controlled with the corresponding clause
Yu
parameter. OpenMP for loop’s scheduling mechanism can be specified through clauses as well.
r
Modify the source code from the previous step so specify guided scheduling mechanism for the for
fo
loop. And also specify number of threads needed for OpenMP parallel region and for loop.
ed
6. Recursive algorithms can be parallelized as well by using OpenMP task pragma. Take a look at the
yP
Question 3.2.g. Why did we use #pragma omp parallel and #pragma omp single for the
iv
7. Control over the variables scope can be done with OpenMP parallel clauses
Ex
private/shared/firstprivate.
Create the program with tree variables and control their behavior with the clauses mentioned above.
Check what values will be assigned to them within the parallel region and how they will react to the
modifications of their values. Source code B.3.2.6 from step-05 shows racing condition for shared
variable varShared and use of private and firstprivate variables.
8. Probably the most common mistake in implementing parallel algorithms is creating racing conditions,
when shared variables accessed for reading and writing by different threads at the same time. Write the
code with OpenMP parallel for loop over the first N = 1000 numbers added together in shared
PN −1
variable sum. Correct value should be i=0 i = N ∗(N 2
−1)
. Note: the upper boundary is N − 1, since
the for loop has “i < N ;" exit condition. Print out the resulted value of sum and expected value of the
summation. Compare your code with B.3.2.7 from step-06.
9. There are several ways to fix racing conditions in the parallel codes. One of them is applying #pragma
omp critical to the region of code, where racing conditions occur. Modify your code to fix the
summation problem.
It should be noted, that only one thread will execute region of the code marked with critical pragma.
Therefore, the parallel code we created will technically become serial, since only one thread executing it
at a time.
Compare your result with B.3.2.8 from step-07 subfolder.
10. Some scalar operations can be marked with atomic pragma, which will ensure that a specific memory
location is updated atomically, which prevents the possibility of multiple, simultaneous reading and
writing of threads.
For the previous code add the #pragma omp atomic before the summation inside the OpenMP
for loop. Execute the compiled program and compare the result with the expected value.
B.3.2.9 shows how the atomic pragma can be implemented (step-08).
11. Another common approach is to have private variables collecting temporary values of summation, and
then adding them together to get the final answer.
g
Implement this idea by using two OpenMP task pragmas and shared variables sum1, sum2, accessed
an
only by corresponding tasks. Use taskwait pragma to synchronize the tasks.
W
You can compare your implementation with B.3.2.10 from step-09 subfolder.
e ng
nh
12. For the highly parallel systems it is better to write parallel region in a way, that OpenMP will split the
work between the available threads automatically. Use the similar approach as in previous step – collect
Yu
the temporary summation result in the private variables. After the OpenMP for loop (but still within the
r
fo
OpenMP parallel region collect the values from those private variables into the shared variable sum,
and to avoid racing conditions use #pragma omp critical (B.3.2.11 from step-0a subfolder).
d
re
pa
13. Reduction is a clause of OpenMP for loop, which indicates what operation will be used on what
reduction variable. OpenMP will automatically take care of avoiding racing conditions and receiving
re
yP
correct result.
Implement the summation over the array by specifying reduction clause for sum. Compare your
el
iv
Answers
Ex
Answer 3.2.a. int omp_get_max_threads(); from omp.h will provide the number of maximum
OpenMP threads available.
Answer 3.2.b. OMP_NUM_THREADS defines how many OpenMP threads will be created, if this number in
not specified with corresponding OpenMP function calls.
Answer 3.2.c. #pragma omp parallel will run the following after it code in the maximum number
of OpenMP threads available for the system.
Answer 3.2.d. int omp_get_thread_num(); from omp.h will provide the current number of
thread.
Answer 3.2.e.
Answer 3.2.f.
1 #pragma omp parallel
2 #pragma omp for
3 for (int i=0; i<N, i++) {...}
Note: If #pragma omp parallel for is used instead – this will created nested parallelism,
which is not desired in our case.
Answer 3.2.g. Without #pragma omp single the maximum number of OpenMP threads will be
created, and all of them will execute Recurse(0); initial recursive function call, which it not the desired
behavior.
g
an
TM
W
A.3.3 Complex Algorithms with Intel R Cilk Plus: Recursive Divide-and-Conquer
ng
The following practical exercises correspond to the material covered in Section 3.2 (pages 94 – 122)
e
nh
Yu
Goal
r
The listed instructions below will help you to get familiar with Intel Cilk Plus, an extension to the C and
fo
Instructions
re
yP
1. In the following practical exercise you will be required to write a code, which will use Intel Cilk Plus
parallelism model. Print out the total number or Intel Cilk Plus workers available on the system. Use
el
_Cilk_for iterating through number of available workers and print out current worker number. Since
iv
the workload is very light all iteration should be done by only one worker (this for loop gets serialized).
us
Therefore, we need to add extra workload to the for loop to see Intel Cilk Plus parallelism. Within
cl
Ex
the for loop write additional while loop adding or multiplying some numbers to the private variable.
At the end print out the current worker number doing those calculation, and the final result to avoid
zero-code elimination.
Question 3.3.a. What environment variable controls the number of Intel Cilk Plus workers?
2. In the previous step we used _Cilk_for loop to complete the total number of iteration equal to total
number of Intel Cilk Plus workers. And if the workload was significant for each worker, than each
of them should have been involved in the calculations only once. Intel Cilk Plus operating on hungry
workers workload distribution model. But number of iterations distributed between the workers can be
controlled by grainsize clause of #pragma cilk grainsize N pragma.
Modify the previous code to grant 4 iteration steps for each worker. Make sure, that only quarter of total
number of workers were involved in the calculations this time.
You can compare your result with B.3.3.2 from step-01.
3. Using _Cilk_spawn for asynchronous parallelism we can run recursive tasks in parallel.
Write the program, which will recursively call some function Recurse( const int task); and
print out the number of current worker doing some calculations within the function.
Compare your results with B.3.3.3 from step-02.
4. Quite often we need synchronisation between the parallel tasks. Intel Cilk Plus has _Cilk_sync
keyword for this.
Write a program, where 1000 dynamically allocated consecutive integer elements of array summed
up by two parallel (_Cilk_spawn) function calls Sum() over two parts of array; synchronized and
printed out with the printf() statement.
Compare your result with B.3.3.4 from step-03.
5. More elegant way to organize parallelism is to avoid hardwiring number of parallel task and let Intel
Cilk Plus take care of this automatically.
To prevent racing conditions we will need to use reducers in Intel Cilk Plus, defined as
g
an
cilk::reducer_opadd<int> sum from <cilk/reducer_opadd.h>.
W
Access to the reducer sum will be done through sum.set_value(N) and sum.get_value()
ng
calls.
e
Write a program, which will use reducer sum and will store summation result of adding the consecutive
nh
20 integers iterated over with _Cilk_for loop, and printing out the final result. Compare your code
Yu
with B.3.3.5 from step-04 subfolder.
r
fo
6. Maximum number of Intel Cilk Plus workers can be controlled not only by environment variable, but
d
Question 3.3.b. What function can we use to change the maximum number of Intel Cilk Plus workers?
pa
re
Consider the following source code B.3.3.6 at step-05 subfolder. Class Scratch has public attribute
yP
array data with many elements, which make it quite expensive to construct the objects of this class.
el
With the current implementation object scratch of class Scratch will be constructed by every Intel
iv
user@host% ./runme
Ex
user@host$ ./runme
Constructor called by worker 0
Constructor called by worker 1
i=5, worker=1, sum=5000000
i=0, worker=0, sum=0
i=6, worker=1, sum=6000000
i=1, worker=0, sum=1000000
i=7, worker=1, sum=7000000
i=2, worker=0, sum=2000000
i=8, worker=1, sum=8000000
i=3, worker=0, sum=3000000
i=9, worker=1, sum=9000000
i=4, worker=0, sum=4000000
Modify the source code to use Intel Cilk Plus holders. Compare your result with B.3.3.7 from step-06
g
subfolder.
an
W
Answers
e ng
nh
Answer 3.3.a. CILK_NWORKERS controls the number of Intel Cilk Plus workers.
r Yu
fo
The next practical exercise corresponds to the material covered in Section 3.3 (pages 122 – 138)
el
iv
Goal
us
cl
This exercise will show you practical asspects of heterogeneous execution of parallel application in
Ex
distributed memory with MPI and provide the basis for Intel Xeon Phi coprocessors clusterring.
Instructions
1. Write a simple Intel MPI “Hello World!" program: find the rank, world size, and name of the host
running the code. Print out this information with only rank 0 printing out the total number of MPI
processes (world size).
The source code B.3.4.2 and corresponding Makefile B.3.4.1 can be found at the
labs/3/3.4-MPI/step-00/ folder.
Initialize MPI support for Intel Xeon Phi coprocessors.
Question 3.4.a. What command would you use to run compiled code manually on the host and two
Intel Xeon Phi coprocessors, with two MPI processes per host?
2. Communication between MPI processes can be organized with MPI_Send and MPI_Revc function
calls.
Write a program based on the source code from the previous step, where all ranks, except the master
process, send its rank and the node name. This information should be collected by the master process
and printed out.
To control proper communication between MPI processes we can specify from which rank we expect
the message. But we can also specify the tag number, which can be use for ordering control, etc..
A message can be received by a receive operation only if it is addressed to the receiving process, and
if its source, tag, and communicator (comm) values match the source, tag, and comm values specified
by the receive operation. The receive operation may specify a wildcard value for source and/or tag,
indicating that any source and/or tag are acceptable. The wildcard value for source is source =
MPI_ANY_SOURCE. The wildcard value for tag is tag = MPI_ANY_TAG. There is no wildcard
value for comm. The scope of these wildcards is limited to the processes in the group of the specified
communicator.
g
Note the asymmetry between send and receive operations: A receive operation may accept messages
an
from an arbitrary sender; on the other hand, a send operation must specify a unique receiver. This
W
matches a “push" communication mechanism, where data transfer is effected by the sender (rather than
ng
a “pull" mechanism, where data transfer is effected by the receiver)
e
nh
If you specified the source as MPI_ANY_SOURCE, and control the message sequencing by the tag, than
change your code to specify the rank number of sender/receiver; and vice versa if you used rank to
Yu
control the order.
r
fo
Compare your result with the source code B.3.4.3 from step-01 subfolder.
d
re
3. Write a program with user-provided buffering communication, use MPI_Bsend and regular MPI
pa
Use even ranks as senders, and odd ranks as receivers. For sender/receiver pairs use unique tag number,
yP
for instance:
el
ranks – tag
iv
0, 1 – 0
us
2, 3 – 1
cl
4, 5 – 2
Ex
Question 3.4.b. What should we use to check if MPI process is currently running on Intel Xeon Phi
coprocessor?
Answers
Answer 3.4.a.
user@host% export I_MPI_MIC=1
user@host% mpirun -host hostmic0 -n 2 ./runme-mpi : \
> -host mic0 -n 2 ~/runme-mpi.MIC : \
> -host mic1 -n 2 ~/runme-mpi.MIC
g
an
Answer 3.4.b.
W
1 #ifdef __MIC__
ng
2 mic++;
3 #else e
nh
4 host++;
5 #endif
rYu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Goal
Intel VTune Amplifier XE is a commercial application for software performance analysis for 32-bit
and 64-bit x86 based machines with advanced hardware-based sampling of Intel-manufactured CPUs and
coprocessors.
Instructions
In this lab, we will walk through the workflow for application performance analysis in the Intel VTune
g
Amplifier XE tool. VTune is an application performance profiling tool that relies on hardware event sampling.
an
Some optimization examples in Chapter 4 demonstrate analysis in Intel VTune Amplifier XE, relying on the
W
procedures described in this lab.
e ng
1. First, let us compile the applications that will be used for profiling in VTune. Navigate to directory
nh
labs/4/4.1-vtune, enter each subdirectory in it, and run make. As you could guess from the
Yu
names of the source files, we will have one application that runs on the host system, one that performs
r
offload to an Intel Xeon Phi coprocessor, and one that runs natively on a coprocessor.
fo
d
re
2. Before we start VTune, some preparation may be needed as shown in Figure A.2.
pa
re
yP
el
iv
us
cl
Ex
In order to use VTune, environment variables must be set by sourcing the script file located at the
following path: /opt/intel/vtune_amplifier_xe/amplxe-vars.sh. Additionally, the
user of VTune must belong to user group vtune, the sampling driver must be loaded, and the NMI
watchdog must be disabled.
When the preparation work is done, launch VTune with the command amplxe-gui.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
3. After you launch the VTune graphical user interface with command amplxe-gui, you will see a
window inviting you to create a new Project or open an existing one. Projects are containers for analysis
settings and results. Create and configure a new project named “Host-Workload” as shown in Figure A.4.
This project will contain the host-only application in step-00-xeon.
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
Figure A.4: Creating and configuring a new project in Intel VTune Amplifier XE.
4. Now we are ready to profile the application. Click the orange triangle in the toolbar and choose “Sandy
Bridge” / “General Exploration” in the sidebar menu as the Analysis Type as shown (see the top panel
of Figure A.5). As the name suggests, this is a general-purpose analysis. Click the large button “Start”
at the right-hand side of the VTune window. VTune will launch your application. You can monitor the
progress of the application by switching to the terminal window (see the bottom panel of Figure A.5). In
order to switch to the terminal window, you can press Alt+Tab or mouse-click the Terminal window at
the bottom of the Gnome desktop.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Figure A.5: Launching and monitoring the General Exploration analysis for a host application.
5. Once VTune processes the analysis results, we can view them. Let us navigate the VTune interface to
get accustomed to the information that it displays.
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
Figure A.6: Viewing the General Exploration analysis results for a host application.
Initially, you will see the “Summary” tab (top panel of Figure A.6). It contains cumulative metrics such
as the elapsed time, CPI rate and platform information. You can mouse over the question marks on this
page, and VTune will display help information on the respective metric in a pop-up window.
You can also see a breakdown of the sampled events by switching to the “Bottom-Up” or “Top-Down”
tab (shown in the bottom panel of Figure A.6). There, you see functions and modules and the number of
events measured in these modules. Event counts that appear sub-optimal are automatically highlighted
in pink. The “Bottom-Up” and “Top-Down” tabs are helpful in identifying which part or parts of a code
are responsible for certain negative metrics.
6. Information collected by VTune can be presented in different viewpoints. In order to switch to a different
viewpoint, click the word “change” in the header of the window (top panel of Figure A.7). The viewpoint
“Hotspots” is particularly helpful for optimizing applications. In this viewpoint, the primary metric
shown in the “Bottom-Up” and “Top-Down” tabs is the CPU time. This allows to find the bottlenecks
(hotspots) of the application (see the bottom panel of Figure A.7).
g
an
W
e ng
nh
r Yu
fo
7. A very powerful feature of Intel VTune Amplifier XE is the ability to narrow down the hotspots to
re
individual lines of code or even individual assembly instructions. In order to get to that view, double-
yP
click any function in the “Bottom-Up” view. The result is shown in Figure A.8. In order to see the
assembly listing corresponding to the C/C++ code, click the “Assembly” button above the code listing.
el
iv
Note that in order to enable source code viewing, the application must be compiled with the compiler
us
argument -g. It is advisable to also use -O3 to avoid slowing down the calculation.
cl
Ex
8. Now that we have learned how to analyze a host application, let us profile an application that uses an
Intel Xeon Phi coprocessor in the offload mode. In order to do that, create a new project by clicking the
button with the “+” symbol in the toolbar (top panel of Figure A.9). Then configure a new project called
“Offload-Workload” with the executable step-01-offload/offload-workload.
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
In fact, there is no difference between configuring a project for a host-only application and one with
offload. However, now is a good time to learn how to control the sampling interval.
Enter value “5” into the box “Automatically resume collection after”. With this setting, VTune will begin
sampling 5 seconds after the launch of your application. This allows you to exclude the initialization of
the application from the analysis. For offload applications, this is especially important, because while
the application and dependent libraries are being transferred to the coprocessor at the beginning of the
run, nothing worth profiling usually happens.
Enter value “12” into the box “Automatically stop collection after”. This setting makes VTune terminate
sampling 12 seconds after the launch of the application. This allows you to exclude finalization stages
from the analysis. You can also manually stop profiling any time using the buttons in the right-hand side
of the VTune window.
9. Now we will run analysis on the coprocessor. Click the orange triangle in the toolbar and choose analysis
type “Knights Corner Platfrom Analysis” / “General Exploration” (you can also choose “Lightweight
Hotspots” if you wish) as shown in the top panel of Figure A.10. This will run the analysis on the
coprocessor. If you wish to profile the host part of an offload application, then choose “Sandy Bridge...”
/ “General Exploration” (we will not do it in this case).
Click the button “Start Paused” to launch the application and start profiling. The button “Start” is greyed
out because in the previous step we chose to start sampling 5 seconds after the launch of the application.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Figure A.10: Profiling an application with offload to an Intel Xeon Phi coprocessor.
When the application terminates, or if you terminate sampling manually, you will see cumulative
sampling information (bottom panel of Figure A.10). The metrics here are different from the metrics
that you saw in the Sandy Bridge architecture analysis. However, you can still get information about the
metrics by placing the mouse cursor over the question mark symbols.
10. Finally, let us analyze an application compiled for native execution on an Intel Xeon Phi coprocessor.
The configuration of a VTune project in this case is slightly different from the configuration for a
host application. You must set micnativeloadex as the application to run. The name of the
executable, native-workload, must be placed in the line “Application Parameters”. You must also
specify the working directory so that the micnativeloadex tool can find the executable. If the
applications uses any external libraries, such as the Intel OpenMP library used in this application, you
must also set the value of the environment variable SINK_LD_LIBRARY_PATH. This variable points
to the directories where micnativeloadex searches for libraries that must be transferred to the
coprocessor. See Section 2.1.3 for more information about using micnativeloadex to run native
coprocessor applications. Figure A.11 shows the project configuration window for a native application.
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
Figure A.11: Configuring a VTune project for a native application for Intel Xeon Phi coprocessors.
11. Create a new project called “Native-Workload” with the application native-workload from the
directory step-02-native, as shown in the previous step. Run the analysis of type “Knights Corner
Plaftorm Analysis” / “Lightweight Hotspots” for this application. Ensure that the run was successful by
switching to the terminal window and monitoring the output of the application.
12. At this point, you should be able to analyze applications that run on the host, use the offload model,
or run on the coprocessor. You can find hotspots and determine which application modules incur
negative performance metrics. We have not discussed how to use these metrics in order to improve the
application performance, because the rest of Chapter 4 is dedicated to this subject. However, when you
see references to profiling of an application using VTune in the main text, you will be able to reproduce
those steps.
13. Before concluding, we would like to show you some additional useful techniques in Intel VTune
Amplifier XE. When you view the “Bottom-Up” or the “Top-Down” tab, you can zoom in on a time
interval to study the events in it. In order to zoom in, click on the timeline and drag the mouse to the left
or to the right. Then choose “Zoom In on Selection” from the context menu that appears. This is shown
in Figure A.12
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
14. It is possible to create a custom analysis with events that you want to study specifically. In order to do
that, use one of the buttons at the top of the sidebar menu. You will be given the opportunity to select
the events that you wish to collect for your custom analysis. Once this is done, your custom analysis
type will appear at the bottom of the sidebar menu. See Figure A.13.
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
15. You can start analysis from the command line. In order to see what command line VTune uses to launch
the profiling that you configured, click the button “Command Line...” at the bottom right corner of the
VTune window. This is shown in Figure A.14. When you have collected profiling information for an
application, you can then use amplxe-gui to load and view the results.
g
an
W
e ng
nh
r Yu
fo
ed
ar
Figure A.14: Obtaining the shell command to start the configured analysis.
p
re
yP
16. Optional: use the techniques discussed in Section 3.2.7 to optimize one of the workloads used in this
el
lab. Use VTune to perform the General Exploration analysis. Compare the results. You can just look
iv
at both results, or use the “Compare Results” function available via a button in the tool bar (below the
us
menu bar) that looks like two halves of a circle (see Figure A.15).
cl
Ex
17. Optional: you can also study the tutorial included in Intel VTune Amplifier XE. This tutorial can be
found by pointing the web browser to the following local URL on a machine with installed Intel VTune
Amplifier XE:
file:///opt/intel/vtune_amplifier_xe/documentation/en/tutorials/
The official documentation for VTune can be found in [74].
Question 4.1.a. What is the difference between the configuration of a VTune project for a host-only application
and the configuration of a project for a native application for Intel Xeon Phi coprocessors?
Question 4.1.b. What happens if you analyze an application with offload to the coprocessor using the “Sandy
Bridge” / “General Exploration” analysis type?
Question 4.1.c. If you want to identify hotspots on the level of individual lines of source code, what compiler
argument must you use when you compile the application?
Answers
Answer 4.1.a. For host-only workloads, one must specify the executable file of the workload as the
application to analyze. For native workloads for coprocessors, one must specify micnativeloadex as the
application and specify the executable as the application parameter.
g
Answer 4.1.b. You will obtain the performance metrics of the host portion of the application.
an
W
Answer 4.1.c. Use -g to include symbols into the executable, and -O3 to avoid slowing down of the
ng
application during the analysis.
e
nh
Yu
TM
A.4.2 Using Intel R Trace Analyzer and Collector
r
fo
The following practical exercises correspond to the material covered in Chapter 4 (pages 139 – 257)
d
re
Goal
pa
re
Intel Trace Analyzer and Collector is a powerful tool for understanding MPI application behavior, quickly
yP
finding bottlenecks, and achieving high performance for parallel cluster applications.
The following instructions will cover the basics of the Intel Trace Analyzer and Collector interface and
el
functionality. Let us refer to the previous problem of calculating the number π, as presented in Chapter 4.7.3.
iv
You can experiment on the source code and see how it effects the results with the Intel Trace Analyzer and
us
Collector. Use the Makefile B.4.2.1 and the source code B.4.2.2, which are located in the corresponding
cl
Ex
Preparation
The Intel Trace Analyzer and Collector should be installed on the host computer. Its current version (as
of this writing) is 8.1.0.024. If your system has a different version, use it instead of the one presented in the
instructions.
The Intel Trace Analyzer and Collector requires that libVT.so be located on the Intel Xeon Phi
coprocessor for collecting trace data and setup of other environment variables:
user@host% sudo scp /opt/intel/itac/8.1.0.024/mic/slib/libVT.so mic0:/lib64
user@host% . /opt/intel/itac/8.1.0.024/intel64/bin/itacvars.sh impi4
Parameter impi4 indicates what version of the Intel MPI will be used with Intel Trace Analyzer and
Collector. We can save some space by using NFS sharing, described in the lab A.1.2. The whole /opt/intel
folder can be connected, to include required libraries from all Intel products installed on the system.
To use Intel Trace Analyzer and Collector libraries and traces from Intel Xeon Phi coprocessors we need
to run the following script:
Troubleshooting
In the following section you will see standard error messages from MPI runs, the reasons causing them,
and the way to fix those problems.
MPI run halts, while running on several Intel Xeon Phi coprocessors. Caused due to missing connection
between Intel Xeon Phi coprocessors. Enable IP packets forwarding on the host. Modify
etc/sysctl.conf – change the following net.ipv4.ip_forward = 1.
g
MPI run halts on any two devices. Communication between devices are blocked. Turn off iptables and
an
see if it helps: sudo service iptables stop.
W
Instructions e ng
nh
1. Using files from labs/4/4.2-itac/step-00 folder B.4.2.1 and B.4.2.2 compile and execute
Yu
user@host% cd ~/labs/4/4.2-itac/step-00
user@host% make
ed
user@host% export I_MPI_MIC=1 # turn MIC architecture support for Intel MPI
el
First message is produced by Intel Trace Analyzer and Collector on Intel Xeon Phi coprocessors. And
can be fixed by setting up the proper environment for MIC architecture:
Second message is about Intel MKL library location on Intel Xeon Phi coprocessors. If you mounted
NFS share, and set up Intel C++ Compiler environment, than the following small trick will resolve the
issue:
This will combine library paths for the host architecture with MIC architecture files.
2. The procedure above should result in creation of runme-mpi.single.stf log trace file. This file
can be visualized with Intel Trace Analyzer and Collector:
user@host% traceanalyzer runme-mpi.single.stf
This command should be executed in the terminal of remote desktop client. Otherwise, X11 forwarding
should be enabled to display GUI of Intel Trace Analyzer and Collector.
You will see main interface of the Intel Trace Analyzer and Collector, similar to one shown on Fig-
ure A.16.
g
an
W
e ng
nh
r Yu
fo
d
re
Figure A.16: Initial view of Intel Trace Analyzer and Collector application.
pa
re
This window provides general information about traced MPI run with the summary of times spend on
yP
MPI communication timeline provides more information about the application run. To open timeline
iv
visualization for all MPI ranks click the “Charts" menu of internal window, and choose “Event Timeline",
us
Figure A.17: Choosing “Event Timeline" chart from Intel Trace Analyzer and Collector.
Default color codes are the following: red and blue zones correspond to MPI and Application groups;
black and blue lines show point-to-point and collective operations. This can be modified by right-clicking
on the chart and choosing “Event Timeline Settings...". Since collective operations use the same color
as application blocks, it is recommended to change color by clicking on “Collective Operations Color"
button and also choosing “Use thick Lines for Collective Operations" checkbox.
g
an
W
e ng
nh
Figure A.18: Event Timeline chart with highlighted broadcast messages (green).
r Yu
3. Statistics about point-to-point MPI communication between ranks can be activated by choosing “Message
fo
Profile" from Charts menu. Color-codding correspond to the latency of the MPI communication.
ed
ar
Default view of the Event Timeline chart shows individual timelines per rank. It is might be helpful to
p
use hostname grouping instead. Choose “Process Aggregation" from “Advanced" menu allows to group
re
processes by the nodes. Select the “All_Nodes" from the list and click “Apply" button.
yP
el
iv
us
cl
Ex
Figure A.19: Groups of MPI processes tracks by the host names, and communication statistics between them.
In the “Advanced" menu we can select “Function Aggregation", and change “Major Function Groups"
to “All Functions", which will change the caption of the blocks, and provide more information on what
MPI function were used within the MPI Groups.
Zooming to the specific area can be done with mouse selection of the horizontal area, with the menu
selection, or with the key short-cuts.
Filtering, tagging and explicit frame limits specification can be changed through button at the bottom of
the Intel Trace Analyzer and Collector.
4. Advanced exercise: Intel Trace Analyzer and Collector can visualize function calls within the user
application. To do this application should be compiled with the -tcollect flag and corresponding
path to the Intel Trace Analyzer and Collector libraries. Makefile
g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex
Figure A.20: Timeline chart of MPI processes with traces of function calls.
Change Function Aggregation view to get the desired functions name scheme.
Goal
In the following practical exercise, you will be asked to optimize the source code of calculating the error
function (aka Gauss error function):
Z x
2 2
erf(x) = √ e−t dt (A.1)
π 0
with the rational approximation[83]:
2
erf(x) ≈ 1 − (a1 t + a2 t2 + · · · + a5 t5 )e−x
1
t=
1 + px
g
an
(A.2)
W
p = 0.3275911, a1 = 0.254829592,
a2 = −0.284496736, a3 = 1.421413741,
e
a4 = −1.453152027, a5 = 1.061405429, ng
nh
which accurately ( ≤ 1.5 × 10−7 ) represents the non-negative part of the error function. And since the error
Yu
2.0
yP
el
1.5
iv
us
1.0
cl
Ex
0.5
0.0
0.5
1.0
1.5
2.0 4 2 0 2 4
Figure A.21: The erf(x) function A.1 is in red (solid) and the rational approximation A.2 is in blue (dashed).
Instructions
1. Consider the following unoptimized source code B.4.3.3 from
labs/4/4.3-serial-optimization/step-00/erf.cpp of the error function erf(x) and
corresponding main.cpp (B.4.3.2) and Makefile (B.4.3.1).
Compile and run the code.
Note: Write down the calculation times at each step of the optimization.
The myerff() function is implemented in a separate file (erf.cpp), and you will be asked to modify
it for optimization purposes. To use this scalar function within a vector environment, the #pragma
simd and __attribute__((vector)) constructions were used. During the compilation, the
-vec-report3 flag is used to display the vectorization report. Since we plan to run this code
on the Intel Xeon Phi coprocessors, we will use intrinsic memory allocation and free for fIn and
fOut input/output arrays. Use printf() to output: the minimum/maximum argument values and
corresponding function values; number of seconds spend on calculation of 228 points on the grid, and
relative error by comparison with the library implementation of erf().
g
an
Question 4.3.a. What optimization techniques can be applied to the source code to speed up the
W
execution?
ng
2. Common subexpression elimination. Using the original unoptimized code B.4.3.3 from the step-00
e
nh
subfolder, modify the method of powers calculation for the variable t. Try to implement two differ-
ent approaches. For the first method, try using the pow() function from the math.h library (see
Yu
Listing B.4.3.4). For the second method, try using multiplication with the previous power values (see
r
3. Explicit specification of literal constant types. Since we are using float type variables as the
re
input and output parameters of the function myerff(), all the literal constants should be specified
pa
as floats as well. This will avoid unnecessary implicit type casting. Floating-point constants have
re
double type by default, and thus we need to use the “f" specifier to make them float type, e.g.
yP
1.0f.
el
Question 4.3.b. What specifiers should be used to explicitly specify the constant “1" as long and
iv
unsigned long?
us
cl
4. Precision control and optimized functions. In our code, most of the computational resources are
spent on calculating the exp(-x*x) multiplier of the resultant value. The double exp(double);
function call typecasts our (float) -x*x into double and we lose precision again when the result
is converted back to the float type. This can be avoided by using float expf(float); function
call instead.
Another approach is to use binary mathematical functions, which due to system architecture get better
performance. The following mathematical property can be used:
ea = 2a log2 e (A.4)
In the Equation A.4, log2 e = 1.442695040 is a constant, which can be specified before the result
calculation line. In this case it will be inlined by compiler to the expression. exp2f() function call
can be used to calculate powers of 2.
Implement the method described above and compare your results with the source code B.4.3.7 from
step-04 subfolder.
5. Branches elimination. In general, a branching code is not good for efficient auto-vectorization.
And it significantly slows down the code execution when used within a function, which called multiple
times. Therefore, branching should be avoided as much as possible.
In our previous implementations we used explicit comparison check twice. The argument suppose to be
non-negative number. This branching correspond to oddness check of the function (see A.3).
Using the bit-wise operations we can speedup the execution. On the downsize, we make code architecture
dependent, which will work on 64-bit systems with little-endian storage OS (Windows, Linux, Intel-
based Mac OS, etc.). But other systems may have big-endian storage (e.g. PowerPC), and thus will
calculate the wrong result. Use this technique with caution.
For Intel Xeon and Intel Xeon Phi coprocessors architectures float numbers have the upper bit
corresponding to the sign, 8 bits representing exponent part, and 23 bits are contain fractional part of the
number. We will use bit-wise AND operation to change only the upper bit. This will give the absolute
value of the argument.
g
To get the sign of the argument bit-wise AND with 0x80000000 mask constant should be used. Afterward
an
it should be applied to the result value with bit-wise OR operation.
W
ng
Implement your code and compare it with B.4.3.8 from step-05 subfolder.
e
nh
6. Explicitly specify what vector instructions should be used. It may speedup the resulted code. Since
Yu
we plan to run the compiled binary code on Intel Xeon processor, which supports AVX instruction set,
-xAVX compiler’s flag should be used. Changing -fp-model flag’s value effects the final performance
r
fo
as well.
ed
Advanced Exercise
yP
el
Modify main.cpp source code file to implement parallelism and vectorization using OpenMP. Compare
iv
5
4
3 2.76 s 2.80 s
2.50 s
2 2.05 s 1.80 s 1.88 s 1.88 s
1.20 s 1.14 s 1.05 s
1
0 0.07 s 0.03 s
) pes
ized pow
( ion
resnsation ctio
ns hes X fl
ag
tori
zed
ptim x p nt s ty f u n br anc -xA
V
vec
Uno ube limi sta ized ing d
on s e Con tim inat l an
m O p l i m alle
Com E par
Answers
• Precision control
• Eliminating branches
• Compiler switches
g
• Using parallelism and vectorization
an
W
ng
Answer 4.3.b.
e
nh
1 long lvar = 1.0L;
2 unsigned long luvar = 1.0UL;
rYu
fo
The next practical exercise corresponds to the material covered in Section 4.3.1 (pages 153 – 157)
pa
re
yP
Goal
el
Optimize the code for automatic vectorization. Apply the technique to the problem of calculating electric
iv
potential on a grid.
us
cl
Instructions
Ex
1. Compile and execute source code B.4.4.2 and corresponding Makefile B.4.4.1 from
labs/4/4.4-vectorization-data-structure/step-00
This code calculates the values of the potential on the grid, formed by charged points. Those points
described as struct Charge structures, and contain corresponding coordinates and charge values.
Coordinates and charge value have float types. Therefore, each structure takes 4x4=16 bytes of
memory.
Modify this source code to apply unit-stride data access to speedup the program by utilizing vector
instructions more efficiently.
Compare your result with the source code B.4.4.3 from step-01 subfolder.
2. Additional performance can be achieved by using special compiler flags to control the precision of
floating-point operations.
For instance, using -fimf-domain-exclusion flag we can exclude some special computationally
expensive cases of floating-point exceptions:
Integer value of the flag is calculated as bitwise OR on corresponding bit flags of excluded classes.
Exclude means that the code generated by the compiler does not have to handle that category of values,
thus providing additional speedup.
-fimf-accuracy-bits defines the relative error for result of math library functions.
g
an
In this step, try using these compiler arguments and monitor their effect on performance. Compare your
W
result with B.4.4.5 and B.4.4.4 from step-02.
e ng
nh
A.4.5 Vector Optimization: Assisting the Compiler
Yu
The next practical exercise corresponds to the material covered in Section 4.3.2 (pages 157 – 161)
r
fo
ed
Goal
p ar
Consider the problem of finding the result of multiplication of a sparse matrix M by vector A :
yP
M x A = B. To save memory space and calculation time original matrix M can be stored as a packed array
el
of contiguous chunks of non-zero elements and additional arrays with the information about those non-zero
iv
Example of a small 16 × 16 sparse random matrix, vector and multiplication result is presented below:
cl
Ex
Preparation
Note: For the application studied in this lab, hyper-threading is counter-productive. Set the number of
OpenMP threads as follows (assuming 16 physical cores on the host and 60 physical cores on the coprocessor):
Instructions
1. Look at the following list of files located at
labs/4/4.5-vectorization-compiler-hints/step-00 folder:
Makefile (see B.4.5.1) will compile our source code for the host and Intel Xeon Phi coprocessors,
with activated flags for OpenMP and auto-vectorization report.
g
an
main.cc file (see B.4.5.2) demonstrates simple example with a small 16 × 16 sparse matrix multipli-
W
cation by vector, and then performs benchmark testing for bigger 20000x20000 matrices with the
ng
row of 100 non-zero elements in average; also initialization and testing functions implemented
e
here.
nh
worker.h header file (see B.4.5.3) contains declaration of PackedSparseMatrix class and
Yu
detailed description of variables used in this class.
r
fo
worker.cc source file (see B.4.5.4) has the class implementation. Constructor of the class creates
packed version of a sparse matrix provided to it. MultiplyByVector method implements
d
re
Compile and execute the code for the host system and Intel Xeon Phi coprocessors.
yP
2. #pragma loop_count can be used to help Intel C++ Compiler optimize the executable for expected
el
number of loop iterations by choosing the optimal execution path. It only leads to an increase in
iv
performance when the actual loop count in the program is in agreement with the prediction value,
us
Question 4.5.a. Where do you think #pragma loop_count should be used and with what parame-
ter value?
Answers
Answer 4.5.a.
1 #pragma loop_count avg(100)
This pragma can be added before summation loop within MultiplyByVector method implementation of
PackedSparseMatrix class, since we know a priori that this loop will have 100 iterations in average.
Goal
In the following practical exercise we will examine conditional branching within the innermost auto-
g
vectorized loop. Intrinsic instructions of MIC architecture (Intel Xeon Phi coprocessors architecture) include
an
masked versions of the most vector instructions, which will apply the specified operation only if a mask for
W
corresponding number is set to non-zero value. With #pragma simd pragma we can force the compiler
dependency).
e ng
to use those masked instructions, if there are no other issues with auto-vectorization (for instance, vector
nh
Yu
Preparation
r
fo
Note: For the application studied in this lab, hyper-threading is counter-productive. Set the number of
ed
OpenMP threads as follows (assuming 16 physical cores on the host and 60 physical cores on the coprocessor):
p ar
Instructions
cl
Ex
Compile and execute this program. Compare your result with the one provided in Section 4.3.3.
2. Modify the source code above to explicitly vectorize the internal loop with #pragma simd.
Check the performance change of those modifications.
Compare your source code with B.4.6.4 from step-01.
Goal
Make parallel implementation and optimize histogram creation algorithm for caching without false
sharing and cache line stilling.
g
an
Instructions
W
1. Files for this practical exercise (Makefile B.4.7.1, main.cc B.4.7.2, and worker.cc B.4.7.3)
ng
located at: labs/4/4.7-optimize-shared-mutexes/step-00 folder. main.cc initialize
e
random array of ages, which will be used for histogram creation (Histogram() function call from
nh
worker.cc). Calculated histogram occupancy compared with the result of serial implementation for
Yu
correctness, with performance statistics printed out.
r
fo
Function Histogram() from worker.cc source code file is serial unoptimized version of histogram
calculation function. Your task is to optimize this code. From our previous practical exercises we
d
re
know, that devision is slower than multiplication operation. Thus, you need to modify the code to use
pa
multiplication wherever possible; pre-compute the reciprocal. Also use strip-mining technique to split
re
the loop into two nested loops. This will allow the inner loop to be vectorized. Use #pragma vector
yP
Try to implement additional loop, which will take care of the tail iterations, if total number of elements
iv
Compare your result with the source code B.4.7.4 from step-01 subfolder. Make sure that inner loop
cl
2. In the previous step we applied data parallelism – vectorization of the code. Next step is to use thread
parallelism, which will be implemented with OpenMP.
Apply OpenMP pragma to the external for loop to be run in parallel. To avoid racing condition use
#pragma omp atomic mutex to protect hist[] array modification. Although, this operation is
highly inefficient and presented here only for educational purposes, this approach still can be used for
light-loaded operations.
Parallel version with OpemMP atomic mutex pragma can be found at step-02 subfolder, source code
B.4.7.5.
3. Optimize previous parallel code by using private variables to hold a copy of histogram in each thread.
To do so use #pragma omp parallel and #pragma omp for separately. Use aligned array
for storing temporary histogram values. You should collect the total number of element from each
corresponding cell in all private histogram arrays. Use #pragma omp atomic to avoid racing
conditions.
Compare your result with B.4.7.6 for step-03 subfolder.
4. False sharing and cache line stilling. Using previous example create shared two-dimensional array,
which will keep values of histogram entries (the first dimension) for each individual thread (the second
dimension).
If your code did not automatically vectorized, use private variable to collect histogram indexes and use
second loop to collect counting of those values into the two-dimensional array.
Compile and run your code. Compare your source code with B.4.7.7 from step-04 subfolder.
Since there are only 5 histogram groups with counters of integer type, we will notice performance issues
due to false sharing.
5. To prevent false cache line sharing we can increase the distance between the accessed elements by
increasing number of elements in the array.
Rewrite the code to calculate the new size of array with provided paddingBytes variable. New
implementation of array should has this new larger size.
Compare your source code with B.4.7.8 from step-05.
g
an
W
A.4.8 Shared-Memory Optimization: Load Balancing
ng
The next practical exercise corresponds to the material covered in Section 4.4.3 (pages 175 – 179)
e
nh
Yu
Goal
r
fo
In the following practical exercise you will be asked to write matrix-vector multiplication solver M x = b,
using Jacobi method.
ed
To show load imbalance and methods of preventing it, we will not use accuracy (threshold) number, but
ar
rather vector with length nBVectors of the accuracy values with the large distribution of values, and those
p
re
values will be assigned to individual OpenMP threads in parallel. To store solution vectors x and vectors b we
yP
90.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
10.0 268.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0
Ex
20.0 21.0 446.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0
30.0 31.0 32.0 624.0 34.0 35.0 36.0 37.0 38.0 39.0
40.0 41.0 42.0 43.0 802.0 45.0 46.0 47.0 48.0 49.0
50.0 51.0 52.0 53.0 54.0 980.0 56.0 57.0 58.0 59.0
60.0 61.0 62.0 63.0 64.0 65.0 1158.0 67.0 68.0 69.0
70.0 71.0 72.0 73.0 74.0 75.0 76.0 1336.0 78.0 79.0
80.0 81.0 82.0 83.0 84.0 85.0 86.0 87.0 1514.0 89.0
90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 1692.0
Vector b is initialized uniformly with random numbers by using Intel MKL streams.
Instructions
1. Compile and execute source code files from labs/4/4.8-optimize-scheduling/step-00:
Makefile B.4.8.1, main.cc B.4.8.2, and worker.cc B.4.8.3.
In the main.cc file for loop will be iterated nTrials times, calling IterativeSolver()
function from worker.cc. Average number of iterations is returned by this function and printed out
as part of execution statistics, as well as time of execution.
#pragma omp parallel for has reduction clause and schedule clause. Changing the
parameter of schedule clause we can optimize workload distribution between the parallel threads,
and thus avoid load imbalance.
2. Modify the source code above to use Intel Cilk Plus as a parallelism engine.
Compare your result with the source code B.4.8.4 from step-01 subfolder.
3. To compare performance difference between scheduling parameters, write the code, witch will call
IterativeSolver() function within different OpenMP
#pragma omp parallel for schedule(...) pragma environments. Second integer param-
eter passed to the schedule clause indicates grain size.
We suggest you to try the following scheduling modes:
g
• Intel Cilk Plus
an
W
• without schedule clause
ng
• schedule(static, 1)
e
• schedule(static, 4) nh
Yu
• schedule(static, 32)
r
fo
• schedule(static, 256)
d
re
• schedule(dynamic, 1)
pa
re
• schedule(dynamic, 4)
yP
• schedule(dynamic, 32)
el
iv
• schedule(dynamic, 256)
us
cl
• schedule(guided, 1)
Ex
• schedule(guided, 4)
• schedule(guided, 32)
• schedule(guided, 256)
Compare your results with the source code B.4.8.5 from step-02 subfolder.
4. Use Intel VTune Amplifier XE to visualize concurrency between threads for different scheduling modes.
(a) Use “Concurrency" from “Algorithm Analysis" for program running on the host Intel Xeon
processor
(b) For Intel Xeon Phi coprocessors you will have to use “Lightweight Hotspots" from “Knights
Corner Platform Analysis", and filter-out IterativeSolver function calls. See illustration
below:
g
an
W
e ng
nh
Can you explain the waiting areas at different scheduling modes and grain sizes?
r Yu
fo
Scalability
ar
The next practical exercise corresponds to the material covered in Section 4.4.4 and Section 4.4.5
p
re
Goal
iv
us
Here, we demonstrate several methods for optimizing some code with insufficient parallelism in which
cl
Instructions
1. Source files B.4.9.1, B.4.9.2, and B.4.9.3 can be found at the following location:
user@host% cd ~/labs/4/4.9-insufficient-parallelism
user@host% cd step-00
What was the bandwidth (in GB/s) on the host? What was it on the coprocessor? Can you explain the
poor performance of the coprocessor in this case?
3. Diagnose performance problems.
Use Intel VTune Amplifier XE to run the analysis of type “Concurrency” on the host version of the
application. What is the analysis telling you? Refer to Section 4.4.4 for additional information.
4. Optimization attempt: inner loop optimization.
You should now be in directory step-00. Modify the file worker.cc so that instead of the outer
loop with few iterations, parallelization is applied to the inner loop with multiple iterations. Do you
expect to get an improvement?
For solution, go to the next step (B.4.9.4 at step-01).
user@host% cd ../step-01
user@host% emacs worker.cc
user@host% make
g
user@host% ./runme
an
% record new results on host
W
user@host% micnativeloadex runmeMIC
% record the results on coprocessor
e ng
nh
Use Intel VTune Amplifier XE to run the analysis of type “Concurrency” on the host version of the
Yu
application. What is the analysis telling you? Refer to Section 4.4.4 for additional information.
r
Explain the difference between the effect of this optimization on the performance of the host and of the
fo
coprocessor.
d
re
You should now be in directory step-01. Now we will attempt to increase the iteration space by using
re
c) decide how to perform the reduction: now it is not possible to use the variable sum.
user@host% cd ../step-02
user@host% emacs worker.cc
user@host% make
user@host% ./runme
% record new results on host
user@host% micnativeloadex runmeMIC
% record the results on coprocessor
Can you explain the results? Hint: try to compile worker.cc with the argument -vec-report3.
7. Optimization attempt: loop collapse + strip-mine
The reason for the failure of the optimization in the previous step is that the compiler does not know
how to automatically vectorize the reduction when loop collapsed technique is used. Let us assist the
compiler by strip-mining the j-loop. You should now be in directory step-02. Use your previous
solution or the file worker.cc in this directory. Strip-mine the j-loop, so that the inner loop along the
strip can be automatically vectorized.
Look up the solution in step-03 and benchmark it:
user@host% cd ../step-03
user@host% emacs worker.cc
user@host% make
user@host% ./runme
% record new results on host
user@host% micnativeloadex runmeMIC
% record the results on coprocessor
g
Goal
an
W
Affinity control allows to get additional performance gain due to optimizing the resources distribution.
Instructions
e ng
nh
1. To show the core affinity control for this first step, we will use the source codes from the last practical
Yu
exercise – summation of column elements of matrix (aka the row-wise matrix reduction). Makefile
r
fo
Memory bandwidth-intensive calculations like this one are best run when hyper-threading is not used,
p
and also KMP_AFFINITY=scatter. This is because the processor or the coprocessor can employ
re
all available memory controllers, and at the same time, there is no thread contention on the memory
yP
controllers.
el
user@host% make
cl
2. Compute-bound calculations, for instance DGEMM function from Intel MKL – matrix-matrix multi-
plication and summation of type αA ∗ B + βC, is highly arithmetically intensive problem. Example
implementation can be found at B.4.10.4 Makefile and B.4.10.5 affinity.cpp source code file.
Compile and execute those files from step-01 subfolder.
Use micnativeloadex to run this program on Intel Xeon Phi coprocessors. Use flag
-e "KMP_AFFINITY=compact" to specify thread affinity mode. Compare the performance of those
two cases.
3. For some problems running on the Intel Xeon processors KMP_AFFINITY can provide additional
performance as well. Consider the problem of one-dimensional Discrete Fast Fourier Transform
(DFFT) of a large 4GB array. Makefile B.4.10.6 and corresponding affinity.cpp source code
file B.4.10.7 located at the step-02 subfolder. Compile and execute this program. Notice the
performance.
In the same folder two additional shell scripts are provided:
(run1_noaffinity.sh and run2_affinity.sh).
Each of them will execute runme compiled program with modified environment variables, changing
number of threads used by Intel MKL, and affinity mode.
Run those scripts and compare the performance.
4. Modify Makefile to specify -par-affinity compiler’s flag. This will define the affinity mode at
the moment of compilation, which will be used at a run-time.
g
Chapter 4 (pages 200 – 213)
an
W
Goal
e ng
Study cache optimization techniques, and compare performance gain from loop interchange and tiling.
nh
Yu
Instructions
r
fo
1. The following practical exercise is based on program calculating the transient emissivity of cosmic dust
d
Study the source code B.4.11.2, B.4.11.3, and corresponding Makefile B.4.11.1 from
pa
There are tree nested loops in the worker.cc iterated with i, j, and k. Within those loops we
el
Question 4.11.a. Interchanging i-loop and j-loop will increase the performance. Can you explain
why?
Modify the source code to interchange nested loops iterated over i and j. k-loop is provide unit-stride
access for vectorization, and thus changing its order will only decrease the performance. Compare your
result with B.4.11.4 from step-01 subfolder.
2. In the next step you will be asked to tile i-loop. Define additional constant iTile and split i-loop into
two nested loops with internal one making iTile iterations.
Try different values of iTile and find the optimal one. iTile-loop will increase the performance,
since several vector registers will keep plankFunk[] array, thus reducing time needed to copy those
chunks from L1 cache layer.
Compare your source code with B.4.11.5 from step-02 folder.
3. Tiling can be used for both i- and j-loops, providing data locality for both planckFunc[] and
distribution[].
Note: Two nested loops within third one prevent it from auto-vectorization by compiler. Thus, you will
need to explicitly unroll one of them.
Use __MIC__ macros and find optimal parameters for iTile and jTile for Intel Xeon processor
and Intel Xeon Phi coprocessor.
Compare your results with B.4.11.6 from step-03 folder.
4. Combine all the steps above together in one file, for instance, as shown at B.4.11.7. Use Intel VTune
Amplifier XE to compare cache replacements on different layers. You should get something similar to
the following plot:
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
This plot indicates, that on every optimization step the number of data cache L1 and L2 cache replace-
ments decreased, since we optimized the data locality by using tiling.
Answers
Answer 4.11.a. distribution[] array is a private array for each thread, while plankFunc[]
is shared between the threads in parallel environment. Therefore, it is better to keep data locality for
distribution[] array, or otherwise portions of it will be copied several times more often (proportional to
number of threads accessing the same cache) from L2 to L1 cache layers, than plankFunc[].
Goal
Study cache optimization technique of cache-oblivious algorithms.
Instructions
1. Folder labs/4/4.c-cache-oblivious-recursion/step-00 contains several files:
Makefile B.4.12.1, main.cc B.4.12.2, and worker.cc B.4.12.3 – source codes of parallel
28000 × 28000 matrix transposition.
File main.cc implements matrix initialization, verification of correct transposition, and timed calls
of Transpose() function from worker.cc source code file. This function uses Intel Cilk Plus
parallel for to iterate over external loop. Function is not optimized, it will exchange elements of the
matrix from below the matrix’s diagonal location with elements above it. Intel C++ Compiler suspects
vector dependence, and thus do not vectorize the inner loop. Using #pragma ivdep we can force
compiler to auto-vectorize it, but this actually will not increase the performance.
g
an
We can increase the performance by applying tiling algorithm, which will improve data locality by
re-using data already in the cache. Try to implement this optimization technique. Compare your result
W
with the source code B.4.12.4 from step-01 folder.
e ng
2. Program from the previous step can be additionally optimized by providing #pragma loop count
nh
avg(TILE) and #pragma simd pragmas, as shown in B.4.12.5 source code file.
r Yu
3. Cache-oblivious algorithm shows even better performance for this problem. Try to implement recursive
fo
function for matrix transposition, using different recursive threshold constant RT for Intel Xeon processor
d
and Intel Xeon Phi coprocessor. Compare your result with B.4.12.6 from step-03 folder.
re
pa
4. Vector operations will benefit significantly, if data split points will be multiple of the SIMD vector
re
length. Using modulo operation implement recursive splitting at those points. Compare your result with
yP
The next practical exercise corresponds to the material covered in Section 4.5.6 (pages 216 – 220)
Ex
Goal
Study cache optimization technique based on loop fusion.
Instructions
1. In the following practical exercise we will calculate mean value and standard deviation of randomly
distributed (generated with Intel MKL) 10000 array of 50000 elements each.
Initially, see labs/4/4.d-cache-loop-fusion/step-00 B.4.13.1, B.4.13.2, and B.4.13.3,
initialization function and calculation of mean and standard deviation values are called separately.
Although, those functions can be combined together, providing additional speed-up due to loop fusion,
and avoiding additional overhead by using only one OpenMP parallel region.
Combine those functions and compare your result with the source code B.4.13.4 from step-01
subfolder.
2. For this particular problem we don’t need to keep all random generated data on the heap, but rather we
can generate it within the parallel OpenMP region for each individual thread on the stack. Compare
your result with B.4.13.5 from step-02 subfolder.
Goal
Offloading function calls can be optimized through precise control of data manipulation.
Instructions
1. Using source code files from labs/4/4.e-offload/step-00: main.cc B.4.14.2,
worker.cc B.4.14.3, and Makefile B.4.14.1, compare the performance of different offload imple-
g
mentations. Default offload procedure contains the following steps: allocating memory on coprocessor,
an
transferring data, performing offload calculations, and deallocating memory on coprocessor. For the
W
offload with the memory retention, please write the body of the function, so that the memory container
2. Implement the offload function with data persistence next. In the body of the function the data is
fo
transferred to the coprocessor during the first iterations, allocated memory is retained afterwards, and
ed
The next practical exercise corresponds to the material covered in Section 4.7 (pages 225 – 248)
iv
us
Goal
cl
Ex
Heterogeneous parallel computing require proper load balancing, which will be shown next.
Instructions
1. Reproduce the code for calculating number π with simple Monte Carlo simulations, where points with
random coordinates, uniformly distributed in the unit square, also are covering one forth of a unit
circle. Detailed description of the problem can be found in corresponding section of the main text (see
Section 4.7.1).
Total number of iterations iter=232 should be fixed, and distributed between available MPI processes.
Use fixed blockSize= 212 constant as number of iterations of most inner vectorized loop. Quick
random numbers generator can be used from Intel MKL library. And since the problem is two-
dimensional, you will need 2*blockSize random numbers. Try different access patterns for choosing
x and y coordinates. See which one is the most efficient and explain why.
Computations should be evenly distributed between all MPI processes. Final number of the points on
the surface of the unit circle should be collected from all processes with MPI_Reduce() function call
and final answer printed out by a single MPI process (rank 0).
Write, compile and execute your code. To compare your implementation you can use B.4.15.1 and
B.4.15.2 from labs/4/4.f-MPI-load-balance/step-00. You can use the following com-
mands for automatic run of MPI jobs:
user@host% make
user@host% make run
user@host% make runhost
user@host% make runmic
user@host% make runboth
This commands will compile the program, copy corresponding version of the binary executable on the
Intel Xeon Phi coprocessors and will execute MPI run. If you have a different number of cores on the
host CPU and coprocessors – modify Makefile to the correct values.
2. Static load balance. For heterogeneous MPI calculations on host and Intel Xeon Phi coprocessors we
will need to distribute the workload proportionally to the performance of the nodes, to guarantee the
load balance and optimal use of the available resources. ALPHA environment variable, specified by
g
an
user, corresponds to the relation between workload split between the host system and the coprocessors.
W
Calculate the number of ranks running on the host system and on the Intel Xeon Phi coprocessors with
__MIC__ macros, and divide the workload accordingly.
e ng
Compare your result with Makefile B.4.15.3 and B.4.15.4 source code from step-01 subfolder.
nh
Change environment variable ALPHA, and see how it effects the performance. Try to plot this dependence
Yu
and calculated theoretical value for the optimal proportion between the workload on the host and the
r
3. Boss–worker model. Dynamic workload distribution. Implement Monte Carlo calculation of number π
re
using this model. Dedicate special rank (rank 0) as a Boss for assigning work distribution between
pa
the rest of the ranks – workers. All workers should request new portion of work, when they finish
re
the previous portion. Boss process should respond with the number of Monte Carlo runs worker will
yP
execute. The same amount of work should be distributed per request, specified by environment variable
el
Compare your results with the source code B.4.15.5 from step-02 subfolder.
Ex
4. Hybrid MPI and OpenMP. Modify the source code from the previous step to use combination of MPI
and OpenMP – hybrid model. Worker’s MPI process will receive portion of the workload and spread it
between OpenMP threads. Optimize the code for the performance. Try different scheduling mechanisms
for #pragma omp for.
Compare your result with the files B.4.15.6 and B.4.15.7 from step-03 subfolder.
Using different combination of MPI/OpenMP processes/threads for the host system and Intel Xeon Phi
coprocessors find the optimal hybrid mode parameters.
Additional exercise: find amount of MPI communication and load imbalance of your program. How
different combinations of MPI/OpenMP processes/threads will effect those characteristics?
5. Guided workload distribution. Avoiding MPI communication. Using the code above, modify the
workload distribution algorithm assigned by the boss process. Instead of assigning the same amount
of computations per request, the boss should spread a portion of the workload between the workers,
calculated dynamically in the chunkSize variable. Thereafter, the workload amount should be
decreased by the half, and so on. To avoid massive MPI communication at the end for the small
chunks of workload, use some threshold value for the smallest chunk, defined in environment variable
GRAIN_SIZE.
Compare your results with source code B.4.15.8 from step-04 subfolder.
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Appendix B
g
an
B.2 Source Code for Chapter 2: Programming Models
W
g
B.2.1 Compiling and Running Native Applications on Intel Xeon Phi Coprocessors
en
h
un
Back to Lab A.2.1.
rY
fo
B.2.1.1 labs/2/2.1-native/hello.c
ed
ar
1
2 hello.c, located at 2/2.1-native
Pr
5
6 Redistribution or commercial usage without a written permission
u
7
Ex
B.2.1.2 labs/2/2.1-native/donothinger.c
9
10 #include <stdio.h>
11 #include <unistd.h>
12 #include <pthread.h>
13
14 void *Spin(void *threadid){
15 long tid;
16 tid = (long)threadid;
17 printf("Hello World from thread #%ld!\n", tid);
18 fflush(0);
19 while(1);
20 pthread_exit(NULL);
21 }
22
23 int main (int argc, char *argv[]){
24 int numThreads=sysconf(_SC_NPROCESSORS_ONLN);
25 pthread_t threads[numThreads];
26 printf("Spawning %d threads that do nothing, press ^C to terminate.\n", numThreads);
27 if (numThreads > 0){
28 for (int i = 1; i < numThreads; i++){
29 int rc = pthread_create(&threads[i], NULL, Spin, (void *)i);
if (rc){
g
30
an
31 printf("ERROR; return code from pthread_create() is %d\n", rc);
W
32 return -1;
}
ng
33
34 }
he
35 }
un
36 Spin(NULL);
rY
37 pthread_exit(NULL);
fo
38 }
d
re
pa
B.2.1.3 labs/2/2.1-native/donothinger-offload.c
el
siv
26 int numThreads=sysconf(_SC_NPROCESSORS_ONLN);
27 pthread_t threads[numThreads];
28 printf("Spawning %d threads that do nothing, press ^C to terminate.\n", numThreads);
29 if (numThreads > 0){
30 for (int i = 1; i < numThreads; i++){
31 int rc = pthread_create(&threads[i], NULL, Spin, (void *)i);
32 if (rc){
33 printf("ERROR; return code from pthread_create() is %d\n", rc);
34 //return -1;
35 }
36 }
37 }
38 Spin(NULL);
39 pthread_exit(NULL);
40 }
41 }
g
an
W
B.2.2.1 labs/2/2.2-explicit-offload/step-00/Makefile
g
CXX = icpc en
h
un
CXXFLAGS =
rY
fo
OBJECTS = offload.o
ed
.SUFFIXES: .o .cpp
ar
ep
.cpp.o:
Pr
all: runme
siv
u
runme: $(OBJECTS)
cl
Ex
clean:
rm -f *.o runme
B.2.2.2 labs/2/2.2-explicit-offload/step-00/offload.cpp
g
34
an
35 }
W
ng
Back to Lab A.2.2.
he
un
rY
B.2.2.3 labs/2/2.2-explicit-offload/step-01/offload.cpp
fo
d
6
siv
9
Ex
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 const int size = 1000;
14 int data[size];
15
16 int CountNonZero(const int N, const int* arr){
17 int nz=0;
18 for ( int i = 0 ; i < N ; i++ ){
19 if ( arr[i] != 0 ) nz++;
20 }
21 return nz;
22 }
23
24 int main( int argc, const char* argv[] ){
25
26 int numberOfNonZeroElements;
27
28 // initialize array of integers
29 for ( int i = 0; i < size ; i++) {
30 data[i] = rand() % 10;
31 }
32
B.2.2.4 labs/2/2.2-explicit-offload/step-02/offload.cpp
g
3 is a part of the practical supplement to the handbook
an
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
W
5
6 Redistribution or commercial usage without a written permission
ng
7 from Colfax International is prohibited.
Contact information can be found at http://colfax-intl.com/ */
e
8
9
10 #include <stdio.h> nh
Yu
11 #include <stdlib.h>
r
12
fo
15
int CountNonZero(const int N, const int* arr){
pa
16
17 int nz=0;
re
19 if ( arr[i] != 0 ) nz++;
20 }
el
21 return nz;
iv
22 }
us
23
24 int main( int argc, const char* argv[] ){
cl
25
Ex
26 int numberOfNonZeroElements;
27
28 // initialize array of integers
29 for ( int i = 0; i < size ; i++) {
30 data[i] = rand() % 10;
31 }
32
33 #pragma offload target(mic)
34 {
35 #ifdef __MIC__
36 printf("Offload is successful!\n");
37 fflush(0);
38 #else
39 printf("Offload has failed miserably!\n");
40 #endif
41 }
42
43 numberOfNonZeroElements = CountNonZero(size, data);
44 printf("There are %d non-zero elements in the array.\n", numberOfNonZeroElements);
45 }
B.2.2.5 labs/2/2.2-explicit-offload/step-03/offload.cpp
g
17 }
an
18 return nz;
W
19 }
ng
20
he
22
23 int numberOfNonZeroElements;
rY
24
fo
26 {
re
28 int data[size];
re
29
yP
31
siv
34
Ex
B.2.2.6 labs/2/2.2-explicit-offload/step-04/offload.cpp
12
13 #pragma offload_attribute(push, target(mic))
14 const int size = 1000;
15 int data[size];
16
17 int CountNonZero(const int N, const int* arr){
18 int nz=0;
19 for ( int i = 0 ; i < N ; i++ ){
20 if ( arr[i] != 0 ) nz++;
21 }
22 return nz;
23 }
24 #pragma offload_attribute(pop)
25
26 int main( int argc, const char* argv[] ){
27
28 // initialize array of integers
29 for ( int i = 0; i < size ; i++) {
30 data[i] = rand() % 10;
31 }
32
int numberOfNonZeroElements;
g
33
an
34 #pragma offload target(mic)
W
35 numberOfNonZeroElements = CountNonZero(size, data);
g
36
en
37 printf("There are %d non-zero elements in the array.\n", numberOfNonZeroElements);
h
38 }
un
rY
fo
B.2.3.1 labs/2/2.3-explicit-offload-persistence/step-00/Makefile
el
siv
u
CXX = icpc
cl
CXXFLAGS =
Ex
OBJECTS = offload.o
.SUFFIXES: .o .cpp
.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
all: runme
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
clean:
rm -f *.o runme
B.2.3.2 labs/2/2.3-explicit-offload-persistence/step-00/offload.cpp
g
20
an
21 for ( long i = 0 ; i < N ; i++ ) {
W
22 sum += p[i];
23 }
ng
24
25 printf("\nsum = %f\n", sum);
e
nh
26 }
Yu
B.2.3.3 labs/2/2.3-explicit-offload-persistence/step-01/offload.cpp
p ar
re
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 __attribute__((target(mic))) double sum = 0;
14
15 int main(){
16
17 const long N=10000;
18 double *p = (double*) malloc(N*sizeof(double));
19 p[0:N] = 1.0; // Cilk Plus array notation
20
21 #pragma offload target (mic) in(p : length(N)) inout(sum)
22 {
23 for ( long i = 0 ; i < N ; i++ ) {
24 sum += p[i];
25 }
26 }
27 printf("\nsum = %f\n", sum);
28 }
B.2.3.4 labs/2/2.3-explicit-offload-persistence/step-02/offload.cpp
g
17 double sumHost = 0;
an
18 const long N=10000;
W
19 double *p = (double*) malloc(N*sizeof(double));
g
20 p[0:N] = 1.0; // Cilk Plus array notation
21
en
h
#pragma offload target (mic:0) in(p : length(N)) inout(sum:free_if(0))
un
22
23 {
rY
25 sum += p[i];
ed
26 }
ar
27 }
ep
28
printf("After the offload: sum = %f \n", sum);
Pr
29
30 sum += 1.0;
y
31
siv
32
33 #pragma offload_transfer target (mic:0) out(sum:alloc_if(0) free_if(1))
u
cl
34
Ex
B.2.3.5 labs/2/2.3-explicit-offload-persistence/step-03/offload.cpp
15 int main(){
16
17 double sumHost = 0;
18 const long N=10000;
19 double *p = (double*) malloc(N*sizeof(double));
20 p[0:N] = 1.0; // Cilk Plus array notation
21
22 #pragma offload target (mic:0) in(p : length(N)) signal(p)
23 {
24 for ( long i = 0 ; i < N ; i++ ) {
25 sum += p[i];
26 }
27 }
28
29 printf("After the offload: sum = %f \n", sum);
30 sum += 1.0;
31 printf("Data change on the host: sum = %f \n", sum);
32
33 #pragma offload_transfer target (mic:0) out(sum) wait(p)
34
35 printf("Copy data back from coprocessor: sum = %f \n", sum);
}
g
36
an
W
ng
B.2.4 Explicit Offload: Putting it All Together
he
un
B.2.4.1 labs/2/2.4-explicit-offload-matrix/step-00/Makefile
d
re
pa
CXX = icpc
re
CXXFLAGS = -vec-report -g
yP
OBJECTS = matrix.o
el
siv
.SUFFIXES: .o .cpp
u
cl
Ex
.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
all: runme
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
clean:
rm -f *.o runme
B.2.4.2 labs/2/2.4-explicit-offload-matrix/step-00/matrix.cpp
g
28
an
29
W
30 for ( int i = 0 ; i < m ; i++)
printf("%f\t", c[i]);
g
31
en
32 printf("\n"); h
33 }
un
rY
B.2.4.3 labs/2/2.4-explicit-offload-matrix/step-01/matrix.cpp
ep
Pr
2
siv
B.2.4.4 labs/2/2.4-explicit-offload-matrix/step-02/matrix.cpp
g
an
3 is a part of the practical supplement to the handbook
W
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
ng
6 Redistribution or commercial usage without a written permission
he
9
#include <stdio.h>
fo
10
11 #include <stdlib.h>
d
re
12
pa
13 // use ‘ulimit -s unlimited‘ to increase the stack size for the process
// Otherwise, code will be stopped with the "segmentation fault" error
re
14
yP
15
16 int main(){
el
17
siv
20
21
22 // Cilk Plus array notation
23 A[0:n*m]=1.0/(double)n;
24 b[:]=1.0;
25 c[:]=0;
26
27 #pragma offload target(mic) in (A:length(n*m))
28 for ( int i = 0 ; i < m ; i++)
29 for ( int j = 0 ; j < n ; j++)
30 c[i] += A[i*n+j] * b[j];
31
32 double norm = 0.0;
33 for ( int i = 0 ; i < m ; i++)
34 norm += (c[i] - 1.0)*(c[i] - 1.0);
35 if (norm > 1e-10)
36 printf("Norm is equal to %f\n", norm);
37 else
38 printf("Yep, we’re good!\n");
39 }
B.2.4.5 labs/2/2.4-explicit-offload-matrix/step-03/matrix.cpp
g
double b[n], c[m];
an
20
21 double * A = (double*) malloc(sizeof(double)*n*m);
W
22
g
// Cilk Plus array notation
en
23
24 A[0:n*m]=1.0/(double)n; h
un
25 b[:]=1.0;
rY
26 c[:]=0;
27
fo
29
ar
31
Pr
35
u
37 {
Ex
B.2.4.6 labs/2/2.4-explicit-offload-matrix/step-04/matrix.cpp
g
29
an
30 printf("Iteration %d of %d...\n", iter+1, maxIter);
W
31
b[:] = (double) iter;
ng
32
33 int size = iter == 0 ? n*m : 1;
he
38
re
39
pa
40 }
re
yP
B.2.4.7 labs/2/2.4-explicit-offload-matrix/step-05/matrix.cpp
cl
Ex
g
44
an
45 c_host[i] += A[i*n+j] * b[j];
W
46
// sync before proceeding
g
47
en
48 #pragma offload_transfer target(mic:1) wait(A)h
49
un
50 double norm = 0.0;
rY
53
54 if (norm > 1e-10)
ar
55 printf("ERROR!\n");
ep
56 else
Pr
58 }
el
59 }
siv
u
cl
Ex
B.2.5.1 labs/2/2.5-sharing-complex-objects/step-00/Makefile
CXX = icpc
CXXFLAGS =
OBJECTS = cilk-shared-offload.o
.SUFFIXES: .o .cpp
.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
all: runme
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
clean:
rm -f *.o runme
B.2.5.2 labs/2/2.5-sharing-complex-objects/step-00/cilk-shared-offload.cpp
g
12
an
13 int ar2[N];
W
14 int res[N]; ng
15
void initialize() {
he
16
17 for (int i = 0; i < N; i++) {
un
18 ar1[i] = i;
rY
19 ar2[i] = 1;
fo
20 }
}
d
21
re
22
pa
24 void add() {
yP
27
28 }
u
cl
29
Ex
30 void verify() {
31 bool errors = false;
32 for (int i = 0; i < N; i++)
33 errors |= (res[i] != (ar1[i] + ar2[i]));
34 printf("%s\n", (errors ? "ERROR" : "CORRECT"));
35 }
36
37 int main(int argc, char *argv[]) {
38 initialize();
39 add(); // Make function call on coprocessor:
40 // ar1, ar2 should be copied in, res copied out
41 verify();
42 }
B.2.5.3 labs/2/2.5-sharing-complex-objects/step-01/cilk-shared-offload.cpp
g
24
an
25 for (int i = 0; i < N; i++)
W
26 res[i] = ar1[i] + ar2[i];
#else
g
27
en
28 printf("Offload to coprocessor failed!\n"); h
29 #endif
un
30 }
rY
31
fo
32 void verify() {
bool errors = false;
ed
33
34 for (int i = 0; i < N; i++)
ar
37 }
y
38
el
40 initialize();
u
41
// // ar1, ar2 are copied in, res copied out
Ex
42
43 verify();
44 }
B.2.5.4 labs/2/2.5-sharing-complex-objects/step-02/dynamic-alloc.cpp
g
35
an
36 printf("%s\n", (sum==N/2 ? "CORRECT" : "ERROR"));
W
37 free(data);
}
ng
38
he
un
B.2.5.5 labs/2/2.5-sharing-complex-objects/step-03/dynamic-alloc.cpp
d
re
pa
4
siv
B.2.5.6 labs/2/2.5-sharing-complex-objects/step-04/structures.cpp
g
10 #include <stdio.h>
an
11 #include <string.h>
W
12
g
13 // share the structure between the host and the coprocessor
14
15
typedef struct {
int i; en
h
un
16 char c[10];
rY
17 } person;
fo
18
ed
20
ep
21
22 p.i = i;
Pr
23 strcpy(p.c, name);
y
25
26 //printf("Offload to coprocessor failed.\n");
u
cl
27
Ex
28 }
29
30 person someone;
31 char who[10];
32
33 int main(){
34 strcpy(who, "John");
35 SetPerson(someone, who, 1);
36 printf("On host: %d %s\n", someone.i, someone.c);
37 }
B.2.5.7 labs/2/2.5-sharing-complex-objects/step-05/structures.cpp
g
28
an
29 person _Cilk_shared someone;
W
30 char _Cilk_shared who[10]; ng
31
32 int main(){
he
33 strcpy(who, "John");
un
36 }
d
re
pa
B.2.5.8 labs/2/2.5-sharing-complex-objects/step-06/classes.cpp
el
siv
26 strcpy(c, name);
27 printf("On coprocessor: %d %s\n", i, c);
28
29 //printf("Offload to coprocessor failed.\n");
30
31 }
32 };
33
34 Person someone;
35 char who[10];
36
37 int main(){
38 strcpy(who, "Mary");
39 someone.Set(who, 2); // make offload function call
40 printf("On host: %d %s\n", someone.i, someone.c);
41 }
B.2.5.9 labs/2/2.5-sharing-complex-objects/step-07/classes.cpp
g
an
W
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.
g
2 file classes.cpp, located at 2/2.5-sharing-complex-objects/step-07
3 is a part of the practical supplement to the handbook
en
h
"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
un
4
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
rY
5
6 Redistribution or commercial usage without a written permission
fo
9
ep
10 #include <stdio.h>
#include <string.h>
Pr
11
12
y
13
siv
14 public:
15 int i;
u
cl
16 char c[10];
Ex
17
18 Person() {
19 i=0; c[0]=’\0’;
20 }
21
22 void Set(_Cilk_shared const char* name, const int i0) {
23 #ifdef __MIC__
24 i = i0;
25 strcpy(c, name);
26 printf("On coprocessor: %d %s\n", i, c);
27 #else
28 printf("Offload to coprocessor failed.\n");
29 #endif
30 }
31 };
32
33 Person _Cilk_shared someone;
34 char _Cilk_shared who[10];
35
36 int main(){
37 strcpy(who, "Mary");
38 _Cilk_offload someone.Set(who, 2);
39 printf("On host: %d %s\n", someone.i, someone.c);
40 }
B.2.5.10 labs/2/2.5-sharing-complex-objects/step-08/new.cpp
g
14
an
15 class MyClass {
W
16 int i;
ng
17
public:
he
18
19 MyClass(){ i = 0; };
un
20
rY
22
d
23 void print(){
re
24 #ifdef __MIC__
pa
26 #else
yP
printf("%d\n", i);
siv
29
30 }
u
cl
31 };
Ex
32
33 MyClass* sharedData;
34
35 int main()
36 {
37 const int size = sizeof(MyClass);
38 // allocate the memory and pass the pointer to it to the new operator
39 MyClass* address = (MyClass*) malloc(size);
40 sharedData=new MyClass;
41
42 sharedData->set(1000); // Shared data initialized on host
43 //sharedData->print(); // Shared data used on coprocessor
44 sharedData->print(); // Shared data used on host
45 }
B.2.5.11 labs/2/2.5-sharing-complex-objects/step-09/new.cpp
g
24
an
25 #else
W
26 printf("On host: ");
#endif
g
27
en
28 printf("%d\n", i); h
29 }
un
30 };
rY
31
fo
33
34 int main()
ar
35 {
ep
39
siv
41
sharedData->print(); // Shared data used on host
Ex
42
43 }
B.2.6.1 labs/2/2.6-multiple-coprocessors/step-03/Makefile
CXX = icpc
CXXFLAGS = -openmp
OBJECTS = multiple.o
.SUFFIXES: .o .cpp
.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
all: runme
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
clean:
rm -f *.o runme
B.2.6.2 labs/2/2.6-multiple-coprocessors/step-00/multiple.cpp
g
9
an
10 #include <stdio.h>
W
11
ng
12 // Write offloaded function, which will print out the device number, using
// _Offload_get_device_number() function call.
he
13
14
un
16
fo
18
re
20 }
re
yP
B.2.6.3 labs/2/2.6-multiple-coprocessors/step-01/multiple.cpp
cl
Ex
23
24 for(int i=0; i<numDevices; i++){
25 _Cilk_offload_to(i) whatIsMyNumber(numDevices);
26 }
27 }
B.2.6.4 labs/2/2.6-multiple-coprocessors/step-02/multiple.cpp
g
10 #include <stdlib.h>
an
11 #include <stdio.h>
W
12
g
13 int* response;
14
15
int _Cilk_shared n_d;
en
h
un
16 int main(){
rY
17 n_d = _Offload_number_of_devices();
fo
18 if (n_d < 1) {
ed
20
}
ep
21
22
Pr
24 response[0:n_d] = 0;
el
siv
25
26 for (int i = 0; i < n_d; i++) {
u
cl
28 {
29 #ifdef __MIC__
30 response[i] = 1;
31 #endif
32 }
33 }
34
35 for (int i = 0; i < n_d; i++)
36 if (response[i] == 1) {
37 printf("OK: device %d responded\n", i);
38 } else {
39 printf("Error: device %d did not respond\n", i);
40 }
41 }
B.2.6.5 labs/2/2.6-multiple-coprocessors/step-03/multiple.cpp
g
24
an
25
W
26 for (int i = 0; i < n_d; i++) {
#pragma offload target(mic:i) inout(response[i:1])
ng
27
28 {
he
29 #ifdef __MIC__
un
30 response[i] = 1;
rY
31 #else
fo
32 response[i] = 0;
#endif
d
33
re
34 }
pa
35 }
re
36
yP
40 } else {
u
41
}
Ex
42
43 }
B.2.7.1 labs/2/2.7-asynchronous-offload/step-03/Makefile
CXX = icpc
CXXFLAGS = -openmp
OBJECTS = async.o
.SUFFIXES: .o .cpp
.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
all: runme
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
clean:
rm -f *.o runme
B.2.7.2 labs/2/2.7-asynchronous-offload/step-00/async.cpp
g
9
an
10 #include <stdio.h>
W
11
g
12 void _Cilk_shared whatIsMyNumber(int numDevices){
en
13 int currentDevice = _Offload_get_device_number(); h
14 printf("Hello from %d coprocessor out of %d.\n", currentDevice, numDevices);
un
15 fflush(0);
rY
16 }
fo
17
int _Cilk_shared numDevices;
ed
18
ar
19
int main(int argc, char *argv[]) {
ep
20
21 numDevices = _Offload_number_of_devices();
Pr
23
el
24
25 _Cilk_offload_to(i) whatIsMyNumber(numDevices);
u
}
cl
26
Ex
27 }
B.2.7.3 labs/2/2.7-asynchronous-offload/step-01/async.cpp
16 }
17
18 int _Cilk_shared numDevices;
19
20 int main(int argc, char *argv[]) {
21 numDevices = _Offload_number_of_devices();
22 printf("Number of Target devices installed: %d\n\n" ,numDevices);
23
24 for(int i=0; i<numDevices; i++){
25 _Cilk_spawn _Cilk_offload_to(i) whatIsMyNumber(numDevices);
26 }
27 }
B.2.7.4 labs/2/2.7-asynchronous-offload/step-02/async.cpp
g
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.
an
2 file async.cpp, located at 2/2.7-asynchronous-offload/step-02
is a part of the practical supplement to the handbook
W
3
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
ng
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
e
nh
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
Yu
9
#include <stdlib.h>
r
10
fo
11 #include <stdio.h>
12
ed
13 int* response;
ar
15
re
16 int main(){
yP
17 n_d = _Offload_number_of_devices();
18 if (n_d < 1) {
el
20 return 2;
us
21 }
22
cl
24 response[0:n_d] = 0;
25
26 // Make the following loop run in parallel with OpenMP
27 for (int i = 0; i < n_d; i++) {
28 // The body of this loop is executed by n_d host threads concurrently
29 //
30 // Use pragma to specify the targets and data manipulation clauses
31 {
32 // Each offloaded segment blocks the execution of the thread that launched it
33 response[i] = 1;
34 }
35 }
36
37 for (int i = 0; i < n_d; i++)
38 if (response[i] == 1) {
39 printf("OK: device %d responded\n", i);
40 } else {
41 printf("Error: device %d did not respond\n", i);
42 }
43 }
B.2.7.5 labs/2/2.7-asynchronous-offload/step-03/async.cpp
g
17 if (n_d < 1) {
an
18 printf("No devices available!");
W
19 return 2;
g
20 }
21
response = (int*) malloc(n_d*sizeof(int)); en
h
un
22
23 response[0:n_d] = 0;
rY
24
fo
29
30 // Each offloaded segment blocks the execution of the thread that launched it
y
response[i] = 1;
el
31
siv
32 }
33 }
u
cl
34
Ex
B.2.7.6 labs/2/2.7-asynchronous-offload/step-04/async.cpp
10 #include <stdlib.h>
11 #include <stdio.h>
12
13 int* response;
14 int _Cilk_shared n_d;
15
16 int main(){
17
18 n_d = _Offload_number_of_devices();
19
20 if (n_d < 1) {
21 printf("No devices available!");
22 return 2;
23 }
24
25 response = (int*) malloc(n_d*sizeof(int));
26 response[0:n_d] = 0;
27
28 for (int i = 0; i < n_d; i++) {
29 //use offload pragma with target, data manipulation clause and signal
30 {
// The offloaded job does not block the execution on the host
g
31
an
32 response[i] = 1;
W
33 }
}
ng
34
35
he
39 }
d
40
re
42 if (response[i] == 1) {
re
44 } else {
45 printf("Error: device %d did not respond\n", i);
el
46 }
siv
47 }
u
cl
Ex
B.2.7.7 labs/2/2.7-asynchronous-offload/step-05/async.cpp
g
39
an
40 printf("Error: device %d did not respond\n", i);
W
41 }
}
g
42
en
h
un
Back to Lab A.2.7.
rY
fo
B.2.7.8 labs/2/2.7-asynchronous-offload/step-06/async.cpp
ed
ar
ep
2
3 is a part of the practical supplement to the handbook
y
4
siv
B.2.7.9 labs/2/2.7-asynchronous-offload/step-07/async.cpp
g
2 file async.cpp, located at 2/2.7-asynchronous-offload/step-07
an
3 is a part of the practical supplement to the handbook
W
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
ng
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
he
7
Contact information can be found at http://colfax-intl.com/ */
rY
8
9
fo
10 #include <stdlib.h>
d
11 #include <stdio.h>
re
12
pa
14
yP
16
siv
17 }
18
u
cl
19 int main(){
Ex
20
21 int n_d = _Offload_number_of_devices();
22
23 if (n_d < 1) {
24 printf("No devices available!");
25 return 2;
26 }
27
28 response = (int _Cilk_shared *) _Offload_shared_malloc(n_d*sizeof(int));
29 response[0:n_d] = 0;
30
31 _Cilk_for (int i = 0; i < n_d; i++) {
32 // All iterations start simultaneously in n_d host threads
33 _Cilk_offload_to(i)
34 Respond(response[i]);
35 }
36
37 for (int i = 0; i < n_d; i++)
38 if (response[i] == 1) {
39 printf("OK: device %d responded\n", i);
40 } else {
41 printf("Error: device %d did not respond\n", i);
42 }
43 }
B.2.7.10 labs/2/2.7-asynchronous-offload/step-08/async.cpp
g
14
an
15 void _Cilk_shared Respond(int _Cilk_shared & a) {
W
16 a = 1;
g
17 }
18
19 int main(){ en
h
un
20
rY
22
ed
23 if (n_d < 1) {
printf("No devices available!");
ar
24
return 2;
ep
25
26 }
Pr
27
y
response[0:n_d] = 0;
siv
29
30
u
cl
B.2.7.11 labs/2/2.7-asynchronous-offload/step-09/async.cpp
g
25
an
26 }
W
27
response = (int _Cilk_shared *) _Offload_shared_malloc(n_d*sizeof(int));
ng
28
29 response[0:n_d] = 0;
he
30
un
32 _Cilk_spawn _Cilk_offload_to(i)
fo
33 Respond(response[i]);
}
d
34
re
35
pa
36 _Cilk_sync;
re
37
yP
41 } else {
u
42
}
Ex
43
44 }
B.2.8.1 labs/2/2.8-MPI/step-00/Makefile
CXX = mpiicpc
CXXFLAGS =
OBJECTS = HelloMPI.o
MICOBJECTS = HelloMPI.oMIC
.c.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.c.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
HelloMPI: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o HelloMPI $(OBJECTS)
HelloMPI.MIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o HelloMPI.MIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) HelloMPI HelloMPI.MIC
g
an
W
g
Back to Lab A.2.8.
en
h
un
rY
B.2.8.2 labs/2/2.8-MPI/step-00/HelloMPI.c
fo
ed
9
10 #include "mpi.h"
11 #include <stdio.h>
12 #include <string.h>
13
14 int main (int argc, char *argv[]) {
15 int i, rank, size, namelen;
16 char name[MPI_MAX_PROCESSOR_NAME];
17
18 MPI_Init (&argc, &argv);
19
20 MPI_Comm_size (MPI_COMM_WORLD, &size);
21 MPI_Comm_rank (MPI_COMM_WORLD, &rank);
22 MPI_Get_processor_name (name, &namelen);
23
24 printf ("Hello World from rank %d running on %s!\n", rank, name);
25 if (rank == 0) printf("MPI World size = %d processes\n", size);
26
27 MPI_Finalize ();
28 }
B.2.8.3 labs/2/2.8-MPI/step-00/hosts
1 mic0
2 mic1
B.3.1.1 labs/3/3.1-vectorization/step-00/vectorization.cpp
g
an
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
W
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
ng
8 Contact information can be found at http://colfax-intl.com/ */
he
9
un
10 #include <stdio.h>
rY
11
int main(){
fo
12
13 const int n=8;
d
re
14 int i;
pa
15 int A[n];
int B[n];
re
16
yP
17
18 // Initialization
el
20 A[i]=B[i]=i;
u
21
cl
22
23 for (i=0; i<n; i++)
24 A[i]+=B[i];
25
26 // Output
27 for (i=0; i<n; i++)
28 printf("%2d %2d %2d\n", i, A[i], B[i]);
29 }
B.3.1.2 labs/3/3.1-vectorization/step-01/vectorization.cpp
9
10 #include <stdio.h>
11
12 int main(){
13 const int n=8;
14 int i;
15 __attribute__((align(64))) int A[n];
16 __attribute__((align(64))) int B[n];
17
18 // Initialization
19 for (i=0; i<n; i++)
20 A[i]=B[i]=i;
21
22 // This loop will be auto-vectorized
23 A[:]+=B[:];
24
25 // Output
26 for (i=0; i<n; i++)
27 printf("%2d %2d %2d\n", i, A[i], B[i]);
28 }
g
an
Back to Lab A.3.1.
W
g
B.3.1.3
en
labs/3/3.1-vectorization/step-02/vectorization.cpp
h
un
/* Copyright (c) 2013, Colfax International. All Right Reserved.
rY
1
2 file vectorization.cpp, located at 3/3.1-vectorization/step-02
fo
9
siv
10 #include <stdio.h>
#include <stdlib.h>
u
11
cl
12
Ex
13 int main(){
14 const int n=8;
15 int i;
16 int* A = (int*) malloc(n*sizeof(int));
17 int* B = (int*) malloc(n*sizeof(int));
18
19 // Initialization
20 for (i=0; i<n; i++)
21 A[i]=B[i]=i;
22
23 // This loop will be auto-vectorized
24 A[0:n]+=B[0:n];
25
26 // Output
27 for (i=0; i<n; i++)
28 printf("%2d %2d %2d\n", i, A[i], B[i]);
29
30 free(A);
31 free(B);
32 }
B.3.1.4 labs/3/3.1-vectorization/step-03/vectorization.cpp
g
// Alignment check
an
20
21 printf("Offset for A is: %lu bytes\n",
22 (al - ( (size_t) A % al ) )%al );
W
ng
23 printf("Offset for B is: %lu bytes\n",
he
25
rY
26 // Initialization
27 for (i=0; i<n; i++)
fo
28 A[i]=B[i]=i;
d
re
29
// This loop will be auto-vectorized
pa
30
31 A[0:n]+=B[0:n];
re
32
yP
33 // Output
el
36
cl
37 free(A);
Ex
38 free(B);
39 }
B.3.1.5 labs/3/3.1-vectorization/step-04/vectorization.cpp
g
35
an
36 free(Bspace);
W
37 }
g
Back to Lab A.3.1. hen
un
rY
B.3.1.6 labs/3/3.1-vectorization/step-05/vectorization.cpp
fo
ed
ar
3
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
y
5
siv
9
10 #include <stdio.h>
11
12 int main(){
13 const int n=8;
14 const int al=64; //alignment
15 int i;
16 int* A = (int*) _mm_malloc(n*sizeof(int), al);
17 int* B = (int*) _mm_malloc(n*sizeof(int), al);
18
19 // Initialization
20 for (i=0; i<n; i++)
21 A[i]=B[i]=i;
22
23 // This loop will be auto-vectorized
24 A[0:n]+=B[0:n];
25
26 // Output
27 for (i=0; i<n; i++)
28 printf("%2d %2d %2d\n", i, A[i], B[i]);
29
30 _mm_free(A);
31 _mm_free(B);
32 }
B.3.1.7 labs/3/3.1-vectorization/step-06/vectorization.cpp
g
14 const int n=16;
an
15 const int al=64; //alignment
W
16 int i;
ng
17 int* A = (int*) _mm_malloc(n*sizeof(int), al);
int* B = (int*) _mm_malloc(n*sizeof(int), al);
he
18
19
un
20 // Initialization
rY
22 A[i]=B[i]=i;
d
23
re
_mm512_store_epi32(A+i, Avec);
siv
29
30 }
u
cl
31
Ex
32 // Output
33 for (i=0; i<n; i++)
34 printf("%2d %2d %2d\n", i, A[i], B[i]);
35
36 _mm_free(A);
37 _mm_free(B);
38 }
B.3.1.8 labs/3/3.1-vectorization/step-07/vectorization.cpp
10 #include <stdio.h>
11
12 int my_simple_add(int x1, int x2){
13 return x1+x2;
14 }
15
16 int main(){
17 const int n=8;
18 const int al=64; //alignment
19 int i;
20 int* A = (int*) _mm_malloc(n*sizeof(int), al);
21 int* B = (int*) _mm_malloc(n*sizeof(int), al);
22
23 // Initialization
24 for (i=0; i<n; i++)
25 A[i]=B[i]=i;
26
27 // This loop will be auto-vectorized
28 for (i=0; i<n; i++)
29 A[i] = my_simple_add(A[i], B[i]);
30
// Output
g
31
an
32 for (i=0; i<n; i++)
W
33 printf("%2d %2d %2d\n", i, A[i], B[i]);
g
34
en
35 _mm_free(A); h
36 _mm_free(B);
un
37 }
rY
fo
B.3.1.9 labs/3/3.1-vectorization/step-08/main.cpp
Pr
y
1
siv
B.3.1.10 labs/3/3.1-vectorization/step-08/worker.cpp
g
6 Redistribution or commercial usage without a written permission
an
7 from Colfax International is prohibited.
W
8 Contact information can be found at http://colfax-intl.com/ */
ng
9
int my_simple_add(int x1, int x2){
he
10
11 return x1+x2;
un
12 }
rY
fo
B.3.1.11 labs/3/3.1-vectorization/step-09/main.cpp
re
yP
1
siv
B.3.1.12 labs/3/3.1-vectorization/step-09/worker.cpp
g
6 Redistribution or commercial usage without a written permission
an
7 from Colfax International is prohibited.
W
8 Contact information can be found at http://colfax-intl.com/ */
g
9
10
11 return x1+x2; en
__attribute__((vector)) int my_simple_add(int x1, int x2){
h
un
12 }
rY
fo
B.3.1.13 labs/3/3.1-vectorization/step-0a/main.cpp
Pr
y
1
siv
B.3.1.14 labs/3/3.1-vectorization/step-0a/worker.cpp
g
12 for (int i=0; i<n; i++)
an
13 a[i] = b[i];
W
14 } ng
he
B.3.1.15 labs/3/3.1-vectorization/step-0b/Makefile
fo
d
re
CXX = icpc
pa
.SUFFIXES: .o .cpp
u
cl
.cpp.o:
Ex
all: runme
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
clean:
rm -f *.o runme
B.3.1.16 labs/3/3.1-vectorization/step-0b/main.cpp
g
29
an
30 }
W
g
Back to Lab A.3.1.
en
h
un
B.3.1.17
rY
labs/3/3.1-vectorization/step-0b/worker.cpp
fo
ed
2
is a part of the practical supplement to the handbook
ep
3
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
Pr
7
8 Contact information can be found at http://colfax-intl.com/ */
u
cl
9
Ex
B.3.1.18 labs/3/3.1-vectorization/step-0c/vectorization.cpp
13 return x1+x2;
14 }
15
16 int main(){
17 const int n=256;
18 int i;
19 int A[n];
20 int B[n];
21
22 // Initialization
23 for (i=0; i<n; i++)
24 A[i]=B[i]=i;
25
26 // This loop will be auto-vectorized
27 _Cilk_for(int j=0; j<n; j++)
28 A[j] = my_simple_add(A[j], B[j]);
29
30 // Output
31 for (i=0; i<n; i++)
32 printf("%2d %2d %2d\n", i, A[i], B[i]);
33 }
g
an
B.3.2 W
Parallelism with OpenMP: Shared and Private Variables, Reduction
ng
he
B.3.2.1 labs/3/3.2-OpenMP/step-00/openmp.cpp
fo
d
re
8
Ex
9
10 #include <stdio.h>
11 #include <omp.h>
12
13 int main(){
14 const int nt=omp_get_max_threads();
15 printf("OpenMP with %d threads\n", nt);
16
17 #pragma omp parallel
18 printf("Hello World from thread %d\n", omp_get_thread_num());
19 }
B.3.2.2 labs/3/3.2-OpenMP/step-01/openmp.cpp
B.3.2.3 labs/3/3.2-OpenMP/step-02/openmp.cpp
g
an
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.
W
2 file openmp.cpp, located at 3/3.2-OpenMP/step-02
g
3 is a part of the practical supplement to the handbook
4
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8en
"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
h
un
5
Redistribution or commercial usage without a written permission
rY
6
7 from Colfax International is prohibited.
fo
9
ar
10 #include <stdio.h>
ep
11 #include <omp.h>
Pr
12
13 int main(){
y
14
siv
17
Ex
B.3.2.4 labs/3/3.2-OpenMP/step-03/openmp.cpp
g
22
an
23
W
24 #pragma omp for schedule(guided, 4)
for (int i=0; i<nt; i++){
ng
25
26 // iteration will be distributed across available threads
he
27 private_number += 1;
un
29 omp_get_thread_num(), private_number);
fo
30 }
// code placed here will be executed my all threads
d
31
re
32 }
pa
33 }
re
yP
B.3.2.5 labs/3/3.2-OpenMP/step-04/openmp.cpp
cl
Ex
23 }
24
25 int main(){
26 #pragma omp parallel
27 {
28 #pragma omp single
29 Recurse(0);
30 }
31 }
B.3.2.6 labs/3/3.2-OpenMP/step-05/openmp.cpp
g
6 Redistribution or commercial usage without a written permission
an
7 from Colfax International is prohibited.
W
8 Contact information can be found at http://colfax-intl.com/ */
g
9
10
11
#include <stdio.h>
#include <omp.h> en
h
un
12
rY
13 int main(){
fo
14 int varShared = 5;
ed
15 int varPrivate = 4;
int varFirstPrivate = 2;
ar
16
ep
17
18 #pragma omp parallel private(varPrivate) firstprivate(varFirstPrivate)
Pr
19 {
y
21
22 varShared += 1; // Race condition, undefined behavior!
u
cl
23 varPrivate += 1;
Ex
24 varFirstPrivate += 1;
25 printf("For thread %d,\t varShared=%d\t varPrivate=%d\t varFirstPrivate=%d\n",
26 omp_get_thread_num(), varShared, varPrivate, varFirstPrivate);
27 }
28 printf("After parallel region, varShared=%d\t varPrivate=%d\t varFirstPrivate=%d\n",
29 varShared, varPrivate, varFirstPrivate);
30 }
B.3.2.7 labs/3/3.2-OpenMP/step-06/openmp.cpp
10 #include <stdio.h>
11 #include <omp.h>
12
13 int main(){
14 const int n = 1000;
15 int sum = 0;
16 #pragma omp parallel for
17 for (int i=0; i<n; i++){
18 // Race condition
19 sum = sum + i;
20 }
21 printf("sum = %d (must be %d)\n", sum, ((n-1)*n)/2);
22 }
B.3.2.8 labs/3/3.2-OpenMP/step-07/openmp.cpp
g
2 file openmp.cpp, located at 3/3.2-OpenMP/step-07
an
3 is a part of the practical supplement to the handbook
W
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
ng
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
Redistribution or commercial usage without a written permission
he
6
7 from Colfax International is prohibited.
un
9
fo
10 #include <stdio.h>
d
11 #include <omp.h>
re
12
pa
13 int main(){
re
15 int sum = 0;
16 #pragma omp parallel for
el
17
18 #pragma omp critical
u
cl
19 sum = sum + i; // only one thread at a time can execute this section
Ex
20 }
21 printf("sum = %d (must be %d)\n", sum, ((n-1)*n)/2);
22 }
B.3.2.9 labs/3/3.2-OpenMP/step-08/openmp.cpp
B.3.2.10 labs/3/3.2-OpenMP/step-09/openmp.cpp
g
6 Redistribution or commercial usage without a written permission
an
7 from Colfax International is prohibited.
W
8 Contact information can be found at http://colfax-intl.com/ */
g
9
10
11
#include <stdio.h>
#include <omp.h> en
h
un
12
rY
13 int main(){
fo
16
#pragma omp parallel
ep
17
18 {
Pr
20 {
el
21
22 int sum1=0, sum2=0;
u
cl
B.3.2.11 labs/3/3.2-OpenMP/step-0a/openmp.cpp
g
an
B.3.2.12 labs/3/3.2-OpenMP/step-0b/openmp.cpp
W
ng
he
9
10 #include <stdio.h>
el
#include <omp.h>
siv
11
12
u
cl
13 int main(){
Ex
B.3.3.1 labs/3/3.3-Cilk-Plus/step-00/cilk.cpp
g
an
Back to Lab A.3.3.
W
g
B.3.3.2 labs/3/3.3-Cilk-Plus/step-01/cilk.cpp en
h
un
rY
6
7 from Colfax International is prohibited.
y
8
siv
9
10 #include <stdio.h>
u
cl
11 #include <cilk/cilk.h>
Ex
12 #include <cilk/cilk_api_linux.h>
13
14 int main(){
15 const int nw=__cilkrts_get_nworkers();
16 printf("Cilk Plus with %d workers\n", nw);
17
18 _Cilk_for (int i=0; i<nw; i++) // Light workload: gets serialized
19 printf("Hello World from worker %d\n", __cilkrts_get_worker_number());
20
21 #pragma cilk grainsize = 4
22 _Cilk_for (int i=0; i<nw; i++){
23 double f=1.0;
24 while (f<1.0e40) f*=2.0; // Extra workload: gets parallelized
25 printf("Hello again from worker %d (%f)\n", __cilkrts_get_worker_number(), f);
26 }
27 }
B.3.3.3 labs/3/3.3-Cilk-Plus/step-02/cilk.cpp
g
22
an
23 int main(){
W
24 Recurse(0);
}
ng
25
he
un
B.3.3.4 labs/3/3.3-Cilk-Plus/step-03/cilk.cpp
d
re
pa
4
siv
B.3.3.5 labs/3/3.3-Cilk-Plus/step-04/cilk.cpp
g
11 #include <cilk/reducer_opadd.h>
an
12 #include <cilk/cilk_api_linux.h>
W
13
g
14 int main(){
15
16
const int N=20;
cilk::reducer_opadd<int> sum; en
h
un
17 sum.set_value(0);
rY
20 sum = sum + i;
}
ar
21
printf("Result=%d (must be %d)\n", sum.get_value(), ((N-1)*N)/2);
ep
22
23 }
Pr
y
el
B.3.3.6 labs/3/3.3-Cilk-Plus/step-05/cilk.cpp
Ex
22
23 int main(){
24 if (0 == __cilkrts_set_param("nworkers","2")){
25 _Cilk_for( int i=0; i<10; i++){
26 Scratch scratch;
27 scratch.data[0:N] = i;
28 int sum = 0;
29 for (int j=0; j<N; j++) sum += scratch.data[j];
30 printf("i=%d, worker=%d, sum=%d\n", i, __cilkrts_get_worker_number(), sum);
31 }
32 } else {
33 printf("Failed to set workers count!\n");
34 return 1;
35 }
36 }
g
an
W
ng
B.3.3.7 labs/3/3.3-Cilk-Plus/step-06/cilk.cpp
e
nh
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.
Yu
3
fo
9
yP
10 #include <stdio.h>
11 #include <cilk/holder.h>
el
12 #include <cilk/cilk_api.h>
iv
13 #include <cilk/cilk_api_linux.h>
us
14
15 const int N=1000000;
cl
16
Ex
17 class Scratch {
18 public:
19 int data[N];
20 Scratch(){ printf("Constructor called by worker %d\n", __cilkrts_get_worker_number());}
21 };
22
23 int main(){
24 if (0 == __cilkrts_set_param("nworkers","2")){
25 cilk::holder<Scratch> scratch;
26 _Cilk_for( int i=0; i<10; i++){
27 scratch().data[0:N] = i; // Operator () is used for data access
28 int sum = 0;
29 for (int j=0; j<N; j++) sum += scratch().data[j];
30 printf("i=%d, worker=%d, sum=%d\n", i, __cilkrts_get_worker_number(), sum);
31 }
32 } else {
33 printf("Failed to set workers count!\n");
34 return 1;
35 }
36 }
B.3.4.1 labs/3/3.4-MPI/step-00/Makefile
CXX = mpiicpc
CXXFLAGS =
OBJECTS = mpi.o
MICOBJECTS = mpi.oMIC
.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.cpp.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
g
all: runme-mpi runme-mpi.MIC
an
W
runme-mpi: $(OBJECTS)
g
$(CXX) $(CXXFLAGS) -o runme-mpi $(OBJECTS)
runme-mpi.MIC: $(MICOBJECTS) en
h
un
$(CXX) $(CXXFLAGS) -mmic -o runme-mpi.MIC $(MICOBJECTS)
rY
fo
clean:
ed
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
el
siv
I_MPI_MIC=1 \
mpirun \
u
cl
B.3.4.2 labs/3/3.4-MPI/step-00/mpi.cpp
g
an
Back to Lab A.3.4.
W
ng
he
B.3.4.3 labs/3/3.4-MPI/step-01/mpi.cpp
un
rY
8
siv
9
10 #include <mpi.h>
u
cl
11 #include <stdio.h>
Ex
12
13 int main(int argc, char** argv) {
14
15 // Set up MPI environment
16 int ret = MPI_Init(&argc,&argv);
17 if (ret != MPI_SUCCESS) {
18 printf("error: could not initialize MPI\n");
19 MPI_Abort(MPI_COMM_WORLD, ret);
20 }
21
22 int worldSize, rank, irank, namelen;
23 char name[MPI_MAX_PROCESSOR_NAME];
24 MPI_Status stat;
25 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
26 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
27 MPI_Get_processor_name(name, &namelen);
28
29 if (rank == 0) {
30 printf("I am the master process, rank %d of %d running on %s\n",
31 rank, worldSize, name);
32 for (int i = 1; i < worldSize; i++){
33 // Blocking receive operations in the master process
34 MPI_Recv (&irank, 1, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &stat);
B.3.4.4 labs/3/3.4-MPI/step-02/mpi.cpp
g
2 file mpi.cpp, located at 3/3.4-MPI/step-02
an
3 is a part of the practical supplement to the handbook
W
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
g
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6
from Colfax International is prohibited. en
Redistribution or commercial usage without a written permission
h
un
7
Contact information can be found at http://colfax-intl.com/ */
rY
8
9
fo
10 #include <mpi.h>
ed
11 #include <stdio.h>
ar
12 #include <stdlib.h>
ep
13
int main(int argc, char** argv) {
Pr
14
15
y
16
siv
20 MPI_Abort(MPI_COMM_WORLD, ret);
21 }
22
23 const int M = 100000, N = 200000;
24 float data1[M]; data1[:]=1.0f;
25 double data2[N]; data2[:]=2.0;
26 int size1, size2;
27 int worldSize, rank, namelen;
28 char name[MPI_MAX_PROCESSOR_NAME];
29 MPI_Status stat;
30 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
31 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
32
33 if ((worldSize > 1) && ((worldSize % 2) == 0)){
34 if (rank % 2 == 0) {
35 // Sender side: allocate user-space buffer for asynchronous
36 // communication
37 MPI_Pack_size(M, MPI_FLOAT, MPI_COMM_WORLD, &size1);
38 MPI_Pack_size(N, MPI_DOUBLE, MPI_COMM_WORLD, &size2);
39 int bufsize = size1 + size2 + 2*MPI_BSEND_OVERHEAD;
40 printf("size1 = %d, size2 = %d, MPI_BSEND_OVERHEAD = %d, allocating %d bytes\n",
41 size1, size2, MPI_BSEND_OVERHEAD, bufsize);
42 void* buffer = malloc(bufsize);
43 MPI_Buffer_attach(buffer, bufsize);
44 MPI_Bsend(data1, M, MPI_FLOAT, rank+1, rank>>1, MPI_COMM_WORLD);
45 MPI_Bsend(data2, N, MPI_DOUBLE, rank+1, rank>>1, MPI_COMM_WORLD);
46 MPI_Buffer_detach(&buffer, &bufsize);
47 free(buffer);
48 } else {
49 // Receiver size does not have to do anything special
50 MPI_Recv(data1, M, MPI_FLOAT, rank-1, rank>>1, MPI_COMM_WORLD, &stat);
51 MPI_Recv(data2, N, MPI_DOUBLE, rank-1, rank>>1, MPI_COMM_WORLD, &stat);
52 }
53 } else
54 if (rank == 0) printf("Use even number of ranks.\n");
55
56 // Terminate MPI environment
57 MPI_Finalize();
58 }
B.3.4.5 labs/3/3.4-MPI/step-03/mpi.cpp
g
an
W
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.
ng
2 file mpi.cpp, located at 3/3.4-MPI/step-03
he
4
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
rY
5
6 Redistribution or commercial usage without a written permission
fo
9
pa
10 #include <mpi.h>
re
11 #include <stdio.h>
yP
12 #include <stdlib.h>
el
13
siv
40 MPI_Wait(&request, &stat);
41 } else if (rank == 1){
42 // Receiver side: blocking receive of data1
43 MPI_Recv(data1, N, MPI_FLOAT, 0, tag, MPI_COMM_WORLD, &stat);
44 // At the end of blocking MPI_Recv, it is safe to use data1
45 }
46 }
47 // Terminate MPI environment
48 MPI_Finalize();
49 }
B.3.4.6 labs/3/3.4-MPI/step-04/mpi.cpp
g
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
an
6 Redistribution or commercial usage without a written permission
W
7 from Colfax International is prohibited.
g
8 Contact information can be found at http://colfax-intl.com/ */
9
#include "mpi.h" en
h
un
10
#include <stdio.h>
rY
11
12 #define SIZE 6
fo
13
ed
16 float sendbuf[SIZE][SIZE] = {
Pr
21
cl
23 float recvbuf[SIZE];
24
25 MPI_Init(&argc,&argv);
26 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
27 MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
28
29 if (numtasks == SIZE) {
30 source = 1;
31 sendcount = SIZE;
32 recvcount = SIZE;
33 MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount,
34 MPI_FLOAT,source,MPI_COMM_WORLD);
35 printf("rank= %d Results: %f\t%f\t%f\t%f\t%f\t%f\n",rank,recvbuf[0],
36 recvbuf[1],recvbuf[2],recvbuf[3],recvbuf[4],recvbuf[5]);
37 }
38 else
39 printf("Must use %d processes, using %d. Terminating.\n", SIZE, numtasks);
40
41 MPI_Finalize();
42 }
B.3.4.7 labs/3/3.4-MPI/step-05/mpi.cpp
g
}
an
20
21
22 int worldSize, rank, irank;
W
ng
23 int mics, hosts, mic=0, host=0;
he
24 MPI_Status stat;
un
25
rY
26 #ifdef __MIC__
27 mic++;
fo
28 #else
d
host++;
re
29
#endif
pa
30
31
re
32 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
yP
33 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
el
34
siv
37
Ex
38 if (rank == 0)
39 printf("Of %d MPI processes, we have %d running on Xeon Phis and %d on\
40 the host.\n", worldSize, mics, hosts);
41
42 // Terminate MPI environment
43 MPI_Finalize();
44 }
B.4.1.1 labs/4/4.1-vtune/step-00-xeon/Makefile
CXX = icpc
CXXFLAGS = -openmp -g -O3
all:
$(CXX) $(CXXFLAGS) -o host-workload host-workload.cpp
clean:
rm -f host-workload
B.4.1.2 labs/4/4.1-vtune/step-00-xeon/Makefile
g
an
10 #include <stdio.h>
#include <omp.h>
W
11
12
g
en
13 long MyCalculation(const int n) {
14
h
un
15 long sum = 0;
rY
16
#pragma omp parallel for
fo
17
18 for (long i = 0; i < n; i++){
ed
19
ar
22 sum = sum + i; // only one thread at a time can execute this section
y
23
el
24 }
siv
25
u
26 return sum;
cl
27 }
Ex
28
29 int main(){
30
31 const long n = 1L<<20L;
32 for (int trial = 0; trial < 5; trial++) {
33
34 const double t0 = omp_get_wtime();
35 const long sum = MyCalculation(n);
36 const double t1 = omp_get_wtime();
37
38 printf("sum = %ld (must be %ld)\n", sum, ((n-1L)*n)/2L);
39 printf("Run time: %.3f seconds\n", t1-t0);
40 fflush(0);
41
42 }
43 }
B.4.1.3 labs/4/4.1-vtune/step-01-offload/Makefile
CXX = icpc
CXXFLAGS = -openmp -g -O3
all:
$(CXX) $(CXXFLAGS) -o offload-workload offload-workload.cpp
clean:
rm -f offload-workload
B.4.1.4 labs/4/4.1-vtune/step-01-offload/Makefile
g
7 from Colfax International is prohibited.
an
8 Contact information can be found at http://colfax-intl.com/ */
W
9
ng
10 #include <stdio.h>
he
11 #include <omp.h>
un
12
__attribute__((target(mic)))
rY
13
14 long MyCalculation(const int n) {
fo
15
d
16 long sum = 0;
re
17
pa
20
// A terrible way to do reduction
el
21
siv
24
Ex
25 }
26
27 return sum;
28 }
29
30 int main(){
31
32 printf("Please be patient, it may take ~20 seconds before you get output...\n");
33
34 const long n = 1L<<20L;
35 for (int trial = 0; trial < 5; trial++) {
36
37 const double t0 = omp_get_wtime();
38
39 long sum;
40 #pragma offload target(mic)
41 {
42 sum = MyCalculation(n);
43 }
44
45 const double t1 = omp_get_wtime();
46
47 printf("sum = %ld (must be %ld)\n", sum, ((n-1L)*n)/2L);
B.4.1.5 labs/4/4.1-vtune/step-02-native/Makefile
CXX = icpc
CXXFLAGS = -openmp -g -O3 -mmic
all:
$(CXX) $(CXXFLAGS) -o native-workload native-workload.cpp
clean:
rm -f native-workload
g
an
W
B.4.1.6 labs/4/4.1-vtune/step-02-native/Makefile
g
en
h
/* Copyright (c) 2013, Colfax International. All Right Reserved.
un
1
file native-workload.cpp, located at 4/4.1-vtune/step-02-native
rY
2
3 is a part of the practical supplement to the handbook
fo
8
9
y
#include <stdio.h>
el
10
siv
11 #include <omp.h>
12
u
cl
14
15 long sum = 0;
16
17 #pragma omp parallel for
18 for (long i = 0; i < n; i++){
19
20 // A terrible way to do reduction
21 #pragma omp critical
22 sum = sum + i; // only one thread at a time can execute this section
23
24 }
25
26 return sum;
27 }
28
29 int main(){
30
31 printf("Please be patient, it may take ~20 seconds before you get output...\n");
32 fflush(0);
33
34 const long n = 1L<<20L;
35 for (int trial = 0; trial < 5; trial++) {
36
TM
B.4.2 Using Intel R Trace Analyzer and Collector
Back to Lab A.4.2.
B.4.2.1 labs/4/4.2-itac/step-00/Makefile
g
an
W
CXX = mpiicpc
ng
CXXFLAGS = -mkl -vec-report3 -openmp
he
OBJECTS = mpi-pi.o
un
MICOBJECTS = mpi-pi.oMIC
rY
fo
.cpp.o:
pa
.cpp.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
el
siv
runme-mpi: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme-mpi $(OBJECTS)
runme-mpi.MIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runme-mpi.MIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme-mpi runme-mpi.MIC
B.4.2.2 labs/4/4.2-itac/step-00/mpi-pi.cpp
g
19 #define BRNG VSL_BRNG_MT19937
an
20 #define METHOD 0
W
21 #define ALIGNED __attribute__((align(64)))
g
22
23
const int trials = 2; // How many times to compute pi en
h
un
24
const long totalIterations=1L<<32L; // How many random number pairs to generate
rY
25
26 // for the calculation of pi
fo
27 const long blockSize = 1<<12; // A block of this many numbers is processes with SIMD
ed
29
ep
30
void PerformCalculationOfPi(const int begin, const int end, long & dUnderCurveComputed) {
Pr
31
32
y
33
siv
36 long dUnderCurve = 0;
Ex
60 dUnderCurve += localCount;
61
62 }
63
64 dUnderCurveComputed += dUnderCurve;
65 }
66
67
68 int main(int argc, char *argv[]){
69
70 assert(totalIterations%blockSize == 0);
71
72 // Who am I in the MPI world
73 int rank, ranks, namelen;
74 MPI_Status stat;
75 MPI_Init(&argc, &argv);
76 MPI_Comm_size(MPI_COMM_WORLD, &ranks);
77 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
78
79 // Scheduling policy
80 const float portion = 0.5;
g
81
an
82 double totalTime = 0.0; // Timing statistics
W
83 double totalWaitTime = 0.0; // Timing statistics
long communication[4] = {0L}; // Number of MPI messages
ng
84
85
he
87
rY
89
const double start_time = MPI_Wtime();
d
90
re
91
pa
94
95 if (rank == 0) {
el
96
siv
97 // "Boss"
u
98
long iter = portion*blocks/(ranks-1);
Ex
99
100 long i = 0;
101 while (i < blocks){
102 // Assign work to workers
103 for(long r = 0; r < ranks - 1; r++){
104 MPI_Recv(&msgInt, 2, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,
105 MPI_COMM_WORLD, &stat);
106 workerRank = msgInt[0];
107 nthreads = msgInt[1];
108 communication[workerRank] ++;
109
110 long begin = i;
111 i += iter;
112 long end = i;
113 if (begin>blocks) begin = blocks;
114 if (end>blocks) end = blocks;
115
116 msgLong[0] = begin;
117 msgLong[1] = end;
118 MPI_Send(&msgLong, 2, MPI_LONG, workerRank, workerRank, MPI_COMM_WORLD);
119 }
120 iter *= portion;
121 if (iter < 32*nthreads) iter = 32*nthreads;
122 }
123 // Tell workers to stop
124 for(int i = 1; i < ranks; i++) {
125 MPI_Recv(&msgInt, 2, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,
126 MPI_COMM_WORLD, &stat);
127 workerRank = msgInt[0];
128 nthreads = msgInt[1];
129 msgLong[0] = -1L;
130 MPI_Send(&msgLong, 2, MPI_LONG, workerRank, workerRank, MPI_COMM_WORLD);
131 }
132
133 } else {
134
135 // "Worker"
136
137 // Pi calculation counters
138 long begin=0, end;
139 int nthreads = omp_get_max_threads();
140 printf("Rank %d uses %d thread%s.\n", rank, nthreads, (nthreads==1?"":"s"));
141 while(begin>=0){
142
// Ask boss for work
g
143
an
144 msgInt[0] = rank;
W
145 msgInt[1] = nthreads;
MPI_Send(&msgInt, 2, MPI_INT, 0, rank, MPI_COMM_WORLD);
g
146
en
147
148
h
MPI_Recv(&msgLong, 2, MPI_LONG, 0, rank, MPI_COMM_WORLD, &stat);
un
149 begin = msgLong[0];
rY
151
if (begin>=0){
ed
152
153
ar
156 }
y
157 }
el
158 }
siv
double stopTime = 0;
cl
160
Ex
161
162 // Get results from all MPI processes using reduction
163 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
164 MPI_Barrier(MPI_COMM_WORLD);
165
166 // Get timing statistics from all MPI processes
167 cTime = MPI_Wtime() - cTime;
168 MPI_Reduce(&cTime, &stopTime, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
169
170 if (rank == 0){
171 // Output results
172 const double pi = (double)UnderCurveSum / (double) totalIterations * 4.0 ;
173 cTime = MPI_Wtime();
174 const double workTime = cTime-start_time;
175 const double pi_exact=3.141592653589793;
176 printf ("pi = %10.9f, rel. error = %12.9f, time = %8.6fs, load unbalance time = \
177 %8.6fs\n", pi, (pi-pi_exact)/pi_exact, workTime, stopTime);
178 switch (t){
179 case 0 : break;
180 case trials-1 : totalTime += workTime;
181 totalWaitTime += stopTime;
182 printf("Average time (s): %f\n", totalTime/(trials-1));break;
183 printf("%.4f\t%f\t%f\t", portion, totalTime/(trials-1),
184 totalWaitTime/(trials-1));
185 for(int i=1; i<ranks; i++) printf("%ld\t", communication[i]);
186 printf("\n");
187 break;
188 default: totalTime += workTime;
189 totalWaitTime += stopTime; break;
190 }
191 }
192 MPI_Barrier(MPI_COMM_WORLD);
193 }
194
195 MPI_Finalize();
196 return 0;
197 }
g
B.4.3.1 labs/4/4.3-serial-optimization/step-00/Makefile
an
CXX = icpc W
ng
CXXFLAGS = -vec-report3 -openmp
he
un
.cpp.o:
re
.cpp.oMIC:
el
all: runme
cl
Ex
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
mic: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runme $(MICOBJECTS)
micnativeloadex ./runme
clean:
rm -f *.o* runme
B.4.3.2 labs/4/4.3-serial-optimization/step-00/main.cpp
g
28
an
29 fOut[i] = myerff(fIn[i]);
W
30 const double stop = omp_get_wtime();
g
31
en
32 double err = 0.0; h
33 for (int i = 0; i < lTotal; i++) {
un
34 const float dif = fOut[i] - erff(fIn[i]);
rY
35 err += dif*dif;
fo
36 }
err = sqrt(err/(double)lTotal);
ed
37
38
ar
43 fflush(0);
siv
44 }
u
_mm_free(fIn);
cl
45
_mm_free(fOut);
Ex
46
47 }
B.4.3.3 labs/4/4.3-serial-optimization/step-00/erf.cpp
B.4.3.4 labs/4/4.3-serial-optimization/step-01/erf.cpp
g
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.
an
2 file erf.cpp, located at 4/4.3-serial-optimization/step-01
W
3 is a part of the practical supplement to the handbook
ng
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
he
5
6 Redistribution or commercial usage without a written permission
un
9
d
10 #include <math.h>
re
11
pa
13
yP
16
17 const float p = 0.3275911;
u
cl
B.4.3.5 labs/4/4.3-serial-optimization/step-02/erf.cpp
g
28
an
W
g
Back to Lab A.4.3.
en
h
un
rY
B.4.3.6 labs/4/4.3-serial-optimization/step-03/erf.cpp
fo
ed
9
10 #include <math.h>
11
12 __attribute__((vector)) float myerff(const float inx){
13
14 //const float x = fabsf(inx);
15 float x = (inx < 0.0f ? -inx : inx);
16
17 const float p = 0.3275911f;
18 const float t1 = 1.0f/(1.0f+p*x);
19 const float t2 = t1*t1;
20 const float t3 = t2*t1;
21 const float t4 = t3*t1;
22 const float t5 = t4*t1;
23
24 float res = 1.0f - (0.254829592f*t1 - 0.284496736f*t2 + 1.421413741f*t3 -
25 1.453152027f*t4 + 1.061405429f*t5) * exp(-x*x);
26
27 return (inx<0.0f ? -res : res);
28 }
B.4.3.7 labs/4/4.3-serial-optimization/step-04/erf.cpp
g
const float t3 = t2*t1;
an
20
21 const float t4 = t3*t1;
22 const float t5 = t4*t1;
W
ng
23
he
}
re
29
pa
re
B.4.3.8 labs/4/4.3-serial-optimization/step-05/erf.cpp
u
cl
Ex
B.4.3.9 labs/4/4.3-serial-optimization/step-0p/main.cpp
g
2 file main.cpp, located at 4/4.3-serial-optimization/step-0p
an
3 is a part of the practical supplement to the handbook
W
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
g
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6
from Colfax International is prohibited. en
Redistribution or commercial usage without a written permission
h
un
7
Contact information can be found at http://colfax-intl.com/ */
rY
8
9
fo
10 #include <stdio.h>
ed
11 #include <omp.h>
ar
12 #include <math.h>
ep
13
__attribute__((vector)) float myerff(const float);
Pr
14
15
y
16
siv
B.4.4.1 labs/4/4.4-vectorization-data-structure/step-00/Makefile
CXX = icpc
CXXFLAGS = -openmp -vec-report3
g
an
OBJECTS = main.o
W
MICOBJECTS = main.oMIC ng
.SUFFIXES: .o .cc .oMIC
he
un
.cc.o:
rY
.cc.oMIC:
d
re
runme: $(OBJECTS)
el
runmeMIC: $(MICOBJECTS)
cl
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC
B.4.4.2 labs/4/4.4-vectorization-data-structure/step-00/main.cc
13 #include <math.h>
14
15 struct Charge { // Elegant, but ineffective data layout
16 float x, y, z, q; // Coordinates and value of this charge
17 };
18
19 // This version performs poorly, because data layout of class Charge
20 // does not allow efficient vectorization
21 void calculate_electric_potential(
22 const int m, // Number of charges
23 const Charge* chg, // Charge distribution (array of classes)
24 const float Rx, const float Ry, const float Rz, // Observation point
25 float & phi // Output: electric potential
26 ) {
27 phi=0.0f;
28 for (int i=0; i<m; i++) { // This loop will be auto-vectorized
29 // Non-unit stride: (&chg[i+1].x - &chg[i].x) == sizeof(Charge)
30 const float dx=chg[i].x - Rx;
31 const float dy=chg[i].y - Ry;
32 const float dz=chg[i].z - Rz;
33 phi -= chg[i].q / sqrtf(dx*dx+dy*dy+dz*dz); // Coulomb’s law
}
g
34
an
35 }
W
36
int main(int argv, char* argc[]){
g
37
en
38 const size_t n=1<<11; h
39 const size_t m=1<<11;
un
40 const int nTrials=10;
rY
41
fo
42 Charge chg[m];
float* potential = (float*) malloc(sizeof(float)*n*n);
ed
43
44
ar
47 chg[i].x = (float)rand()/(float)RAND_MAX;
y
48 chg[i].y = (float)rand()/(float)RAND_MAX;
el
49 chg[i].z = (float)rand()/(float)RAND_MAX;
siv
50 chg[i].q = (float)rand()/(float)RAND_MAX;
u
}
cl
51
printf("Initialization complete.\n");
Ex
52
53
54 for (int t=0; t<nTrials; t++){
55 potential[0:n*n]=0.0f;
56 const double t0 = omp_get_wtime();
57 #pragma omp parallel for schedule(dynamic)
58 for (int j = 0; j < n*n; j++) {
59 const float Rx = (float)(j % n);
60 const float Ry = (float)(j / n);
61 const float Rz = 0.0f;
62 calculate_electric_potential(m, chg, Rx, Ry, Rz, potential[j]);
63 }
64 const double t1 = omp_get_wtime();
65
66 if ( t>= 2) { // First two iterations are slow on Xeon Phi; exclude them
67 printf("time: %.6f\n", t1-t0);
68 }
69 }
70 free(potential);
71 }
B.4.4.3 labs/4/4.4-vectorization-data-structure/step-01/main.cc
g
19 float * y; // ...y-coordinates...
an
20 float * z; // ...etc.
W
21 float * q; // These arrays are allocated in the constructor
ng
22 };
he
23
// This version vectorizes better thanks to unit-stride data access
un
24
void calculate_electric_potential(
rY
25
26 const int m, // Number of charges
fo
28 const float Rx, const float Ry, const float Rz, // Observation point
re
30 ) {
re
31 phi=0.0f;
yP
33
siv
60 printf("Initialization complete.\n");
61
62 for (int t=0; t<nTrials; t++){
63 potential[0:n*n]=0.0f;
64 const double t0 = omp_get_wtime();
65 #pragma omp parallel for schedule(dynamic)
66 for (int j = 0; j < n*n; j++) {
67 const float Rx = (float)(j % n);
68 const float Ry = (float)(j / n);
69 const float Rz = 0.0f;
70 calculate_electric_potential(m, chg, Rx, Ry, Rz, potential[j]);
71 }
72 const double t1 = omp_get_wtime();
73
74 if ( t>= 2) { // First two iterations are slow on Xeon Phi; exclude them
75 printf("time: %.6f\n", t1-t0);
76 }
77 }
78 free(chg.x);
79 free(chg.y);
80 free(chg.z);
free(potential);
g
81
an
82 }
W
g
Back to Lab A.4.4 h en
un
Note: In this lab, Makefile for step 01 is identical to the Makefile for step 00. However, Makefile for step 02 is
rY
different.
fo
ed
ar
B.4.4.4 labs/4/4.4-vectorization-data-structure/step-02/Makefile
ep
Pr
CXX = icpc
y
el
OBJECTS = main.o
cl
MICOBJECTS = main.oMIC
Ex
.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC
B.4.4.5 labs/4/4.4-vectorization-data-structure/step-02/main.cc
g
19 float * z;
an
20 float * q;
W
21 };
ng
22
he
23 void calculate_electric_potential(
const int m,
un
24
const Charge_Distribution & chg,
rY
25
26 const float Rx, const float Ry, const float Rz,
fo
28 ) {
re
29 phi=0.0f;
pa
33
siv
36 }
Ex
37
38
39 int main(int argv, char* argc[]){
40 const size_t n=1<<11;
41 const size_t m=1<<11;
42 const int nTrials=10;
43
44 Charge_Distribution chg = { .m = m };
45 chg.x = (float*)malloc(sizeof(float)*m);
46 chg.y = (float*)malloc(sizeof(float)*m);
47 chg.z = (float*)malloc(sizeof(float)*m);
48 chg.q = (float*)malloc(sizeof(float)*m);
49 float* potential = (float*) malloc(sizeof(float)*n*n);
50
51 for (size_t i=0; i<n; i++) {
52 chg.x[i] = (float)rand()/(float)RAND_MAX;
53 chg.y[i] = (float)rand()/(float)RAND_MAX;
54 chg.z[i] = (float)rand()/(float)RAND_MAX;
55 chg.q[i] = (float)rand()/(float)RAND_MAX;
56 }
57 printf("Initialization complete.\n");
58
59 for (int t=0; t<nTrials; t++){
60 potential[0:n*n]=0.0f;
61 const double t0 = omp_get_wtime();
62 #pragma omp parallel for schedule(dynamic)
63 for (int j = 0; j < n*n; j++) {
64 const float Rx = (float)(j % n);
65 const float Ry = (float)(j / n);
66 const float Rz = 0.0f;
67 calculate_electric_potential(m, chg, Rx, Ry, Rz, potential[j]);
68 }
69 const double t1 = omp_get_wtime();
70
71 if ( t>= 2) {
72 printf("time: %.6f\n", t1-t0);
73 }
74 }
75 free(chg.x);
76 free(chg.y);
77 free(chg.z);
78 free(potential);
79 }
g
Back to Lab A.4.4
an
W
g
B.4.5 Vector Optimization: Assisting the Compiler
Back to Lab A.4.5. en
h
un
rY
fo
B.4.5.1 labs/4/4.5-vectorization-compiler-hints/step-00/Makefile
ed
ar
CXX = icpc
ep
.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC
B.4.5.2 labs/4/4.5-vectorization-compiler-hints/step-00/main.cc
g
22
an
23 double err = 0.0;
W
24 for (int i = 0; i < m; i++)
err += (C[i]-B[i])*(C[i]-B[i])/(double)m;
ng
25
26
he
27 if (m*n<=256L) {
un
30 printf("(");
for (int j = 0; j < n; j++)
d
31
re
33 printf(") ");
re
38 }
u
}
cl
39
Ex
40
41 return sqrt(err);
42 }
43
44 void FillSparseMatrix(const int m, const int n, const int nb, const int bl,
45 float* const M) {
46 M[0:n*m] = 0.0f;
47 for (int b = 0; b < nb; b++) {
48 // Initializing a random sparse matrix
49 int i = rand()%m;
50 int blockStart = rand()%n;
51 // This expression gives a probability distribution that peaks at bl
52 int blockLength = 1 + (int)( (8.0f * (0.125f +
53 powf((float)rand()/(float)RAND_MAX - 0.5f, 3)))*(float)bl);
54 if (blockStart + blockLength > m-1)
55 blockLength = m-1 - blockStart;
56 for (int j = blockStart; j < blockStart + blockLength; j++)
57 M[i*n + j] = (float)rand()/(float)RAND_MAX;
58 }
59 }
60
61 void TestSparseMatrixTimesVector (const int nTrials, const int skip_it, const int m,
62 const int n, const float* M, PackedSparseMatrix* pM, float* A, float* B, float* C) {
63
64 double tAvg=0.0, dt=0.0;
65 for (int t=0; t<nTrials; t++){
66 for (int i = 0; i < n; i++)
67 A[i] = (float)rand()/(float)RAND_MAX;
68 const double t0 = omp_get_wtime();
69 pM->MultiplyByVector(A, B);
70 const double t1 = omp_get_wtime();
71 if (t==0) {
72 printf("iteration %d: time=%.2f ms, error=%f\n",
73 t, (t1-t0)*1e3, CheckResult(M, A, B, C, m, n));
74 } else {
75 printf("iteration %d: time=%.2f ms\n",
76 t, (t1-t0)*1e3);
77 }
78 if (t>=skip_it) {
79 tAvg += (t1-t0);
80 dt += (t1-t0)*(t1-t0);
81 }
82 }
83 tAvg /= (double)(nTrials-skip_it);
dt /= (double)(nTrials-skip_it);
g
84
an
85 dt = sqrt(dt-tAvg*tAvg);
W
86 printf("Average: %.2f +- %.2f ms per iteration.\n", tAvg*1e3, dt*1e3);
fflush(0);
g
87
en
88
89 }
h
un
90
rY
92
// Generating output to illustrate the algorithm
ed
93
94 printf("Demonstration:\n");
ar
101
free(M); free(A); free(B); free(C);
Ex
102
103
104 printf("\nPreparing for a benchmark...\n"); fflush(0);
105 const size_t n=20000;
106 const size_t m=20000;
107 const int nTrials=50;
108 M = (float*)malloc(sizeof(float)*n*m);
109 A = (float*)malloc(sizeof(float)*n);
110 B = (float*)malloc(sizeof(float)*n);
111 C = (float*)malloc(sizeof(float)*n);
112 FillSparseMatrix(m, n, (n*m/1000), 100, M);
113 PackedSparseMatrix pM2(m, n, M, true);
114
115 printf("\nBenchmark:\n"); fflush(stdout);
116 const int skip_it = 10; // First few iterations on the coprocessor are warm-up; skip them
117 TestSparseMatrixTimesVector(nTrials, skip_it, m, n, M, &pM2, A, B, C);
118
119 free(M); free(A); free(B); free(C);
120 }
B.4.5.3 labs/4/4.5-vectorization-compiler-hints/step-00/worker.h
1 #ifndef __INCLUDE_WORKER_H__
2 #define __INCLUDE_WORKER_H__
3
4 class PackedSparseMatrix {
5 const int nRows; // Number of matrix rows
6 const int nCols; // Number of matrix columns
7 int nBlocks; // Number of contiguous non-zero blocks
8
9 float* packedData; // Non-zero elements of the matrix in packed form
10 int* blocksInRow; // The number of non-zero blocks in the respective row
11 int* blockFirstIdxInRow; // The index of the first non-zero blocks in the respective row
12 int* blockOffset; // Indices in the packedData array of the respective blocks
13 int* blockLen; // Lengths of the respective blocks
14 int* blockCol; // The column number of the first element in the respective block
15
16 public:
17
18 PackedSparseMatrix(const int m, const int n, const float* M, const bool verbose);
19 ~PackedSparseMatrix();
g
20 void MultiplyByVector(const float* inVector, float* outVector);
an
21
22 };
W
ng
23
he
24 #endif
un
rY
B.4.5.4 labs/4/4.5-vectorization-compiler-hints/step-00/worker.cc
re
pa
re
3
siv
30 if (inBlock) inBlock=false;
31 }
32 j++;
33 }
34 }
35
36 // Allocating data for packed storage
37 packedData = (float*)malloc(sizeof(float)*nData);
38 blocksInRow = (int*) malloc(sizeof(float)*nRows);
39 blockFirstIdxInRow = (int*) malloc(sizeof(float)*nRows);
40 blockOffset = (int*) malloc(sizeof(float)*nBlocks);
41 blockLen = (int*) malloc(sizeof(float)*nBlocks);
42 blockCol = (int*) malloc(sizeof(float)*nBlocks);
43
44 int pos = 0;
45 int idx = -1;
46 for (int i = 0; i < nRows; i++) {
47 blocksInRow[i] = 0;
48 int j = 0;
49 bool inBlock = false;
50 bool firstBlock = true;
while (j < nCols) {
g
51
an
52 if (M[i*nCols + j] != 0) {
W
53 if (!inBlock) {
// Begin block
g
54
en
55 idx++; h
56 inBlock=true;
un
57 blocksInRow[i]++;
rY
58 if (firstBlock) {
fo
59 blockFirstIdxInRow[i] = idx;
firstBlock = false;
ed
60
61 }
ar
62 blockOffset[idx] = pos;
ep
63 blockLen[idx] = 1;
Pr
64 blockCol[idx] = j;
y
65 } else {
el
66 // Continue block
siv
67 blockLen[idx]++;
u
}
cl
68
packedData[pos++] = M[i*nCols + j];
Ex
69
70 } else {
71 // End block
72 if (inBlock)
73 inBlock=false;
74 }
75 // Continue parsing
76 j++;
77 }
78 }
79
80 if (verbose) {
81 printf("Results of packing a sparse %d x %d matrix:\nContains %d non-zero blocks,\
82 a total of %d non-zero elements.\n", nRows, nCols, nBlocks, nData);
83 printf("Average number of non-zero blocks per row: %d\n",
84 (int)((float)nBlocks/(float)nRows));
85 printf("Average length of non-zero blocks: %d\n", (int)((float)(nData)/(float)(nBlocks)));
86 printf("Matrix fill factor: %.2f%%\n", (float)nData/(float)(nRows*nCols)*100.0f);
87 }
88
89 }
90
91 PackedSparseMatrix::~PackedSparseMatrix() {
92 free(packedData);
93 free(blocksInRow);
94 free(blockFirstIdxInRow);
95 free(blockOffset);
96 free(blockLen);
97 free(blockCol);
98 }
99
100 void PackedSparseMatrix::MultiplyByVector(const float* inVector, float* outVector) {
101 #pragma omp parallel for schedule(dynamic,30)
102 for (int i = 0; i < nRows; i++) {
103 outVector[i] = 0.0f;
104 for (int nb = 0; nb < blocksInRow[i]; nb++) {
105 const int idx = blockFirstIdxInRow[i]+nb;
106 const int offs = blockOffset[idx];
107 const int j0 = blockCol[idx];
108 // Variable sum is needed for more efficient automatic vectorization of reduction.
109 float sum = 0.0f;
110 for (int c = 0; c < blockLen[idx]; c++) {
111 sum += packedData[offs+c]*inVector[j0+c];
112 }
outVector[i] += sum;
g
113
an
114 }
W
115 }
}
ng
116
he
un
Note: in this lab, between steps 00 and 01, only the file worker.cc is changed
fo
d
B.4.5.5 labs/4/4.5-vectorization-compiler-hints/step-01/worker.cc
re
pa
re
3
siv
30 if (inBlock) inBlock=false;
31 }
32 j++;
33 }
34 }
35
36 // Allocating data for packed storage
37 packedData = (float*)malloc(sizeof(float)*nData);
38 blocksInRow = (int*) malloc(sizeof(float)*nRows);
39 blockFirstIdxInRow = (int*) malloc(sizeof(float)*nRows);
40 blockOffset = (int*) malloc(sizeof(float)*nBlocks);
41 blockLen = (int*) malloc(sizeof(float)*nBlocks);
42 blockCol = (int*) malloc(sizeof(float)*nBlocks);
43
44 int pos = 0;
45 int idx = -1;
46 for (int i = 0; i < nRows; i++) {
47 blocksInRow[i] = 0;
48 int j = 0;
49 bool inBlock = false;
50 bool firstBlock = true;
while (j < nCols) {
g
51
an
52 if (M[i*nCols + j] != 0) {
W
53 if (!inBlock) {
// Begin block
g
54
en
55 idx++; h
56 inBlock=true;
un
57 blocksInRow[i]++;
rY
58 if (firstBlock) {
fo
59 blockFirstIdxInRow[i] = idx;
firstBlock = false;
ed
60
61 }
ar
62 blockOffset[idx] = pos;
ep
63 blockLen[idx] = 1;
Pr
64 blockCol[idx] = j;
y
65 } else {
el
66 // Continue block
siv
67 blockLen[idx]++;
u
}
cl
68
packedData[pos++] = M[i*nCols + j];
Ex
69
70 } else {
71 // End block
72 if (inBlock)
73 inBlock=false;
74 }
75 // Continue parsing
76 j++;
77 }
78 }
79
80 if (verbose) {
81 printf("Results of packing a sparse %d x %d matrix:\nContains %d non-zero blocks,\
82 a total of %d non-zero elements.\n", nRows, nCols, nBlocks, nData);
83 printf("Average number of non-zero blocks per row: %d\n",
84 (int)((float)nBlocks/(float)nRows));
85 printf("Average length of non-zero blocks: %d\n", (int)((float)(nData)/(float)(nBlocks)));
86 printf("Matrix fill factor: %.2f%%\n", (float)nData/(float)(nRows*nCols)*100.0f);
87 }
88
89 }
90
91 PackedSparseMatrix::~PackedSparseMatrix() {
92 free(packedData);
93 free(blocksInRow);
94 free(blockFirstIdxInRow);
95 free(blockOffset);
96 free(blockLen);
97 free(blockCol);
98 }
99
100 void PackedSparseMatrix::MultiplyByVector(const float* inVector, float* outVector) {
101 #pragma omp parallel for schedule(dynamic,30)
102 for (int i = 0; i < nRows; i++) {
103 outVector[i] = 0.0f;
104 for (int nb = 0; nb < blocksInRow[i]; nb++) {
105 const int idx = blockFirstIdxInRow[i]+nb;
106 const int offs = blockOffset[idx];
107 const int j0 = blockCol[idx];
108 // Variable sum is needed for more efficient automatic vectorization of reduction.
109 float sum = 0.0f;
110 // Pragma loop count assists the application at runtime in choosing
111 // the optimal execution path. It only leads to an increase in performance
112 // when the actual loop count in the problem is in agreement
// with this compile-time prediction.
g
113
an
114 #pragma loop_count avg(100)
W
115 for (int c = 0; c < blockLen[idx]; c++) {
sum += packedData[offs+c]*inVector[j0+c];
ng
116
117 }
he
119 }
rY
120 }
fo
121 }
d
re
Note: in this lab, between steps 01 and 02, files main.cc, worker.h and worker.cc are changed.
re
yP
el
B.4.5.6 labs/4/4.5-vectorization-compiler-hints/step-02/main.cc
u siv
cl
25 err += (C[i]-B[i])*(C[i]-B[i])/(double)m;
26
27 if (m*n<=256L) {
28 // For small matrix, output elements on the screen
29 for (int i = 0; i < m; i++) {
30 printf("(");
31 for (int j = 0; j < n; j++)
32 printf(" %5.3f", M[i*n+j]);
33 printf(") ");
34 if (i == m/2) { printf("x"); } else { printf(" "); }
35 printf(" (%5.3f)", A[i]);
36 if (i == m/2) { printf(" ="); } else { printf(" "); }
37 printf(" (%5.3f) (correct=%5.3f)\n", B[i], C[i]);
38 }
39 }
40
41 return sqrt(err);
42 }
43
44 void FillSparseMatrix(const int m, const int n, const int nb, const int bl,
45 float* const M) {
M[0:n*m] = 0.0f;
g
46
an
47 for (int b = 0; b < nb; b++) {
W
48 // Initializing a random sparse matrix
int i = rand()%m;
g
49
en
50 int blockStart = rand()%n; h
51 // This expression gives a probability distribution that peaks at bl
un
52 int blockLength = 1 + (int)( (8.0f * (0.125f +
rY
55
56 for (int j = blockStart; j < blockStart + blockLength; j++)
ar
57 M[i*n + j] = (float)rand()/(float)RAND_MAX;
ep
58 }
Pr
59 }
y
60
el
61 void TestSparseMatrixTimesVector (const int nTrials, const int skip_it, const int m,
siv
63
double tAvg=0.0, dt=0.0;
Ex
64
65 for (int t=0; t<nTrials; t++){
66 for (int i = 0; i < n; i++)
67 A[i] = (float)rand()/(float)RAND_MAX;
68 const double t0 = omp_get_wtime();
69 pM->MultiplyByVector(A, B);
70 const double t1 = omp_get_wtime();
71 if (t==0) {
72 printf("iteration %d: time=%.2f ms, error=%f\n",
73 t, (t1-t0)*1e3, CheckResult(M, A, B, C, m, n));
74 } else {
75 printf("iteration %d: time=%.2f ms\n",
76 t, (t1-t0)*1e3);
77 }
78 if (t>=skip_it) {
79 tAvg += (t1-t0);
80 dt += (t1-t0)*(t1-t0);
81 }
82 }
83 tAvg /= (double)(nTrials-skip_it);
84 dt /= (double)(nTrials-skip_it);
85 dt = sqrt(dt-tAvg*tAvg);
86 printf("Average: %.2f +- %.2f ms per iteration.\n", tAvg*1e3, dt*1e3);
87 fflush(0);
88
89 }
90
91 int main(int argv, char* argc[]){
92
93 // Generating output to illustrate the algorithm
94 printf("Demonstration:\n");
95 float* M = (float*) _mm_malloc(sizeof(float)*16*16, ALIGN_BYTES);
96 float* A = (float*) _mm_malloc(sizeof(float)*16, ALIGN_BYTES);
97 float* B = (float*) malloc(sizeof(float)*16);
98 float* C = (float*) malloc(sizeof(float)*16);
99 FillSparseMatrix(16, 16, 10, 3, M);
100 PackedSparseMatrix pM(16, 16, M, true);
101 TestSparseMatrixTimesVector(1, 0, 16, 16, M, &pM, A, B, C);
102 _mm_free(M); _mm_free(A); free(B); free(C);
103
104 printf("\nPreparing for a benchmark...\n"); fflush(0);
105 const size_t n=20000;
106 const size_t m=20000;
107 const int nTrials=50;
M = (float*)_mm_malloc(sizeof(float)*n*m, ALIGN_BYTES);
g
108
an
109 A = (float*)_mm_malloc(sizeof(float)*n, ALIGN_BYTES);
W
110 B = (float*)malloc(sizeof(float)*n);
C = (float*)malloc(sizeof(float)*n);
ng
111
112 FillSparseMatrix(m, n, (n*m/1000), 100, M);
he
114
rY
116 const int skip_it = 10; // First few iterations on the coprocessor are warm-up; skip them
TestSparseMatrixTimesVector(nTrials, skip_it, m, n, M, &pM2, A, B, C);
d
117
re
118
pa
120 }
yP
el
B.4.5.7 labs/4/4.5-vectorization-compiler-hints/step-02/worker.h
Ex
1 #ifndef __INCLUDE_WORKER_H__
2 #define __INCLUDE_WORKER_H__
3
4 // The size of the cache line and also the size of the vector register on coprocessor
5 #define ALIGN_BYTES 64
6 // The number of 32-bit floats that fit in ALIGN_BYTES
7 #define ALIGN_FLOATS 16
8
9 class PackedSparseMatrix {
10 const int nRows; // Number of matrix rows
11 const int nCols; // Number of matrix columns
12 int nBlocks; // Number of contiguous non-zero blocks
13
14 float* packedData; // Non-zero elements of the matrix in packed form
15 int* blocksInRow; // The number of non-zero blocks in the respective row
16 int* blockFirstIdxInRow; // The index of the first non-zero blocks in the respective row
17 int* blockOffset; // Indices in the packedData array of the respective blocks
18 int* blockLen; // Lengths of the respective blocks
19 int* blockCol; // The column number of the first element in the respective block
20
21 public:
22
23 PackedSparseMatrix(const int m, const int n, const float* M, const bool verbose);
24 ~PackedSparseMatrix();
25 void MultiplyByVector(const float* inVector, float* outVector);
26
27 };
28
29 #endif
B.4.5.8 labs/4/4.5-vectorization-compiler-hints/step-02/worker.cc
g
7 from Colfax International is prohibited.
an
8 Contact information can be found at http://colfax-intl.com/ */
W
9
g
10 #include "worker.h"
11 #include
#include
<stdlib.h>
<stdio.h> en
h
un
12
#include <assert.h>
rY
13
14
fo
19
20 assert(nCols % ALIGN_FLOATS == 0);
y
nBlocks = 0;
el
21
siv
22 int nData = 0;
23 for (int i = 0; i < nRows; i++) {
u
cl
24 int j = 0;
Ex
g
69
an
70 inBlock=true;
W
71 blocksInRow[i]++;
if (firstBlock) {
ng
72
73 blockFirstIdxInRow[i] = idx;
he
74 firstBlock = false;
un
75 }
rY
76 blockOffset[idx] = pos;
fo
77 blockLen[idx] = 16;
blockCol[idx] = j;
d
78
re
79 } else {
pa
80 // Continue block
re
81 blockLen[idx] += 16;
yP
82 }
83 for (int jj = j; jj < j+ALIGN_FLOATS; jj++)
el
85 } else {
u
// End block
cl
86
if (inBlock)
Ex
87
88 inBlock=false;
89 }
90 // Continue parsing
91 j+=16;
92 }
93 }
94
95 if (verbose) {
96 printf("Results of packing a sparse %d x %d matrix:\nContains %d non-zero blocks,\
97 a total of %d non-zero elements.\n",
98 nRows, nCols, nBlocks, nData);
99 printf("Average number of non-zero blocks per row: %d\n",
100 (int)((float)nBlocks/(float)nRows));
101 printf("Average length of non-zero blocks: %d\n",
102 (int)((float)(nData)/(float)(nBlocks)));
103 printf("Matrix fill factor: %f%%\n",
104 (float)nData/(float)(nRows*nCols)*100.0f);
105 }
106
107 }
108
109 PackedSparseMatrix::~PackedSparseMatrix() {
110 _mm_free(packedData);
111 free(blocksInRow);
112 free(blockFirstIdxInRow);
113 free(blockOffset);
114 free(blockLen);
115 free(blockCol);
116 }
117
118 void PackedSparseMatrix::MultiplyByVector(const float* inVector, float* outVector) {
119 #pragma omp parallel for schedule(dynamic,30)
120 for (int i = 0; i < nRows; i++) {
121 outVector[i] = 0.0f;
122 for (int nb = 0; nb < blocksInRow[i]; nb++) {
123 const int idx = blockFirstIdxInRow[i]+nb;
124 const int offs = blockOffset[idx];
125 const int j0 = blockCol[idx];
126 // Variable sum is needed for more efficient automatic vectorization of
127 // reduction.
128 float sum = 0.0f;
129 // Pragma vector aligned makes a promise to the compiler that the elements of
130 // vectorized arrays accessed in the first iteration are aligned on a 64-byte
// boundary. Pragma loop count assists the application at runtime in choosing
g
131
an
132 // the optimal execution path. It only leads to an increase in performance
W
133 // when the actual loop count in the problem is in agreement
// with this compile-time prediction.
g
134
en
135 #pragma vector aligned h
136 #pragma loop count avg(128) min(16)
un
137 for (int c = 0; c < blockLen[idx]; c++) {
rY
139 }
outVector[i] += sum;
ed
140
141 }
ar
142 }
ep
143 }
Pr
y
el
B.4.6.1 labs/4/4.6-vectorization-branches/step-00/Makefile
CXX = icpc
CXXFLAGS = -openmp -vec-report3 -g -O3
.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC
B.4.6.2 labs/4/4.6-vectorization-branches/step-00/main.cc
g
9
an
10 #include <stdio.h>
W
11 #include <cstdlib>
ng
12 #include <omp.h>
he
13 #include <math.h>
un
14
typedef void (*FunctionPtrType)(const int, const int, const int*, float*);
rY
15
16
fo
17 void MaskedOperations(const int m, const int n, const int* flag, float* data);
d
18 void NonMaskedOperations(const int m, const int n, const int* flag, float* data);
re
19
pa
23
siv
50 if (c==0) {
51 flag[0:n] = 0;
52 fp = &MaskedOperations;
53 printf("Masked calculation, all branches not taken, none of the elements \
54 computed:\n");
55 } else if (c==1) {
56 flag[0:n] = 1;
57 fp = &MaskedOperations;
58 printf("Masked calculation, all branches taken, all elements computed:\n");
59 } else if (c==2) {
60 flag[0:n/2:2] = 0;
61 flag[1:n/2:2] = 1;
62 fp = &MaskedOperations;
63 printf("Masked calculation, half of the branches taken, half of the elements\
64 computed (stride 2):\n");
65 } else if (c==3) {
66 for (int k = 0; k < 16; k++)
67 flag[k:n/32:32] = 0;
68 for (int k = 16; k < 32; k++)
69 flag[k:n/32:32] = 1;
70 fp = &MaskedOperations;
printf("Masked calculation, half of the branches taken, half of the elements\
g
71
an
72 computed (stride 16):\n");
W
73 } else if (c==4) {
flag[0:n] = 0;
g
74
en
75 fp = &NonMaskedOperations; h
76 printf("Unmasked calculation, all elements computed:\n");
un
77 }
rY
79 }
ed
80
81 _mm_free(data);
ar
82 _mm_free(flag);
ep
83
Pr
84 }
y
el
siv
B.4.6.3
u
labs/4/4.6-vectorization-branches/step-00/worker.cc
cl
Ex
B.4.6.4 labs/4/4.6-vectorization-branches/step-01/worker.cc
g
12
an
13 #pragma omp parallel for schedule(dynamic)
W
14 for (int i = 0; i < m; i++)
#pragma simd
ng
15
16 for (int j = 0; j < n; j++) {
he
17 if (flag[j] == 1)
un
18 data[i*n+j] = sqrtf(data[i*n+j]);
rY
19 }
fo
20 }
d
21
re
22 void NonMaskedOperations(const int m, const int n, const int* flag, float* data) {
pa
25 #pragma simd
26 for (int j = 0; j < n; j++) {
el
data[i*n+j] = sqrtf(data[i*n+j]);
siv
27
28 }
u
}
cl
29
Ex
B.4.7.1 labs/4/4.7-optimize-shared-mutexes/step-00/Makefile
CXX = icpc
CXXFLAGS = -openmp -mkl -vec-report2
.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC
B.4.7.2 labs/4/4.7-optimize-shared-mutexes/step-00/main.cc
g
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
an
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
W
6 Redistribution or commercial usage without a written permission
g
7 from Colfax International is prohibited.
8
en
Contact information can be found at http://colfax-intl.com/
h */
un
9
#include <stdio.h>
rY
10
11 #include <cstdlib>
fo
12 #include <omp.h>
ed
13 #include <math.h>
ar
14 #include <mkl_vsl.h>
ep
15
void HistogramReference(const float* age, int* const group, const int n,
Pr
16
17 const float group_width){
y
18
siv
21 group[j]++;
Ex
22 }
23 }
24
25 void Histogram(const float* age, int* const group, const int n, const float group_width,
26 const int m);
27
28 int main(int argv, char* argc[]){
29 const size_t n=1L<<30L;
30 const float max_age=99.999f;
31 const float group_width=20.0f;
32 const size_t m = (size_t) floorf(max_age/group_width + 0.1f);
33 const int nTrials=10;
34
35 float* age = (float*) _mm_malloc(sizeof(int)*n, 64);
36 int group[m];
37 int ref_group[m];
38
39 // Initializing array of ages
40 printf("Initialization..."); fflush(0);
41 VSLStreamStatePtr rnStream;
42 vslNewStream( &rnStream, VSL_BRNG_MT19937, 1 );
43 vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, n, age, 0.0f, max_age);
44 for (int i = 0; i < n; i++)
45 age[i] = age[i]*age[i]/max_age;
46
47 // Computing the "correct" answer
48 ref_group[:]=0;
49 HistogramReference(age, ref_group, n, group_width);
50 printf("complete.\n"); fflush(0);
51
52 for (int t=0; t<nTrials; t++){
53 group[:] = 0;
54
55 printf("Iteration %d...", t); fflush(0);
56 const double t0 = omp_get_wtime();
57 Histogram(age, group, n, group_width, m);
58 const double t1 = omp_get_wtime();
59
60 for (int i=0; i<m; i++) {
61 if (fabs((double)(ref_group[i]-group[i])) > 1e-4*fabs((double)(ref_group[i]
62 +group[i]))) {
63 printf("Result is incorrect!\n");
64 for (int i=0; i<m; i++) printf(" (%d vs %d)", group[i], ref_group[i]);
65 }
}
g
66
an
67 printf(" time: %.3f sec\n", t1-t0);
W
68
printf("Result: ");
ng
69
70 for (int i=0; i<m; i++) printf("\t%d", group[i]);
he
71 printf("\n");
un
72 fflush(0);
rY
73 }
fo
74
_mm_free(age);
d
75
re
76 }
pa
re
B.4.7.3 labs/4/4.7-optimize-shared-mutexes/step-00/worker.cc
u
cl
Ex
B.4.7.4 labs/4/4.7-optimize-shared-mutexes/step-01/worker.cc
g
22
an
23 // Vectorize the multiplication and rounding
W
24 #pragma vector aligned
for (int i = ii; i < ii+vecLen; i++)
g
25
en
26 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
h
27
un
28 // Scattered memory access, does not get vectorized
rY
30 hist[histIdx[c]]++;
}
ed
31
32
ar
33 // Finish with the tail of the data (if n is not a multiple of vecLen)
ep
36 }
el
siv
u
B.4.7.5 labs/4/4.7-optimize-shared-mutexes/step-02/worker.cc
g
an
Back to Lab A.4.7.
W
ng
he
B.4.7.6 labs/4/4.7-optimize-shared-mutexes/step-03/worker.cc
un
rY
8
siv
9
10 void Histogram(const float* age, int* const hist, const int n, const float group_width,
u
cl
11 const int m) {
Ex
12
13 const int vecLen = 16; // Length of vectorized loop (lower is better,
14 // but a multiple of 64/sizeof(int))
15 const float invGroupWidth = 1.0f/group_width; // Pre-compute the reciprocal
16 const int nPrime = n - n%vecLen; // nPrime is a multiple of vecLen
17
18 #pragma omp parallel
19 {
20 // Private variable to hold a copy of histogram in each thread
21 int hist_priv[m];
22 hist_priv[:] = 0;
23
24 // Temporary storage for vecLen indices. Necessary for vectorization
25 int histIdx[vecLen] __attribute__((aligned(64)));
26
27 // Distribute work across threads
28 // Strip-mining the loop in order to vectorize the inner short loop
29 #pragma omp for schedule(guided)
30 for (int ii = 0; ii < nPrime; ii+=vecLen) {
31 // Vectorize the multiplication and rounding
32 #pragma vector aligned
33 for (int i = ii; i < ii+vecLen; i++)
34 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
35
36 // Scattered memory access, does not get vectorized
37 for (int c = 0; c < vecLen; c++)
38 hist_priv[histIdx[c]]++;
39 }
40
41 // Finish with the tail of the data (if n is not a multiple of vecLen)
42 #pragma omp single
43 for (int i = nPrime; i < n; i++)
44 hist_priv[(int) ( age[i] * invGroupWidth )]++;
45
46 // Reduce private copies into global histogram
47 for (int c = 0; c < m; c++) {
48 // Protect the += operation with the lightweight atomic mutex
49 #pragma omp atomic
50 hist[c] += hist_priv[c];
51 }
52 }
53 }
g
an
W
B.4.7.7 labs/4/4.7-optimize-shared-mutexes/step-04/worker.cc
g
en
h
/* Copyright (c) 2013, Colfax International. All Right Reserved.
un
1
file worker.cc, located at 4/4.7-optimize-shared-mutexes/step-04
rY
2
3 is a part of the practical supplement to the handbook
fo
8
9
y
#include <omp.h>
el
10
siv
11
12 void Histogram(const float* age, int* const hist, const int n, const float group_width,
u
cl
13 const int m) {
Ex
14
15 const int vecLen = 16; // Length of vectorized loop (lower is better,
16 // but a multiple of 64/sizeof(int))
17 const float invGroupWidth = 1.0f/group_width; // Pre-compute the reciprocal
18 const int nPrime = n - n%vecLen; // nPrime is a multiple of vecLen
19 const int nThreads = omp_get_max_threads();
20 // Shared histogram with a private section for each thread
21 int hist_thr[nThreads][m];
22 hist_thr[:][:] = 0;
23
24 // Strip-mining the loop in order to vectorize the inner short loop
25 #pragma omp parallel
26 {
27 // Get the number of this thread
28 const int iThread = omp_get_thread_num();
29 // Temporary storage for vecLen indices. Necessary for vectorization
30 int histIdx[vecLen] __attribute__((aligned(64)));
31
32 #pragma omp for schedule(guided)
33 for (int ii = 0; ii < nPrime; ii+=vecLen) {
34
35 // Vectorize the multiplication and rounding
36 #pragma vector aligned
g
an
Back to Lab A.4.7.
W
ng
B.4.7.8 labs/4/4.7-optimize-shared-mutexes/step-05/worker.cc
he
un
1
2 file worker.cc, located at 4/4.7-optimize-shared-mutexes/step-05
fo
9
siv
10 #include <omp.h>
11
u
cl
12 void Histogram(const float* age, int* const hist, const int n, const float group_width,
Ex
13 const int m) {
14
15 const int vecLen = 16; // Length of vectorized loop (lower is better,
16 // but a multiple of 64/sizeof(int))
17 const float invGroupWidth = 1.0f/group_width; // Pre-compute the reciprocal
18 const int nPrime = n - n%vecLen; // nPrime is a multiple of vecLen
19 const int nThreads = omp_get_max_threads();
20 // Padding for hist_thr[][] in order to avoid a situation
21 // where two (or more) rows share a cache line.
22 const int paddingBytes = 64;
23 const int paddingElements = paddingBytes / sizeof(int);
24 const int mPadded = m + (paddingElements-m%paddingElements);
25 // Shared histogram with a private section for each thread
26 int hist_thr[nThreads][mPadded] __attribute__((aligned(64)));
27 hist_thr[:][:] = 0;
28
29 // Strip-mining the loop in order to vectorize the inner short loop
30 #pragma omp parallel
31 {
32 // Get the number of this thread
33 const int iThread = omp_get_thread_num();
34 // Temporary storage for vecLen indices. Necessary for vectorization
35 int histIdx[vecLen] __attribute__((aligned(64)));
36
37 #pragma omp for schedule(guided)
38 for (int ii = 0; ii < nPrime; ii+=vecLen) {
39
40 // Vectorize the multiplication and rounding
41 #pragma vector aligned
42 for (int i = ii; i < ii+vecLen; i++)
43 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
44
45 // Scattered memory access, does not get vectorized.
46 // There is no synchronization in this for-loop,
47 // however, false sharing occurs here and ruins the performance
48 for (int c = 0; c < vecLen; c++)
49 hist_thr[iThread][histIdx[c]]++;
50 }
51 }
52
53 // Finish with the tail of the data (if n is not a multiple of vecLen)
54 for (int i = nPrime; i < n; i++)
g
55 hist[(int) ( age[i] * invGroupWidth )]++;
an
56
W
57 // Reducing results from all threads to the common histogram hist
58 for (int iThread = 0; iThread < nThreads; iThread++)
ng
59 hist[0:m] += hist_thr[iThread][0:m];
e
60
61 }
nh
Yu
Back to Lab A.4.7.
r
fo
d
B.4.8.1 labs/4/4.8-optimize-scheduling/step-00/Makefile
el
iv
CXX = icpc
us
.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC
B.4.8.2 labs/4/4.8-optimize-scheduling/step-00/main.cc
g
17 const double minAccuracy);
an
18
W
19 void InitializeMatrix(const int n, double* M) {
ng
20 // "Good" matrix for Jacobi method
he
22
for (int j = 0; j < n; j++) {
rY
23
24 M[i*n+j] = (double)(i*n+j);
fo
25 sum += M[i*n+j];
d
26 }
re
27 sum -= M[i*n+i];
pa
28 M[i*n+i] = 2.0*sum;
re
29 }
yP
30 }
el
31
siv
g
79
an
80
W
81 _mm_free(M);
_mm_free(x);
g
82
en
83 _mm_free(b); h
84 _mm_free(accuracy);
un
85 }
rY
fo
B.4.8.3 labs/4/4.8-optimize-scheduling/step-00/worker.cc
Pr
y
1
siv
g
49
an
50 for (int j = 0; j < n; j++)
W
51 bTrial[i] += M[i*n+j]*x[j];
}
ng
52
53 accuracy = RelativeNormOfDifference(n, b, bTrial);
he
54
un
56 return iterations;
fo
57 }
d
re
pa
B.4.8.4 labs/4/4.8-optimize-scheduling/step-01/main.cc
el
siv
26 sum += M[i*n+j];
27 }
28 sum -= M[i*n+i];
29 M[i*n+i] = 2.0*sum;
30 }
31 }
32
33 int main(int argv, char* argc[]){
34 printf("Initialization..."); fflush(0);
35 const int n = 128;
36 const int nBVectors = 1<<14; // The number of b-vectors
37 double* M = (double*) _mm_malloc(sizeof(double)*n*n, 64);
38 double* x = (double*) _mm_malloc(sizeof(double)*n*nBVectors, 64);
39 double* b = (double*) _mm_malloc(sizeof(double)*n*nBVectors, 64);
40 double* accuracy = (double*) _mm_malloc(sizeof(double)*nBVectors, 64);
41 InitializeMatrix(n, M);
42 VSLStreamStatePtr rnStream;
43 vslNewStream( &rnStream, VSL_BRNG_MT19937, 1234 );
44 vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, n*nBVectors, b, 0.0, 1.0);
g
45 vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, nBVectors, accuracy, 0.0, 1.0);
an
46 accuracy[0:nBVectors]=exp(-28.0+26.0*accuracy[0:nBVectors]);
W
47 printf(" initialized %d vectors and a [%d x %d] matrix\n",
48 nBVectors, n, n); fflush(0);
ng
49
e
50 const int nTrials=10;
51 const int itSkip=1;
double tAvg = 0.0; nh
Yu
52
53 double dtAvg = 0.0;
r
55 cilk::reducer_opadd<double> itAvg;
cilk::reducer_opadd<double> dItAvg;
d
56
re
61 itAvg += (double)it;
62 dItAvg += (double)it*it;
el
63 }
iv
66
mdItAvg = sqrt(mdItAvg - mitAvg*mitAvg);
Ex
67
68 printf(" time: %.3f sec, Jacobi iterations per vector: %.1f +- %.1f\n", t1-t0,
69 mitAvg, mdItAvg);
70 if (t >= itSkip) {
71 tAvg += (t1-t0);
72 dtAvg += (t1-t0)*(t1-t0);
73 }
74 fflush(0);
75 }
76 tAvg /= (double)(nTrials-itSkip);
77 dtAvg /= (double)(nTrials-itSkip);
78 dtAvg = sqrt(dtAvg - tAvg*tAvg);
79 printf("Average: %.3f +- %.3f sec\n\n", tAvg, dtAvg); fflush(0);
80
81 _mm_free(M);
82 _mm_free(x);
83 _mm_free(b);
84 _mm_free(accuracy);
85 }
B.4.8.5 labs/4/4.8-optimize-scheduling/step-02/main.cc
g
17 const double minAccuracy);
an
18
W
19 void InitializeMatrix(const int n, double* M) {
ng
20 // "Good" matrix for Jacobi method
he
22
for (int j = 0; j < n; j++) {
rY
23
24 M[i*n+j] = (double)(i*n+j);
fo
25 sum += M[i*n+j];
d
26 }
re
27 sum -= M[i*n+i];
pa
28 M[i*n+i] = 2.0*sum;
re
29 }
yP
30 }
el
31
siv
58 if (iMethod == 0) {
59 printf(" (_Cilk_for)..."); fflush(0);
60 //__cilkrts_set_param("nworkers","50");
61 _Cilk_for (int c = 0; c < nBVectors; c++)
62 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
63 }
64 if (iMethod == 1) {
65 printf(" (no scheduling)..."); fflush(0);
66 #pragma omp parallel for
67 for (int c = 0; c < nBVectors; c++)
68 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
69 }
70 if (iMethod == 2) {
71 printf(" (static, 1)..."); fflush(0);
72 #pragma omp parallel for schedule(static, 1)
73 for (int c = 0; c < nBVectors; c++)
74 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
75 }
76 if (iMethod == 3) {
77 printf(" (static, 4)..."); fflush(0);
78 #pragma omp parallel for schedule(static, 4)
for (int c = 0; c < nBVectors; c++)
g
79
an
80 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
W
81 }
if (iMethod == 4) {
g
82
en
83 printf(" (static, 32)..."); fflush(0);h
84 #pragma omp parallel for schedule(static, 32)
un
85 for (int c = 0; c < nBVectors; c++)
rY
87 }
if (iMethod == 5) {
ed
88
89 printf(" (static, 256)..."); fflush(0);
ar
93 }
el
94 if (iMethod == 6) {
siv
96
for (int c = 0; c < nBVectors; c++)
Ex
97
98 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
99 }
100 if (iMethod == 7) {
101 printf(" (dynamic, 4)..."); fflush(0);
102 #pragma omp parallel for schedule(dynamic, 4)
103 for (int c = 0; c < nBVectors; c++)
104 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
105 }
106 if (iMethod == 8) {
107 printf(" (dynamic, 32)..."); fflush(0);
108 #pragma omp parallel for schedule(dynamic, 32)
109 for (int c = 0; c < nBVectors; c++)
110 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
111 }
112 if (iMethod == 9) {
113 printf(" (dynamic, 256)..."); fflush(0);
114 #pragma omp parallel for schedule(dynamic, 256)
115 for (int c = 0; c < nBVectors; c++)
116 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
117 }
118 if (iMethod == 10) {
119 printf(" (guided, 1)..."); fflush(0);
g
141
an
142 const double t1 = omp_get_wtime();
W
143 printf(" time: %.3f sec\n", t1-t0);
if (t >= itSkip) {
ng
144
145 tAvg += (t1-t0);
he
147 }
rY
148 fflush(0);
fo
149 }
tAvg /= (double)(nTrials-itSkip);
d
150
re
154 }
155 _mm_free(M);
el
156 _mm_free(x);
siv
157 _mm_free(b);
u
_mm_free(accuracy);
cl
158
}
Ex
159
B.4.9.1 labs/4/4.9-insufficient-parallelism/step-00/Makefile
CXX = icpc
CXXFLAGS = -openmp
.cc.o:
.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC
B.4.9.2 labs/4/4.9-insufficient-parallelism/step-00/main.cc
g
an
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.
W
2 file main.cc, located at 4/4.9-insufficient-parallelism/step-00
g
3 is a part of the practical supplement to the handbook
4
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8en
"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
h
un
5
Redistribution or commercial usage without a written permission
rY
6
7 from Colfax International is prohibited.
fo
9
ar
10 #include <malloc.h>
ep
11 #include <math.h>
#include <omp.h>
Pr
12
13 #include <stdio.h>
y
el
14
siv
17 int main(){
Ex
g
Back to Lab A.4.9.
an
B.4.9.3 W
ng
labs/4/4.9-insufficient-parallelism/step-00/worker.cc
he
un
3
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
d
re
7
Contact information can be found at http://colfax-intl.com/ */
yP
8
9
el
10 #include <omp.h>
siv
11 #include <cstring>
u
12
cl
14
15 // Distribute rows across threads
16 #pragma omp parallel for
17 for (int i = 0; i < m; i++) {
18 long sum = 0;
19
20 // In each row, use vectorized reduction
21 // to compute the sum of all columns
22 #pragma simd reduction(+: sum)
23 #pragma vector aligned
24 for (int j = 0; j < n; j++)
25 sum += M[i*n+j];
26
27 s[i] = sum;
28
29 }
30
31 strcpy(method, "Unoptimized");
32
33 }
B.4.9.4 labs/4/4.9-insufficient-parallelism/step-01/worker.cc
g
// to compute the sum of all columns
an
20
21 #pragma omp parallel for schedule(guided) reduction(+: sum)
W
22 #pragma simd
g
#pragma vector aligned
en
23
24 for (int j = 0; j < n; j++) h
un
25 sum += M[i*n+j];
rY
26
27 s[i] = sum;
fo
28
ed
29 }
ar
30
ep
32
33 }
y
el
siv
B.4.9.5 labs/4/4.9-insufficient-parallelism/step-02/worker.cc
20 s_thread[0:m] = 0;
21
22 // Note the absence of "parallel" in #pragma omp for, because it is already
23 // in a parallel region
24 #pragma omp for collapse(2) schedule(guided)
25 #pragma simd
26 #pragma vector aligned
27 for (int i = 0; i < m; i++) // Loop i will be collased with loop j
28 for (int j = 0; j < n; j++) // to form a single, greater iteration space
29 s_thread[i] += M[i*n+j];
30
31 // Arrays cannot be declared as reducers in pragma omp,
32 // and so the reduction must be programmed explicitly
33 for (int i = 0; i < m; i++)
34 #pragma omp atomic
35 s[i] += s_thread[i];
36 }
37
38 strcpy(method, "Collapse nested loops");
39
40 }
g
an
Back to Lab A.4.9.
W
ng
he
B.4.9.6 labs/4/4.9-insufficient-parallelism/step-03/worker.cc
un
rY
8
siv
9
10 #include <omp.h>
u
cl
11 #include <assert.h>
Ex
12 #include <cstring>
13
14 void SumColumns(const int m, const int n, long* M, long* s, char* method){
15
16 // stripSize works best if it is
17 // (a) a multiple of the SIMD vector length, and
18 // (b) be much greater than the SIMD vector length
19 // (c) much smaller than n
20 const int stripSize = 10000;
21
22 // It is trivial to avoid this limitation by peeling off n%stripSize iterations
23 // at the end of the j-loop, and adding a second loop to process these elements.
24 assert(n % stripSize == 0);
25
26 s[0:m] = 0;
27
28 #pragma omp parallel
29 {
30 long s_thread[m]; // Private reduction array to avoid false sharing
31 s_thread[0:m] = 0;
32
33 // Note the absence of "parallel" in #pragma omp for, because already in a
34 // parallel region
g
an
W
B.4.10 Shared-Memory Optimization: Core Affinity Control
g
Back to Lab A.4.10. en
h
un
rY
B.4.10.1 labs/4/4.a-affinity/step-00/Makefile
fo
ed
ar
CXX = icpc
ep
CXXFLAGS = -openmp
Pr
.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC
B.4.10.2 labs/4/4.a-affinity/step-00/main.cc
g
20 long* matrix = (long*)_mm_malloc(sizeof(long)*m*n, 64);
an
21 long* sums = (long*)_mm_malloc(sizeof(long)*m, 64); // will contain sum of matrix rows
W
22 char method[100];
23
ng
24 const int nTrials=10;
25 double t=0, dt=0;
e
nh
26
printf("Problem size: %.3f GB, outer dimension: %d, threads: %d\n",
Yu
27
28 (double)(sizeof(long))*(double)(n)*(double)m/(double)(1<<30),
r
29 m, omp_get_max_threads());
fo
30
// Initializing data
ed
31
32 #pragma omp parallel for
ar
35 matrix[i*n + j] = (long)i;
yP
36
37 // Benchmarking SumColumns(...)
el
41
Ex
42
43 if ( l>= 2) { // First two iterations are slow on Xeon Phi; exclude them
44 t += (t1-t0)/(double)(nTrials-2);
45 dt += (t1-t0)*(t1-t0)/(double)(nTrials-2);
46 }
47
48 // Verifying that the result is correct
49 for (int i = 0; i < m; i++)
50 if (sums[i] != i*n)
51 printf("Results are incorrect!");
52
53 }
54 dt = sqrt(dt-t*t);
55 const float GBps = (double)(sizeof(long)*(size_t)n*(size_t)m)/t*1e-9;
56 printf("%s: %.3f +/- %.3f seconds (%.2f +/- %.2f GB/s)\n",
57 method, t, dt, GBps, GBps*(dt/t));
58
59 _mm_free(sums); _mm_free(matrix);
60 }
B.4.10.3 labs/4/4.a-affinity/step-00/worker.cc
g
16
an
17 // (a) a multiple of the SIMD vector length, and
W
18 // (b) be much greater than the SIMD vector length
// (c) much smaller than n
g
19
en
20 const int stripSize = 10000; h
21
un
22 // It is trivial to avoid this limitation by peeling off n%stripSize iterations
rY
23 // at the end of the j-loop, and adding a second loop to process these elements.
fo
25
s[0:m] = 0;
ar
26
27
ep
29 {
y
31 s_thread[0:m] = 0;
siv
32
u
33
Ex
34 // in a parallel region
35 #pragma omp for collapse(2) schedule(guided)
36 for (int i = 0; i < m; i++) // Loop i will be collased with loop jj
37 for (int jj = 0; jj < n; jj += stripSize) // to form a single, greater
38 // iteration space
39 #pragma simd reduction(+:s_thread[i])
40 #pragma vector aligned
41 for (int j = jj; j < jj + stripSize; j++) // This loop is auto-vectorized
42 s_thread[i] += M[i*n+j];
43
44 // Arrays cannot be declared as reducers in pragma omp,
45 // and so the reduction must be programmed explicitly
46 for (int i = 0; i < m; i++)
47 #pragma omp atomic
48 s[i] += s_thread[i];
49 }
50
51 strcpy(method, "Strip-mine and collapse");
52
53 }
B.4.10.4 labs/4/4.a-affinity/step-01/Makefile
CXX = icpc
CXXFLAGS = -openmp -mkl
OBJECTS = affinity.o
MICOBJECTS = affinity.oMIC
.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.cpp.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
g
runmeMIC: $(MICOBJECTS)
an
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
W
ng
clean:
he
B.4.10.5 labs/4/4.a-affinity/step-01/affinity.cpp
pa
re
yP
2
siv
29 fflush(0);
30 }
31 _mm_free(A); _mm_free(B); _mm_free(C);
32 }
B.4.10.6 labs/4/4.a-affinity/step-02/Makefile
CXX = icpc
CXXFLAGS = -openmp -mkl
OBJECTS = affinity.o
MICOBJECTS = affinity.oMIC
.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
g
an
.cpp.oMIC:
W
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
g
en
all: runme runmeMIC h
un
runme: $(OBJECTS)
rY
runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
ar
ep
clean:
Pr
B.4.10.7 labs/4/4.a-affinity/step-02/affinity.cpp
g
an
W
B.4.11 Cache Optimization: Loop Interchange and Tiling ng
Back to Lab A.4.11.
he
un
rY
B.4.11.1 labs/4/4.b-tiling/step-00/Makefile
fo
d
re
CXX = icpc
pa
CXXFLAGS = -openmp
re
yP
.cc.o:
Ex
.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC
B.4.11.2 labs/4/4.b-tiling/step-00/main.cc
g
22
an
23 );
W
24
int main(){
g
25
en
26 const int wlBins = 128; h
27 const int tempBins = 128;
un
28 const int gsMax = 200;
rY
30
double wavelength[wlBins], grainSizeD[gsMax], absorption[wlBins*gsMax],
ed
31
32 planckFunc[tempBins*wlBins] __attribute__((aligned(64))), report[wlBins];
ar
33
ep
36
el
wavelength[i] = exp(0.1*i);
cl
39
for (int j = 0; j < gsMax; j++) {
Ex
40
41 absorption[j*wlBins + i] = (double)(i*j);
42 grainSizeD[j] = (double)j;
43 }
44 for (int k = 0; k < tempBins; k++)
45 planckFunc[i*tempBins + k] = (double)i*exp(-0.1*k);
46 }
47
48 for (int tr = 0; tr < nTrials; tr++) {
49 const double t0=omp_get_wtime();
50
51 #pragma omp parallel for schedule(guided)
52 for (int cell = 0; cell < nCells; cell++) {
53 double emissivity[wlBins];
54 double distribution[tempBins*gsMax] __attribute__((aligned(64)));
55 // In the practical application, this quantity is computed for every cell,
56 // but for benchmarking, we omit this calculation and use the same distribution
57 // for all cells
58 distribution[:] = 1.0;
59 ComputeEmissivity(wlBins,
60 emissivity,
61 wavelength,
62 gsMax,
63 grainSizeD,
64 absorption,
65 tempBins,
66 planckFunc,
67 distribution
68 );
69 if ((tr == nTrials-1) && (cell == nCells-1))
70 report[:] = emissivity[:];
71 }
72 const double t1=omp_get_wtime();
73
74 if (tr >= 2) { // First two iterations are slow on Xeon Phi; exclude them
75 t += (t1-t0)/(double)(nTrials-2);
76 dt += (t1-t0)*(t1-t0)/(double)(nTrials-2);
77 }
78 printf("Trial %d: %.3f seconds\n", tr+1, t1-t0); fflush(0);
79 }
80 dt = sqrt(dt-t*t);
81 printf("Average: %.3f +- %.3f seconds.\nResult (for verification): ", t, dt);
82 for (int i = 0; i < wlBins; i++)
83 printf(" %.2e", report[i]);
printf("\n"); fflush(0);
g
84
an
85 }
W
ng
Back to Lab A.4.11.
he
un
rY
B.4.11.3 labs/4/4.b-tiling/step-00/worker.cc
fo
d
6
siv
9
Ex
33 emissivity[i] = sum*wavelength[i];
34 }
35 }
B.4.11.4 labs/4/4.b-tiling/step-01/worker.cc
g
12 const double* wavelength,
an
13 const int gsMax,
W
14 const double* grainSizeD,
g
15 const double* absorption,
16
17
const int tempBins,
const double* planckFunc, en
h
un
18 const double* distribution
rY
19 ) {
fo
22
// and improve the locality of access to absorption[]
ep
23
24 emissivity[0:wlBins] = 0.0;
Pr
27
28 double result = 0;
u
cl
B.4.11.5 labs/4/4.b-tiling/step-02/worker.cc
g
29
an
30 double result[iTile]; result[:] = 0.0;
W
31 #pragma vector aligned
#pragma simd
ng
32
33 for (int k = 0; k < tempBins; k++)
he
34 #pragma novector
un
37
for (int i = ii; i < ii+iTile; i++) {
d
38
re
41 emissivity[i] += result[i-ii]*product*wavelength[i];
yP
42 }
43 }
el
44 }
siv
45 }
u
cl
Ex
B.4.11.6 labs/4/4.b-tiling/step-03/worker.cc
g
39
an
40 #pragma vector aligned
W
41 #pragma simd
for (int k = 0; k < tempBins; k++) {
g
42
en
43
44 //
h
In an ideal world, the following code would be the body of the k-loop:
un
45 // for (int j = jj; j < jj+jTile; j++)
rY
48
49 // However, #pragma simd fails to vectorize the k-loop when its body
ar
53
el
56
distribution[(jj+0)*tempBins + k];
Ex
57
58 result[(1)*iTile + (i-ii)] += planckFunc[i*tempBins + k]*
59 distribution[(jj+1)*tempBins + k];
60 // on the host, the j-loop tile is 4, so the host code ends here
61 #ifdef __MIC__
62 // on the coprocessor, the j-loop tile is 4, so two more iterations
63 result[(2)*iTile + (i-ii)] += planckFunc[i*tempBins + k]*
64 distribution[(jj+2)*tempBins + k];
65 result[(3)*iTile + (i-ii)] += planckFunc[i*tempBins + k]*
66 distribution[(jj+3)*tempBins + k];
67 // end of coprocessor-only code
68 #endif
69 // End of j-loop unrolling
70 }
71 }
72
73 for (int j = jj; j < jj+jTile; j++) {
74 const double gsd = grainSizeD[j];
75 for (int i = ii; i < ii+iTile; i++) {
76 const double crossSection = absorption[j*wlBins + i];
77 const double product = gsd*crossSection;
78 emissivity[i] += result[(j-jj)*iTile + (i-ii)]*product*wavelength[i];
79 }
80 }
81 }
82 }
83 }
B.4.11.7 labs/4/4.b-tiling/step-04/main.cc
g
11 #include <omp.h>
an
12 #include <stdio.h>
W
13
ng
14 void ComputeEmissivity00(const int, double*, const double*, const int, const double*,
he
16
void ComputeEmissivity01(const int, double*, const double*, const int, const double*,
rY
17
18 const double*, const int, const double*, const double* );
fo
19
d
20 void ComputeEmissivity02(const int, double*, const double*, const int, const double*,
re
22
re
23 void ComputeEmissivity03(const int, double*, const double*, const int, const double*,
yP
25
siv
26 typedef void (*CompFunc)(const int, double*, const double*, const int, const double*,
27 const double*, const int, const double*, const double* );
u
cl
28
Ex
52 planckFunc[i*tempBins + k] = (double)i*exp(-0.1*k);
53 }
54
55 for (int tr = 0; tr < nTrials; tr++) {
56 const double t0=omp_get_wtime();
57
58 #pragma omp parallel for schedule(guided)
59 for (int cell = 0; cell < nCells; cell++) {
60 double emissivity[wlBins];
61 double distribution[tempBins*gsMax] __attribute__((aligned(64)));
62 // In the practical application, this quantity is computed for every cell, but
63 // for benchmarking, we omit this calculation and use the same distribution
64 // for all cells.
65 distribution[:] = 1.0;
66 Func[tr](wlBins,
67 emissivity,
68 wavelength,
69 gsMax,
70 grainSizeD,
71 absorption,
72 tempBins,
planckFunc,
g
73
an
74 distribution
W
75 );
if ((tr == nTrials-1) && (cell == nCells-1))
g
76
en
77 report[:] = emissivity[:]; h
78 }
un
79 const double t1=omp_get_wtime();
rY
80
fo
81 if (tr >= 2) { // First two iterations are slow on Xeon Phi; exclude them
t += (t1-t0)/(double)(nTrials-2);
ed
82
83 dt += (t1-t0)*(t1-t0)/(double)(nTrials-2);
ar
84 }
ep
86 }
y
87 printf("\n"); fflush(0);
el
88 }
siv
u
cl
B.4.12.1 labs/4/4.c-cache-oblivious-recursion/step-00/Makefile
CXX = icpc
CXXFLAGS = -openmp -vec-report3
.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)
runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC
B.4.12.2 labs/4/4.c-cache-oblivious-recursion/step-00/main.cc
g
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
an
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
W
6 Redistribution or commercial usage without a written permission
ng
7 from Colfax International is prohibited.
he
9
#include <stdio.h>
rY
10
11 #include <omp.h>
fo
12 #include <malloc.h>
d
13 #include <math.h>
re
14 #include <cilk/reducer_opadd.h>
pa
15
re
18
siv
19 A[i*n+j] = (float)(i*n+j);
20 }
u
cl
21
Ex
g
an
Back to Lab A.4.12.
W
g
B.4.12.3 en
labs/4/4.c-cache-oblivious-recursion/step-00/worker.cc
h
un
rY
4
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
ep
5
6 Redistribution or commercial usage without a written permission
Pr
9
10 void Transpose(float* const A, const int n){
u
cl
B.4.12.4 labs/4/4.c-cache-oblivious-recursion/step-01/worker.cc
10 #include <cassert>
11
12 void Transpose(float* const A, const int n){
13 // Tiled algorithm improves data locality by re-using data already in cache
14 #ifdef __MIC__
15 const int TILE = 16;
16 #else
17 const int TILE = 32;
18 #endif
19 assert(n%TILE == 0);
20 _Cilk_for (int ii = 0; ii < n; ii += TILE) {
21 const int iMax = (n < ii+TILE ? n : ii+TILE);
22 for (int jj = 0; jj <= ii; jj += TILE) {
23 for (int i = ii; i < iMax; i++) {
24 const int jMax = (i < jj+TILE ? i : jj+TILE);
25 for (int j = jj; j<jMax; j++) {
26 const int c = A[i*n + j];
27 A[i*n + j] = A[j*n + i];
28 A[j*n + i] = c;
29 }
30 }
}
g
31
an
32 }
W
33 } ng
he
B.4.12.5 labs/4/4.c-cache-oblivious-recursion/step-02/worker.cc
fo
d
re
5
siv
9
10 #include <cassert>
11
12 void Transpose(float* const A, const int n){
13 // Tiled algorithm improves data locality by re-using data already in cache
14 #ifdef __MIC__
15 const int TILE = 16;
16 #else
17 const int TILE = 32;
18 #endif
19 assert(n%TILE == 0);
20 _Cilk_for (int ii = 0; ii < n; ii += TILE) {
21 const int iMax = (n < ii+TILE ? n : ii+TILE);
22 for (int jj = 0; jj <= ii; jj += TILE) {
23 for (int i = ii; i < iMax; i++) {
24 const int jMax = (i < jj+TILE ? i : jj+TILE);
25 #pragma loop count avg(TILE)
26 #pragma simd
27 for (int j = jj; j<jMax; j++) {
28 const int c = A[i*n + j];
29 A[i*n + j] = A[j*n + i];
30 A[j*n + i] = c;
31 }
32 }
33 }
34 }
35 }
B.4.12.6 labs/4/4.c-cache-oblivious-recursion/step-03/worker.cc
g
an
12 const int jStart, const int jEnd,
float* A, const int n){
W
13
14 #ifdef __MIC__
g
en
15 const int RT = 64; // Recursion threshold on coprocessor
16 #else h
un
17 const int RT = 32; // Recursion threshold on host
rY
18 #endif
if ( ((iEnd - iStart) <= RT) && ((jEnd - jStart) <= RT) ) {
fo
19
20 for (int i = iStart; i < iEnd; i++) {
ed
22 #pragma simd
ep
25
el
27 A[j*n + i] = c;
u
28 }
cl
29 }
Ex
30 return;
31 }
32
33 if ((jEnd - jStart) > (iEnd - iStart)) {
34 // Split into subtasks j-wise
35 const int jSplit = jStart + (jEnd - jStart)/2;
36 _Cilk_spawn transpose_cache_oblivious_thread(iStart, iEnd, jStart, jSplit, A, n);
37 transpose_cache_oblivious_thread(iStart, iEnd, jSplit, jEnd, A, n);
38 } else {
39 // Split into subtasks i-wise
40 const int iSplit = iStart + (iEnd - iStart)/2;
41 const int jMax = (jEnd < iSplit ? jEnd : iSplit);
42 _Cilk_spawn transpose_cache_oblivious_thread(iStart, iSplit, jStart, jMax, A, n);
43 transpose_cache_oblivious_thread(iSplit, iEnd, jStart, jEnd, A, n);
44 }
45 }
46
47 void Transpose(float* const A, const int n){
48 transpose_cache_oblivious_thread(0, n, 0, n, A, n);
49 }
B.4.12.7 labs/4/4.c-cache-oblivious-recursion/step-04/worker.cc
g
17 const int RT = 32; // Recursion threshold on host
an
18 #endif
W
19 if ( ((iEnd - iStart) <= RT) && ((jEnd - jStart) <= RT) ) {
20 for (int i = iStart; i < iEnd; i++) {
ng
21 int je = (jEnd < i ? jEnd : i);
22 #pragma simd
e
nh
23 #pragma loop_count avg(RT)
for (int j = jStart; j < je; j++) {
Yu
24
25 const float c = A[i*n + j];
r
27 A[j*n + i] = c;
}
ed
28
29 }
ar
30 return;
p
31 }
re
32
yP
37 // boundaries
if (jSplit - jSplit%16 > jStart) jSplit -= jSplit%16;
cl
38
_Cilk_spawn transpose_cache_oblivious_thread(iStart, iEnd, jStart, jSplit, A, n);
Ex
39
40 transpose_cache_oblivious_thread(iStart, iEnd, jSplit, jEnd, A, n);
41 } else {
42 // Split into subtasks i-wise
43 int iSplit = iStart + (iEnd - iStart)/2;
44 // The following line slightly improves performance by splitting on aligned
45 // boundaries
46 if (iSplit - iSplit%16 > iStart) iSplit -= iSplit%16;
47 const int jMax = (jEnd < iSplit ? jEnd : iSplit);
48 _Cilk_spawn transpose_cache_oblivious_thread(iStart, iSplit, jStart, jMax, A, n);
49 transpose_cache_oblivious_thread(iSplit, iEnd, jStart, jEnd, A, n);
50 }
51 }
52
53 void Transpose(float* const A, const int n){
54 transpose_cache_oblivious_thread(0, n, 0, n, A, n);
55 }
B.4.13.1 labs/4/4.d-cache-loop-fusion/step-00/Makefile
CXX = icpc
CXXFLAGS = -openmp
LINKFLAGS = -openmp -mkl
.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
g
an
all: runme runmeMIC
W
g
runme: $(OBJECTS)
$(CXX) $(LINKFLAGS) -o runme $(OBJECTS)
en
h
un
runmeMIC: $(MICOBJECTS)
rY
clean:
ar
B.4.13.2 labs/4/4.d-cache-loop-fusion/step-00/main.cc
u
cl
Ex
24
25 const int nt = 8;
26 double t = 0.0, dt = 0.0;
27 for (int k = 0; k < nt; k++) {
28
29 const double t0 = omp_get_wtime();
30 RunStatistics(m, n, resultMean, resultStdev);
31 const double t1 = omp_get_wtime();
32
33 if (k >= 2) {
34 t += (t1-t0);
35 dt += (t1-t0)*(t1-t0);
36 }
37
38 printf("Iteration %d: %.3f ms\n", k+1, 1e3*(t1-t0)); fflush(0);
39 }
40 t /= (double)(nt-2);
41 dt = sqrt(dt/(double)(nt-2) - t*t);
42
43 printf("Some of the results:\n ...\n");
44 for (int i = 10; i < 14; i++)
printf(" i=%d: x = %.1f+-%.1f (expected = %.1f+-%.1f)\n",
g
45
an
46 i, resultMean[i], resultStdev[i], (float)i, 1.0f);
W
47 printf(" ...\n"); fflush(stdout); ng
48
49
he
52 }
fo
d
B.4.13.3 labs/4/4.d-cache-loop-fusion/step-00/worker.cc
yP
el
siv
27 status = vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,
28 rng, n, &data[i*n], mean, stdev);
29 }
30
31 vslDeleteStream(&rng);
32 }
33 }
34
35 void ComputeMeanAndStdev(const int m, const int n, const float* data,
36 float* const resultMean, float* const resultStdev) {
37
38 // Processing data to compute the mean and standard deviation
39 #pragma omp parallel for schedule(guided)
40 for (int i = 0; i < m; i++) {
41 float sumx=0.0f, sumx2=0.0f;
42 #pragma vector aligned
43 for (int j = 0; j < n; j++) {
44 sumx += data[i*n + j];
45 sumx2 += data[i*n + j]*data[i*n + j];
46 }
47 resultMean[i] = sumx/(float)n;
resultStdev[i] = sqrtf(sumx2/(float)n-resultMean[i]*resultMean[i]);
g
48
an
49 }
W
50
}
g
51
en
52
53 void RunStatistics(const int m, const int n,
h
un
54 float* const resultMean, float* const resultStdev) {
rY
55
fo
57
58 float* data = (float*) _mm_malloc((size_t)m*(size_t)n*sizeof(float), 64);
ar
59
ep
60 GenerateRandomNumbers(m, n, data);
Pr
62
el
64 _mm_free(data);
u
cl
65
}
Ex
66
B.4.13.4 labs/4/4.d-cache-loop-fusion/step-01/worker.cc
16
17 // Allocating memory for scratch space for the whole problem
18 // m*n elements on heap (does not fit on stack)
19 float* data = (float*) _mm_malloc((size_t)m*(size_t)n*sizeof(float), 64);
20
21 #pragma omp parallel
22 {
23 // Initializing a random number generator in each thread
24 VSLStreamStatePtr rng;
25 const int seed = omp_get_thread_num();
26 int status = vslNewStream(&rng, VSL_BRNG_MT19937, omp_get_thread_num());
27
28 #pragma omp for schedule(guided)
29 for (int i = 0; i < m; i++) {
30
31 // Filling arrays with normally distributed random numbers
32 const float seedMean = (float)i;
33 const float seedStdev = 1.0f;
34 status = vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,
35 rng, n, &data[i*n], seedMean, seedStdev);
36
// Processing data to compute the mean and standard deviation
g
37
an
38 float sumx=0.0f, sumx2=0.0f;
W
39 #pragma vector aligned
for (int j = 0; j < n; j++) {
ng
40
41 sumx += data[i*n + j];
he
43 }
rY
44 resultMean[i] = sumx/(float)n;
fo
45 resultStdev[i] = sqrtf(sumx2/(float)n-resultMean[i]*resultMean[i]);
}
d
46
re
47
pa
48 vslDeleteStream(&rng);
re
49 }
yP
50
51 // Deallocating scratch space
el
52 _mm_free(data);
siv
53 }
u
cl
Ex
B.4.13.5 labs/4/4.d-cache-loop-fusion/step-02/worker.cc
18 {
19 // Allocating scratch data, n elements on stack for each thread
20 float data[n] __attribute__((aligned(64)));
21
22 // Initializing a random number generator in each thread
23 VSLStreamStatePtr rng;
24 const int seed = omp_get_thread_num();
25 int status = vslNewStream(&rng, VSL_BRNG_MT19937, omp_get_thread_num());
26
27 #pragma omp for schedule(guided)
28 for (int i = 0; i < m; i++) {
29
30 // Filling arrays with normally distributed random numbers
31 const float seedMean = (float)i;
32 const float seedStdev = 1.0f;
33 status = vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,
34 rng, n, data, seedMean, seedStdev);
35
36 // Processing data to compute the mean and standard deviation
37 float sumx=0.0f, sumx2=0.0f;
38 #pragma vector aligned
for (int j = 0; j < n; j++) {
g
39
an
40 sumx += data[j];
W
41 sumx2 += data[j]*data[j];
}
g
42
en
43 resultMean[i] = sumx/(float)n; h
44 resultStdev[i] = sqrtf(sumx2/(float)n-resultMean[i]*resultMean[i]);
un
45 }
rY
46
fo
47 vslDeleteStream(&rng);
}
ed
48
49
ar
50 }
ep
Pr
B.4.14.1 labs/4/4.e-offload/step-00/Makefile
CXX = icpc
CXXFLAGS = -openmp -vec-report
LINKFLAGS = -openmp
.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"
.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
all: runme
runme: $(OBJECTS)
$(CXX) $(LINKFLAGS) -o runme $(OBJECTS)
runmeMIC: $(MICOBJECTS)
$(CXX) $(LINKFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC
B.4.14.2 labs/4/4.e-offload/step-00/main.cc
g
7 from Colfax International is prohibited.
an
8 Contact information can be found at http://colfax-intl.com/ */
W
9
ng
10 #include <omp.h>
he
11 #include <cstdlib>
#include <cmath>
un
12
#include <cstdio>
rY
13
14
fo
16
re
21
siv
22
23 void PerformOffloadTransfer(const size_t size, double & t, double & dt);
u
cl
24
Ex
25 int main(){
26 const size_t sizeMin = 1L<<10L;
27 const size_t sizeMax = 1L<<30L;
28 const size_t sizeFactor = 2;
29 const int skipTrials = 2;
30 const int dropTrials = 1;
31
32 size_t size = sizeMin;
33
34 printf("#%11s%19s%19s%19s%19s%19s\n", "Array", "Default offload",
35 "With memory", "With data", "Allocate+free", "Effective");
36 printf("#%11s%19s%19s%19s%19s%19s\n", "Size, kB", "time, ms",
37 "retention, ms", "persistence, ms", "overhead, ms", "bandwidth, GB/s");
38 fflush(stdout);
39 while (size <= sizeMax) {
40
41 const size_t nTrials = 8L*sqrtf(sqrtf((float)(1L<<30L)/size));
42
43 // Array to be transferred
44 data = (char*) _mm_malloc(size, 64);
45 data[0:size] = 0;
46
47 // Timing the default offload
g
69
an
70
W
71 // printf("t=%.2f ms\n", (t1-t0)*1e3);
}
g
72
en
73 tMemR /= (double)(nTrials-skipTrials-1); h
74 dtMemR = sqrt(dtMemR/(double)(nTrials-skipTrials-1) - tMemR*tMemR);
un
75
rY
78
79 const double t0 = omp_get_wtime();
ar
82
y
85 }
u
tDataP /= (double)(nTrials-skipTrials-1);
cl
86
dtDataP = sqrt(dtDataP/(double)(nTrials-skipTrials-1) - tDataP*tDataP);
Ex
87
88
89 // Bandwidth is the transfer time with memory retention
90 const double bandwidth = (double)(size)/(double)(1L<<30L) / tMemR;
91 const double dBandwidth = bandwidth*(dtMemR/tMemR);
92
93 // The memory allocation latency is the default offload time
94 // minus the offload time with memory retention.
95 const double mallocLat = (tDefault - tMemR);
96 const double dMallocLat = sqrtf(dtDefault*dtDefault + dtMemR*dtMemR);
97
98 printf("%12ld %8.2f +/- %5.2f %8.2f +/- %5.2f %8.2f +/- %5.2f %8.2f +/- %5.2f\
99 %8.2f +/- %5.2f\n",
100 size/(1L<<10L),
101 tDefault*1e3, dtDefault*1e3,
102 tMemR*1e3, dtMemR*1e3,
103 tDataP*1e3, dtDataP*1e3,
104 mallocLat*1e3, dMallocLat*1e3,
105 bandwidth, dBandwidth);
106 fflush(stdout);
107
108 _mm_free(data);
109
B.4.14.3 labs/4/4.e-offload/step-00/worker.cc
g
8 */
an
9
W
10 #include <cstdlib> ng
11
void DefaultOffload(const size_t size, char* data) {
he
12
13
un
16 // transfer data,
// perform offload calculations
d
17
re
19
re
data[0] = 0;
siv
23
24 }
u
}
cl
25
Ex
26
27 void OffloadWithMemoryRetention(const size_t size, char* data, const int k,
28 const int nTrials) {
29
30 // Write the body of this function so that
31 // the memory container for the data is allocated during the first iteration,
32 // but this allocated memory is retained in the subsequent iterations
33 // and deallocated during the last iteration.
34
35 }
36
37 void OffloadWithDataPersistence(const size_t size, char* data, const int k,
38 const int nTrials) {
39
40 // Write the body of this function so that
41 // the data is transferred to the coprocessor during the first iteration,
42 // allocated memory is retained afterwards, and
43 // data is not transferred in subsequent iterations.
44
45 }
B.4.14.4 labs/4/4.e-offload/step-01/worker.cc
g
#pragma offload target(mic:1) \
an
20
21 in(data: length(size) align(64))
W
22 {
g
data[0] = 0;
en
23
24 } h
un
25 }
rY
26
27 void OffloadWithMemoryRetention(const size_t size, char* data, const int k,
fo
29
ar
34 {
siv
35 data[0] = 0;
u
36 }
cl
37 }
Ex
38
39 void OffloadWithDataPersistence(const size_t size, char* data, const int k,
40 const int nTrials) {
41
42 // Write the body of this function so that
43 // the data is transferred to the coprocessor during the first iteration,
44 // allocated memory is retained afterwards, and
45 // data is not transferred in subsequent iterations.
46
47 }
B.4.14.5 labs/4/4.e-offload/step-02/worker.cc
g
27
an
28 const int nTrials) {
W
29
// Allocate arrays on coprocessor during the first iteration;
ng
30
31 // retain allocated memory for subsequent iterations
he
34 {
fo
35 data[0] = 0;
}
d
36
re
37 }
pa
38
re
44
#pragma offload target(mic:1) \
Ex
45
46 in(data: length(transferSize) alloc_if(k==0) free_if(k==nTrials-1) align(64))
47 {
48 data[0] = 0;
49 }
50 }
B.4.15.1 labs/4/4.f-MPI-load-balance/step-00/Makefile
all:
mpiicpc -mkl -o pi-host pi.cc
mpiicpc -mkl -o pi-mic -mmic pi.cc
scp pi-mic mic0:~/
runhost:
mpirun -host localhost -np 32 ./pi-host
runmic:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host mic0 -np 240 ~/pi-mic
runboth:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 32 ./pi-host : -host mic0 -np 240 ~/pi-mic
clean:
rm -f pi-host pi-mic
g
B.4.15.2 labs/4/4.f-MPI-load-balance/step-00/pi.cc
an
W
g
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.
2 file pi.cc, located at 4/4.f-MPI-load-balance/step-00
is a part of the practical supplement to the handbook en
h
un
3
"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
rY
4
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
fo
9
#include <mpi.h>
Pr
10
11 #include <stdio.h>
y
#include <stdlib.h>
el
12
siv
13 #include <mkl_vsl.h>
14
u
cl
16
17 void RunMonteCarlo(const long firstBlock, const long lastBlock,
18 VSLStreamStatePtr & stream, long & dUnderCurve) {
19 // Performs the Monte Carlo calculation for blocks in the range [firstBlock; lastBlock)
20 // to count the number of random points inside of the quarter circle
21
22 long j, i;
23 float r[BLOCK_SIZE*2] __attribute__((align(64)));
24
25 for (j = firstBlock; j < lastBlock; j++) {
26
27 vsRngUniform( 0, stream, BLOCK_SIZE*2, r, 0.0f, 1.0f );
28 for (i = 0; i < BLOCK_SIZE; i++) {
29 const float x = r[i];
30 const float y = r[i+BLOCK_SIZE];
31 // Count points inside quarter circle
32 if (x*x + y*y < 1.0f) dUnderCurve++;
33 }
34 }
35
36 }
37
38 int main(int argc, char *argv[]){
39
40 int rank, nRanks, trial;
41
42 MPI_Init(&argc, &argv);
43 MPI_Comm_size(MPI_COMM_WORLD, &nRanks);
44 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
45
46 // Work sharing: equal amount of work in each process
47 const double blocksPerProc = (double)nBlocks / (double)nRanks;
48
49 for (trial = 0; trial < nTrials; trial++) { // Multiple trials
50
51 const double start_time = MPI_Wtime();
52 long dUnderCurve=0, UnderCurveSum=0;
53
54 // Create and initialize a random number generator from MKL
55 VSLStreamStatePtr stream;
56 vslNewStream( &stream, VSL_BRNG_MT19937, trial*nRanks + rank );
57
58 // Range of blocks processed by this process
59 const long myFirstBlock = (long)(blocksPerProc*rank);
const long myLastBlock = (long)(blocksPerProc*(rank+1));
g
60
an
61
W
62 RunMonteCarlo(myFirstBlock, myLastBlock, stream, dUnderCurve);
ng
63
64 vslDeleteStream( &stream );
he
65
un
66 // Compute pi
rY
68 if (rank==0) {
const double pi = (double)UnderCurveSum / (double) iter * 4.0 ;
d
69
re
75 fflush(0);
siv
76 }
u
cl
77
MPI_Barrier(MPI_COMM_WORLD);
Ex
78
79 }
80 MPI_Finalize();
81 }
B.4.15.3 labs/4/4.f-MPI-load-balance/step-01/Makefile
all:
mpiicpc -mkl -o pi-static-host pi-static.cc
mpiicpc -mkl -o pi-static-mic -mmic pi-static.cc
scp pi-static-mic mic0:~/
scp pi-static-mic mic1:~/
runhost:
mpirun -host localhost -np 32 ./pi-static-host
runmic:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host mic0 -np 240 ~/pi-static-mic
runboth:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 32 ./pi-static-host : -host mic0 -np 240 ~/pi-static-mic
runall:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 32 ./pi-static-host : \
-host mic0 -np 240 ~/pi-static-mic \
-host mic1 -np 240 ~/pi-static-mic
clean:
rm -f pi-static-host pi-static-mic
g
an
B.4.15.4 labs/4/4.f-MPI-load-balance/step-01/pi-static.cc
W
g
1
en
/* Copyright (c) 2013, Colfax International. All Right Reserved.
h
file pi-static.cc, located at 4/4.f-MPI-load-balance/step-01
un
2
is a part of the practical supplement to the handbook
rY
3
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
fo
9
10 #include <mpi.h>
y
#include <stdio.h>
el
11
siv
12 #include <stdlib.h>
13 #include <mkl_vsl.h>
u
cl
14
Ex
g
59
an
60 ( nProcsOnMIC > 0 ? alpha*nBlocks/(alpha*nProcsOnHost+nProcsOnMIC) :
W
61 (double)nBlocks/nProcsOnHost );
const long blockOffset = 0;
ng
62
63 const int rankOnDevice = rank;
he
64 #else
un
68
re
69 #endif
pa
70
re
75
u
76
VSLStreamStatePtr stream;
Ex
77
78 vslNewStream(&stream, VSL_BRNG_MT19937, rank*nTrials + t);
79
80 // Range of blocks processed by this process
81 const long myFirstBlock = blockOffset + (long)(blocksPerRank*rankOnDevice);
82 const long myLastBlock = blockOffset + (long)(blocksPerRank*(rankOnDevice+1));
83
84 RunMonteCarlo(myFirstBlock, myLastBlock, stream, dUnderCurve);
85
86 vslDeleteStream( &stream );
87
88 // Reduction to collect the results of the Monte Carlo calculation
89 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
90
91 double unbalancedTime = MPI_Wtime();
92 MPI_Barrier(MPI_COMM_WORLD);
93 unbalancedTime = MPI_Wtime() - unbalancedTime;
94
95 // Timing collection
96 double hostUnbalancedTime = 0.0, MICUnbalancedTime = 0.0;
97 #ifdef __MIC__
98 MICUnbalancedTime += unbalancedTime;
99 #else
g
121
an
122 fflush(0);
W
123 }
g
124
en
125 MPI_Barrier(MPI_COMM_WORLD); h
126 }
un
127 MPI_Finalize();
rY
128 }
fo
ed
B.4.15.5 labs/4/4.f-MPI-load-balance/step-02/pi-dynamic.cc
y
el
siv
g
48
an
49 #ifdef __MIC__
W
50 thisProcOnMIC++; // This process is running on an Intel Xeon Phi coprocessor
#else
ng
51
52 thisProcOnHost++; // This process is running on an Intel Xeon processor
he
53 #endif
un
54 }
rY
57
re
59
re
63
siv
64 if (rank == 0) {
u
65
const char* grainSizeSt = getenv("GRAIN_SIZE");
Ex
66
67 if (grainSizeSt == NULL) { printf("GRAIN_SIZE undefined\n"); exit(1); }
68 grainSize = atof(grainSizeSt);
69 long currentBlock = 0;
70 while (currentBlock < nBlocks) {
71 msg[0] = currentBlock; // First block for next worker
72 msg[1] = currentBlock + grainSize; // Last block
73 if (msg[1] > nBlocks) msg[1] = nBlocks; // Stay in range
74
75 // Wait for next worker
76 MPI_Recv(&worker, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,
77 &stat);
78
79 // Assign work to next worker
80 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
81
82 currentBlock = msg[1]; // Update position
83 }
84
85 // Terminate workers
86 msg[0] = -1; msg[1] = -2;
87 for (int i = 1; i < nRanks; i++) {
88 MPI_Recv(&worker, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,
89 &stat);
90 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
91 }
92
93 } else {
94 // Worker performs the Monte Carlo calculation
95 VSLStreamStatePtr stream; // Create & initialize a random number generator from MKL
96 vslNewStream(&stream, VSL_BRNG_MT19937, rank*nTrials + t);
97 float r[BLOCK_SIZE*2] __attribute__((align(64)));
98
99 // Range of blocks processed by this worker
100 msg[0] = 0;
101 while (msg[0] >= 0) {
102 double waitTime = MPI_Wtime();
103 MPI_Send(&rank, 1, MPI_INT, 0, rank, MPI_COMM_WORLD);
104 MPI_Recv(&msg, 2, MPI_LONG, 0, rank, MPI_COMM_WORLD, &stat);
105 waitTime = MPI_Wtime() - waitTime;
106 #ifdef __MIC__
107 MICSchedulingWait += waitTime;
108 #else
109 hostSchedulingWait += waitTime;
#endif
g
110
an
111 const long myFirstBlock = msg[0];
W
112 const long myLastBlock = msg[1];
g
113
en
114 RunMonteCarlo(myFirstBlock, myLastBlock, r, stream, dUnderCurve);
h
115 }
un
116 vslDeleteStream( &stream );
rY
117 }
fo
118
// Reduction to collect the results of the Monte Carlo calculation
ed
119
120 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
ar
121
ep
123 MPI_Barrier(MPI_COMM_WORLD);
y
125
siv
127
#ifdef __MIC__
Ex
128
129 MICUnbalancedTime += unbalancedTime;
130 #else
131 hostUnbalancedTime += unbalancedTime;
132 #endif
133 double averageHostUnbalancedTime = 0.0, averageMICUnbalancedTime = 0.0;
134 double averageHostSchedulingWait = 0.0, averageMICSchedulingWait = 0.0;
135 MPI_Reduce(&hostUnbalancedTime, &averageHostUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
136 0, MPI_COMM_WORLD);
137 MPI_Reduce(&MICUnbalancedTime, &averageMICUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
138 0, MPI_COMM_WORLD);
139 MPI_Reduce(&hostSchedulingWait, &averageHostSchedulingWait, 1, MPI_DOUBLE, MPI_SUM,
140 0, MPI_COMM_WORLD);
141 MPI_Reduce(&MICSchedulingWait, &averageMICSchedulingWait, 1, MPI_DOUBLE, MPI_SUM,
142 0, MPI_COMM_WORLD);
143 if (nProcsOnHost > 0) {
144 averageHostUnbalancedTime /= nProcsOnHost;
145 averageHostSchedulingWait /= nProcsOnHost;
146 }
147 if (nProcsOnMIC > 0) {
148 averageMICUnbalancedTime /= nProcsOnMIC;
149 averageMICSchedulingWait /= nProcsOnMIC;
150 }
151
152 // Compute pi
153 if (rank==0) {
154 const double pi = (double) UnderCurveSum / (double) iter * 4.0 ;
155 const double end_time = MPI_Wtime();
156 const double pi_exact=3.141592653589793;
157 if (t==0) printf("#%9s %8s %7s %9s %14s %14s %14s %14s\n",
158 "pi", "Rel.err", "Time, s", "GrainSize", "Host unbal., s", "MIC unbal., s",
159 "Host sched, s.", "MIC sched, s.");
160 printf ("%10.8f %8.1e %7.3f %9ld %14.3f %14.3f %14.3f %14.3f\n",
161 pi, (pi-pi_exact)/pi_exact, end_time-start_time, grainSize,
162 averageHostUnbalancedTime, averageMICUnbalancedTime, averageHostSchedulingWait,
163 averageMICSchedulingWait);
164 fflush(0);
165 }
166 }
167 MPI_Finalize();
168 }
g
an
B.4.15.6 labs/4/4.f-MPI-load-balance/step-03/Makefile
W
ng
he
all:
mpiicpc -mkl -openmp -o pi-boss-dynamic pi-dynamic-hybrid.cc
un
runhost1:
el
siv
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
u
cl
runhost4:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host localhost -np 8 -env OMP_NUM_THREADS 4 ./pi-worker-hybrid-host
runhost16:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host localhost -np 2 -env OMP_NUM_THREADS 16 ./pi-worker-hybrid-host
runhostall:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host localhost -np 1 -env OMP_NUM_THREADS 32 ./pi-worker-hybrid-host
runmic1:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host mic0 -np 240 -env OMP_NUM_THREADS 1 ~/pi-worker-hybrid-mic
runmic4:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host mic0 -np 60 -env OMP_NUM_THREADS 4 ~/pi-worker-hybrid-mic
runmic16:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host mic0 -np 15 -env OMP_NUM_THREADS 16 ~/pi-worker-hybrid-mic
runmicall:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host mic0 -np 1 -env OMP_NUM_THREADS 240 ~/pi-worker-hybrid-mic
g
an
W
runboth: runboth1 runboth4 runboth16 runbothall
g
en
runboth1: h
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
un
I_MPI_MIC=1 \
rY
runboth4:
ep
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
Pr
I_MPI_MIC=1 \
y
runboth16:
Ex
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host localhost -np 2 -env OMP_NUM_THREADS 16 ./pi-worker-hybrid-host : \
-host mic0 -np 15 -env OMP_NUM_THREADS 16 ~/pi-worker-hybrid-mic
runbothall:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host localhost -np 1 -env OMP_NUM_THREADS 32 ./pi-worker-hybrid-host : \
-host mic0 -np 1 -env OMP_NUM_THREADS 240 ~/pi-worker-hybrid-mic
clean:
rm -f pi-boss-dynamic pi-worker-hybrid-host pi-worker-hybrid-mic
B.4.15.7 labs/4/4.f-MPI-load-balance/step-03/pi-dynamic-hybrid.cc
g
22
an
23
W
24 long j, i;
long dUnderCurve = 0;
ng
25
26 #pragma omp parallel
he
27 {
un
31
re
38 }
u
}
cl
39
}
Ex
40
41
42 dUnderCurveExt += dUnderCurve;
43
44 }
45
46 int main(int argc, char *argv[]){
47
48 int rank, nRanks, worker;
49 long grainSize, msg[2]; // MPI message; msg[0] is blockStart, msg[1] is blockEnd
50 MPI_Status stat;
51 MPI_Init(&argc, &argv);
52 MPI_Comm_size(MPI_COMM_WORLD, &nRanks); MPI_Comm_rank(MPI_COMM_WORLD, &rank);
53
54 // Count the number of processes running on the host and on coprocessors
55 int nProcsOnMIC, nProcsOnHost, thisProcOnMIC=0, thisProcOnHost=0;
56 if (rank != 0) {
57 #ifdef __MIC__
58 thisProcOnMIC++; // This process is running on an Intel Xeon Phi coprocessor
59 #else
60 thisProcOnHost++; // This process is running on an Intel Xeon processor
61 #endif
62 }
g
84
an
85
W
86 // Assign work to next worker
MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
g
87
en
88
89
h
currentBlock = msg[1]; // Update position
un
90 }
rY
91
fo
92 // Terminate workers
msg[0] = -1; msg[1] = -2;
ed
93
94 for (int i = 1; i < nRanks; i++) {
ar
96 &stat);
Pr
98 }
el
99
siv
100 } else {
u
101
Ex
102
103 // Create and initialize a random number generator from MKL
104 VSLStreamStatePtr stream[omp_get_max_threads()];
105 #pragma omp parallel
106 {
107 // Each thread gets its own random seed
108 const int randomSeed = nTrials*omp_get_thread_num()*nRanks + nTrials*rank + t;
109 vslNewStream(&stream[omp_get_thread_num()], VSL_BRNG_MT19937, randomSeed);
110 }
111
112 msg[0] = 0;
113 while (msg[0] >= 0) {
114 // Receive from boss the range of blocks to process
115 double waitTime = MPI_Wtime();
116 MPI_Send(&rank, 1, MPI_INT, 0, rank, MPI_COMM_WORLD);
117 MPI_Recv(&msg, 2, MPI_LONG, 0, rank, MPI_COMM_WORLD, &stat);
118 waitTime = MPI_Wtime() - waitTime;
119 #ifdef __MIC__
120 MICSchedulingWait += waitTime;
121 #else
122 hostSchedulingWait += waitTime;
123 #endif
124 const long myFirstBlock = msg[0];
g
146
an
147 MICUnbalancedTime += unbalancedTime;
W
148 #else
hostUnbalancedTime += unbalancedTime;
ng
149
150 #endif
he
154 0, MPI_COMM_WORLD);
MPI_Reduce(&MICUnbalancedTime, &averageMICUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
d
155
re
156 0, MPI_COMM_WORLD);
pa
158 0, MPI_COMM_WORLD);
yP
averageHostSchedulingWait /= nProcsOnHost;
cl
163
}
Ex
164
165 if (nProcsOnMIC > 0) {
166 averageMICUnbalancedTime /= nProcsOnMIC;
167 averageMICSchedulingWait /= nProcsOnMIC;
168 }
169
170 // Compute pi
171 if (rank==0) {
172 const double pi = (double) UnderCurveSum / (double) iter * 4.0 ;
173 const double end_time = MPI_Wtime();
174 const double pi_exact=3.141592653589793;
175 if (t==0) printf("#%9s %8s %7s %9s %14s %14s %14s %14s\n",
176 "pi", "Rel.err", "Time, s", "GrainSize", "Host unbal., s", "MIC unbal., s",
177 "Host sched, s.", "MIC sched, s.");
178 printf ("%10.8f %8.1e %7.3f %9ld %14.3f %14.3f %14.3f %14.3f\n",
179 pi, (pi-pi_exact)/pi_exact, end_time-start_time, grainSize,
180 averageHostUnbalancedTime, averageMICUnbalancedTime,
181 averageHostSchedulingWait, averageMICSchedulingWait);
182 fflush(0);
183 }
184 }
185 MPI_Finalize();
186 }
B.4.15.8 labs/4/4.f-MPI-load-balance/step-04/pi-guided-hybrid.cc
g
15
an
16 const long iter=1L<<32L, BLOCK_SIZE=4096L, nBlocks=iter/BLOCK_SIZE, nTrials = 10;
W
17
g
18 void RunMonteCarlo(const long firstBlock, const long lastBlock,
19
en
VSLStreamStatePtr *stream, long & dUnderCurveExt) {
h
un
20
// Performs the Monte Carlo calculation for blocks in the range [firstBlock; lastBlock)
rY
21
22 // to count the number of random points inside of the quarter circle
fo
23
ed
24 long j, i;
ar
25 long dUnderCurve = 0;
ep
27
28 float r[BLOCK_SIZE*2] __attribute__((align(64)));
y
29
siv
56 if (rank != 0) {
57 #ifdef __MIC__
58 thisProcOnMIC++; // This process is running on an Intel Xeon Phi coprocessor
59 #else
60 thisProcOnHost++; // This process is running on an Intel Xeon processor
61 #endif
62 }
63 MPI_Allreduce(&thisProcOnMIC, &nProcsOnMIC, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
64 MPI_Allreduce(&thisProcOnHost, &nProcsOnHost, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
65 for (int t = 0; t < nTrials; t++) { // Multiple trials
66
67 long dUnderCurve = 0, UnderCurveSum = 0;
68 double hostSchedulingWait = 0.0, MICSchedulingWait = 0.0;
69 const double start_time = MPI_Wtime();
70
71 if (rank == 0) {
72 // Boss assigns work
73 const char* grainSizeSt = getenv("GRAIN_SIZE");
74 if (grainSizeSt == NULL) { printf("GRAIN_SIZE undefined\n"); exit(1); }
75 grainSize = atof(grainSizeSt);
76 long currentBlock = 0;
while (currentBlock < nBlocks) {
g
77
an
78 // Chunk size is proportional to the number of unassigned blocks
W
79 // divided by the number of workers...
long chunkSize = ((nBlocks-currentBlock)/(nRanks-1))/2;
ng
80
81 // ...but never smaller than GRAIN_SIZE
he
86
re
89 &stat);
yP
90
91 // Assign work to next worker
el
93
u
94
}
Ex
95
96
97 // Terminate workers
98 msg[0] = -1; msg[1] = -2;
99 for (int i = 1; i < nRanks; i++) {
100 MPI_Recv(&worker, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,
101 &stat);
102 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
103 }
104
105 } else {
106 // Worker performs the Monte Carlo calculation
107
108 // Create and initialize a random number generator from MKL
109 VSLStreamStatePtr stream[omp_get_max_threads()];
110 #pragma omp parallel
111 {
112 // Each thread gets its own random seed
113 const int randomSeed = nTrials*omp_get_thread_num()*nRanks + nTrials*rank + t;
114 vslNewStream(&stream[omp_get_thread_num()], VSL_BRNG_MT19937, randomSeed);
115 }
116
117 msg[0] = 0;
g
139
an
140 }
W
141
// Reduction to collect the results of the Monte Carlo calculation
g
142
en
143 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
h
144
un
145 double unbalancedTime = MPI_Wtime();
rY
146 MPI_Barrier(MPI_COMM_WORLD);
fo
148
149 // Timing collection
ar
153 #else
el
155 #endif
u
156
double averageHostSchedulingWait = 0.0, averageMICSchedulingWait = 0.0;
Ex
157
158 MPI_Reduce(&hostUnbalancedTime, &averageHostUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
159 0, MPI_COMM_WORLD);
160 MPI_Reduce(&MICUnbalancedTime, &averageMICUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
161 0, MPI_COMM_WORLD);
162 MPI_Reduce(&hostSchedulingWait, &averageHostSchedulingWait, 1, MPI_DOUBLE, MPI_SUM,
163 0, MPI_COMM_WORLD);
164 MPI_Reduce(&MICSchedulingWait, &averageMICSchedulingWait, 1, MPI_DOUBLE, MPI_SUM,
165 0, MPI_COMM_WORLD);
166 if (nProcsOnHost > 0) {
167 averageHostUnbalancedTime /= nProcsOnHost;
168 averageHostSchedulingWait /= nProcsOnHost;
169 }
170 if (nProcsOnMIC > 0) {
171 averageMICUnbalancedTime /= nProcsOnMIC;
172 averageMICSchedulingWait /= nProcsOnMIC;
173 }
174
175 // Compute pi
176 if (rank==0) {
177 const double pi = (double) UnderCurveSum / (double) iter * 4.0 ;
178 const double end_time = MPI_Wtime();
179 const double pi_exact=3.141592653589793;
180 if (t==0) printf("#%9s %8s %7s %9s %14s %14s %14s %14s\n",
181 "pi", "Rel.err", "Time, s", "GrainSize", "Host unbal., s", "MIC unbal., s",
182 "Host sched, s.", "MIC sched, s.");
183 printf ("%10.8f %8.1e %7.3f %9ld %14.3f %14.3f %14.3f %14.3f\n",
184 pi, (pi-pi_exact)/pi_exact, end_time-start_time, grainSize,
185 averageHostUnbalancedTime, averageMICUnbalancedTime,
186 averageHostSchedulingWait, averageMICSchedulingWait);
187 fflush(0);
188 }
189 }
190 MPI_Finalize();
191 }
g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex
Bibliography
[1] J.L. Hennessy and D.A. Patterson. Computer Architecture: a Quantitative Approach. Morgan Kaufmann,
5th edition, 2011.
[2] James Reinders. An Overview of Programming for Intel Xeon processors and Intel Xeon Phi
coprocessors.
g
an
http://software.intel.com/sites/default/files/blog/337861/
W
reindersxeonandxeonphi-v20121112a.pdf.
ng
[3] Intel Many Core Platform Software Stack (MPSS).
e
http://software.intel.com/en-us/articles/intel-manycore-platform-
software-stack-mpss. nh
Yu
[4] Wikipedia. Private Network.
r
fo
http://en.wikipedia.org/wiki/Private_network.
d
re
http://nfs.sourceforge.net/nfs-howto/.
re
yP
41/lin/Reference_Manual/index.htm.
iv
us
[7] Intel C++ Compiler XE 13.0 Reference: Placing Variables and Functions on the Coprocessor.
cl
http://software.intel.com/sites/products/documentation/doclib/stdxe/
Ex
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-CCAC04A3-DD2F-
4DFF-BD89-7235B321F7F3.htm.
[13] Aart Bik. The software vectorization handbook. Applying multimedia extensions for maximum perfor-
mance. Intel Press, 2006.
http://www.intel.com/intelpress/sum_vmmx.htm.
g
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-712779D8-D085-
an
4464-9662-B630681F16F1.htm.
W
ng
[15] Intel C++ Compiler XE 13.0 Reference: Data Alignment.
http://software.intel.com/sites/products/documentation/doclib/stdxe/
e
nh
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-801063E6-0144-
Yu
4025-8852-2BBBB38D526A.htm.
r
fo
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-A8921E64-6201-
ar
4018-BAE8-DE58E4E6ECB3.htm.
p
re
[17] Intel C++ Compiler XE 13.0 Reference: Allocating and Freeing Aligned Memory Blocks.
yP
http://software.intel.com/sites/products/documentation/doclib/stdxe/
el
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-D0927A8E-A220-
iv
4F50-8697-C89BBE6EFC95.htm.
us
cl
http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-5100C4FC-BC2F-
4E36-943A-120CFFFB4285.htm.
[19] Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual.
http://software.intel.com/sites/default/files/forum/278102/
327364001en.pdf.
[22] Intel C++ Compiler XE 13.0 Reference: Class Libraries for KNC.
http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-8FAC8E44-EFD8-
4A49-95E5-D051DA1C3A05.htm.
g
an
4865-8DF4-AF851F51DDA1.htm.
W
[26] Intel C++ Compiler XE 13.0 Cilk Plus Library, Array Notation.
ng
http://software.intel.com/sites/products/documentation/doclib/stdxe/
e
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-B4E06ED4-184F-
40E6-A8B4-117947D8C7AD.htm. nh
rYu
[27] Intel C++ Compiler XE 13.0 Elemental Functions.
fo
http://software.intel.com/sites/products/documentation/doclib/stdxe/
d
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-90A7F490-941F-
re
4C07-A88E-07BBA14AE6AF.htm.
pa
re
http://software.intel.com/sites/products/documentation/doclib/stdxe/
el
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-DD32852C-A0F9-
iv
4AC6-BF67-D10D064CC87A.htm.
us
cl
[29] Intel C++ Compiler XE 13.0 Cilk Plus Library, How to Write a New Reducer.
Ex
http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-275AA577-EE90-
4829-B1EA-01B7EB64C26F.htm.
[30] Intel C++ Compiler XE 13.0 Cilk Plus Library, Reducer Library.
http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-6B5EBB46-2BAB-
465B-870F-5CD6A981FA35.htm.
[32] B. Blaise. OpenMP Tutorial on the Lawrence Livermore National Laboratory Web Site.
https://computing.llnl.gov/tutorials/openMP/.
[35] Jim Jeffers and James Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Morgan
Kaufmann, 1 edition edition, March 2013.
[36] Jim Jeffers and James Reinders. Web Site for the book “Intel Xeon Phi Coprocessor High Performance
Programming”.
http://www.lotsofcores.com/.
[37] Michael McCool, Arch D. Robinson, and James Reinders. Structured Parallel Programming: Patterns
for Efficient Computation. Morgan Kaufmann, 2012.
http://parallelbook.com/.
[38] Michael McCool, Arch D. Robinson, and James Reinders. Web Site for the book “Structured Parallel
g
Programming: Patterns for Efficient Computation”.
an
http://parallelbook.com/.
W
ng
[39] Michael J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill, 2004.
e
nh
[40] Web Pages for MPI Routines at the Argonne National Laboratory Seb Site.
Yu
http://www.mcs.anl.gov/research/projects/mpi/www/.
r
fo
[42] Message Passing Interface Forum. MPI: A Message Passing Interface Standard Version 2.2.
p
re
http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf.
yP
[43] Message Passing Interface Forum. MPI: A Message Passing Interface Standard Version 3.0.
el
http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf.
iv
us
[44] Argonne National Laboratory. Web Portal for MPI at the Argonne National Laboratory Web Site.
cl
http://www.mcs.anl.gov/mpi/.
Ex
[48] Andrey Vladimirov. Arithmetics on Intel’s Sandy Bridge and Westmere CPUs: not all FLOPs are Created
Equal.
http://research.colfaxinternational.com/post/2012/04/30/FLOPS.aspx.
[50] Martyn J. Corden and David Kreitzer. Consistency of Floating-Point Results using the Intel Compiler or
Why doesn’t my application always give the same answer?
http://software.intel.com/en-us/articles/consistency-of-floating-
point-results-using-the-intel-compiler.
[52] Wendy Doerner. Advanced Optimizations for Intel MIC Architecture, Low Precision Optimizations.
g
an
http://software.intel.com/en-us/articles/advanced-optimizations-for-
W
intel-mic-architecture-low-precision-optimizations.
ng
[53] Units in the last place - Wikipedia, the free encyclopedia.
e
http://en.wikipedia.org/wiki/Unit_in_the_last_place.
nh
Yu
[54] Intel Math Kernel Library 11.0 Reference Manual.
r
http://software.intel.com/sites/products/documentation/doclib/mkl_
fo
sa/11/mklman/hh_goto.htm#GUID-0191F247-778C-4C69-B54F-ABF951506FCD.
d
htm.
re
pa
[55] Andrey Vladimirov. Auto-Vectorization with the Intel Compilers: is Your Code Ready for Sandy Bridge
re
http://research.colfaxinternational.com/post/2012/03/12/AVX.aspx.
el
iv
[56] Peter Richards and Stephen Weston. Technology in Banking - Facing the Challenges of Scale and
us
http://www.stanford.edu/class/ee380/Abstracts/110511.html.
Ex
[59] Andrey Vladimirov. Terabyte RAM Servers: Memory Bandwidth Benchmark and How to Boost RAM
Bandwidth by 20% with a Single Command.
http://research.colfaxinternational.com/post/2012/01/04/Terabyte-
RAM-Servers-Memory-Bandwidth-Benchmark.aspx.
[60] Andrey Vladimirov. Large Fast Fourier Transforms with FFTW 3.3 on Terabyte-RAM NUMA Servers.
http://research.colfaxinternational.com/post/2012/02/02/FFTW-
NUMA.aspx.
[61] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an Insightful Visual Performance
Model for Multicore Architectures. Communications of the ACM, 52(4):65–76, April 2009.
http://dx.doi.org/doi:10.1145/1498765.1498785.
[62] B. T. Draine and A. Li. Infrared Emission from Interstellar Dust. I. Stochastic Heating of Small Grains.
The Astrophysical Journal, 551:807–824, 2001.
http://adsabs.harvard.edu/abs/2001ApJ...551..807D.
[63] P. Guhathakurta and B. T. Draine. Temperature fluctuations in interstellar grains. I - Computational
method and sublimation of small grains. The Astrophysical Journal, 345:230–244, 1989.
http://adsabs.harvard.edu/abs/1989ApJ...345..230G.
[64] Harald Prokop. Cache-Oblivious Algorithms. Master’s thesis, Massachusetts Institute of Technology,
1999.
http://supertech.csail.mit.edu/papers/Prokop99.pdf.
[65] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-Oblivious
g
Algorithms. In 40th Annual Symposium on Foundations of Computer Science, 1999.
an
http://doi.ieeecomputersociety.org/10.1109/SFFCS.1999.814600.
W
ng
[66] D. Tsifakis, Alistair P. Rendell, and Peter E. Strazdins. Cache Oblivious Matrix Transposition: Simulation
e
and Experiment. In International Conference on Computational Science, pages 17–25, 2004.
nh
http://www.springeronline.com/3-540-22115-8.
Yu
[67] Rakesh Krishnaiyer. Compiler prefetching for Intel Xeon Phi coprocessors.
r
fo
http://software.intel.com/sites/default/files/article/326703/5.3-
prefetching-on-mic-4.pdf.
ed
ar
http://software.intel.com/sites/products/documentation/doclib/stdxe/
re
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-3A086451-4C82-
yP
4BB1-B742-FF93EBF60DA3.htm.
el
iv
http://software.intel.com/sites/products/documentation/doclib/stdxe/
cl
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-C46A86DA-6D6B-
Ex
455D-8860-AC814569C3D5.htm.
[70] Chris J. Newburn et al. Offload Runtime for the Intel Xeon Phi Coprocessor.
http://software.intel.com/en-us/articles/offload-runtime-for-the-
intelr-xeon-phitm-coprocessor.
[71] Intel Math Kernel Library Link Line Advisor.
http://software.intel.com/sites/products/mkl/MKL_Link_Line_Advisor.
html.
[72] Changkyu Kim et al. Closing the Ninja Performance Gap through Traditional Programming and
Compiler Technology.
http://www.intel.com/content/dam/www/public/us/en/documents/
technology-briefs/intel-labs-closing-ninja-gap-paper.pdf.
[73] Intel. Intel Xeon Phi Coprocessor System Software Developers Guide.
http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-
system-software-developers-guide.
[74] Intel VTune Parallel Amplifier XE 2013 Help for Linux* OS.
http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/amplifierxe/lin/ug_docs/index.htm.
[75] Intel Trace Analyzer Reference Guide.
http://software.intel.com/sites/products/documentation/hpc/ics/itac/
81/ITA_Reference_Guide/index.htm.
[78] Wendy Doerner. Programming and Compiling for Intel Many Integrated Core Architecture.
http://software.intel.com/en-us/articles/programming-and-compiling-
g
an
for-intel-many-integrated-core-architecture.
W
[79] Intel Software TV. Intel Xeon Phi Coprocessor.
ng
http://www.youtube.com/playlist?list=PLg-UKERBljNwuVuid_
e
rhZ1yVUrTjC3gzx.
nh
Yu
[80] Intel Many Integrated Core Architecture Forum.
http://software.intel.com/en-us/forums/intel-many-integrated-core.
r
fo
http://software.intel.com/en-us/forums/threading-on-intel-parallel-
pa
architectures.
re
[82] Parallel Computing with Intel Xeon Phi Coprocessors (a LinkedIn group).
yP
http://www.linkedin.com/groups/Parallel-Computing-Intel-Xeon-Phi-
el
4722265/about.
iv
us
[83] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions. New York: Dover, 1972.
cl
Ex
Index
g
_Cilk_sync, 72, 95, 113
an
162, 222, 282, 283 __cilkrts_get_nworkers(), 98
W
offset __cilkrts_get_worker_number(), 98,
calculation, 282
ng
106
arithmetic compilation, 97, 98
e
complexity, 16
nh elemental functions, 86
Yu
array extension for array notation, 85
notation, 85 fork-join, 106
r
fo
-cilk-serialize, 98
us
flag FLOPS, 9, 16
-cilk-serialize, 98 FMA, 92
-mkl, 269 fork-join, 103
-mmic, 3, 37, 44, 192, 230, 231, 238, 242, form factor, 9
255, 267–269, 280 fused multiply-add, 92
-no-offload, 272
-O, 78, 84, 90 GCC, 11
-openmp, 95, 97, 105 GDDR bandwidth, 7
-openmp-stubs, 97 GDDR5, 1, 5, 15
-vec-report, 83, 84, 90
-x, 90 hardware
optimization, 78 installation, 18
compute-bound, 14–16, 140, 189, 192, 196–198, 255, header file
322 stdio.h, 49, 50
computing stdlib.h, 49
header files
g
model
an
heterogeneous, 4, 122 cilk/cilk.h, 96, 98
W
hybrid, 4, 122 dvec.h, 82
emmintrin.h, 81
ng
configuration
file, 31 fvec.h, 82
e
nh
cpuinfo, 34 ia32intrin.h, 81
immintrin.h, 81
Yu
data ivec.h, 82
r
fo
alignment, 11, 79–81, 84, 89, 90, 139, 153, 158, malloc.h, 79
159, 162, 222, 282, 283 mmintrin.h, 81
ed
heap, 79 mpi.h, 43
ar
transfer pthread.h, 40
asynchronous, 55 smmintrin.h, 81
el
synchronous, 53 stdio.h, 37, 40, 43, 45, 46, 67–69, 71, 72, 83, 98,
iv
267
us
__declspec(align()), 79, 83
stdlib.h, 68, 71, 72, 80
cl
string.h, 43
environment tmmintrin.h, 81
variable, 49 unistd.h, 37, 40, 267
CILK_NWORKERS, 98, 106, 119, 120, 290 xmmintrin.h, 81
OMP_NUM_THREADS, 97, 105, 170, 196, 286, heterogeneous
287 computing, 4, 122
OMP_SCHEDULE, 101 MPI, 76, 122
explicit offload, 45, 271, 273 hostname, 32, 38
exportfs -a, 35 HPC, 1
hybrid
false sharing, 171 computing, 4
fflush(0), 47, 269 hyper-threads, 5, 6, 13, 97, 140, 179, 189, 194, 196,
file 199, 207, 209, 230, 238
configuration, 31 counter-productive, 315, 316, 322
firmware
update, 30 I_MPI_MIC, 44, 74, 306
firstprivate, 108 I/O, 47
#ifdef pattern, 15
__INTEL_COMPILER, 85 bandwidth, 7, 15
__MIC__, 50, 63, 272, 278, 324, 327 hierarchy, 7
IMCI, 6, 14, 78, 81, 82, 91, 93 MIC, 1
alignment, 84 vector
installation, 18 reatures, 91
Intel __MIC__, 50, 63, 272, 278, 324, 327
Cluster Studio XE, 12 MIC_ENV_PREFIX, 49
Parallel Studio XE, 12 MIC_ENV_PREFIX, 49
Xeon, 11 MIC_LD_LIBRARY_PATH, 49
__INTEL_COMPILER, 85 MIC_PROXY_IO, 47
IP address, 32, 33 miccheck, 21, 28, 262
IP-addressable, 5 micctrl, 18, 21, 27, 31, 263
iptables, 35 micflash, 21, 30, 262
ITAC, 3 micinfo, 21, 22, 262
g
micnativeloadex, 39, 267, 268
an
Jacobi micnativeloadex, 39
W
method, 318 micrasd, 21, 29
ng
micsmc, 21, 24, 26, 40, 262
KNC, 1, 6
e
MKL, 3, 248
KNC chip, 5
nh
mkl_mic_set_workdivision(), 252
Yu
L1 cache, 7 Automatic Offload, 252
r
µLinux, 10, 31
re
LD_LIBRARY_PATH, 49
_mm_free, 79, 80, 158, 192, 218, 222, 283
pa
Linux, 10
_mm_malloc, 79, 80, 158, 192, 218, 222, 283
re
linux
-mmic, 44, 267
yP
kernel, 20
load -mmic flag, 3, 37, 192, 230, 231, 238, 242, 255, 267,
el
mount -a, 36
Ex
MTU, 33 _Offload_number_of_devices, 67
multi-Core, 1 OMP_NUM_THREADS, 97, 105, 170, 196, 286, 287
multi-core, 1, 3, 4 OpenCL, 3
multi-threaded, 14 OpenMP, 3, 39, 95, 97, 99
multiple atomic operations, 110
asynchronous compilation, 97
offload, 72 critical section, 110
coprocessors, 66 firstprivate, 108
offload, 67 fork-join, 103, 104
MYO, 45 lastprivate, 108
_Cilk_offload, 59, 60 loops, 99
_Cilk_offload, 65 private, 108
_Cilk_shared, 59, 60, 71 private variables, 114, 166
_Cilk_shared, 65 reduction, 114
schedule
native
g
dynamic, 100
an
compilation, 38 guided, 101
W
execution, 37, 39, 264, 267 static, 100
mode, 264, 267
ng
shared, 109
MPI, 44 e
tasks, 103
nh
netwokring
thread, 99
host IP, 32
Yu
variables
fo
bridge, 31, 32
private, 107
multiple, 33
ed
shared, 107
DHCP, 33
ar
IP, 32
-openmp-stubs, 97
yP
MTUsize, 33
optimization
SSH, 34
el
-On, 141
subnetwork, 32
iv
pragma, 141
new, 64
us
parallel
Ex
MPI, 35, 73
application, 13
offload performance, 13
coherence, 59 processor, 13
diagnostics, 48 parallel vs. serial, 13
explicit, 45, 51, 271, 273 parallelism, 13
function, 46 data, 77
model, 45 distributed memory, 122
multiple, 67 SIMD, 77
asynchronous, 72 task, 94
unsuccessful, 50 thread, 94
_Offload_number_of_devices(), 67 PCIe, 2, 31
OFFLOAD_REPORT, 48, 272, 275 bandwidth, 16
_Offload_shared_aligned_free, 61, 64 traffic, 17
_Offload_shared_aligned_malloc, 61, 64 peak
_Offload_shared_free, 61, 64 bandwidth, 9
_Offload_shared_malloc, 61, 64 FLOPS, 9
performance, 9 always, 89
PMU, 5 nontemporal, 89
portability, 3, 81 temporal, 89
power unaligned, 89
management, 261 prefetch, 7
power consumption, 2 prefetching, 8, 11, 15, 92, 140, 220
#pragma private, 108
cilk proxy console I/O, 47
grainsize, 102 pthreads, 3, 40, 94
ivdep, 87–89
loop count, 90 race condition, 166
novector, 89 resource
offload, 51–53 monitoring, 261
in, 53 restrict, 87, 88, 90
inout, 53
schedule
g
nocopy, 53
an
dynamic, 100
out, 53
W
guided, 101
signal, 69
static, 100
ng
target(mic), 45, 50, 58
scp, 38
e
target(mic:0), 55, 56, 67
target(mic:i), 68, 69
serial
nh
application, 13
Yu
wait, 55
performance, 13
r
offload_attribute
fo
processor, 13
pop, 46, 57, 272
service mpss, 19, 20, 262, 264
d
shared, 109
offload_transfer, 52, 53, 275
pa
SIMD, 77
alloc_if, 54
re
inout, 53
SSE, 78, 81, 82
iv
nocopy, 53, 54
us
signal, 55
Ex
micinfo, 22
micnativeloadex, 39
micrasd, 29
micsmc, 24, 41, 262
variables
private, 107
shared, 107
vector
register, 5, 14, 78
vectorization
automatic, 83
IMCI, 91
virtual
shared
class, 62
g
an
object, 60
W
VTune, 3
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex