0% found this document useful (0 votes)
3 views

Parallel Programming and Optimization With Intel Xeon Phi Coprocessors

The document is a handbook prepared for Yunheng Wang, focusing on the development and optimization of parallel applications for Intel Xeon and Xeon Phi coprocessors. It includes contributions from experts in high-performance computing, detailing various programming models, installation procedures, and performance considerations. The authors express gratitude to several individuals and organizations for their support in the creation of the book.

Uploaded by

49rjtz5t7e
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Parallel Programming and Optimization With Intel Xeon Phi Coprocessors

The document is a handbook prepared for Yunheng Wang, focusing on the development and optimization of parallel applications for Intel Xeon and Xeon Phi coprocessors. It includes contributions from experts in high-performance computing, detailing various programming models, installation procedures, and performance considerations. The authors express gratitude to several individuals and organizations for their support in the creation of the book.

Uploaded by

49rjtz5t7e
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 520

Ex

cl
us
iv
el
yP
re
pa
re
d
fo
r Yu
nh
eng
W
an
g
Prepared for Yunheng Wang

About the Authors

g
an
W
ng
Andrey Vladimirov, PhD, is Head of HPC Research at Colfax Inter-
e
national. His primary interest is the application of modern computing
nh
technologies to computationally demanding scientific problems. Prior to
Yu

joining Colfax, A. Vladimirov was involved in computational astrophysics


research at Stanford University, North Carolina State University, and the
r
fo

Ioffe Institute in Russia, where he studied cosmic rays, collisionless plas-


ed

mas and the interstellar medium using computer simulations. He is an


ar

author or co-author of over 10 peer-reviewed publications in the fields of


p

theoretical astrophysics and scientific computing.


re
yP

Vadim Karpusenko, PhD, is Principal HPC Research Engineer at Colfax


el

International involved in training and consultancy projects on data mining,


iv

software development and statistical analysis of complex systems. His


us

research interests are in the area of physical modeling with HPC clusters,
cl

highly parallel architectures, and code optimization. Vadim holds a PhD


Ex

in Computational BioPhysics from North Carolina State University for his


research on the free energy and stability of helical secondary structures of
proteins.

Authors are sincerely grateful to James Reinders for supervising and directing the creation of this
book, Albert Lee for his help with editing and error checking, to specialists at Intel Corporation who
contributed their time and shared with the authors their expertise on the MIC architecture programming:
Bob Davies, Shannon Cepeda, Pradeep Dubey, Ronald Green, James Jeffers, Taylor Kidd, Rakesh
Krishnaiyer, Chris (CJ) Newburn, Kevin O’Leary, Zhang Zhang, and to a great number of people,
mostly from Colfax International and Intel, who have ensured that gears were turning and bits were
churning during the production of the book, including Rajesh Agny, Mani Anandan, Joe Curley,
Roger Herrick, Richard Jackson, Mike Lafferty, Thomas Lee, Belinda Liviero, Gary Paek, Troy
Porter, Tim Puett, John Rinehimer, Gautam Shah, Manish Shah, Bruce Shiu, Jimmy Tran, Achim
Wengeler, and Desmond Yuen.
Parallel Programming and Optimization

g
TM

an
with Intel R Xeon Phi Coprocessors

W
e ng
nh
Handbook on the Development and Optimization of Parallel Applications
Yu
TM
for Intel R Xeon R Processors and Intel R Xeon Phi Coprocessors
r
fo
d
re
pa
re
yP

c Colfax International, 2013


el
iv

Electronic book built: October 16, 2014


us

Last revision date: October 16, 2014


cl
Ex
Copyrighted Material
Copyright c 2013, Colfax International. All rights reserved.
Cover image Copyright c pio3, 2013. Used under license from Shutterstock.com.
Published by Colfax International, 750 Palomar Ave, Sunnyvale, CA 94085, USA.
All Rights Reserved.
No part of this book (or publication) may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording or by any information storage and retrieval system, without written permission from the
publisher, except for the inclusion of brief quotations in a review.
Intel, Xeon and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.
All trademarks and registered trademarks appearing in this publication are the property of their respective owners.

Disclaimer and Legal Notices

g
While best efforts have been used in preparing this book, the publisher makes no representations or warranties of any kind

an
and assumes no liabilities of any kind with respect to the accuracy or completeness of the contents and specifically disclaims

W
any implied warranties of merchantability or fitness of use for a particular purpose. The publisher shall not be held liable or
responsible to any person or entity with respect to any loss or incidental or consequential damages caused, or alleged to

ng
have been caused, directly or indirectly, by the information or programs contained herein. No warranty may be created or
extended by sales representatives or written sales materials.
e
nh
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Yu

Performance tests are measured using specific computer systems, components, software, operations and functions. Any
change to any of those factors may cause the results to vary. You should consult other information and performance tests to
r

assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
fo

other products.
ed

Results have been simulated and are provided for informational purposes only. Results were derived using simulations run
ar

on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect
p

actual performance.
re

Because of the evolutionary nature of technology, knowledge and best practices described at the time of this writing, may
yP

become outdated or simply inapplicable at a later date. Summaries, strategies, tips and tricks are only recommendations
by the publisher, and reading this eBook does not guarantee that one’s results will exactly mirror our own results. Every
el

company is different and the advice and strategies contained herein may not be suitable for your situation. References are
iv

provided for informational purposes only and do not constitute endorsement of any websites or other sources.
us

The products described in this document may contain design defects or errors known as errata which may cause the product
cl

to deviate from published specifications. All products, computer systems, dates, and figures specified are preliminary based
Ex

on current expectations, and are subject to change without notice.

ISBN: 978-0-9885234-1-8
CONTENTS iii

Contents

Foreword ix

Preface xi

g
an
List of Abbreviations xiii

W
ng
1 Introduction 1
TM

e
1.1 Intel R Xeon Phi Coprocessors and the MIC Architecture . . . . . . . . . . . . . . . . . . 1
nh
1.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Yu
1.1.2 A Drop-in Solution for a Novel Platform . . . . . . . . . . . . . . . . . . . . . . . 2
r

1.1.3 Code Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


fo

1.1.4 Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


d

1.2 MIC Architecture: Developer’s Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 5


re

1.2.1 Knights Corner Die Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


pa

1.2.2 Core Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6


re

1.2.3 Memory Hierarchy and Cache Properties . . . . . . . . . . . . . . . . . . . . . . . 7


yP

1.2.4 Technical Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


el

1.2.5 Integration into the Host System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10


iv

TM
1.2.6 Intel R Xeon R Processors versus Intel R Xeon Phi Coprocessors: Developer Experi-
us

ence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
cl
Ex

1.2.7 Development Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12


TM
1.3 Identifying Algorithms Appropriate for Execution on Intel R Xeon Phi Coprocessors . . . 13
1.3.1 Task-Parallel Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Data-Parallel Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 Memory Access Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.4 PCIe Bandwidth Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
TM
1.4 Installing the Intel R Xeon Phi Coprocessor and MPSS . . . . . . . . . . . . . . . . . . . 18
1.4.1 Hardware Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2 Installing MPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 Intel R Compilers and Miscellaneous Tools . . . . . . . . . . . . . . . . . . . . . . 19
1.4.4 Restoring MPSS Functionality after Linux Kernel Updates . . . . . . . . . . . . . . 20
TM
1.5 MPSS Tools and Linux Environment on Intel R Xeon Phi Coprocessors . . . . . . . . . . 21
1.5.1 mic* Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
TM
1.5.2 Network Configuration of the uOS on Intel R Xeon Phi Coprocessors . . . . . . . 31
TM
1.5.3 SSH Access to Intel R Xeon Phi Coprocessors . . . . . . . . . . . . . . . . . . . 34
1.5.4 NFS Mounting a Host Export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Prepared for Yunheng Wang c Colfax International, 2013


iv CONTENTS

TM
2 Programming Models for Intel R Xeon Phi Applications 37
TM
2.1 Native Applications and MPI on Intel R Xeon Phi Coprocessors . . . . . . . . . . . . . . 37
2.1.1 Using Compiler Argument -mmic to Compile Native Applications for Intel R Xeon
TM
Phi Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1.2 Establishing SSH Sessions with Coprocessors . . . . . . . . . . . . . . . . . . . . . 38
2.1.3 Running Native Applications with micnativeloadex . . . . . . . . . . . . . . . 39
2.1.4 Monitoring the Coprocessor Activity with micsmc . . . . . . . . . . . . . . . . . . 40
2.1.5 Compiling and Running MPI Applications on the Coprocessor . . . . . . . . . . . . 42
2.2 Explicit Offload Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.1 “Hello World” Example in the Explicit Offload Model . . . . . . . . . . . . . . . . 45
2.2.2 Offloading Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.3 Proxy Console I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2.4 Offload Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.5 Environment Variable Forwarding and MIC_ENV_PREFIX . . . . . . . . . . . . . 49
2.2.6 Target-Specific Code with the Preprocessor Macro __MIC__ . . . . . . . . . . . . 50

g
2.2.7 Fall-Back to Execution on the Host upon Unsuccessful Offload . . . . . . . . . . . . 50

an
2.2.8 Using Pragmas to Transfer Bitwise-Copyable Data to the Coprocessor . . . . . . . . 51

W
2.2.9 Data Traffic and Persistence between Offloads . . . . . . . . . . . . . . . . . . . . . 53

ng
2.2.10 Asynchronous Offload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
e 55
2.2.11 Review: Core Language Constructs of the Explicit Offload Model . . . . . . . . . . 57
nh
2.3 MYO (Virtual-Shared) Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Yu

2.3.1 Sharing Objects with _Cilk_shared and _Cilk_offload Keywords . . . . . 60


2.3.2 Dynamically Allocating Virtual-Shared Objects . . . . . . . . . . . . . . . . . . . . 61
r
fo

2.3.3 Virtual-Shared Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62


ed

2.3.4 Placement Version of Operator new for Shared Classes . . . . . . . . . . . . . . . . 64


ar

2.3.5 Summary of Language Extensions for the MYO Model . . . . . . . . . . . . . . . . 65


TM TM
p

2.4 Multiple Intel R Xeon Phi Coprocessors in a System and Clusters with Intel R Xeon Phi
re

Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
yP

2.4.1 Using Several coprocessors in the Explicit Offload Model . . . . . . . . . . . . . . 67


el

2.4.2 Using Multiple Coprocessors in the MYO Model . . . . . . . . . . . . . . . . . . . 70


iv

2.4.3 Running MPI Applications on Multiple Intel Xeon Phi Coprocessors. . . . . . . . . 73


us
cl

3 Expressing Parallelism 77
Ex

3.1 Data Parallelism in Serial Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77


3.1.1 SIMD Operations: Concept and History . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.2 MMX, SSE, AVX and IMCI Instruction Sets . . . . . . . . . . . . . . . . . . . . . 78
3.1.3 Is Your Code Using SIMD Instructions? . . . . . . . . . . . . . . . . . . . . . . . . 78
3.1.4 Data Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.1.5 Using SIMD with Inline Assembly Code, Compiler Intrinsics and Class Libraries . . 81
3.1.6 Automatic Vectorization of Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
TM
3.1.7 Extensions for Array Notation in Intel R Cilk Plus . . . . . . . . . . . . . . . . . 85
3.1.8 Elemental Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.1.9 Assumed Vector Dependence. The restrict Keyword. . . . . . . . . . . . . . . 87
3.1.10 Summary of Vectorization Pragmas, Keywords and Compiler Arguments. . . . . . . 89
3.1.11 Exclusive Features of the IMCI Instruction Set . . . . . . . . . . . . . . . . . . . . 91
3.2 Task Parallelism in Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
TM
3.2.1 About OpenMP and Intel R Cilk Plus . . . . . . . . . . . . . . . . . . . . . . . . 94
TM
3.2.2 “Hello World” OpenMP and Intel R Cilk Plus Programs . . . . . . . . . . . . . . 97
TM
3.2.3 Loop-Centric Parallelism: For-Loops in OpenMP and Intel R Cilk Plus . . . . . . 99

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


CONTENTS v

3.2.4 Fork-Join Model of Parallel Execution: Tasks in OpenMP and Spawning in Intel R
TM
Cilk Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.2.5 Shared and Private Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.2.6 Synchronization: Avoiding Unpredictable Program Behavior . . . . . . . . . . . . . 110
3.2.7 Reduction: Avoiding Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.2.8 Additional Resources on Shared Memory Parallelism . . . . . . . . . . . . . . . . . 121
3.3 Task Parallelism in Distributed Memory, MPI . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.3.1 Parallel Computing in Clusters with Multi-Core and Many-Core Nodes . . . . . . . 122
3.3.2 Program Structure in Intel R MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.3.3 Point-to-Point Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.3.4 MPI Communication Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.3.5 Collective Communication and Reduction . . . . . . . . . . . . . . . . . . . . . . . 135
3.3.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

4 Optimizing Applications for Intel R Xeon R Product Family 139


TM

g
4.1 Roadmap to Optimal Code on Intel R Xeon Phi Coprocessors . . . . . . . . . . . . . . . . 139

an
4.1.1 Optimization Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

W
4.1.2 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

ng
4.2 Scalar Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

e
4.2.1 Assisting the Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
nh
4.2.2 Eliminating Redundant Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Yu
4.2.3 Controlling Precision and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.2.4 Library Functions for Standard Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 152
r
fo

4.3 Data Parallelism: Vectorization the Right and Wrong Way . . . . . . . . . . . . . . . . . . 153
d

4.3.1 Unit-Stride Access and Spatial Locality of Reference . . . . . . . . . . . . . . . . . 153


re

4.3.2 Guiding Automatic Vectorization with Compiler Hints . . . . . . . . . . . . . . . . 157


pa

4.3.3 Branches in Automatically Vectorized Loops . . . . . . . . . . . . . . . . . . . . . 161


re

4.3.4 Diagnosing the Utilization of Vector Instructions . . . . . . . . . . . . . . . . . . . 165


yP

4.4 Task Parallelism: Common Pitfalls in Shared-Memory Parallel Code . . . . . . . . . . . . . 166


el

4.4.1 Too Much Synchronization. Solution: Avoiding True Sharing with Private Variables
iv

and Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166


us

4.4.2 False Sharing. Solution: Data Padding and Private Variables . . . . . . . . . . . . . 171
cl

4.4.3 Load Imbalance. Solution: Load Scheduling and Grain Size Specification . . . . . . 175
Ex

4.4.4 Insufficient Parallelism. Solution: Strip-Mining and Collapsing Nested Loops . . . . 179
4.4.5 Wandering Threads. Improving OpenMP Performance by Setting Thread Affinity . . 189
4.4.6 Diagnosing Parallelization Problems with Scalability Tests . . . . . . . . . . . . . . 196
4.5 Memory Access: Computational Intensity and Cache Management . . . . . . . . . . . . . . 197
4.5.1 Cache Organization on Intel Xeon Processors and Intel Xeon Phi Coprocessors . . . 199
4.5.2 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.5.3 Loop Interchange (Permuting Nested Loops) . . . . . . . . . . . . . . . . . . . . . 200
4.5.4 Loop Tiling (Blocking) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.5.5 Cache-Oblivious Recursive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.5.6 Cross-Procedural Loop Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
4.5.7 Advanced Topic: Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
4.6 PCIe Traffic Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.6.1 Memory Retention Between Offloads . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.6.2 Data Persistence Between Offloads . . . . . . . . . . . . . . . . . . . . . . . . . . 222
4.6.3 Memory Alignment and TLB Page Size Control . . . . . . . . . . . . . . . . . . . 222
4.6.4 Offload Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

Prepared for Yunheng Wang c Colfax International, 2013


vi CONTENTS

4.7 Process Parallelism: MPI Optimization Strategies . . . . . . . . . . . . . . . . . . . . . . . 225


4.7.1 Example Problem: the Monte Carlo Method of Computing the Number π . . . . . . 226
4.7.2 MPI Implementation without Load Balancing . . . . . . . . . . . . . . . . . . . . . 228
4.7.3 Load Balancing with Static Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 233
4.7.4 Load Balancing with Dynamic Scheduling . . . . . . . . . . . . . . . . . . . . . . 236
4.7.5 Multi-threading within MPI Processes . . . . . . . . . . . . . . . . . . . . . . . . . 241
4.7.6 Load Balancing with Guided Scheduling . . . . . . . . . . . . . . . . . . . . . . . 245
4.7.7 Load Balancing with Work Stealing . . . . . . . . . . . . . . . . . . . . . . . . . . 247
4.8 Using the Intel R MKL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
4.8.1 Functions Offered by MKL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
4.8.2 Linking Applications Intel MKL. Link Line Advisor . . . . . . . . . . . . . . . . . 251
4.8.3 Automatic offload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
4.8.4 Compiler-Assisted Offload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
4.8.5 Native Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.8.6 General Performance Considerations for Applications Using Intel MKL . . . . . . . 255

g
an
5 Summary and Resources 257

W
TM
5.1 Programming Intel R Xeon Phi Coprocessors is Not Trivial, but Offers Double Rewards . . 257

ng
5.2 Practical Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.3 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
e
nh
Yu

A Practical Exercises 261


A.1 Exercises for Chapter 1: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
r
fo

A.1.1 Power Management and Resource Monitoring . . . . . . . . . . . . . . . . . . . . . 261


TM
ed

A.1.2 Networking on Intel R Xeon Phi Coprocessors . . . . . . . . . . . . . . . . . . . 263


ar

A.2 Exercises for Chapter 2: Programming Models . . . . . . . . . . . . . . . . . . . . . . . . 267


TM
p

A.2.1 Compiling and Running Native Intel R Xeon Phi Applications . . . . . . . . . . . 267
re

A.2.2 Explicit Offload: Sharing Arrays and Structures . . . . . . . . . . . . . . . . . . . . 271


yP

A.2.3 Explicit Offload: Data Traffic and Asynchronous Offload . . . . . . . . . . . . . . . 273


el

A.2.4 Explicit Offload: Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . 274


iv

A.2.5 Virtual-Shared Memory Model: Sharing Complex Objects . . . . . . . . . . . . . . 275


us

A.2.6 Using Multiple Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276


cl

A.2.7 Asynchronous Execution on One and Multiple Coprocessors . . . . . . . . . . . . . 278


Ex

A.2.8 Using MPI for Multiple Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . 279


A.3 Exercises for Chapter 3: Expressing Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 282
A.3.1 Automatic Vectorization: Compiler Pragmas and Vectorization Report . . . . . . . . 282
A.3.2 Parallelism with OpenMP: Shared and Private Variables, Reduction . . . . . . . . . 285
TM
A.3.3 Complex Algorithms with Intel R Cilk Plus: Recursive Divide-and-Conquer . . . 288
A.3.4 Data Traffic with MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
A.4 Exercises for Chapter 4: Optimizing Applications . . . . . . . . . . . . . . . . . . . . . . . 293
TM
A.4.1 Using Intel R VTune Amplifier XE . . . . . . . . . . . . . . . . . . . . . . . . . 293
TM
A.4.2 Using Intel R Trace Analyzer and Collector . . . . . . . . . . . . . . . . . . . . . 305
A.4.3 Serial Optimization: Precision Control, Eliminating Redundant Operations . . . . . 310
A.4.4 Vector Optimization: Unit-Stride Access, Data Alignment . . . . . . . . . . . . . . 313
A.4.5 Vector Optimization: Assisting the Compiler . . . . . . . . . . . . . . . . . . . . . 314
A.4.6 Vector Optimization: Branches in Auto-Vectorized Loops . . . . . . . . . . . . . . 316
A.4.7 Shared-Memory Optimization: Reducing the Synchronization Cost . . . . . . . . . 317
A.4.8 Shared-Memory Optimization: Load Balancing . . . . . . . . . . . . . . . . . . . . 318
A.4.9 Shared-Memory Optimization: Loop Collapse and Strip-Mining for Parallel Scalability320

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


CONTENTS vii

A.4.10 Shared-Memory Optimization: Core Affinity Control . . . . . . . . . . . . . . . . . 322


A.4.11 Cache Optimization: Loop Interchange and Tiling . . . . . . . . . . . . . . . . . . 323
A.4.12 Memory Access: Cache-Oblivious Algorithms . . . . . . . . . . . . . . . . . . . . 324
A.4.13 Memory Access: Loop Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
A.4.14 Offload Traffic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
A.4.15 MPI: Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

B Source Code for Practical Exercises 329


B.1 Source Code for Chapter 1: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
B.2 Source Code for Chapter 2: Programming Models . . . . . . . . . . . . . . . . . . . . . . . 329
B.2.1 Compiling and Running Native Applications on Intel Xeon Phi Coprocessors . . . . 329
B.2.2 Explicit Offload: Sharing Arrays and Structures . . . . . . . . . . . . . . . . . . . . 331
B.2.3 Explicit Offload: Data Traffic and Asynchronous Offload . . . . . . . . . . . . . . . 335
B.2.4 Explicit Offload: Putting it All Together . . . . . . . . . . . . . . . . . . . . . . . . 338
B.2.5 Virtual-Shared Memory Model: Sharing Complex Objects . . . . . . . . . . . . . . 343
B.2.6 Using Multiple Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

g
an
B.2.7 Asynchronous Execution on One and Multiple Coprocessors . . . . . . . . . . . . . 354

W
B.2.8 Using MPI for Multiple Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . 362
B.3 Source Code for Chapter 3: Expressing Parallelism . . . . . . . . . . . . . . . . . . . . . . 364

ng
B.3.1 Automatic Vectorization: Compiler Pragmas and Vectorization Report . . . . . . . . 364

e
nh
B.3.2 Parallelism with OpenMP: Shared and Private Variables, Reduction . . . . . . . . . 374
Yu
B.3.3 Complex Algorithms with Cilk Plus: Recursive Divide-and-Conquer . . . . . . . . . 380
B.3.4 Data Traffic with MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
r
fo

B.4 Source Code for Chapters 4 and 5: Optimizing Applications . . . . . . . . . . . . . . . . . 390


TM
B.4.1 Using Intel R VTune Amplifier XE . . . . . . . . . . . . . . . . . . . . . . . . .
d

390
re

TM
B.4.2 Using Intel R Trace Analyzer and Collector . . . . . . . . . . . . . . . . . . . . . 394
pa

B.4.3 Serial Optimization: Precision Control, Eliminating Redundant Operations . . . . . 398


re

B.4.4 Vector Optimization: Unit-Stride Access, Data Alignment . . . . . . . . . . . . . . 404


yP

B.4.5 Vector Optimization: Assisting the Compiler . . . . . . . . . . . . . . . . . . . . . 409


B.4.6 Vector Optimization: Branches in Auto-Vectorized Loops . . . . . . . . . . . . . . 421
el
iv

B.4.7 Shared-Memory Optimization: Reducing the Synchronization Cost . . . . . . . . . 424


us

B.4.8 Shared-Memory Optimization: Resolving Load Imbalance . . . . . . . . . . . . . . 431


cl

B.4.9 Shared-Memory Optimization: Loop Collapse and Strip-Mining for Improved Parallel
Ex

Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
B.4.10 Shared-Memory Optimization: Core Affinity Control . . . . . . . . . . . . . . . . . 443
B.4.11 Cache Optimization: Loop Interchange and Tiling . . . . . . . . . . . . . . . . . . 448
B.4.12 Memory Access: Cache-Oblivious Algorithms . . . . . . . . . . . . . . . . . . . . 455
B.4.13 Memory Access: Loop Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
B.4.14 Offload Traffic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
B.4.15 MPI: Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470

Bibliography 487

Index 495

Prepared for Yunheng Wang c Colfax International, 2013


Ex
cl
us
iv
el
yP
re
par
ed
fo
r Yu
nh
eng
W
an
g
ix

Foreword

We live in exciting times; the amount of computing power available for sciences and engineering is
reaching enormous heights through parallel computing. Parallel computing is driving discovery in many
endeavors, but remains a relatively new area of computing. As such, software developers are part of an industry
that is still growing and evolving as parallel computing becomes more commonplace.

g
The added challenges involved in parallel programming are being eased by four key trends in the industry:

an
emergence of better tools, wide-spread usage of better programming models, availability of significantly more

W
hardware parallelism, and more teaching material promising to yield better-educated programmers. We have

ng
seen recent innovations in tools and programming models including OpenMP and Intel Threading Building
TM
Blocks. Now, the Intel R Xeon Phi coprocessor certainly provides a huge leap in hardware parallelism with

e
nh
its general purpose hardware thread counts being as high as 244 (up to 61 cores, 4 threads each).
Yu
This leaves the challenge of creating better-educated programmers. This handbook from Colfax, with a
subtitle of “Handbook on the Development and Optimization of Parallel Applications for Intel Xeon Processors
r
fo

and Intel Xeon Phi Coprocessors” is an example-based course for the optimization of parallel applications for
d

platforms with Intel Xeon processors and Intel Xeon Phi coprocessors.
re

This handbook serves as practical training covering understandable computing problems for C and C++
pa

programmers. The authors at Colfax have developed sample problems to illustrate key challenges and offer
re

their own guidelines to assist in optimization work. They provide easy to follow instructions that allow the
yP

reader to understand solutions to the problems posed as well as inviting the reader to experiment further.
el

Colfax’s examples and guidelines complement those found in our recent book on programming the Intel Xeon
iv

Phi Coprocessor by Jim Jeffers and myself by adding another perspective to the teaching materials available
us

from which to learn.


cl

In the quest to learn, it takes multiple teaching methods to reach everyone. I applaud these authors in their
Ex

efforts to bring forth more examples to enable either self-directed or classroom oriented hands-on learning of
the joys of parallel programming.

James R. Reinders
TM
Co-author of “Intel R Xeon Phi Coprocessor High Performance Programming"
c 2013, Morgan Kaufmann Publishers
Intel Corporation
March 2013

Prepared for Yunheng Wang c Colfax International, 2013


Ex
cl
us
iv
el
yP
re
par
ed
fo
r Yu
nh
eng
W
an
g
xi

Preface

Welcome to the Colfax Developer Training! You are holding in your hands or browsing on your computer
screen a comprehensive set of training materials for this training program. This document will guide you to the
mastery of parallel programming with Intel R Xeon R family products: Intel R Xeon R processors and Intel R
TM
Xeon Phi coprocessors. The curriculum includes a detailed presentation of the programming paradigm

g
for Intel Xeon product family, optimization guidelines, and hands-on exercises on systems equipped with

an
Intel Xeon Phi coprocessors, as well as instructions on using Intel R software development tools and libraries

W
included in Intel R Parallel Studio XE.

ng
These training materials are targeted toward developers familiar with C/C++ programming in Linux.
Developers with little parallel programming experience will be able to grasp the core concepts of this subject

e
nh
from the detailed commentary in Chapter 3. For advanced developers familiar with multi-core and/or GPU
Yu
programming, the training offers materials specific to the Intel compilers and Intel Xeon family products, as
well as optimization advice pertinent to the Many Integrated Core (MIC) architecture.
r
fo

We have written these materials relying on key elements for efficient learning: practice and repetition.
d

As a consequence, the reader will find a large number of code listings in the main section of these materials.
re

In the extended Appendix, we provided numerous hands-on exercises that one can complete either under an
pa

instructor’s supervision, or autonomously in a self-study training.


re

This document is different from a typical book on computer science, because we intended it to be used
yP

as a lecture plan in an intensive learning course. Speaking in programming terms, a typical book traverses
el

material with a “depth-first algorithm”, describing every detail of each method or concept before moving on to
iv

the next method. In contrast, this document traverses the scope of material with a “breadth-first” algorithm.
us

First, we give an overview of multiple methods to address a certain issue. In the subsequent chapter, we
cl

re-visit these methods, this time in greater detail. We may go into even more depth down the line. In this way,
Ex

we expect that students will have enough time to absorb and comprehend the variety of programming and
optimization methods presented here. The course road map is outlined in the following list.

• Chapter 1 presents the Intel Xeon Phi architecture overview and the environment provided by the MIC
Platform Software Stack (MPSS) and Intel Cluster Studio XE on Many Integrated Core architecture
(MIC). The purpose of Chapter 1 is to outline what users may expect from Intel Xeon Phi coprocessors
(technical specifications, software stack, application domain).

• Chapter 2 allows the reader to experience the simplicity of Intel Xeon Phi usage early on in the program.
It describes the operating system running on the coprocessor, with the compilation of native applications,
and with the language extensions and CPU-centric codes that utilize Intel Xeon Phi coprocessors: offload
and virtual-shared memory programming models. In a nutshell, Chapter 2 demonstrates how to write
serial code that executes on Intel Xeon Phi coprocessors.

• Chapter 3 introduces Single Instruction Multiple Data (SIMD) parallelism and automatic vectorization,
thread parallelism with OpenMP and Intel Cilk Plus, and distributed-memory parallelization with MPI.
In brief, Chapter 3 shows how to write parallel code (vectorization, OpenMP, Intel Cilk Plus, MPI).

Prepared for Yunheng Wang c Colfax International, 2013


xii PREFACE

• Chapter 4 re-iterates the material of Chapter 3, this time delving deeper into the topics of parallel
programming and providing example-based optimization advice, including the usage of the Intel Math
Kernel Library. This chapter is the core of the training. The topics discussed in this Chapter 4 include:
i) scalar optimizations;
ii) improving data structures for streaming, unit-stride, local memory access;
iii) guiding automatic vectorization with language constructs and compiler hints;
iv) reducing synchronization in task-parallel algorithms by the use of reduction;
v) avoiding false sharing;
vi) increasing arithmetic intensity and reducing cache misses by loop blocking and recursion;
vii) exposing the full scope of available parallelism;
viii) controlling process and thread affinity in OpenMP and MPI;
ix) reducing communication through data persistence on coprocessor;

g
an
x) scheduling practices for load balancing across cores and MPI processes;

W
xi) optimized Intel Math Kernel Library function usage, and other.

ng
If Chapter 3 demonstrated how to write parallel code for Intel Xeon Phi coprocessors, then Chapter 4
e
nh
shows how to make this parallel code run fast.
Yu

• Chapter 5 summarizes the course and provides pointers to additional resources.


r
fo

Throughout the training, we emphasize the concept of portable parallel code. Portable parallelism can be
ed

achieved by designing codes in a way that exposes the data and task parallelism of the underlying algorithm,
ar

and by using language extensions such as OpenMP pragmas and Intel Cilk Plus. The resulting code can be run
p

on processors as well as on coprocessors, and can be ported with only recompilation to future generations of
re

multi- and many-core processors with SIMD capabilities. Even though the Colfax Developer Training program
yP

touches on low-level programming using intrinsic functions, it focuses on achieving high performance by
el

writing highly parallel code and utilizing the Intel compiler’s automatic vectorization functionality and parallel
iv

frameworks.
us

The handbook of the Colfax Developer Training is an essential component of a comprehensive, hands-on
cl

course. While the handbook has value outside a training environment as a reference guide, the full utility of
Ex

the training is greatly enhanced by students’ access to individual computing systems equipped with Intel Xeon
processors, Intel Xeon Phi coprocessors and Intel software development tools. Please check the Web page of
the Colfax Developer training for additional information: http://www.colfax-intl.com/xeonphi/
Welcome to the exciting world of parallel programming!

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


xiii

List of Abbreviations

ALU Arithmetic Logic Unit


AO Automatic Offload
API Application Programming Interface

g
an
AVX Advanced Vector Extensions (SIMD standard)

W
BLAS Basic Linear Algebra Subprograms

e ng
CAO Compiler Assisted Offload
nh
Yu
CLI Command Line
r

CPU Central Processing Unit, used interchangeably with the terms “processor” and “host” to indicate the
fo

Intel Xeon processor, as opposed to the Intel Xeon Phi coprocessor


d
re

DFFT Discrete Fast Fourier Transform


pa
re

DGEMM Double-precision General Matrix Multiply


yP

DSA Digital Signature Algorithm


el
iv

ECC Error Correction Code


us
cl

FFT Fast Fourier Transform


Ex

FLOP Floating-Point Operation. Refers to any floating-point operation, not just addition or multiplication.
FMA Fused Multiply-Add instruction
FP Floating-point
GCC GNU Compiler Collection
GDDR Graphics Double Data Rate memory
GEMM General Matrix Multiply
GPGPU General Purpose Graphics Processing Unit
GUI Graphical User Interface
HPC High Performance Computing
I/O Input/Output

Prepared for Yunheng Wang c Colfax International, 2013


xiv PREFACE

IMCI Initial Many-Core Instructions


IP Internet Protocol
IPN Interprocessor Network
ITAC Intel Trace Analyzer and Collector
KNC Knights Corner architecture
LAPACK Linear Algebra Package
LLC Last level cache
LRU Least Recently Used, a cache replacement policy
MAC Media Access Control (address)
MIC Many Integrated Core architecture, used interchangeably with the terms “coprocessor”, “device” and

g
an
“target” to indicate the Intel Xeon Phi coprocessor, as opposed to the Intel Xeon processor.

W
MKL Math Kernel Library
MMX Multimedia Extensions (SIMD standard)
e ng
nh
MPI Message Passing Interface
r Yu

MPSS MIC Platform Software Stack


fo
ed

MTU Maximum Transmission Unit


ar

MYO Mine, Yours, Ours — memory sharing model


p
re

NFS Network File Sharing Protocol


yP

NIC Network Interface Controller


el
iv

NUMA Non-Uniform Memory Access


us
cl

OpenMP Open Multi-Processing


Ex

OS Operating System
PCIe Peripheral Component Interconnect Express
PMU Performance Monitoring Unit
RAM Random Access Memory
RNG Random Number Generator
RSA Rivest-Shamir-Adleman cryptography algorithm
RTL Runtime Library
ScaLAPACK Scalable Linear Algebra Package
SGEMM Single-precision General Matrix Multiply
SIMD Single Instruction Multiple Data

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


xv

SMP Symmetric Multiprocessor

SSE Streaming SIMD Extensions (SIMD standard)


SSH Secure Shell protocol
SVML Short Vector Math Library
TCP Transmission Control Protocol

TLB Translation Lookaside Buffer


uOS or µOS, Linux operating system for Intel Xeon Phi coprocessors
VML Vector Mathematical Library

VSL Vector Statistical Library

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


Ex
cl
us
iv
el
yP
re
par
ed
fo
r Yu
nh
eng
W
an
g
1

Chapter 1

Introduction

This chapter introduces the Intel Many Integrated Core (MIC) architecture and positions Intel Xeon Phi

g
an
coprocessors in the context of parallel programing.

W
ng
TM
1.1 Intel R Xeon Phi Coprocessors and the MIC Architecture

e
1.1.1 Overview nh
Yu
Intel Xeon Phi coprocessors have been designed by Intel Corporation as a supplement to the Intel Xeon
r
fo

processor family. These computing accelerators feature the MIC (Many Integrated Core) architecture, which
d

enables fast and energy-efficient execution of High Performance Computing (HPC) applications utilizing
re

massive thread parallelism, vector arithmetics and streamlined memory access. The term “Many Integrated
pa

Core” serves to distinguish the Intel Xeon Phi product family from the “Multi-Core” family of Intel Xeon
re

processors.
yP

Intel Xeon Phi coprocessors derive their high performance from multiple cores, dedicated vector arith-
el

metic units with wide vector registers, and cached on-board GDDR5. High energy efficiency is achieved
iv

through the use of low clock speed x86 cores with lightweight design suitable for parallel HPC applications.
us

Figure 1.1 illustrates the chip layout of an Intel Xeon processor and an Intel Xeon Phi coprocessor
cl

based on the KNC (Knights Corner) architecture. The most apparent difference conveyed by this image is the
Ex

number and density of cores on the chip. This fact reminds the user that massive parallelism in applications is
necessary in order to fully employ Intel Xeon Phi coprocessors.

Multi-core Intel Xeon processor Many-core Intel Xeon Phi coprocessor

Figure 1.1: Intel’s multi-core and many-core engines (not to scale). Image credit: Intel Corporation

Prepared for Yunheng Wang c Colfax International, 2013


2 CHAPTER 1. INTRODUCTION

1.1.2 A Drop-in Solution for a Novel Platform


Intel Xeon Phi coprocessors based on the KNC chip are installed in MIC-ready computing systems,
including workstations and servers, connected to the Peripheral Component Interconnect Express (PCIe)
interface. Solutions featuring 1, 2, 4 and 8 Intel Xeon Phi coprocessors are offered by Colfax International.
Existing applications can be ported to scale across heterogeneous Multi-Core/Many-Core architectures,
as described in Section 1.1.3 in order to enhance the performance and improve the power efficiency. Figure 1.2
demonstrates a system based on Intel Xeon E5 processors and enhanced with eight Intel Xeon Phi coprocessors
boasting a theoretical limit of over 8 TFLOP/s performance in double precision at 3 kW power consumption.

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Figure 1.2: Top: An Intel Xeon Phi coprocessor based on the KNC chip, with an passive-cooling solution and a PCIe
x16 connector. Bottom: A server computing system featuring eight Intel Xeon Phi coprocessors with a passive-cooling
solution. . Relative sizes not to scale.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.1. INTEL R XEON PHI COPROCESSORS AND THE MIC ARCHITECTURE 3

1.1.3 Code Portability


As an x86 architecture processor, an Intel Xeon Phi coprocessor can execute applications compiled from
the same C/C++ or Fortran code as an Intel Xeon processor. Indeed, Intel Xeon processors and Intel Xeon Phi
coprocessors support the same parallelization models and benefit from similar optimization methods. The
compatibility of host and coprocessor programming models and execution environments is such that it is
usually possible to port a code designed for many-core systems to the MIC architecture by specifying the
compiler argument -mmic.
Facilitating the Multi-Core and Many-Core portability, Intel provides common software development and
runtime tools for processors and coprocessors.
Compilers : Intel C Compiler, Intel C++ Compiler and Intel Fortran Compiler;
Optimization tools : Intel VTune Amplifier XE and Intel Intel Trace Analyzer and Collector (ITAC);
Mathematics support : Intel Math Kernel Library (MKL);
Runtime parallelization libraries : Intel MPI and Intel OpenMP;

g
an
and other.

W
Figure 1.3 demonstrates the variety of choices for thread and data parallelism implementations in the

ng
design of applications for Intel Xeon and Intel Xeon Phi platforms. Depending on the specificity and computing

e
nh
needs of the application, the depth of control over the execution may be chosen from high-level library function
calls to low-level threading functionality and Single Instruction Multiple Data (SIMD) operations. This choice
Yu
is available in both Multi-Core and Many-Core applications.
r
fo

Ease of use
d

Threading Options Vector Options


re
pa
re

Intel R Math Kernel Library


Intel R Math Kernel Library MPI*
yP
el

TM
Array Notation: Intel R Cilk Plus
iv

R
Intel Threading Building
us

Blocks
cl

TM
Auto vectorization
Ex

R
Intel Cilk Plus
Depth

Semi-auto vectorization:
OpenMP* #pragma (vector, ivdep, simd)

OpenCL*

Pthreads* C/C++ Vector Classes


(F32vec16, F64vec8)

Fine control
Figure 1.3: Implementation of thread and data parallelism in applications for Intel Xeon processors and Intel Xeon Phi
coprocessors designed with Intel software development tools. Diagram based on materials designed by Intel.

Prepared for Yunheng Wang c Colfax International, 2013


4 CHAPTER 1. INTRODUCTION

1.1.4 Heterogeneous Computing


Programming models for Intel Xeon Phi coprocessors include native execution and offload-based
approaches. These approaches enable developers to design a spectrum of hybrid computing models, ranging
from Multi-Core hosted (i.e., only employing the CPU) to Multi-Core-centric (i.e., executing on the host
system with some operations performed on the coprocessor) to symmetric (i.e., employing the host and the
coprocessor) and Many-core-hosted (i.e., executing exclusively on a set of coprocessors..
The choice of work division between the host and the coprocessor is dictated by the nature of the
application. Highly parallel, vectorized workloads (e.g., linear algebraic calculations) can be executed on the
coprocessor as well as on the host. Serial segments of an application perform significantly better on Intel Xeon
processors, and so do applications with stochastic memory access patterns. The overhead of data transport
over the PCIe bus should also be taken into consideration.
Figure 1.4 summarizes the development options for systems enabled with Intel Xeon Phi coprocessors.

Xeon - Multi-Core Centric Breadth MIC - Many-Core Centric

g
an
W
Multi-Core Hosted Offload Symmetric Many Core Hosted
General serial and
parallel computing
Code with highly-
parallel phases
e ng
Codes with
balanced needs
Highly-parallel
codes
nh
Yu

Figure 1.4: Intel R Architecture benefit: wide range of development options. Breadth, depth, familiar models meet varied
r
fo

application needs. Diagram based on materials designed by Intel Corporation.


ed
ar

Intel Xeon Phi coprocessors are Internet Protocol (IP)-addressable devices running a Linux operating
p

system, which enables straightforward porting of code written for the Intel Xeon architecture to the MIC
re

architecture. This, combined with code portability, makes Intel Xeon Phi coprocessors a compelling platform
yP

for heterogeneous clustering. In heterogeneous cluster applications, host processors and MIC coprocessors
can be used on an equal basis as individual compute nodes.
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


1.2. MIC ARCHITECTURE: DEVELOPER’S PERSPECTIVE 5

1.2 MIC Architecture: Developer’s Perspective


Programming applications for Intel Xeon Phi coprocessors it is not significantly different from programing
for Intel Xeon processors. Indeed, with both devices featuring the x86 architecture, support for C, C++ and
Fortran, and common parallelization libraries, only familiarity with processor programming is required.
However, in order to optimize applications, it is helpful to know some of the architectural properties of the
coprocessor. Relevant properties are described in this section.

1.2.1 Knights Corner Die Organization


The KNC chip contains 50+ cores and 6+ GB of cached GDDR5 memory onboard. It is connected to
the host system via the PCIe bus, and operates as an IP-addressable device running its own Linux operating
system.
The cores of KNC are successors of the Intel Pentium processor cores. Numerous improved features,
however, differentiate KNC from Pentium:

g
• The KNC die is manufactured using the 22 nm process technology with 3-D Trigate transistors.

an
W
• Improved technology allows to fit over 50 cores on a single die.

ng
• KNC supports 64-bit instructions and 512-bit SIMD vector registers.

e
nh
• The x86 logic of KNC (excluding the L2 caches) constitutes less than 2% of the die.
Yu
• Each x86 core on the Knights Corner chip has its own Performance Monitoring Unit (PMU).
r
fo

• Each KNC core is capable of 4-way hyper-threading.


d
re
pa

Figure 1.5 illustrates the organization of the KNC die.


re
yP
el
iv
us
cl
Ex

Figure 1.5: Knights Corner die organization. The cores and GDDR5 memory controllers are connected via an Interproces-
sor Network (IPN) ring, which can be thought of as an independent bi-directional ring. The L2 caches are shown here
as slices per core, but can be thought of as a fully coherent cache of the aggregated slices. Information can be copied to
each core that uses it to provide the fastest possible local access, or a single copy can be present for all cores to provide
maximum cache capacity. This diagram is a conceptual drawing and does not imply actual distances, latencies, etc. Image
and description credit: Intel Corporation.

Prepared for Yunheng Wang c Colfax International, 2013


6 CHAPTER 1. INTRODUCTION

1.2.2 Core Topology


Figure 1.6 schematically describes the topology of a single KNC core.

• Each core has a dedicated vector unit supporting 512-bit wide registers with support for the Initial
Many-Core Instructions (IMCI) instruction set.
• Scalar instructions are processed in a separate unit.
• The KNC core is an in-order processor with 4-way hyper-threading.

• Every hyper-thread issues instructions every other cycle, and therefore 2 hyper-threads per core are
necessary to utilize all available cycles. Additional two hyper-threads may improve performance in the
same situations as with hyper-threading in Intel Xeon processors.

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Figure 1.6: The topology of a single Knights Corner core. Image credit: Intel Corporation.

The hierarchical cache structure is a significant component in the KNC productivity. The details of cache
organization and properties are discussed in Section 1.2.3

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


1.2. MIC ARCHITECTURE: DEVELOPER’S PERSPECTIVE 7

1.2.3 Memory Hierarchy and Cache Properties


Memory hierarchy in KNC features two levels of caches: the L1 cache nearest to the processor (32 KB
per core) and the L2 cache (512 KB) per core. The caches are 8-way associative, fully coherent with the LRU
(Least Recently Used) replacement policy.

Parameter Value
Cache line size 64B
L1 size 32KB data, 32KB code
L1 set conflict 4KB (Dcache), 8KB (Icache)
L1 ways 8 (Dcache), 4 (Icache)
L1 latency 1 cycle
L2 to 1 prefetch (vprefetch0) buffer depth 8
L2 sizea 512K
L2 set conflict 64KB

g
L2 ways 8

an
L2 latency 15-30 cycles depending on load

W
Memory → L2 prefetch buffer depth 32

ng
Translation Lookaside Buffer (TLB) cover- 64 pages of size 4KB (256KB coverage),

e
age options (L1, data) 8 pages of size 2MB (16MB coverage),
nh
32 pages of size 64K (2MB coverage)
Yu
TLB coverage (L1, instruction) 32 pages of size 4KB (shared amongst all threads running
r

on the cores);
fo

TLB coverage (L2) L2: 64 entries, stores 2M backup PTEs, and PDEs (reduc-
d

ing page walks for 4KB and 64KB TLB misses);


re
pa

TLB Miss penalty for 2M L1 TLB 12 cycles;


re

TLB Miss penalty for 4K and 64K TLBs ≈30 cycles (assuming PDE hit in the L2 TLB);
yP

GDDR bandwidth Theoretical max: 5.5GT/s * 16 channels * 4B/T =


el

352GB/s, but ring BW and Error Correction Code (ECC)


iv

overheads limit achieved performance to 200 GB/s or less.


us
cl

Table 1.1: Cache properties of the Knights Corner architecture.


Ex

a inclusive of L1D and L1I unless I$ snooping is off

Associativity

Eight-way associativity strikes a balance between the low overhead of direct-mapped caches and the
versatility of fully-associative caches. An 8-way set associative cache chooses, for each memory address, one
of 8 ways of cache (i.e., cache segments) into which the data at that memory address be placed. Within the
way, the data can be placed anywhere.

Replacement Policy

The Least Recently Used policy is such behavior of a cache that when some data has to be evicted from
cache in order to load new data, the data is evicted from least recently used set. LRU is implemented by
dedicated hardware units in the cache.

Prepared for Yunheng Wang c Colfax International, 2013


8 CHAPTER 1. INTRODUCTION

Set Conflicts
To the developer, an important property of multi-way associative caches with LRU is the possibility of
set conflict. A set conflict may occur when the code processes data with a certain stride in virtual memory. For
KNC, the stride is 4 KB in the L1 cache and 64 KB in L2 cache. With this stride, data from memory must be
mapped into same set, and, if LRU is not functioning properly, some data may be evicted prematurely, causing
performance loss.

Coherency
A coherent cache guarantees that when data is modified in one cache, copies of this data in all other
caches will be correspondingly updated before they are made available to the cores accessing these other
caches. In KNC, L2 caches are not truly shared between all cores; each core has its private slice of the
aggregate cache (see Figure 1.5). Therefore, the coherency of the L2 cache comes at the cost of potential
performance loss when data is transferred across the ring interconnect. Generally, data locality is the best way
to optimize cache operation. See Section 4.5 for more information on improving data locality.

g
an
Translation Lookaside Buffer (TLB)

W
ng
Translation Lookaside Buffer, or TLB, is a cache residing on each core, that speeds up the lookup of the
e
physical memory address corresponding to a virtual memory address. Entries, or pages in TLB can vary in the
nh
amount of memory that they map. The physical size of the TLB places restrictions on the number of pages on
Yu

the total address range stored in TLB. When memory address accessed by the code is not found in TLB, the
TLB entries must be re-built in order to look up that address. This causes a data page walk operation, which is
r
fo

fairly expensive compared to the misses in L1 and L2 caches. Optimal TLB page properties depend on the
ed

memory access pattern of the application. As with other cache function, TLB performance can generally be
ar

improved by increasing the locality of data access in time and space. See Section 4.5 for information on TLB
p

performance tuning.
re
yP

Prefetching
el

Another important property of caches is prefetching. During the program execution, it is possible to
iv
us

request that data is fetched into cache before the core uses this data. This diminishes the impact of memory
cl

latency on performance. Two types prefetching is available in KNC: software prefetching, when the prefetch
Ex

instruction is issued by the code in advance of the data usage, and hardware prefetching, when a dedicated
hardware unit in the cache learns the data access pattern and issues prefetch instructions automatically. The L2
cache in KNC has a hardware prefetcher, while the L1 cache does not. Normally, Intel compilers automatically
introduce L1 prefetch instructions into the compiled code. However, in some cases it may be desirable to
manually tune the prefetch distances or to disable software prefetching when it introduces undesirable TLB
misses. See Section 4.5.7 for more information on this topic.

Additional Reading
A comprehensive source on microprocessor caches is the book “Computer Architecture: a Quantitative
Approach” by Hennessy and Peterson [1].

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


1.2. MIC ARCHITECTURE: DEVELOPER’S PERSPECTIVE 9

1.2.4 Technical Specifications


Cumulative technical specifications of Intel Xeon Phi based on the Knights Corner chip are listed in
Table 1.2.

Characteristic Value
Process 22 nm
Peak SP FLOPs 300W 2340 GFLOPS (P1750, +/- 10%)
Peak DP FLOPs 300W 1170 GFLOPS (P1750, +/- 10%)
Peak SP FLOPS 225W 2020 GFLOPS (P1640, +/- 10%)
Peak DP FLOPS 225W 1010 GFLOPS (P1640, +/- 10%)
Reliability Memory ECC
Max Memory Size 3/6/8 GBs
Peak Mem BW Up to 350 GB/sec
L2 Cache Per Core 512K
L2 Cache (all cores) Up to 31 MB

g
PCIe Gen 2 Up to: KNC → Host: 6.5 GB/s Host

an
I/O
→ KNC: 6.0 GB/s

W
Form Factor Double-wide PCIe w/Fan sink or Passive; Dense

ng
Form Factor

e
Transcendentals Exp, Log, Sqrt, Rsqrt, Recip, Div
Power Management C1, C3, C6 nh
Yu
Micro OS Linux
r
fo

Table 1.2: Technical specifications of Intel Xeon Phi coprocessors. Table credit: Intel Corporation.
d
re
pa

Most properties listed in Table 1.2 have been discussed above. The form factor property demonstrates
re

the options for the physical packaging of the product. Active (with fan) and passive (fanless) cooling solutions
yP

require more space inside the system than the dense form factor solution; however, the latter requires a
custom, usually liquid-based, cooling solution. The dense form-factor option is shown in Figure 1.7 as a visual
el

supplement to hardware description.


iv
us
cl
Ex

Figure 1.7: KNC without a heat sink (dense form factor). A traditional active-cooling solution is shown Figure 1.2.

Prepared for Yunheng Wang c Colfax International, 2013


10 CHAPTER 1. INTRODUCTION

1.2.5 Integration into the Host System


From a developer’s perspective, an Intel Xeon Phi coprocessor is a compute node with an IP address and
a Linux operating system running on it. That said,

• the coprocessor responds to ping;

• it runs an Secure Shell protocol (SSH) server, which allows the user to log into the coprocessor and
obtain a shell;

• hosts a virtual filesystem with standard Linux ownerships and permissions,

• and is capable of running other services such as Network File Sharing Protocol (NFS).

On the operating system level, the above mentioned functionality is provided by the MIC Platform
Software Stack (MPSS), a suite of tools including drivers, daemons, command-line and graphical tools. The
role of MPSS is to boot the coprocessor, load the Linux operating system, populate the virtual file system, and

g
to enable the host system user to interact with Intel Xeon Phi coprocessor in the same way as the user would

an
interact with an independent compute node on the network.

W
ng
TM
Linux* Host Intel R Xeon Phi
e coprocessor
nh
Virtual terminal session
Yu

Host-side offload Target-side "native" Target-side offload


application application application
r
fo

User code User code User code


ed

SSH
ar

Offload libraries, Standard OS Offload libraries,


p

user-level driver, libraries plus any user-accessible


re

user-accessible APIs 3rd-party or Intel APIs and


yP

and libraries libraries libraries


el
iv

User-level code User-level code


us

System-level code System-level code


cl

TM TM
Intel R Xeon Phi Intel R Xeon Phi
Ex

coprocessor support coprocessor communication and


libraries, tools, and drivers application-launching support

Linux* OS PCIe Bus PCIe Bus Linux uOS

Figure 1.8: MPSS, MIC Platform Software Stack

Figure 1.8 illustrates the role of MPSS in the operation of an Intel Xeon Phi coprocessor. User-level code
for the coprocessor runs in an environment that resembles a compute node. The network traffic is carried over
the PCIe bus instead of network interconnects.
User applications can be built in two ways. For high performance workloads, Intel compilers can be used
to compile C, C++ and Fortran code for the MIC architecture. Compilers are not a part of the MPSS; they
are distributed in additional software suites such as Intel Parallel Studio XE or Intel Cluster Studio XE (see
Section 1.1.3 and Section 1.2.7). For the Linux operating system running on Intel Xeon Phi coprocessors, a

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


1.2. MIC ARCHITECTURE: DEVELOPER’S PERSPECTIVE 11

specialized version of GNU Compiler Collection (GCC) is available. The specialized GCC is not capable of
automatic vectorization and other optimization routines that Intel C Compiler is capable of.

TM
1.2.6 Intel R Xeon R Processors versus Intel R Xeon Phi Coprocessors: Developer
Experience
The following is an excerpt from an article “Programming for the Intel Xeon family of products
(Intel Xeon processors and Intel Xeon Phi coprocessors)” by James Reinders, Intel’s Chief Evangelist and
Spokesperson for Software Tools and Parallel Programming [2].

Forgiving Nature: Easier to port, flexible enough for initial base performance
Because an Intel Xeon Phi coprocessor is an x86 SMP-on-a-chip, it is true that a port to an Intel Xeon Phi
coprocessor is often trivial. However, the high degree of parallelism of Intel Xeon Phi coprocessors requires

g
applications that are structured to use the parallelism. Almost all applications will benefit from some tuning

an
beyond the initial base performance to achieve maximum performance. This can range from minor work

W
to major restructuring to expose and exploit parallelism through multiple tasks and use of vectors. The
experiences of users of Intel Xeon Phi coprocessors and the “forgiving nature" of this approach are generally

ng
promising but point out one challenge: the temptation to stop tuning before the best performance is reached.

e
nh
This can be a good thing if the return on investment of further tuning is insufficient and the results are good
enough. It can be a bad thing if expectations were that working code would always be high performance.
Yu
There is no free lunch! The hidden bonus is the “transforming- and-tuning” double advantage of programming
r

investments for Intel Xeon Phi coprocessors that generally applies directly to any general-purpose processor
fo

as well. This greatly enhances the preservation of any investment to tune working code by applying to other
d

processors and offering more forward scaling to future systems.


re
pa
re

Transformation for Performance


yP

There are a number of possible user-level optimizations that have been found effective for ultimate performance.
These advanced techniques are not essential. They are possible ways to extract additional performance for
el

your application. The “forgiving nature" of Intel Xeon Phi coprocessors makes transformations optional but
iv

should be kept in mind when looking for the highest performance. It is unlikely that peak performance will be
us

achieved without considering some of these optimizations:


cl

• Memory Access and Loop Transformations (e.g., cache blocking, loop unrolling, prefetching, tiling,
Ex

loop interchange, alignment, affinity)


• Vectorization works best on unit-stride vectors (the data being consumed is contiguous in memory).
Data Structure Transformations can increase the amount of data accessed with unit-strides (such as AoS
(Array of Structures) to SoA (Structure of Arrays) transformations or recoding to use packed arrays
instead of indirect accesses).
• Use of full (not partial) vectors is best, and data transformations to accomplish this should be considered.
• Vectorization is best with properly aligned data.
• Large Pages Considerations
• Algorithm selection (change) to favor those that are parallelization and vectorization friendly.

Detailed description of Intel Xeon Phi coprocessor programming models can be found in Chapter 2. A
thorough exploration of optimization techniques mentioned in this quote is undertaken in Chapter 4.

Prepared for Yunheng Wang c Colfax International, 2013


12 CHAPTER 1. INTRODUCTION

1.2.7 Development Tools


As described in Section 1.1.3, the same tools and parallelization libraries are suitable for Intel Xeon
processors and Intel Xeon Phi coprocessors. Figure 1.9 illustrates Intel software development product suites
recommended for software development on MIC-enabled platforms: Intel Parallel Studio XE and Intel
Cluster Studio XE. Both suites contain a comprehensive set of tools for the development and optimization of
applications for Intel Xeon processors and Intel Xeon Phi coprocessors. In addition, Intel Cluster Studio XE
contains distributed-memory development and tuning tools.

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv

Figure 1.9: Intel software development tool suites for shared-memory and distributed-memory application design. Intel
us

Xeon processors and Intel Xeon Phi coprocessors are supported by both suites.
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.3. IDENTIFYING ALGORITHMS APPROPRIATE FOR EXECUTION ON INTEL R XEON PHI
COPROCESSORS 13

1.3 Identifying Algorithms Appropriate for Execution on Intel R Xeon


TM
Phi Coprocessors
Not all algorithms and applications are expected to be highly efficient on Intel Xeon Phi coprocessors
even with the best optimization effort. For a number of applications, general-purpose processors will remain a
better option. This section discusses the properties of applications and algorithms that identify applications as
“MIC-friendly” or not.

1.3.1 Task-Parallel Scalability


Intel Xeon Phi coprocessors contain more than 50 cores clocked at approximately 1 GHz, whereas
today’s Intel Xeon processors can have up to 8 cores clocked at around 3 GHz. One cannot expect better
performance from a single power-efficient, low-clock speed Knights Corner core than from a hardware-rich,
high-clock speed Sandy Bridge core. Indeed, the key to good performance on Intel Xeon Phi coprocessors is

g
parallelism. Moreover, coprocessor applications require a high level of parallelism: they should have good

an
scalability up to at least 100 threads, considering the 4-way hyper-threading capability of KNC cores.

W
Figure 1.10 is an illustration of the necessity for parallelism necessary of Intel Xeon Phi utilization.

e ng
nh
Yu
serial serial serial
application processor performance
r
fo
d
re
pa

parallel serial serial


application processor performance
re
yP
el
iv

serial parallel serial


us

application processor performance


cl
Ex

parallel parallel parallel


application processor performance

Figure 1.10: Diagram credit: James Reinders, Intel Corporation

Examples

1. Compilation of a programming language is an example of a task more suited for Intel Xeon processors
than for Intel Xeon Phi coprocessors, because compilation involves inherently sequential algorithms.

2. Monte Carlo simulations are well-suited for Intel Xeon Phi coprocessors because of their inherent
massive parallelism. See, however, a comment in Section 1.3.2.

Prepared for Yunheng Wang c Colfax International, 2013


14 CHAPTER 1. INTRODUCTION

1.3.2 Data-Parallel Component


Comparing the number of clock cycles per second issued by a 60-core coprocessor (60 cores × 1 GHz)
to the same metric of a two-socket host system with 8-core processors (2 × 8 × 3.4 GHz), one can see that
the coprocessor is making only some 10% more clock cycles than the host. However, an Intel Xeon Phi
coprocessor can provide significantly greater arithmetic performance than the host system. Why?
The answer is: vector operations. While Sandy Bridge cores support AVX instructions with support
for 256-bit wide SIMD registers, KNC cores support IMCI instructions with 512-bit wide SIMD registers.
Neglecting other aspects of performance, such as memory bandwidth, pipelining and transcendental arithmetics
instructions, one can see that KNC has the capability to perform roughly twice as many arithmetic operations
per second as Sandy Bridge.
The flip-side of this feature is also true. Suppose a potentially vectorizable calculation is running in
scalar (i.e., non-SIMD) mode on Sandy Bridge processor in single precision. This calculation is experiencing
up to 256 / (8*sizeof(float))=8-fold performance penalty for failure to employ vector arithmetics. On
KNC, the penalty for the lack of vectorization is a factor or 512 / (8*sizeof(float))=16.
Therefore, if a compute-bound application does not employ vectorization, it is unlikely to exhibit better

g
performance on an Intel Xeon Phi coprocessor than on an Intel Xeon processor. See Section 3.1 for more

an
information about vectorization.

W
Figure 1.11 is a visual pointer to the need for SIMD parallelism (vectorization) in KNC workloads.
e ng
nh
Yu

- Intel Xeon Phi


r
fo

Scalar & Single-thread


ed

- Intel Xeon
ar
More Parallel

Vector & Single-thread


p
re
yP

Scalar & Multi-threaded


el
iv
us

Vector & Multi-threaded


cl
Ex

1 10 100 1k 10k

More Performance

Figure 1.11: Scalar/vector and single/multi-threaded performance. Diagram credit: James Reinders, Intel Corporation

Examples
1. Monte Carlo algorithms may be well-suited for execution on Intel Xeon Phi coprocessors if they either
use vector for calculations inside of each Monte Carlo iteration, or perform multiple simultaneous Monte
Carlo iterations using multiple SIMD lanes.
2. For common linear algebraic operations, there exist SIMD-friendly algorithms and implementations.
These calculations are well-suited for Intel Xeon Phi coprocessors.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.3. IDENTIFYING ALGORITHMS APPROPRIATE FOR EXECUTION ON INTEL R XEON PHI
COPROCESSORS 15

1.3.3 Memory Access Pattern


Memory streaming in Intel Xeon Phi coprocessors is fast. The theoretical limit of memory bandwidth is
384 GB/s and practical performance is up to a half of that. This is almost 3x as fast as the memory bandwidth
in a two-socket system with Intel Xeon processors. Therefore, for applications in which memory access pattern
is streamlined, Intel Xeon Phi coprocessors can yield better performance.
However, the streaming bandwidth is irrelevant in cases where memory access pattern is complicated. In
these cases, the memory latency or cache performance is the limiting performance factor. Even though the
GDDR5 RAM of Intel Xeon Phi coprocessors is cached by a total of 30 MB of L2 cache, coprocessor caches
are less powerful than in Intel Xeon processors. For instance, the lack of L1 prefetcher places greater demands
on code optimization.
Considering the above facts, one can only expect a better performance from an Intel Xeon Phi coprocessor
than from an Intel Xeon processor in two cases:
a) the data set is so small, or the arithmetic intensity (number of operations on every word fetched from
memory) is so high that memory performance is irrelevant — the compute-bound case, or

g
an
b) memory access pattern is streamlined enough so that the application is limited by memory bandwidth and

W
not memory latency — the bandwidth-bound case.

ng
See also Section 4.5 for a discussion of memory and cache traffic tuning. Multi-threading is as important for

e
bandwidth-bound applications as it is for compute-bound workloads, because all available memory controllers
must be utilized. Figure 1.12 is a visual reminder of this fact. nh
r Yu
fo
d

- Intel Xeon Phi


re
pa

Single-thread
re
More Parallel

Bandwidth limited
yP

- Intel Xeon
el
iv
us

Multi-threaded
cl

Bandwidth limited
Ex

1 10 100 1k 10k

More Performance

Figure 1.12: Parallelism and bandwidth-bound application performance. Diagram credit: James Reinders, Intel.

Examples
1. Matrix transposition involves a memory access pattern with a stride equal to one of the matrix dimensions.
Therefore, this operation may be unable to fully utilize the coprocessor memory bandwidth;
2. Some stencil operations on dense data have a streamlined memory access pattern. These algorithms are
a good match for the Intel Xeon Phi architecture.

Prepared for Yunheng Wang c Colfax International, 2013


16 CHAPTER 1. INTRODUCTION

1.3.4 PCIe Bandwidth Considerations


When data needs to be transferred across the PCIe bus from the host to the coprocessor, or between
coprocessors, consideration should be given to the data transfer time. Depending on how much work will be
done on the coprocessor with the data before the result is returned, and depending on how quickly this work
will be done, the usage of the coprocessor may or may not be justified.

Ratio of Arithmetic Operations to Data Size


The following technical characteristics can be used to estimate the benefit of coprocessor usage: PCIe v
2.0 bandwidth of 6 GB/s, 1 TFLOP/s theoretical maximum arithmetic performance of the coprocessor and
200 GB/s practical peak memory bandwidth on Intel Xeon Phi coprocessors. The above numbers yield the
following “rules of thumb” for identifying situations when the PCIe overhead is insignificant:

a) for compute-bound calculations, if the coprocessor performs many more than than

1000 GFLOP/s

g
Na = ≈ 1300 (1.1)

an
6 GB/s/sizeof(double)

W
lightweight floating-point operations (additions and multiplications), then the data transport overhead is
insignificant; e ng
nh
b) for compute-bound calculations with division and transcendental operations, this the arithmetic intensity
Yu

threshold at which the communication overhead is justified is lower than 1300 operations per transferred
r

floating-point number;
fo
ed

c) for bandwidth-bound calculations, if the data in the coprocessor memory is read many more than
ar

200 GB/s
p

Ns = ≈ 30 times (1.2)
re

6 GB/s
yP

in streaming fashion, then the data transport across the PCIe bus is likely not to be the bottleneck.
el
iv
us

Arithmetic Complexity
cl

In this context, it is informative to establish a link between the complexity of an algorithm and its ability
Ex

to benefit from Intel Xeon Phi coprocessors. Namely,

• if the data size is n, and the arithmetic complexity (i.e., the number of arithmetic operations) of the
algorithm scales as O(n), such an algorithm may experience a bottleneck in the data transport. This is
because the coprocessor performs a fixed number of arithmetic operations on every number sent across
the PCIe bus. If this number is too small, the data transport overhead is not justifiable.

• for algorithms in which the arithmetic complexity scales faster than O(n) (e.g., O(n log n) or O(n2 )),
larger problems are likely to be less limited by PCIe traffic than smaller problems, as the arithmetic
intensity in this case increases with n. The stronger the arithmetic complexity scaling, the less important
is the communication overhead.

Masking
Masking the data transfer time can potentially increase overall performance by up to a factor of two. In
order to mask data transfer, the asynchronous transfer and asynchronous execution capabilities of Intel Xeon
Phi coprocessors can be used. More details are provided in Section 2.2.9 and Section 2.3.1.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.3. IDENTIFYING ALGORITHMS APPROPRIATE FOR EXECUTION ON INTEL R XEON PHI
COPROCESSORS 17

Examples
1. Computing a vector dot-product, when one or both vectors need to be transferred to the coprocessor, is an
inefficient use of Intel Xeon Phi coprocessors, because the complexity of the algorithm is O(n), and only
2 arithmetic operations per transferred floating-point number are performed. The PCIe communication
overhead is expected to be too high, and such a calculation can be done more efficiently on the host;
2. Computing a matrix-vector product with a square matrix, when only the vector must be transferred to
the coprocessor, is expected to have a small PCIe communication overhead if the vector size is large
enough (so that the communication latency is unimportant). The algorithm complexity is O(n2 ), and
therefore each transferred floating-point number will be used n times.

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


18 CHAPTER 1. INTRODUCTION

TM
1.4 Installing the Intel R Xeon Phi Coprocessor and MPSS
This section overviews the process of installing an Intel Xeon Phi coprocessor and related software.

1.4.1 Hardware Installation


Computing systems enabled with Intel Xeon Phi coprocessors provisioned by Colfax International come
with Intel Xeon Phi coprocessors and related software and drivers already installed. System configurations
validated for use with Intel Xeon Phi coprocessors can be found at http://www.colfax-intl.com/xeonphi/.
Self-installation of Intel Xeon Phi coprocessors into computing systems not validated for usage with these
devices is not recommended. If thermal or electrical specifications of the host system are not met, the system
or the coprocessor can be irreparably damaged.
In order to verify that an Intel Xeon Phi coprocessor is installed in the system, boot the system and use
the Linux tool lspci, as shown in Listing 1.1

g
an
user@host% lspci | grep -i "co-processor"
82:00.0 Co-processor: Intel Corporation Device 2250

W
ng
Listing 1.1: Using lspci to check whether an Intel Xeon Phi coprocessor is installed in the computing system.
e
nh
Yu

1.4.2 Installing MPSS


r
fo

Drivers and administrative tools required for Intel Xeon Phi operation are included in the MPSS (Intel
ed

MIC Platform Software Stack) package. The role of MPSS is to boot the Intel Xeon Phi coprocessor, populate
ar

its virtual file system and start the operating system on the coprocessor, to provide connectivity protocols and
p
re

to enable management and monitoring of the coprocessor using specialized tools.


yP

As is the case with hardware installation, computing systems provisioned by Colfax International will
have the drivers already installed. For MPSS installation on other systems, instructions can be found in the
el

corresponding “readme” file included with the software stack. MPSS can be freely downloaded from the Intel
iv

Web site [3].


us

After MPSS has been installed, initial configuration steps are be required, as shown in Listing 1.2
cl
Ex

user@host% micctrl --initdefaults

Listing 1.2: Initial configuration of Intel Xeon Phi coprocessor.

That command will create configuration files /etc/sysconfig/mic/* and the system-wide config-
uration file /etc/modprobe.d/mic.conf. In addition, the hosts file /etc/hosts will be modified by
MPSS. The IP addresses and hostnames of Intel Xeon Phi coprocessors will be placed into that file.
The Intel Xeon Phi coprocessor driver is a part of MPSS, and it is available as the system service mpss.
It can be enabled at boot by running the following two commands:

user@host% chkconfig --add mpss


user@host% chkconfig mpss on

Listing 1.3: Enabling mpss service at boot.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.4. INSTALLING THE INTEL R XEON PHI COPROCESSOR AND MPSS 19

In order to stop, start or restart the driver, the following command can be used:

user@host% sudo service mpss [stop | start | restart | status]

Listing 1.4: Controlling the mpss service.

In order to verify the installation and configuration of MPSS, the command miccheck can be used. For
more information on MPSS, refer to Section 1.5.

1.4.3 Intel R Compilers and Miscellaneous Tools


Separately from MPSS, a development workstation with an Intel Xeon Phi coprocessor must have Intel
software development tools installed, such as compilers, parallelization libraries and performance tuning
utilities. The products are not a part of the MPSS.
As of the day that this document is being written, only the Intel compilers support high performance

g
an
code compilation for Intel Xeon Phi coprocessors (see also comment about GCC for the MIC architecture

W
in Section 1.2.5). While a stand-alone Intel compiler of C, C++ or Fortran is sufficient for the development
of applications for Intel Xeon Phi coprocessors, we recommend a suite of Intel software development tools

ng
called Intel Parallel Studio XE 2013. This suite includes valuable additional libraries and performance tuning

e
nh
tools that can accelerate the process of software development. For cluster environments, a better option is Intel
Yu
Cluster Studio XE, which includes the Intel MPI library and Intel Trace Analyzer and Collector in addition to
all of the components of Intel Parallel Studio XE. In order to choose the correct software for your needs, see
r
fo

the product comparison chart available at http://software.intel.com/en-us/intel-xe-product-comparison.


Intel compilers and product suites such as Intel Parallel Studio XE can be purchased directly from Intel
d
re

or from one of the authorized resellers. Colfax International is an authorized reseller offering discounts for
pa

bundling software licenses with hardware purchases, and academic discounts for eligible customers. For
re

more information, refer to http://www.colfax-intl.com/ms/devws/developer-ws-microsite-home.html or contact


yP

[email protected].
Installation instructions are included with the downloadable software suites. After installation, it is
el

important to set up the environment variables for Intel Parallel Studio XE as shown in Listing 1.5.
iv
us
cl

user@host% /opt/intel/composerxe/bin/compilervars.sh intel64


Ex

Listing 1.5: Enabling environment variables for the intel64 architecture.

The setup of environment variables using the compilervars script can be automated. The automation
process is depends on the operating system. For example, on RedHat Linux, in order to automate loading the
script for an individual user, place the command shown in Listing 1.5 into the file ~/.bashrc.
In order to verify that the compilers have been successfully installed, run the following commands:

user@host% icc -v
icc version 13.1.0 (gcc version 4.4.6 compatibility)
user@host% icpc -v
icpc version 13.1.0 (gcc version 4.4.6 compatibility)
user@host% ifort -v
ifort version 13.1.0

Listing 1.6: Verifying the installation of the Intel compilers.

Prepared for Yunheng Wang c Colfax International, 2013


20 CHAPTER 1. INTRODUCTION

1.4.4 Restoring MPSS Functionality after Linux Kernel Updates


The MPSS integrates with the operating system by means of a kernel module. This kernel module may
become unfunctional after the vendor of the operating systems provides an update to the Linux kernel. In this
case, starting MPSS will fail with the following error message:
Starting MPSS failed: module MIC not found.

and the Intel Xeon Phi coprocessor will be unavailable.


When this happens, there two solutions:

A) Reboot the system, enter the boot menu and choose to boot the old version of the Linux kernel. Alter-
natively, the choice of the old kernel may be made permanent by modifying the Grub configuration file
/boot/grub/grub.conf. This method is a workaround, and should only be used temporarily, until
method B can be applied.
B) Rebuild the MIC kernel module.

g
an
In order to rebuild the MIC kernel module, superuser access is required. The steps illustrated in Listing 1.7

W
describe the rebuild process of the kernel module.

user@host% sudo su # Become superuser


e ng
nh
root@host% service mpss status # Verify MPSS status
mpss is stopped
Yu

root@host% service mpss start # Verify MPSS error message


r

Starting MPSS Stack: failed


fo

root@host% # Some packages will be necessary for rebuilding:


ed

root@host% yum -y install rpm-build kernel-devel kernel-headers


# some output skipped
ar

root@host% ls KNC* # Locate the MPSS archive


p

KNC_gold-2.1.4346-16-rhel-6.3.gz
re

root@host% tar -xf KNC_gold-2.1.4346-16-rhel-6.3.gz


yP

root@host% cd KNC_gold-2.1.4346-16-rhel-6.3/src
root@host% # Rebuild the MPSS kernel module:
el

root@host% rpmbuild --rebuild intel-mic-knc-kmod-2.1.4346-16.el6.src.rpm


iv

# output skipped
us

root@host% cd ..
root@host% mkdir old_modules # store old modules
cl

root@host% mv intel-mic-knc-kmod* old_modules/ # save old module


Ex

root@host% # Fetch the newly rebuilt module


root@host% mv /root/rpmbuild/RPMS/x86_64/intel-mic-knc-kmod* ./
root@host% yum -y remove intel-mic-knc-kmod # uninstall old module
root@host% yum install intel-mic*.rpm # install new module
root@host% service mpss start # start MPSS
Starting MPSS Stack: [ OK ]
mic0: online (mode: linux image: /lib/firmware/mic/uos.img)
mic1: online (mode: linux image: /lib/firmware/mic/uos.img)
root@host% exit # end the superuser session
user@host%

Listing 1.7: Restoring MPSS functionality after a Linux kernel update by rebuilding the MIC kernel module.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.5. MPSS TOOLS AND LINUX ENVIRONMENT ON INTEL R XEON PHI COPROCESSORS 21

TM
1.5 MPSS Tools and Linux Environment on Intel R Xeon Phi Copro-
cessors
This section lists the essential tools and configuration options for managing the Intel Xeon Phi coprocessor
operating environment.

1.5.1 mic* Utilities


This section describes the MPSS utilities for the management and diagnostics of Intel Xeon Phi copro-
cessors.
micinfo a system information query tool,
micsmc a utility for monitoring the physical parameters of Intel Xeon Phi coprocessors: model, memory,
core rail temperatures, core frequency, power usage, etc.,

g
an
micctrl a comprehensive configuration tool for the Intel Xeon Phi coprocessor operating system,

W
miccheck a set of diagnostic tests for the verification of the Intel Xeon Phi coprocessor configuration,

ng
micrasd a host daemon logger of hardware errors reported by Intel Xeon Phi coprocessors,

e
micflash an Intel Xeon Phi flash memory agent. nh
r Yu
Most of the administrative tools and utilities can be found in the /opt/intel/mic/bin directory. Some
fo

of these utilities require superuser privileges. In order to facilitate the path lookup for these tools, modify the
d

PATH environment variable, as shown in Listing 1.8.


re
pa
re

user@host% export PATH=/opt/intel/mic/bin:$PATH


yP
el

Listing 1.8: Setting the path lookup for MPSS utilities.


iv
us

The information in this section is a brief overview of the above mentioned tools. As usual, the usage and
cl

arguments of these tools can be obtained by running the any of the tools with the argument --help or by
Ex

using man, as illustrated in Listing 1.9.

user@host% micctrl --help


...
user@host% man micsmc

Listing 1.9: Obtaining help information on MPSS utilities.

Prepared for Yunheng Wang c Colfax International, 2013


22 CHAPTER 1. INTRODUCTION

micinfo: coprocessor, firmware, driver information


The micinfo tool can be used in order to obtain detailed information about the Intel Xeon Phi
coprocessor, installed system and the driver version:

user@host% /opt/intel/mic/bin/micinfo

VERSION: Copyright 2011-2012 Intel Corporation All Rights Reserved.


VERSION: ODM/OEM Tools
VERSION: 3126-12

MicInfo Utility Log

Created Mon Aug 20 11:32:50 2012

System Info

g
Host OS : Linux

an
OS Version : 2.6.32-71.el6.x86_64

W
Driver Version : 3126-12
MPSS Version : 2.1.3126-12
Host Physical Memory
CPU Family
:
:
65922 MB
GenuineIntel Family 6
e ng Model 45 Stepping 5
nh
CPU Speed : 1200
Yu

...
r
fo

Listing 1.10: Example of micinfo tool output.


ed
ar

Using -listdevices option provides a list of the Intel Xeon Phi coprocessors present in the system.
p
re
yP

user@host% /opt/intel/mic/bin/micinfo -listdevices


el

VERSION: Copyright 2011-2012 Intel Corporation All Rights Reserved.


iv

VERSION: ODM/OEM Tools


us

VERSION: 3653-8
cl
Ex

MicInfo Utility Log

Created Fri Aug 31 13:22:22 2012

List of Available Devices

deviceId | bus# | pciDev# | hardwareId


---------|------|---------|-----------
0 | 5 | 0 | 22508086
1 | 83 | 0 | 22508086
--------------------------------------

Listing 1.11: Listing available Intel Xeon Phi coprocessors with micinfo utility.

To request detailed information about a specific device, the option -deviceinfo <number> should
be used. Additionally, the information displayed by this command can be narrowed down by including the
option -group <group name>. Where valid group names are:

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.5. MPSS TOOLS AND LINUX ENVIRONMENT ON INTEL R XEON PHI COPROCESSORS 23

• Version

• Board
• Core
• Thermal
• GDDR

For instance, the following shell command returns the information about the total number of cores on the
first Intel Xeon Phi coprocessor, current voltage and frequency:

user@host% sudo /opt/intel/mic/bin/micinfo -deviceinfo 0 -group core

VERSION: Copyright 2011-2012 Intel Corporation All Rights Reserved.


VERSION: ODM/OEM Tools

g
VERSION: 3653-8

an
W
MicInfo Utility Log

ng
Created Fri Aug 31 13:36:03 2012

e
Device No: 0, Device Name: K1OM nh
Yu
Core
r
fo

Total No of Active Cores: 60


Voltage : 944000 uV
d
re

Frequency : 1000000 kHz


pa
re

Listing 1.12: Printing out detailed information about cores from the first Intel Xeon Phi coprocessor.
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


24 CHAPTER 1. INTRODUCTION

micsmc: Real-Time Monitoring Tool


The micsmc tool returns information about the physical parameters of the Intel Xeon Phi coprocessor:
processor, memory, core rail temperatures; core frequency and power usage. micsmc can also be used for
viewing error logs, monitoring and connection to Intel Xeon Phi coprocessors, viewing and managing log files;
root/admin users can manage per-coprocessor or per-node settings, such as ECC, Turbo Mode, and power
states.
The micsmc tool operates in two modes: Graphical User Interface (GUI) mode and Command Line
(CLI) mode. In order to invoke the GUI mode, micsmc should be executed without any additional parameters.
In this mode, the tool provides continuously updated information on Intel Xeon Phi coprocessor core utilization,
core temperature, memory usage, and power usage statistics. The CLI mode is activated with command-line
arguments. This mode produces similar information, but in a one-shot operation, which allows usage in a
script environment.
micsmc invoked in the command-line mode accepts the following arguments:

-c or --cores returns the average and per core utilization levels for each available board in the system.

g
an
-f or --freq returns the clock frequency and power levels for each available board in the system.

W
-i or --info returns general system info.
-m or --mem returns memory utilization data.
e ng
nh
Yu

-t or --temp returns temperature levels for each available board in the system.
r
fo

--pwrenable [cpufreq | corec6 | pc3 | pc6 | all] enables the specified respective power
management features, disables unspecified
ed
ar

--pwrstatus Returns and the status of power management features for each coprocessor,
p
re

--turbo [status | enable | disable] returns or modifies the Turbo Mode status on all copro-
yP

cessors
el

--ecc [status | enable | disable] returns or modifies the ECC status on all coprocessors
iv
us

-a or --all results in the processing of all valid options, excluding --help, --turbo, and --ecc.
cl
Ex

-h or --help displays command specific help information.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.5. MPSS TOOLS AND LINUX ENVIRONMENT ON INTEL R XEON PHI COPROCESSORS 25

Example output in the CLI mode is shown in Listing 1.13.

user@host% sudo /opt/intel/mic/bin/micsmc -a

Card 1 (info):
Device Name: ............. KNC
Device ID: ............... 2250
Number of Cores: ......... 60
OS Version: .............. 2.6.34-ga914e40
Flash Version: ........... 2.1.02.0314
Driver Version: .......... DRIVERS_3126-12
Stepping: ................ 0
SubStepping: ............. 2500

Card 1 (temp):
Cpu Temp: ................ 59.00 C
Memory Temp: ............. 42.00 C

g
Fan-In Temp: ............. 31.00 C

an
Fan-Out Temp: ............ 42.00 C

W
Core Rail Temp: .......... 43.00 C
Uncore Rail Temp: ........ 44.00 C

ng
Memory Rail Temp: ........ 44.00 C

e
Card 1 (freq): nh
Yu
Core Frequency: .......... 1.00 GHz
r

Total Power: ............. 112.00 Watts


fo

Lo Power Limit: .......... 315.00 Watts


Hi Power Limit: .......... 395.00 Watts
d
re

Phys Power Limit: ........ 395.00 Watts


pa
re

Card 1 (mem):
yP

Free Memory: ............. 7252.92 MB


Total Memory: ............ 7693.34 MB
el

Memory Usage: ............ 440.41 MB


iv
us

Card 1 (cores):
cl

Card Utilization: User: 0.00%, System: 0.16%, Idle: 99.84%


Ex

Per Core Utilization (60 cores in use)


Core #1: User: 0.00%, System: 4.03%, Idle: 95.97%
Core #2: User: 0.00%, System: 0.00%, Idle: 100.00%
...
Core #60: User: 0.00%, System: 4.57%, Idle: 95.43%

Listing 1.13: micsmc command-line (CLI) output the Intel Xeon Phi coprocessor core utilization, temperature, memory
usage, and power usage statistics.

Prepared for Yunheng Wang c Colfax International, 2013


26 CHAPTER 1. INTRODUCTION

Running micsmc without the command-line arguments will open the application’s GUI and display
system physical characteristics with graphical primitives, as demonstrated in Figure 1.13 and Figure 2.1.

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el

Figure 1.13: The GUI mode of the micsmc tool illustrating the execution of a workload on a system with two Intel Xeon
iv

Phi coprocessors.
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.5. MPSS TOOLS AND LINUX ENVIRONMENT ON INTEL R XEON PHI COPROCESSORS 27

micctrl: Power State and Configuration Management


The micctrl command is the power state and configuration system administration tool. The options
supported by this command are:

-h or --help display help


-s or --status show the boot state of Intel Xeon Phi coprocessors in the system
-b or --boot boot one or more Intel Xeon Phi coprocessors. The MPSS service must be running.

-r or --reset reset one or more Intel Xeon Phi coprocessors.


-w or --wait wait for one or more Intel Xeon Phi coprocessors to not be in either the booting or resetting
states.
--initdefaults used once after MPSS software install. Creates the unique per Intel Xeon Phi coproces-
sor configuration files in /etc/sysconfig/mic. The mpss service must not be running

g
an
--resetconfig used after changes are made to configuration files. It recreates all the default files based

W
on the new configuration. The MPSS service must not be running.

ng
--resetdefaults used to reset configuration files back to defaults if hand editing of files has created

e
unknown situations. The MPSS service must not be running.
nh
Yu
--cleanconfig [MIC list] the cleanconfig option removes: the filelist file and directories as-
r

sociated with the MicDir configuration parameter; the image file specified by the FileSystem
fo

configuration parameter; /etc/sysconfig/mic/micN.conf associated with the coprocessor(s);


d
re

the
pa

/etc/sysconfig/mic/default.conf file.
re

--configuser | --useradd | --userdel | --passwd | --groupadd | --groupdel


yP

modify the Intel Xeon Phi coprocessor filesystem to configure, add and delete Linux users and groups.
el

Listing 1.14 demonstrates using micsmc.


iv
us
cl

user@mic% micctrl -s
Ex

mic0: online (mode: linux image: /lib/firmware/mic/uos.img)


mic1: online (mode: linux image: /lib/firmware/mic/uos.img)
user@mic%

Listing 1.14: Using the micctrl utility to query the power status of coprocessors.

Prepared for Yunheng Wang c Colfax International, 2013


28 CHAPTER 1. INTRODUCTION

miccheck: Configuration Test


miccheck runs a set of diagnostic tests in order to verify the configuration of an Intel Xeon Phi
coprocessor system. By default, all available tests are run on all Intel Xeon Phi coprocessors. However, a
subset of tests and devices can be selected. Listing 1.15 demonstrates the output of miccheck.

user@host% miccheck

miccheck 2.1, created 10:23:01 Oct 19 2012


Copyright 2011-2012 Intel Corporation All rights reserved

Test 1 Ensure installation matches manifest : OK


Test 2 Ensure host driver is loaded : OK
Test 3 Ensure driver matches manifest : OK
Test 4 Detect all MICs : OK
MIC 0 Test 1 Find the MIC : OK
MIC 0 Test 2 Read device configuration file : OK
MIC 0 Test 3 Ensure IP address is unique : OK

g
MIC 0 Test 4 Ensure MAC address is unique : OK

an
MIC 0 Test 5 Check the POST code via PCI : OK

W
MIC 0 Test 6 Ping the MIC : OK

ng
MIC 0 Test 7 Connect to the MIC : OK
MIC 0 Test 8 Check for normal mode :
e OK
MIC 0 Test 9 Check the POST code via SCIF : OK
nh
MIC 0 Test 10 Send data to the MIC : OK
Yu

MIC 0 Test 11 Compare the PCI configuration : OK


MIC 0 Test 12 Ensure Flash version matches manifest : OK
r

MIC 0 Test 13 Ping the host : OK


fo

MIC 1 Test 1 Find the MIC : OK


ed

MIC 1 Test 2 Read device configuration file : OK


ar

MIC 1 Test 3 Ensure IP address is unique : OK


MIC 1 Test 4 Ensure MAC address is unique : OK
p
re

MIC 1 Test 5 Check the POST code via PCI : OK


MIC 1 Test 6 Ping the MIC : OK
yP

MIC 1 Test 7 Connect to the MIC : OK


MIC 1 Test 8 Check for normal mode : OK
el

MIC 1 Test 9 Check the POST code via SCIF : OK


iv

MIC 1 Test 10 Send data to the MIC : OK


us

MIC 1 Test 11 Compare the PCI configuration : OK


cl

MIC 1 Test 12 Ensure Flash version matches manifest : OK


Ex

MIC 1 Test 13 Ping the host : OK


Status: OK.

Listing 1.15: miccheck tests the Intel Xeon Phi coprocessor’s configuration and functionality.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.5. MPSS TOOLS AND LINUX ENVIRONMENT ON INTEL R XEON PHI COPROCESSORS 29

micrasd
micrasd is an application running on the host system to handle and log the hardware errors reported by
Intel Xeon Phi coprocessors. It can be run as a service daemon. This tool requires administrative privileges.
The following command starts micrasd:

root@host% /opt/intel/mic/bin/micrasd

Listing 1.16: micrasd log Intel Xeon Phi coprocessor errors handler.

In order to run the utility in the daemon mode, the command line argument -daemon should be used. In
the daemon mode, micrasd will run in the background and handle/log errors silently.
The usage of micrasd:

g
root@host% micrasd [-daemon] [-loglevel LEVELS] [-help]

an
W
Listing 1.17: micrasd usage.

e ng
-daemon to run it in daemon mode
nh
Yu
-loglevel to set the logging level, from 1 to 7
r
fo

bit 0: Info message


d
re

bit 1: Warning message


pa

bit 2: error message


re

default: all messages ON


yP

-help to show the help info.


el
iv

The errors will be logged into the Linux system log /var/log/messages with the tag “micras”.
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


30 CHAPTER 1. INTRODUCTION

micflash
The primary purpose of this tool is to update the firmware in Intel Xeon Phi coprocessor’s flash memory.
In addition, micflash can save and retrieve the current flash image version.
Prior to using the micflash utility, the Intel Xeon Phi coprocessor must be put in either the offline, or
the ready state. The utility will fail if the device is in the normal mode. The Intel Xeon Phi coprocessor can be
placed in the ready state with the following command (root privileges are required):

root@host% micctrl -r
root@host% micctrl -w
mic0: ready

Listing 1.18: micctrl provides information about the Intel Xeon Phi coprocessor’s current status.

The functions of micflash are invoked with the following commands:

g
an
• micflash -info -device 0 – provide the information about which sections of the flash are

W
update-able on the hardware,

the specified file,


e ng
• micflash -save <destination_file> – save the current flash image from the hardware to
nh
Yu

• micflash -Update <flash_image> - writes flash image to the device,


r
fo

• micflash -getversion 2 -device 0 – retrieve the version information on the firmware.


-getversion option indicates what type of version is requested:
ed
ar

2 or Flash image version for flash image version


p
re

3 or Fboot0 version for fboot0 flash section version (valid only for NKF and earlier MIC versions)
yP

4 or Fboot1 version for fboot1 flash section version


el

Note: Flash version information can be retrieved from micsmc -i output as well, and not require
iv

MPSS service restart afterwards.


us
cl

• additionally, we can use micflash -info <flash_image> to find the version information from
Ex

the flash image file.

If a firmware update is performed, the host system must be rebooted prior to using Intel Xeon Phi
coprocessors. If any other flash operation besides update is performed, start the mpss service to ensure the
MPSS is fully functional.
If several Intel Xeon Phi coprocessors are present in the system, micflash will operate on the first
device only (by default). To perform update/save/info operations on the other coprocessors, the device
number must be specified with -device <number> option (numbering starts from 0) or -device all
(to run on all coprocessors).
WARNING: Multiple instance of micflash should never be allowed to access the same Intel Xeon Phi
coprocessor simultaneously!

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.5. MPSS TOOLS AND LINUX ENVIRONMENT ON INTEL R XEON PHI COPROCESSORS 31

TM
1.5.2 Network Configuration of the uOS on Intel R Xeon Phi Coprocessors
Communication with the uOS (embedded Linux* operating system on the Intel Xeon Phi coprocessor) is
provided by a virtual network interfaces mic0, mic1, etc. These interfaces are created by the MPSS. When
data is sent to or from these virtual interfaces, it is physically transferred over the bus from the host sytem to
the coprocessor or vice versa. Standard networking protocols, such as ping, TCP/IP and SSH are supported.
Network properties of Intel Xeon Phi devices can be configured using the configuration file located at
/etc/sysconfig/mic/default.conf. Listing 1.19 illustrates file format of that file:

# If NetworkBridgeName is not defined the use of a static pair of IP addresses


# is used by default
#
# Bridge names starting with "mic" will be created by the MPSS daemon. Other bridges
# are assumed to already exist.
#BridgeName micbr0

# Define the first 2 quads of the network address.

g
# Static pair configurations will fill in the second 2 quads by default. The individual

an
# MIC configuration files can override the defaults with MicIPaddress and HostIPaddress.

W
Subnet 172.31
...

ng
# Include all additional functionality configuration files by default

e
Include "conf.d/*.conf"
... nh
r Yu
Listing 1.19: Network configuration in the default configuration /etc/sysconfig/mic/default.conf.
fo
d
re

In order to apply any modifications made to default.conf, the following command must be executed
pa

with root privileges:


re
yP

root@host% micctrl --resetconfig


el
iv

Listing 1.20: Resetting the Intel Xeon Phi coprocessor configuration file with micctrl utility.
us
cl
Ex

In addition to default.conf, files /etc/sysconfig/mic/micN.conf provide fine-grained


configuration control for Intel Xeon Phi coprocessors. Here, N is a zero-based number of Intel Xeon Phi
devices in the system. For each coprocessor for which the micN.conf has been modified, the coprocessor
configuration should be updated as follows:

root@host% micctrl --resetconfig micN

Listing 1.21: Resetting configuration file of the specified Intel Xeon Phi coprocessor.

The rest of Section 1.5.2 discusses the networking parameters that can be controlled in these configuration
files. The configuration files can be modified by direct editing as well as using the tool micctrl. Run
micctrl -h for information on the latter method.

Prepared for Yunheng Wang c Colfax International, 2013


32 CHAPTER 1. INTRODUCTION

HostMacAddress and MicMacAddress


The HostMacAddress and MicMacAddress configuration parameters in micN.conf provide a
mechanism to assign MAC addresses to interfaces. This may be necessary if a MAC address collision occurs
in a multi-node system.

Specifying Network Topology


Two network topologies can be used for networking with Intel Xeon Phi coprocessors:
a) Static pair topology (default), where a private network is created on the host system, and the host all
coprocessors are assigned IP addresses on this private network. In this configuration, all devices within
a system can communicate to each other, but coprocessors cannot communicate to any IP-addressable
devices outside the host system, unless routing is set up on the host.
b) Bridge topology, in which a network bridge is connected to one of the host’s NICs , and all coprocessors
in the system can use this bridge to join the private network to which the host system is connected. In a

g
computing cluster, this means that Intel Xeon Phi coprocessors on one host can can communicate over

an
TCP/IP directly with coprocessors on another host.

W
ng
Parameter BridgeName in /etc/sysconfig/mic/micN.conf defines the name of the static
bridge to link to.The name specified is overloaded to provide three different sets of topologies:
e
nh
1. If this parameter is commented out, it is assumed the network topology is a static pair. If this is true, the
Yu

CardIPaddress and HostIPaddress parameters become relevant.


r
fo

2. If the bridge name starts with the string “mic”, then the static bridge is created with this name, binding
ed

the Intel Xeon Phi coprocessors into a single subnet.


ar

3. Finally, if the bridge name does not start with the string “mic”, then the coprocessors will be attached
p
re

to what is assumed to be an existing static bridge to one of the host’s networking interfaces.
yP

Subnet
el
iv

The parameter Subnet defines the leading two or three elements of the coprocessor’s IP address. The
us

default value of Subnet is 172.31. This places the coprocessor in the private network range (see [4] for more
cl

information on private networks).


Ex

In the static pair network topology, the IP addresses of the host and coprocessors are constructed by
appending to the two-element Subnet a third element specified by the Intel Xeon Phi coprocessor ID plus one
and a .1 for the coprocessor and a .254 for the host (see Table 1.3).

Device Network Device IP Host IP


Number Interface Address Address
0 mic0 172.31.1.1 172.31.1.254
1 mic1 172.31.2.1 172.31.2.254
2 mic2 172.31.3.1 172.31.3.254
3 mic3 172.31.4.1 172.31.4.254

Table 1.3: Default IP address assignment for the static pair network topology.

If static bridge is defined, the IP addresses are constructed by appending to the subnet a .1. and then
the ID assigned to the Intel Xeon Phi coprocessor plus one; the host bridge is assigned 172.31.1.254. See
Table 1.4 for this case.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.5. MPSS TOOLS AND LINUX ENVIRONMENT ON INTEL R XEON PHI COPROCESSORS 33

Host Bridge Card Network Device IP


IP Address Interface Address
172.31.1.254 (micbr0) 0 mic0 172.31.1.1
1 mic1 172.31.1.2
2 mic2 172.31.1.3
3 mic3 172.31.1.4

Table 1.4: Default IP address assignment for the static bridge network topology.

It is also possible to use more than one static bridge. The BridgeName parameter needs to be specified
in each individual Intel Xeon Phi coprocessor configuration file to assigned the correct bridge ID. The files
also need assign the Subnet parameter a 3 element ID in each configuration file. For example a set of files
may assign BridgeName parameter the string micbr0 and Subnet the value 172.31.1. Another set of
files may assign BridgeName parameter the string micbr1 and Subnet the value 172.31.2.

g
MicIPaddress and HostIPaddress

an
W
By default, host and coprocessor IP addresses are automatically generated from the Subnet parameter.
In circumstances in which these automatically generated addresses are inadequate, MicIPaddress and

ng
HostIPaddress can be specified in each micN.conf. These values will override the automatically

e
generated IP addresses.
nh
Yu
Setting the Coprocessor’s Host Name
r
fo

In addition to assigning IP addresses to the coprocessors, MPSS automatically defines hostnames for
d
re

coprocessors and stores them in the hosts file /etc/hosts.


pa

By default, mpssd set the host names of coprocessors to a modified version of the host name. For
example, if the host has the hostname compute1.mycluster.com then the first Intel Xeon Phi copro-
re
yP

cessor will get assigned the hostname compute1-mic0.mycluster.com. The Hostname parameter
in micN.conf allows to override this assignment for each individual Intel Xeon Phi coprocessor.
el

The Hostname parameter will be added to the /etc/hosts file on the host. It will also be used to
iv

create the file /etc/sysconfig/network.conf and /etc/hosts.


us
cl
Ex

MTUsize
The MTUsize parameter allows setting of the network Maximum Transmission Unit (MTU) size. The
default is the max jumbo packet size of 64 kilobytes. This parameter should be set to the default network
packet size being used in the subnet that belongs to. With clusters this is often 9 kilobytes.

DHCP
IP address assignment through DHCP is not natively supported in the current MPSS release (Gold Update
2). However, it can still be enabled through external setup by setting IPADDR=dhcp must in files
/opt/intel/mic/filesystem/micN/etc/sysconfig/network/ifcfg-micN
prior to starting the MPSS.

Prepared for Yunheng Wang c Colfax International, 2013


34 CHAPTER 1. INTRODUCTION

TM
1.5.3 SSH Access to Intel R Xeon Phi Coprocessors
The Linux OS on the Intel Xeon Phi coprocessor supports SSH access for all users, including root,
using public key authentication keys. The configuration phase of the MPSS stack creates users for each
coprocessor using the file /etc/passwd on the host. For each user, the public SSH key files found in the
user’s /home/user/.ssh directory, are copied to the Intel Xeon Phi coprocessor filesystem.
Listing 1.22 demonstrates the generation of an SSH key pair and its inclusion into the coprocessor
filesystem.

user@host% ssh-keygen
# ... output omitted ...
user@host% sudo service mpss stop
user@host% sudo micctrl --resetconfig
user@host% sudo service mpss start

Listing 1.22: Generating SSH keys and reconfiguring/restarting MPSS.

g
an
W
If MPSS and encryption keys are configured correctly, all Linux users should be able to log in from the
host to Intel Xeon Phi coprocessors using the Linux SSH client ssh. No password is necessary if the SSH key
ng
is copied over to the coprocessor. Listing 1.23 demonstrates interaction with an Intel Xeon Phi coprocessor via
e
nh
a terminal shell.
Yu

user@host% ssh mic0


r
fo

user@mic0% cat /proc/cpuinfo | grep "processor" | wc -l


228
ed

user@mic0% ls /
ar

bin etc lib linuxrc proc sbin sys usr


p

dev home lib64 oldroot root sep3.8 tmp var


re

user@mic0% uname -a
yP

Linux mic0 2.6.34.11-g4af9302 #2 SMP Thu Aug 16 18:52:36 PDT 2012 k1om GNU/Linux
el

Listing 1.23: Logging in to Intel Xeon Phi coprocessor using ssh access, querying the number of cores, listing the files
iv
us

and checking the Linux kernel version.


cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
1.5. MPSS TOOLS AND LINUX ENVIRONMENT ON INTEL R XEON PHI COPROCESSORS 35

1.5.4 NFS Mounting a Host Export


NFS export from the host to the coprocessor is configured in the same way as NFS exports between
regular Linux hosts. The root user on host must allow export of the shared folder by including the respective
line in the /etc/exports file. In addition, the host firewall or iptables needs to be configured to allow
traffic on the following ports:

• tcp/udp port 111 - RPC 4.0 portmapper

• tcp/udp port 2049 - NFS server

On the client, an NFS share can be mounted either manually using the command mount, or automatically by
modifying the file /etc/hosts.

Example
As an example, let us illustrate how to share the Intel MPI library located on the host at the path

g
an
/opt/intel/impi with all Intel Xeon Phi coprocessors. The mount point on coprocessors will be the

W
same as on the host, i.e, /opt/intel/impi. The instructions below assume that the reader is familiar with
the text editor vi.

ng
First of all, let us ensure that all the necessary services are started (you will need to install the package

e
nfs-utils if some of these services are missing):
nh
r Yu
root@host% service rpcbind status
fo

rpcbind (pid 3495) is running...


d

root@host% service nfslock status


re

rpc.statd (pid 3624) is running...


pa

root@host% service nfs status


rpc.svcgssd is stopped
re

rpc.mountd (pid 3864) is running...


yP

nfsd (pid 3927 3926 3925 3924 3923 3922 3921 3920) is running...
rpc.rquotad (pid 3860) is running...
el

root@host%
iv
us

Listing 1.24: Verifying the NFS server status on the host.


cl
Ex

In order to enable sharing /opt/intel/impi, in the host file /etc/exports the line shown in
Listing 1.25 must be present.

/opt/intel/impi mic0(rw,no_root_squash) mic1(rw,no_root_squash)

Listing 1.25: Text to append to /etc/exports on host.

The file /etc/hosts.allow on the host must have the line shown in Listing 1.26:

ALL: mic0,mic1

Listing 1.26: Text to append to /etc/hosts.allow on host.

Prepared for Yunheng Wang c Colfax International, 2013


36 CHAPTER 1. INTRODUCTION

After the /etc/exports and /etc/hosts.allow files have been updated, the command sudo
exportfs -a should be executed in order to pass the modifications to the NFS server.
On the coprocessor side, the file /etc/fstab must be modified to allow mounting the exported NFS
file system. It must contain the line shown in Listing 1.27.

host:/opt/intel/impi /opt/intel/impi nfs rsize=8192,wsize=8192,nolock,intr 0 0

Listing 1.27: Line to append to the file /etc/fstab on coprocessor(s)

Finally, on the coprocessor, the mount point /opt/intel/impi must be created, and then the
command mount -a will mount the directory.

user@mic0% mkdir -p /opt/intel/impi


user@mic0% mount -a

g
an
W
Listing 1.28: Enabling NFS mount on a coprocessor.

ng
If the above steps succeed, the host contents of directory /opt/intel/impi will be available on all
e
nh
coprocessors. Next time MPSS is restarted or the host system is rebooted, the NFS share will vanish. In order
Yu

to make these changes persistent between system or MPSS reboots, some files must be edited on the host in
the directory /opt/intel/mic/filesystem as shown in Listing 1.29. This procedure must be repeated
r
fo

for every coprocessor in the system (mic0, mic1, etc.).


ed
ar

user@host% sudo su
p

root@host% cd /opt/intel/mic/filesystem
re

root@host% vi mic0/etc/fstab
yP

# Append one line:


host:/opt/intel/impi /opt/intel/impi nfs rsize=8192,wsize=8192,nolock,intr 0 0
el

root@host% mkdir -p mic0/opt/intel/impi


iv

root@host% vi mic0.filelist
us

# Append these three lines:


dir /opt 755 0 0
cl

dir /opt/intel 755 0 0


Ex

dir /opt/intel/impi 755 0 0

Listing 1.29: Making NFS mount on a coprocessor persistent by modifying the mic0 filesystem files on the host.

The procedure shown in Listing 1.29 can be performed automatically using the tool micctrl:

user@host% sudo micctrl --addnfs=/opt/intel/impi --dir=/opt/intel/impi


user@host% sudo service mpss restart

Listing 1.30: Automated procedure for making NFS mount on a coprocessor persistent by modifying the mic0 filesystem
files on the host (an alternative to the procedure shown in Listing 1.29.

For more information regarding the NFS service, refer to [5].

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


37

Chapter 2

Programming Models for Intel R Xeon


TM
Phi Applications

g
an
W
In Chapter 1, we introduced the MIC architecture without going into the details of how to program Intel

ng
Xeon Phi coprocessors. This chapter demonstrates the utilization of the Intel Xeon Phi coprocessor from

e
user applications written in C/C++ and Fortran. It focuses on transferring data and executable code to the
nh
coprocessor. Parallelism will be discussed in Chapter 3, and performance optimization in Chapter 4.
r Yu
fo

TM
d

2.1 Native Applications and MPI on Intel R Xeon Phi Coprocessors


re
pa

Intel Xeon Phi coprocessors run a Linux operating system and support traditional Linux services,
re

including SSH. This allows the user to run applications directly on an Intel Xeon Phi coprocessor by compiling
yP

a native executable for the MIC architecture and transferring it to the coprocessor’s virtual filesystem.
el
iv
us

2.1.1 Using Compiler Argument -mmic to Compile Native Applications for Intel R
cl

TM
Xeon Phi Coprocessors
Ex

In order to compile a C/C++ code to an Intel Xeon Phi executable, Intel compilers must be given the
argument -mmic. A “Hello World” code for the coprocessor is shown in Listing 2.1. Compiling and running
this code on the host is trivial, and the compilation procedure and runtime output are shown in Listing 2.2.
The name of the executable is not specified, so the compiler sets it to the default name a.out.

1 #include <stdio.h>
2 #include <unistd.h>
3
4 int main(){
5 printf("Hello world! I have %ld logical cores.\n",
6 sysconf(_SC_NPROCESSORS_ONLN ));
7 }

Listing 2.1: This C code (“hello.c”) can be compiled for execution on the host as well as on an Intel Xeon Phi
coprocessor.

Prepared for Yunheng Wang c Colfax International, 2013


TM
38 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

user@host% icc hello.c


user@host% ./a.out
Hello world! I have 32 logical cores.
user@host%

Listing 2.2: Compiling and running the “Hello World” code on the host.

Listing 2.3 shows how this code can be compiled for native execution on an Intel Xeon Phi coprocessor.
The code fails to run on the host, because it is not compiled for the Intel Xeon architecture. See Section 2.1.2
to see how this code can be transferred to an Intel Xeon Phi coprocessor and executed on it.

user@host% icc hello.c -mmic


user@host% ./a.out
-bash: ./a.out: cannot execute binary file

g
Listing 2.3: A native application for Intel Xeon Phi coprocessors cannot be run on the host system.

an
W
ng
2.1.2 Establishing SSH Sessions with Coprocessors e
nh
Intel Xeon Phi coprocessors runs a Linux operating system with an SSH server (see also Section 1.4.2)
and, when the MPSS is loaded, the list of Linux users and their SSH keys on the host are transferred to the
Yu

Intel Xeon Phi coprocessor. By default, the first Intel Xeon Phi coprocessor in the system is resolved to the
r
fo

hostname mic0, as specified in the file /etc/hosts.


That said, the user can transfer the executable a.out to the coprocessor using the secure copy tool scp
ed

tool, as shown in Listing 2.4. After that, the user can log into the coprocessor using ssh and use the shell to
ar

run the application on the coprocessor. Running this executable produces the expected “Hello world” output,
p
re

and the number of logical cores is correctly detected as 240.


yP
el

user@host% scp a.out mic0:~/


iv

a.out 100% 10KB 10.4KB/s 00:00


us

user@host% ssh mic0


user@mic0% pwd
cl

/home/user
Ex

user@mic0% ls
a.out
user@mic0% ./a.out
Hello world! I have 240 logical cores.
user@mic0%

Listing 2.4: Transferring and running a native application on an Intel Xeon Phi coprocessor.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
2.1. NATIVE APPLICATIONS AND MPI ON INTEL R XEON PHI COPROCESSORS 39

2.1.3 Running Native Applications with micnativeloadex


The micnativeloadex utility is a tool for running native applications on Intel Xeon Phi coproces-
sors. It uses copies the native binary to a specified Intel Xeon Phi coprocessor and execute it. In addition,
micnativeloadex automatically checks library dependencies for the application, and, if they can be
located, these libraries are also copied to the device prior to execution. By default, the output from the
application running remotely on the Intel Xeon Phi coprocessor is redirected back to the local console. This
redirection can be disabled with a command line option.
The default search for path dependent libraries is set using the SINK_LD_LIBRARY_PATH environment
variable. This environment variable works just like LD_LIBRARY_PATH for normal Linux applications. In
order to only display the list of dependencies, micnativeloadex should be run with the command line
argument -l. Any dependencies not found can be included in SINK_LD_LIBRARY_PATH.
Listing 2.5 illustrates the usage of micnativeloadex on an application that uses the Intel OpenMP
library as well as some user libraries.

g
user@host% export SINK_LD_LIBRARY_PATH=/opt/intel/composerxe/compiler/lib/mic

an
user@host% micnativeloadex ./myapp -l

W
Dependency information for my

ng
Full path was resolved as

e
/home/user/myapp
nh
Yu
Binary was built for Intel(R) Xeon Phi(TM) Coprocessor
(codename: Knights Corner) architecture
r
fo

SINK_LD_LIBRARY_PATH = /opt/intel/composerxe/lib/mic
d
re

Dependencies Found:
pa

/opt/intel/composer_xe_2013.2.146/compiler/lib/mic/libiomp5.so
re
yP

Dependencies Not Found Locally (but may exist already on the coprocessor):
libm.so.6
el

libpthread.so.0
iv

libc.so.6
us

libdl.so.2
libstdc++.so.6
cl

libgcc_s.so.1
Ex

user@host% micnativeloadex ./app


# User application using the OpenMP library is running on the coprocessor...
user@host%

Listing 2.5: Using micnativeloadex to find library dependencies for native MIC application and launch it from the
host.

Note: micnativeloadex may also be used for performance analysis of native MIC applications. See
Section A.4.1 for more information.

Prepared for Yunheng Wang c Colfax International, 2013


TM
40 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

2.1.4 Monitoring the Coprocessor Activity with micsmc


In order to demonstrate how the Intel Xeon Phi coprocessor activity can be monitored, and at the same
time to show how POSIX threads (pthreads) can be used to execute parallel codes on coprocessors, below
we construct a pthreads-based workload. The source code shown in Listing 2.6 spawns as many threads as
there are logical cores in the system, and each thread executes an infinite loop in function spin(void*).
This application does not perform any useful calculations, but it keeps all cores occupied to produce activity
which we will monitor. Naturally, this code is suitable for both the host (an Intel Xeon processor) and the
target (an Intel Xeon Phi coprocessor).

1 #include <stdio.h>
2 #include <unistd.h>
3 #include <pthread.h>
4
5 void* spin(void* arg) {
6 while(true);
return NULL;

g
7

an
8 }
9

W
10 int main(){

ng
11 int n=sysconf(_SC_NPROCESSORS_ONLN);
12 printf("Spawning %d threads that do nothing. Press ^C to terminate.\n", n);
e
for (int i = 0; i < n-1; i++) {
nh
13
14 pthread_t foo;
Yu

15 pthread_create(&foo, NULL, &spin, NULL);


16 }
r
fo

17 spin(NULL);
18 }
ed
ar

Listing 2.6: This C code (“spin.c”) illustrates how the pthreads library can be used to produce parallel codes on
p
re

Intel Xeon Phi coprocessors. This workload is produced in order to illustrate monitoring tools for the coprocessor.
yP

The output in Listing 2.7 illustrates the compilation and running of the code spin.c on a coprocessor.
el
iv

The code enters an infinite loop and never terminates, so the execution must be terminated by pressing Ctrl+C.
us

However, while the program is running, we can monitor the Intel Xeon Phi coprocessor load.
cl
Ex

user@host% icpc spin.c -lpthread


user@host% scp a.out mic0:~/
a.out 100% 11KB 11.0KB/s 00:00
user@host% ssh mic0
user@mic0% ./a.out
Spawning 240 threads that do nothing. Press ^C to terminate.
^C
user@mic0%

Listing 2.7: Compiling and running the code in Listing 2.6 as a native workload for Intel Xeon Phi coprocessors.

The utility micsmc included with the MPSS is a graphical user interface that allows to monitor the load
on the coprocessor, temperature, read logs and error messages and control some of the coprocessor’s settings.
Figure 2.1 (top panel) illustrates how the load on the coprocessor increases for the duration of the
execution of the workload code, and drops afterwards. Three panels in the bottom part of Figure 2.1 show
some of the information and controls accessible via the micsmc interface. Much of this information and
controls are also available via the MPSS command line tools. See Section 1.5.1 for more information.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
2.1. NATIVE APPLICATIONS AND MPI ON INTEL R XEON PHI COPROCESSORS 41

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us

Figure 2.1: The interface of the micsmc utility. The top panel demonstrates the load on the Intel Xeon Phi coprocessor.
cl

Three panels in the bottom part of the screenshot demonstrate the controls and information available via micsmc.
Ex

Prepared for Yunheng Wang c Colfax International, 2013


TM
42 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

2.1.5 Compiling and Running MPI Applications on the Coprocessor


The ability of Intel Xeon Phi coprocessors to run native applications and the IP-addressability feature
of coprocessors make it possible to seamlessly integrate Intel Xeon Phi coprocessors into applications that
employ the traditional solution for cluster computing, Message Passing Interface (MPI). This section describes
how to compile a native application for Intel Xeon Phi coprocessors with MPI and to run it on the coprocessor.

About MPI
MPI, or Message Passing Interface, is a communication protocol for distributed memory high performance
applications. Intel’s proprietary implementation of MPI is available as Intel MPI Library. This library
implements version 2.2 of the MPI protocol. For information about using MPI to express distributed-memory
parallel algorithms, refer to Section 3.3. The Intel MPI Reference Guide [6] contains more detailed information
about using Intel MPI.

Prerequisites

g
an
Before compiling and running MPI applications, environment variables should be set by calling a script

W
included in Intel MPI:

user@host% source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh


e ng
nh
Yu

Listing 2.8: Setting Intel MPI environment variables on the host.


r
fo
ed

Additionally, the Intel MPI binaries and libraries have to available on the Intel Xeon Phi coprocessor.
ar

There are two ways to achieve that. A straightforward, but not recommended method, is to copy certain files
p

from /opt/intel/impi to the coprocessor. A better method is to NFS-share the required files with the
re

coprocessor or coprocessors. The procedure for doing so is described in Section 1.5.4. We will assume that
yP

the latter method is used, and that all the required files are already available on the coprocessor.
el
iv

Usage
us
cl

MPI applications must be compiled with special wrapper applications: mpiicc for C, mpiicpc for
Ex

C++ or mpiifort for Fortran codes. In order to launch the resulting executable as a parallel MPI application,
it should be run using a wrapper script called mpirun. MPI executables can also be executed as usual
applications, however, in this case, parallelization does not occur.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
2.1. NATIVE APPLICATIONS AND MPI ON INTEL R XEON PHI COPROCESSORS 43

“Hello World” with MPI on the Host


Listing 3.57 shows a “Hello World” example of MPI usage. When this application is run in the MPI
environment, multiple processes execute this code. Each of these processes is assigned an identification
number, known as rank, which may be used by the programmer to determine the role of each process in the
application. MPI processes can exchange information by passing messages in a variety of ways. Message
passing is discussed in more detail in Section 3.3.

1 #include "mpi.h"
2 #include <stdio.h>
3 #include <string.h>
4
5 int main (int argc, char *argv[]) {
6 int i, rank, size, namelen;
7 char name[MPI_MAX_PROCESSOR_NAME];
8
9 MPI_Init (&argc, &argv);

g
10

an
11 MPI_Comm_size (MPI_COMM_WORLD, &size);

W
12 MPI_Comm_rank (MPI_COMM_WORLD, &rank);
13 MPI_Get_processor_name (name, &namelen);

ng
14

e
15 printf ("Hello World from rank %d running on %s!\n", rank, name);
16
nh
if (rank == 0) printf("MPI World size = %d processes\n", size);
Yu
17
18 MPI_Finalize ();
r

19 }
fo
d
re

Listing 2.9: Source code HelloMPI.c of a “Hello world” program with MPI.
pa
re

In order to compile and run the source file from Listing 2.9, we use the procedure demonstrated in
yP

Listing 2.10.
el
iv

user@host% mpiicc -o HelloMPI.XEON HelloMPI.c


us

user@host% mpirun -n 2 ./HelloMPI.XEON


cl

user@host% ./HelloMPI.XEON # Running without MPI


Hello World from rank 0 running on host!
Ex

MPI World size = 1 processes


user@host% mpirun -host localhost -np 2 ./HelloMPI.XEON # Two parallel processes on host
Hello World from rank 1 running on host!
Hello World from rank 0 running on host!
MPI World size = 2 processes

Listing 2.10: Compiling the “Hello World!” code with Intel MPI for the host system and running it using two processes.

Prepared for Yunheng Wang c Colfax International, 2013


TM
44 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

Hello World with MPI on the Coprocessor


In order to compile the “Hello World” code for Intel Xeon Phi coprocessors, one must use the -mmic
compiler flag. No modification of the source code is required for this. In order to run the code, the resulting
executable file must be copied to the Intel Xeon Phi coprocessor, and then micrun must be invoked on the
host as shown in Listing 2.11.

user@host% mpiicc -mmic -o HelloMPI.MIC HelloMPI.c


user@host% sudo scp HelloMPI.MIC mic0:~/
user@host% export I_MPI_MIC=1
user@host% mpirun -host mic0 -np 2 ~/HelloMPI.MIC # Parallel processes on coprocessor
Hello World from rank 1 running on host-mic0!
Hello World from rank 0 running on host-mic0!
MPI World size = 2 processes

Listing 2.11: Compiling and running a Hello World code with Intel MPI on an Intel Xeon Phi coprocessor.

g
an
The difference between this case and the case shown in Listing 2.10 is that we included the argument

W
-host mic0 instead of -host localhost. The hostname mic0 resolves to the first coprocessor in the

enable Intel MPI processes on the MIC architecture.


e ng
system, as mentioned above. In addition, we had to set the environment variable I_MPI_MIC=1 in order to
nh
This concludes a brief introduction into using Intel MPI to compile and run MPI processes on Intel Xeon
Yu

Phi coprocessors. The discussion of MPI will continue in Section 2.4.3, where we will demonstrate how to run
r

MPI calculations on multiple coprocessors or on the host and coprocessors simultaneously. Subsequently, in
fo

Section 3.3, we will introduce message passing in order to effect cooperation between processes. Optimization
ed

in MPI is discussed in Section 4.7.


p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


2.2. EXPLICIT OFFLOAD MODEL 45

2.2 Explicit Offload Model


Section 2.1 demonstrated how native codes for the MIC architecture may be run directly on the coproces-
sor without the involvement of the host. It is also possible to develop applications so that they run on the host
and employ the MIC architecture by transferring only some of the data and functions to the coprocessors.
The process of data and code transfer to the coprocessor is generally called offload, and applications
using this procedure are known as offload applications. This section describes a set of C language extensions
for the explicit offload model, and Section 2.3 introduces alternative extensions, the virtual-shared memory
model.

2.2.1 “Hello World” Example in the Explicit Offload Model


The source code in the C language in Listing 2.12 demonstrates offloading a section of the program to
Intel Xeon Phi coprocessor using #pragma offload.

g
1 #include <stdio.h>

an
2

W
3 int main(int argc, char * argv[] ) {
4 printf("Hello World from host!\n");

ng
5 #pragma offload target(mic)

e
6 {
7 printf("Hello World from coprocessor!\n");
nh
Yu
8 fflush(0);
9 }
r

10 printf("Bye\n");
fo

11 }
d
re
pa

Listing 2.12: Source code of hello-offload.cpp example with the offload segment to be executed on Intel Xeon
Phi coprocessor.
re
yP

Line 6 in Listing 2.12 — #pragma offload target(mic) — indicates that the following segment
el

of the code should be executed on an Intel Xeon Phi coprocessor (i.e., “offloaded”). This application must
iv

be compiled as a usual host application: no additional compiler arguments are necessary in order to compile
us

offload applications. This code produces the following output:


cl
Ex

user@host% icpc hello_offload.cpp -o hello_offload


user@host% ./hello_offload
Hello World from host!
Bye
Hello World from coprocessor!

Listing 2.13: Output of the execution of hello-offload.cpp.

Prepared for Yunheng Wang c Colfax International, 2013


TM
46 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

2.2.2 Offloading Functions


When user-defined functions are called within the scope of #pragma offload, these functions must
be declared with the qualifier __attribute__((target(mic))) [7] (see Listing 2.14). This qualifier
tells the compiler to generate the MIC architecture code for the function. The coprocessor implementation of
the function is automatically transferred to the coprocessor and launched when offload occurs.

1 #include <stdio.h>
2
3 __attribute__((target(mic))) void MyFunction() {
4 printf("Hello World from coprocessor!\n");
5 fflush(0);
6 }
7
8 int main(int argc, char * argv[] ) {
9 printf("Hello World from host!\n");
10 #pragma offload target(mic)
{

g
11

an
12 MyFunction();
}

W
13
14 printf("Bye\n");

ng
15 }
e
nh
Listing 2.14: Offloading a function to an Intel Xeon Phi coprocessor.
r Yu

If multiple functions must be declared with this qualifier, there is a short-hand way to set and unset this
fo

qualifier inside a source file (see Listing 2.15). This also useful when using #include to inline header files.
ed
p ar

1 #include <stdio.h>
re

2
yP

3 #pragma offload_attribute(push, target(mic))


4 // The target(mic) attribute is set for all functions after the above pragma
el

5
iv

6 void MyFunctionOne() { // This function has target(mic) set by the pragma above
us

7 printf("Hello World from coprocessor!\n");


8 }
cl

9 void MyFunctionTwo() { // The target(mic) attribute is still active for this function
Ex

10 fflush(0);
11 }
12
13 #pragma offload_attribute(pop)
14 // The target(mic) attribute is unset after the above pragma
15
16 int main(int argc, char * argv[] ) {
17 printf("Hello World from host!\n");
18 #pragma offload target(mic)
19 {
20 MyFunctionOne();
21 MyFunctionTwo();
22 }
23 printf("Bye\n");
24 }

Listing 2.15: Declaring multiple functions with the target attribute qualifier.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


2.2. EXPLICIT OFFLOAD MODEL 47

2.2.3 Proxy Console I/O


The example in Section 2.2.1 demonstrates that the code executing on the coprocessor can output data to
the standard output stream, and these data appear on the host console. How does that happen?
When console output operations are called on an Intel Xeon Phi coprocessor, e.g. with printf(), they
are buffered in the uOS and later passed on (proxied) to the host console by the coi daemon running on the
coprocessor. The communication scheme of the console proxy is shown in Figure 2.2.

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re

Figure 2.2: Proxy console I/O diagram. Output to standard output and standard error streams on the coprocessor is
yP

buffered and passed on to the host terminal by the coi daemon running on the Intel Xeon Phi coprocessor. Image credit:
el

Intel Corporation.
iv
us

In the case of the “Hello World” code (Listing 2.12 and Listing 2.13), buffering delays caused the stream
cl

from the coprocessor to be printed out after the host had finished the last printf() function call (line 11 in
Ex

Listing 2.12).
The output buffer must be flushed using the fflush(0) function of the stdio library in order to
ensure consistent operation of the console proxy. Without fflush(0) in the coprocessor code, the output of
the printf function might be lost if the program is terminated prematurely.
The proxy console I/O service is enabled by default. It can be disabled by setting the environment
variable MIC_PROXY_IO=0.
Despite the name "Proxy console I/O", the coi service proxies only the standard output and standard
error streams. Proxy console input is not supported.

Prepared for Yunheng Wang c Colfax International, 2013


TM
48 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

2.2.4 Offload Diagnostics


It is possible to generate diagnostic output for offload applications that utilize Intel Xeon Phi coprocessors.
The Linux environment variable OFFLOAD_REPORT controls the verbosity of the diagnostic output:

a) When this variable is not set or has the value 0, no diagnostic output is produced.
b) Setting OFFLOAD_REPORT=1 produces output including the offload locations and times.
c) Setting OFFLOAD_REPORT=2, in addition, produces information regarding data traffic.

Data traffic is discussed in detail in Section 2.2.8.


Listing 2.16 demonstrates the effect of the environment variable OFFLOAD_REPORT.

user@host% export OFFLOAD_REPORT=0


Hello World from host!
Bye

g
Hello World from coprocessor!

an
W
user@host%
user@host% export OFFLOAD_REPORT=1

ng
user@host% ./hello_offload
Hello World from host!
e
nh
Bye
Hello World from coprocessor!
Yu

[Offload] [MIC 0] [File] hello_offload.cpp


[Offload] [MIC 0] [Line] 6
r
fo

[Offload] [MIC 0] [CPU Time] 0.491965 (seconds)


[Offload] [MIC 0] [MIC Time] 0.000167 (seconds)
ed
ar

user@host% export OFFLOAD_REPORT=2


p

user@host% ./hello_offload
re

Hello World from host!


yP

Bye
Hello World from coprocessor!
el

[Offload] [MIC 0] [File] hello_offload.cpp


iv

[Offload] [MIC 0] [Line] 6


us

[Offload] [MIC 0] [CPU Time] 0.552854 (seconds)


[Offload] [MIC 0] [CPU->MIC Data] 0 (bytes)
cl

[Offload] [MIC 0] [MIC Time] 0.000205 (seconds)


Ex

[Offload] [MIC 0] [MIC->CPU Data] 0 (bytes)

user@host%

Listing 2.16: Using the environment variable OFFLOAD_REPORT to monitor the execution of an application performing
offload to an Intel Xeon Phi coprocessor.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


2.2. EXPLICIT OFFLOAD MODEL 49

2.2.5 Environment Variable Forwarding and MIC_ENV_PREFIX


Environment variables defined on the host are automatically forwarded to the coprocessor when an
offload application is launched. By default, variable names are not changed on the coprocessor.
Example: suppose the user sets the environment variable MYVAR=myval. In the offloaded program,
environment variable MYVAR on the coprocessor will have the value myval.
In order to avoid environment variable name collisions on the host and the coprocessor, the user can
utilize the environment variable MIC_ENV_PREFIX. When this variable is set, only environment variables
with names beginning with the specified prefix are be forwarded, and the prefix is stripped on the coprocessor.
Example: suppose the user sets MIC_ENV_PREFIX=MIC and MIC_MYVAR=myval2. In the offloaded
part of the application running on the coprocessor, the environment variable MYVAR will have the value
myval2 (variable value is passed, and the variable name is stripped of “MIC_”).
Note that the variable MIC_LD_LIBRARY_PATH is an exception. It is never passed to the coprocessor
with its suffix stripped, so it is not possible to change the value of LD_LIBRARY_PATH on the coprocessor
using environment forwarding.
Listing 2.17 and Listing 2.18 demonstrate environment forwarding and the effect of MIC_ENV_PREFIX.

g
an
W
1 #include <stdio.h>

ng
2 #include <stdlib.h>
3

e
int main(){
nh
4
5 #pragma offload target (mic)
Yu
6 {
7 char* myvar = getenv("MYVAR");
r

if (myvar) {
fo

8
9 printf("MYVAR=%s on the coprocessor.\n", myvar);
d

10 } else {
re

11 printf("MYVAR is not defined on the coprocessor.\n");


pa

12 }
re

13 }
}
yP

14
el

Listing 2.17: This code, environment.cc, prints the value of the environment variable MYVAR on the coprocessor.
iv
us
cl
Ex

user@host% icpc environment.cc -o environment


user@host% ./environment
MYVAR is not defined on the coprocessor.
user@host%
user@host% export MYVAR=myval
user@host% ./environment
MYVAR=myval on the coprocessor.
user@host%
user@host% export MIC_ENV_PREFIX=MIC
user@host% ./environment
MYVAR is not defined on the coprocessor.
user@host%
user@host% export MIC_MYVAR=myval2
user@host% ./environment
MYVAR=myval2 on the coprocessor.
user@host%

Listing 2.18: With MIC_ENV_PREFIX undefined, environment variables are passed to the coprocessor without name
changes. With MIC_ENV_PREFIX=MIC, only variables starting with MIC_ are passed, with prefix dropped.

Prepared for Yunheng Wang c Colfax International, 2013


TM
50 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

2.2.6 Target-Specific Code with the Preprocessor Macro __MIC__


When the Intel compiler detects offloaded sections in a code, it produces two versions of executables
for these segments: one for Intel Xeon processors, and another for Intel Xeon Phi coprocessors. For the
coprocessor version, the compiler defines the macro __MIC__. For the processor code, this macro is undefined.
This macro allows the programmer to “check” where the code is executed, and to write multiversioned offload
codes. In native applications for the Intel Xeon Phi architecture, the macro __MIC__ is also defined. The
usage of this macro is illustrated in Section 2.2.7, where fall-back to host is discussed.

2.2.7 Fall-Back to Execution on the Host upon Unsuccessful Offload


If no coprocessors are found in the system, the code offload code will be executed anyway: the host’s
processor will be used instead of the coprocessor in this case. In order to make this fall-back possible,
the compiler always generates a CPU version and a MIC version of the offloaded code. It is possible to
check whether the code is executing on the coprocessor or on the host using the compiler macro __MIC__
(see Section 2.2.6). A modification of the code in Listing 2.12, with a check of the execution platform, is

g
demonstrated in Listing 2.19.

an
W
ng
1 printf("Hello World from host!\n");
2 #pragma offload target(mic) e
{
nh
3
4 printf("Hello World ");
Yu

5 #ifdef __MIC__
6 printf("from coprocessor (offload succeeded).\n");
r
fo

7 #else
8 printf("from host (offload to coprocessor has failed, running on the host).\n");
ed

9 #endif
ar

10 fflush(0);
}
p

11
re
yP

Listing 2.19: Fragment of hello-fallback.cpp: handling the fall-back to host in cases when offload fails.
el
iv

In Listing 2.20, the code hello-fallback.cpp is complied and executed. In the first execution
us

attempt, the coprocessor is available, and offload occurs. In the second attempt, the MIC driver is intentionally
cl

disabled, and offload fails. Execution proceeds nevertheless, only the code is run on the host.
Ex

user@host% icpc hello_fallback.cpp -o hello_fallback


user@host% ./hello_fallback
Hello World from host!
Hello World from coprocessor (offload succeeded).
user@host%
user@host% # Shutting down MPSS in order to disable the coprocessor
user@host% sudo service mpss stop
Shutting down MPSS Stack: [ OK ]
user@host% micctrl -s
mic0: ready
user@host% ./simple02
Hello World from host!
Hello World from host (offload to coprocessor has failed, running on the host).
user@host%

Listing 2.20: Compiling and running hello-fallback.cpp: fall-back to execution on the host when offload fails.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


2.2. EXPLICIT OFFLOAD MODEL 51

2.2.8 Using Pragmas to Transfer Bitwise-Copyable Data to the Coprocessor


Up until now, we only discussed the methods to transfer executable code and to start calculations on the
coprocessor. In this section, we will discuss the transfer of data to the coprocessor in the explicit offload model.
The explicit offload model, as the name suggests, allows the user to specify exactly which data must be copied.

Offloading Local Scalars and Arrays of Known Size


In order to transfer local scalar variables and arrays of known size, nothing needs to be done, as shown in
Listing 2.21). The value of N and the array data will be copied to the coprocessor at the start of the offload,
and at the end, these variables will be copied back.

1 void MyFunction() {
2 const int N = 1000;
3 int data[N];
4 #pragma offload target(mic)

g
5 {

an
6 for (int i = 0; i < N; i++)
data[i] = 0;

W
7
8 }

ng
9 }

e
nh
Listing 2.21: Offload of local scalars and arrays of known size occurs automatically
r Yu
fo

Offloading Pointer-Based Arrays


d

When data is stored in an array referenced by a pointer, and the array size is unknown at compile time,
re

the programmer must indicate the array length in a clause of #pragma offload, as shown in Listing 2.22.
pa

The length is indicated in array elements and not bytes.


re
yP

1 void MyFunction(const int N, int* data) {


el

2 #pragma offload target(mic) inout(data: length(N))


iv

3 {
us

4 for (int i = 0; i < N; i++)


cl

5 data[i] = 0;
Ex

6 }
7 }

Listing 2.22: Offload of pointer-based arrays of unknown size

The clauses indicating data transfer direction and amount are further discussed in Sections 2.2.9 and
2.2.10.

Prepared for Yunheng Wang c Colfax International, 2013


TM
52 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

Offloading Global and Static Variables


When an offloaded variable is used in the global scope or with the static attribute, it must also be
declared with the qualifier __attribute__((target(mic))):

1 int* __attribute__((target(mic))) data;


2
3 void MyFunction() {
4 static __attribute__((target(mic))) int N = 1000;
5 #pragma offload target(mic) inout(data: length(N))
6 {
7 for (int i = 0; i < N; i++)
8 data[i] = 0;
9 }
10 }

Listing 2.23: Offload of global and static variables

g
an
W
Data Transfer without Computation

e ng
If it is necessary to send data to the coprocessor without launching any processing of this data, either
the body of the offloaded code can be left blank (i.e., use “{}” after #pragma offload), or a special
nh
#pragma offload_transfer can be used as shown in Listing 2.24.
r Yu
fo

1 // Offload with nothing to do...


2 #pragma offload target(mic) in(data: length(N))
ed

3 { }
ar

4
p

5 // ... is equivalent to the following pragma:


re

6 #pragma offload_transfer target(mic) in(data: length(N))


yP

7
8 // The above pragma does not have a body.
el

9 // Continuing on the host...


iv

10 }
us
cl

Listing 2.24: Transferring data to the coprocessor without computation.


Ex

This pragma especially useful when combined with the clause signal in order to start data transfer
without blocking (i.e., initiate an asynchronous transfer). In this way, data transfer may be overlapped with
some other work on the host or on the coprocessor. See Section 2.2.10 for a discussion of asynchronous
transfer.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


2.2. EXPLICIT OFFLOAD MODEL 53

2.2.9 Data Traffic and Persistence between Offloads


In order to control the details of data traffic between the host and coprocessor,
#pragma offload_transfer or #pragma offload may be configured with specifiers:

• Specifiers in/out indicate that the data must be sent to (“in”) or from (“out”) the coprocessor;

• inout indicates that data must be passed both to and from the target, and

• nocopy can be used to indicate that data should not be transferred in either direction.

These four specifiers only apply to bitwise-copyable data referenced by a pointer (e.g., an array of scalars or
an array of structs). Refer to [8] and [9] for complete information. The rest of this subsection demonstrates the
basic usage of language constructs for data transfer between the host and the coprocessor.

Basic Data Traffic

g
an
The following example shows how to initiate an offload instance, in which arrays p1 and p2 are sent to
the Intel Xeon Phi coprocessor, and array sum is fetched back.

W
e ng
1 #include <stdlib.h>
2 #define N 1000
nh
Yu
3
4 int main() {
r

5 double *p1=(double*)malloc(sizeof(double)*N);
fo

6 double *p2=(double*)malloc(sizeof(double)*N);
d

7 p1[0:N]=1; p2[0:N]=2;
re

8 double sum[N];
pa

9 // Local variable sum will be automatically passed to/from coprocessor,


10 // no need for inout clause for sum
re

11 #pragma offload target(mic) in(p1, p2 : length(N))


yP

12 {
13 for (int i = 0; i < N; i++) {
el

14 sum[i] = p1[i] + p2[i];


iv

15 }
us

16 }
cl
Ex

Listing 2.25: Illustration of synchronous transfer of bitwise-copyable data referenced by pointers.

In example shown in Listing 2.25, arrays p1 and p2 referenced by pointers are passed to the coprocessor.
The pointers must be declared in the scope of the offload pragma, and their length must be specified in the in
clause. At the same time, the data returned from the coprocessor at the end of the offload is stored in sum,
which is a global array with size known at compile time. Its length does not have to be specified in the out
clause. This type of data transfer is synchronous, because the execution of the host code is blocked until the
offloaded code returns.

Transfer with Data Persistence


By default, every time some pointer-based data is sent from or to the coprocessor, the runtime system
allocates memory on the coprocessor at the start of offload, and deallocates this memory at the end of offload.
Dynamic memory allocation is a serial operation, which may incur significant overhead on the Intel Xeon
Phi coprocessor, and it may be beneficial for performance to preserve memory allocation between offloads.
Besides allocated memory, the user may wish to preserve the data in that memory as well.

Prepared for Yunheng Wang c Colfax International, 2013


TM
54 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

In order to preserve some coprocessor allocated memory, and, optionally, the data in it, between offloads,
clauses alloc_if and free_if may be used. These clauses are given arguments which, if evaluated to 1,
enforce the allocation and deallocation of data, respectively. By default, the argument of both alloc_if and
free_if evaluates to 1. Allocation takes place at the start of the offload, and freeing takes place at the end
of the offload.
Example in Listing 2.26 demonstrates how data can be transferred to the coprocessor and preserved there
until the next offload.

1 SetupPersistentData(N, persistent);
2
3 #pragma offload_transfer target(mic:0) \
4 in(persistent : length(N) alloc_if(1) free_if(0) )
5
6 for (int iter = 0; iter < nIterations; iter++) {
7 SetupDataset(iter, dataset);
8 #pragma offload target(mic:0) \
9 in (dataset : length(N) alloc_if(iter==0) free_if(iter==nIterations-1) ) \

g
out (results : length(N) alloc_if(iter==0) free_if(iter==nIterations-1) ) \

an
10
11 nocopy (persistent : length(N) alloc_if(0) free_if(iter==nIterations-1) )

W
12 {
Compute(N, dataset, results, persistent);

ng
13
14 } e
15 ProcessResults(N, results);
nh
16 }
r Yu

Listing 2.26: Illustration of data transfer with persistence on the coprocessor between offload regions.
fo
ed

In Listing 2.26, the first pragma allocates array persistent on the coprocessor and initializes it
ar

by transferring some data into this array from the host. Then, inside the for-loop, the data in the array
p
re

persistent is re-used, because the nocopy specifier is used in #pragma offload. In order for
yP

nocopy to work, the allocated memory must persist between offloads, and this behavior is requested by
using alloc_if(iter==0) in order to limit data allocation to the first iteration . Finally, the clause
el

free_if(iter==nIterations-1) makes sure that the memory on the coprocessor is eventually deal-
iv

located to prevent memory leaks. The deallocation occurs only in the last iteration.
us

Note how the character ‘\’ is used in order to make the specification of the pragma continue onto the
cl

next line.
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


2.2. EXPLICIT OFFLOAD MODEL 55

2.2.10 Asynchronous Offload


As mentioned above, offloaded function calls and data transport to the coprocessor block execution on
the host until the offloaded function returns. The explicit offload model also supports asynchronous (i.e.,
non-blocking) transfers and functions calls. Asynchronous data transfer opens additional possibilities for
optimization, as

a) data transfer time can be masked, and

b) the host processor and coprocessor can be employed simultaneously.

c) work can be distributed across multiple coprocessors.

In order to effect asynchronous data transfer, the specifier signal and wait and #pragma offload_wait
are used. Complete information about asynchronous transfer can be found in the Intel C++ Compiler reference
[10].

g
an
W
1 #pragma offload_transfer target(mic:0) in(data : length(N)) signal(data)

ng
2
3 // Execution will not block until transfer is completer.

e
4
5 SomeOtherFunction(otherData); nh
// The function called below will be executed concurrently with data transfer.
Yu
6
#pragma offload target(mic:0) wait(data) \
r

7
fo

8 nocopy(data : length(N)) out(result : length(N))


9 { // Function CalculaCalculatetie() be launched until data are transferred
d

Calculate(data, result);
re

10
11 }
pa
re

Listing 2.27: Illustration of asynchronous data transfer and wait clause.


yP
el
iv

Example in Listing 2.27 illustrates the use of asynchronous transfer pragmas. In this code, #pragma
us

offload_transfer initiates the transfer, and specifier signal indicates that it should be asynchronous.
cl

With asynchronous offload, SomeOtherFunction() will be executed concurrently with data transport. The
Ex

second pragma statement, #pragma offload, performs an offloaded calculation. Specifier wait(data)
indicates that the offloaded calculation should not start until the data transport signaled by data has been
completed. Any pointer variable can serve as the signal, not just the pointer to the array being transferred.
Besides including the wait clause in an offload pragma, the compiler supports the offload_wait
pragma, which is illustrated in Listing 2.28.

1 #pragma offload_wait target(mic:0) wait(data)

Listing 2.28: Illustration of #pragma offload_wait.

Here, the host code execution will wait at this pragma until the transport signaled by data has finished.
This pragma is useful when it is not necessary to initiate another offload or data transfer at the synchronization
point.
Similarly to asynchronous data transfer, function offload can be done asynchronously, as shown in
Figure 2.29.

Prepared for Yunheng Wang c Colfax International, 2013


TM
56 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

1 char* offload0;
2 char* offload1;
3
4 #pragma offload target(mic:0) signal(offload0) \
5 in(data0 : length(N)) out(result0 : length(N))
6 { // Offload will not begin until data are transferred
7 Calculate(data0, result0);
8 }
9
10 #pragma offload target(mic:1) signal(offload1) \
11 in(data1 : length(N)) out(result1 : length(N))
12 { // Offload will not begin until data are transferred
13 Calculate(data1, result1);
14 }
15
16 #pragma offload_wait target(mic:0) wait(offload0)
17 #pragma offload_wait target(mic:1) wait(offload1)

g
Listing 2.29: Illustration of asynchronous offload to different coprocessors.

an
W
In this code, two coprocessors are employed simultaneously using asynchronous offloads. More in-

Section 2.4.1.
e ng
formation on managing multiple coprocessors in a system with the explicit offload model can be found in
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


2.2. EXPLICIT OFFLOAD MODEL 57

2.2.11 Review: Core Language Constructs of the Explicit Offload Model


The key language constructs used for the offload model programming are listed below. For complete
description, refer to the Intel C++ compiler reference [11].

• __attribute__((target(mic))) is a declaration qualifier that indicates that the declared object


(variable or function) must be compiled into the target code.

1 __attribute__((target(mic))) int data[1000];

1 int __attribute__((target(mic)))
2 CountNonzero(const int N, const int* arr) {
3 int nz=0;
4 for (int i = 0; i < N; i++) {
5 if (arr[i] != 0) nz++
6 }

g
7 return nz;

an
8 }

W
ng
Listing 2.30: Illustration of __attribute__((target(mic))) usage. The code in the top panel indicates

e
that global variable data may be used in coprocessor code. The code in the top panel indicates that function
CountNonzero() may be used in coprocessor code.
nh
r Yu
fo

Examples in Listing 2.30 show how the non-scalar variable data can be made visible in the scope of
d

the target code and the function CountNonzero() can be compiled for the coprocessor.
re
pa

• #pragma offload_attribute(push, target(mic)) and


re

#pragma offload_attribute(pop) can be used instead of the qualifier


yP

__attribute__((target(mic))) when multiple consecutive elements in a source file need to


be included in the offload code.
el
iv
us
cl

#pragma offload_attribute(push, target(mic))


Ex

1
2
3 double* ptrdata; // Apply the offload qualifier to a pointer-based array,
4 void MyFunction(); // a function
5 #include "myvariables.h" // or even a whole file
6
7 #pragma offload_attribute(pop)

Listing 2.31: Illustration of offload_attribute(push) usage.

Example in Listing 2.31 specifies that several arrays and all of the variables and functions declared in
the header file myvariables.h should be accessible to the coprocessor code. Note that these must
be global variables.

• #pragma offload_transfer target(mic) requests that certain non-scalar data must be


copied to the coprocessor. This pragma takes a number of clauses to specify data traffic. These clauses
are described in Section 2.2.9.

Prepared for Yunheng Wang c Colfax International, 2013


TM
58 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

1 #pragma offload_transfer target (mic:0) \


2 in(ptrdata : length(N)) \
3 alloc_if(1) free_if(0)

Listing 2.32: Illustration of #pragma offload_transfer usage.

The single line of code in Listing 2.32 requests that the array data, which contains 1000 elements, be
transferred in, i.e., from the host to the coprocessor number 0, and that the memory on the coprocessor
must be allocated before the offload, but not freed afterwards. The symbol “\” is used to break the
pragma code into several lines. This is a blocking operation, which means that code execution will stop
until the transfer is complete. It is also possible to request a non-blocking (asynchronous) offload, using
the signal clause, as described in Section 2.2.9.
In this example, no operations will be applied to the transferred data. In order to request some processing
along with data transfer, #pragma offload should be used, as described below.

g
an
• #pragma offload target(mic) specifies that the code following this pragma must be executed

W
on the coprocessor if possible. Note that the code offload using this method blocks until the offloaded
code returns. This pragma takes a number of clauses to specify data traffic. These clauses are described
in Section 2.2.9. e ng
nh
Yu

1 #pragma offload target (mic) in(ptrdata : length(N))


2 {
r
fo

3 ct = CountNonzero(N, data);
4 }
ed
ar

Listing 2.33: Illustration of #pragma offload usage.


p
re
yP

The function CountNonzero() in example shown in Listing 2.33 will be offloaded to the coprocessor
el

if possible. Code execution on the host blocks until the offloaded code returns from the coprocessor.
iv

Note that the scalar variables N and c are lexically visible to the both the host and the coprocessor and
us

therefore transferred automatically.


cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


2.3. MYO (VIRTUAL-SHARED) MEMORY MODEL 59

2.3 MYO (Virtual-Shared) Memory Model


An alternative to the offload model, a virtual-shared memory approach called MYO, eliminates the need
for data marshaling (i.e., #pragma offload) by utilizing a software emulation of memory shared between
multi-core and many-core processors residing in a single system. MYO is an acronym for “Mine Yours Ours”
and refers to the software abstraction that shares memory within a system for determining current access
and ownership privileges. MYO is a run time user mode library. The virtual-shared memory model is only
available in C++; it is not available in Fortran.
The MYO approach is more implicit in nature than the offload model, which has several advantages. It
allows synchronization of data between the multi-core processors and an Intel Xeon Phi coprocessor, with
compiler support enabling allocation of data at the same virtual addresses on the host and the coprocessor.
MYO allows to share complex (i.e., not bitwise-copyable) objects without marshaling. For instance, C++
classes cannot be transferred using offload pragmas, but can be shared with the coprocessor in the virtual-shared
memory model. Importantly, MYO does not copy all shared variables upon synchronization. Instead, it only
copies the values that have changed between two synchronization points.
To summarize, the host processor and the Intel Xeon Phi coprocessor do not share physical or virtual

g
an
memory in hardware. In order to use the virtual-shared memory model, the programmer has to specify what

W
data and how it should be accessed by the target:

ng
1. Programmer marks variables that need to be shared between the host system and the target.

e
nh
2. The same shared variable can then be used in both host and coprocessor code.
Yu
3. Runtime automatically maintains coherence at the beginning and at the end of offload statements. Upon
r
fo

entry to the offload code, data modified on the host are automatically copied to the target, and upon exit
from the offload call, data modified on the target are copied to the host.
d
re
pa

4. Syntax is based on the keyword extensions: _Cilk_shared and _Cilk_offload


re

5. Note that, despite _Cilk being a part of these keywords, the programmer is not limited to using Intel
yP

Cilk Plus to parallelize the offloaded code. OpenMP, Pthreads, and other frameworks can be used within
el

the offloaded segment. See Section 3.2 for more information about Intel Cilk Plus and OpenMP.
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


TM
60 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

2.3.1 Sharing Objects with _Cilk_shared and _Cilk_offload Keywords


In order to use the MYO capability, the programmer must mark data items that are meant to be shared
with the _Cilk_shared keyword, and use the _Cilk_offload keyword to invoke work on the Intel
Xeon Phi coprocessor. The syntax is illustrated by statement “x = _Cilk_offload func(y);”, which
means that the function func() is executed on the coprocessor, and the return value is assigned to the host
variable x. The function may read and modify shared data, and the modified values will be synchronized with
all processors at the end of offload. Listing 2.34 illustrates the usage of the MYO model.

1 #include <stdio.h>
2 #define N 1000
3 _Cilk_shared int ar1[N];
4 _Cilk_shared int ar2[N];
5 _Cilk_shared int res[N];
6
7 void initialize() {
8 for (int i = 0; i < N; i++) {

g
9 ar1[i] = i;

an
10 ar2[i] = 1;

W
11 }
12 }
13
14 _Cilk_shared void add() {
e ng
nh
15 #ifdef __MIC__
Yu

16 for (int i = 0; i < N; i++)


17 res[i] = ar1[i] + ar2[i];
r

18 #else
fo

19 printf("Offload to coprocessor failed!\n");


ed

20 #endif
}
ar

21
22
p

void verify() {
re

23
24 bool errors = false;
yP

25 for (int i = 0; i < N; i++)


26 errors |= (res[i] != (ar1[i] + ar2[i]));
el

27 printf("%s\n", (errors ? "ERROR" : "CORRECT"));


iv

28 }
us

29
cl

30 int main(int argc, char *argv[]) {


Ex

31 initialize();
32 _Cilk_offload add(); // Function call on coprocessor:
33 // // ar1, ar2 are copied in, res copied out
34 verify();
35 }

Listing 2.34: Example of using the virtual-shared memory and offloading calculations with _Cilk_shared and
_Cilk_offload of the function call. Note that, even though data are not explicitly passed from the host to the
coprocessor, function compute_sum(), executed on the coprocessor, has access to data initialized on the host.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


2.3. MYO (VIRTUAL-SHARED) MEMORY MODEL 61

2.3.2 Dynamically Allocating Virtual-Shared Objects


Dynamic memory shared between the host and the target must be allocated and deallocated with the
special functions listed below (note: not supported on Fortran). Details are available in the Intel C++ Compiler
Reference [12].
• Allocation: _Offload_shared_malloc and _Offload_shared_aligned_malloc;
• Deallocation: _Offload_shared_free and _Offload_shared_aligned_free.
Listing 2.35 demonstrates how pointer-based data can be dynamically allocated and used on the target code.

1 #include <stdio.h>
2 #define N 10000
3 int* _Cilk_shared data; // Shared pointer to shared data
4 int _Cilk_shared sum;
5
_Cilk_shared void ComputeSum() {

g
6

an
7 #ifdef __MIC__
printf("Address of data[0] on coprocessor: %p\n", &data[0]);

W
8
9 sum = 0;

ng
10 #pragma omp parallel for reduction(+: sum)
11 for (int i = 0; i < N; ++i)

e
12
13
sum += data[i];
#else nh
Yu
14 printf("Offload to coprocessor failed!\n");
#endif
r

15
fo

16 }
17
d

int main() {
re

18
19 data = (_Cilk_shared int*)_Offload_shared_malloc(N*sizeof(float));
pa

20 for (int i = 0; i < N; i++)


re

21 data[i] = i%2;
yP

22 printf("Address of data[0] on host: %p\n", &data[0]);


23 _Cilk_offload ComputeSum();
el

24 printf("%s\n", (sum==N/2 ? "CORRECT" : "ERROR"));


iv

25 _Offload_shared_free(data);
us

26 }
cl
Ex

Listing 2.35: Using the _Offload_shared_malloc for dynamic virtual-shared memory allocation in C/C++.

Listing 2.36 demonstrates the result:

user@host% icpc shared-malloc.cc


user@host% ./a.out
Address of data[0] on host: 0x820000030
CORRECT
Address of data[0] on coprocessor: 0x820000030

Listing 2.36: Output of the code in Listing 2.35

Note that in the above listing, data is a global variable declared as a pointer to shared memory, and that it
is allocated using a special _Offload_shared_malloc call. Variables marked with the _Cilk_shared
keyword will be placed at the same virtual addresses on both the host and the coprocessor, and will synchronize
their values at the beginning and end of offload function calls marked with the _Cilk_offload keyword.

Prepared for Yunheng Wang c Colfax International, 2013


TM
62 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

2.3.3 Virtual-Shared Classes


Listing 2.37 — Listing 2.39 illustrate how complex objects, such as structures and classes, can be shared
between the host and the target. Note that transferring bitwise-copyable structures is possible with virtual-
shared memory as well as with the explicit offload model, whereas sharing structures with pointer elements
and C++ classes is only possible using virtual-shared memory, as these objects are not bitwise-copyable.

1 #include <stdio.h>
2 #include <string.h>
3
4 typedef struct {
5 int i;
6 char c[10];
7 } person;
8
9 _Cilk_shared void SetPerson(_Cilk_shared person & p,

g
10 _Cilk_shared const char* name, const int i) {

an
11 #ifdef __MIC__

W
12 p.i = i;
13 strcpy(p.c, name);

ng
14 printf("On coprocessor: %d %s\n", p.i, p.c);
15 #else
e
nh
16 printf("Offload to coprocessor failed.\n");
#endif
Yu

17
18 }
r

19
fo

20 person _Cilk_shared someone;


ed

21 char _Cilk_shared who[10];


22
ar

23 int main(){
p

24 strcpy(who, "John");
re

25 _Cilk_offload SetPerson(someone, who, 1);


yP

26 printf("On host: %d %s\n", someone.i, someone.c);


27 }
el
iv
us

Listing 2.37: C struct data will be synchronized in virtual-shared memory.


cl
Ex

Listing 2.38 shows the result:

user@host % icpc shared-struct.cc


user@host % ./a.out
On host: 1 John
On coprocessor: 1 John

Listing 2.38: Output of the code in Listing 2.37

Note that in this example, function SetPerson accepts an argument initialized on the host, however, it
is executed on the coprocessor, and produces an object (someone), which is later used on the host.
The example in Listing 2.37 and Listing 2.38 could also be implemented in the explicit offload model
using pragmas. However, a more complex object, such as the class shown in Listing 2.39 and Listing 2.40, can
only be shared in the virtual-shared memory model.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


2.3. MYO (VIRTUAL-SHARED) MEMORY MODEL 63

1 #include <stdio.h>
2 #include <string.h>
3
4 class _Cilk_shared Person {
5 public:
6 int i;
7 char c[10];
8
9 Person() {
10 i=0; c[0]=’\0’;
11 }
12
13 void Set(_Cilk_shared const char* name, const int i0) {
14 #ifdef __MIC__
15 i = i0;
16 strcpy(c, name);
17 printf("On coprocessor: %d %s\n", i, c);
18 #else
19 printf("Offload to coprocessor failed.\n");

g
#endif

an
20
21 }

W
22 };

ng
23
24 Person _Cilk_shared someone;

e
25 char _Cilk_shared who[10];
26
nh
Yu
27 int main(){
28 strcpy(who, "Mary");
r

_Cilk_offload someone.Set(who, 2);


fo

29
30 printf("On host: %d %s\n", someone.i, someone.c);
d

31 }
re
pa

Listing 2.39: C++ class data will be synchronized in virtual-shared memory.


re
yP

Listing 2.38 shows the result:


el
iv
us

user@host % icpc shared-class.cc


cl

user@host % ./a.out
Ex

On host: 2 Mary
On coprocessor: 2 Mary

Listing 2.40: Output of the code in Listing 2.39

Prepared for Yunheng Wang c Colfax International, 2013


TM
64 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

2.3.4 Placement Version of Operator new for Shared Classes


In order to allocate virtual-shared data, special functions _Offload_shared_malloc must be
used. However, what if a virtual-shared class needs to be allocated? Regular usage of operator new to
allocate memory and call the class constructor is not applicable in this case, because it allocates local,
and not shared, memory. However, the placement version of operator new can be used in tandem with
_Offload_shared_malloc in order to create a virtual-shared class. This method is shown in Listing 2.41.

1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <new>
4
5 class _Cilk_shared MyClass {
6 int i;
7 public:
8 MyClass(){ i = 0; };
9 void set(const int l) { i = l; }

g
10 void print(){

an
11 #ifdef __MIC__

W
12 printf("On coprocessor: ");
13 #else

ng
14 printf("On host: ");
15 #endif
e
nh
16 printf("%d\n", i);
}
Yu

17
18 };
r

19
fo

20 MyClass* _Cilk_shared sharedData;


ed

21
22 int main()
ar

23 {
p

24 const int size = sizeof(MyClass);


re

25 _Cilk_shared MyClass* address = (_Cilk_shared MyClass*) _Offload_shared_malloc(size);


yP

26 sharedData=new( address ) MyClass;


27
el

28 sharedData->set(1000); // Shared data initialized on host


iv

29 _Cilk_offload sharedData->print(); // Shared data used on coprocessor


us

30 sharedData->print(); // Shared data used on host


}
cl

31
Ex

Listing 2.41: Using the placement version of operator new to allocate a virtual-shared object of type MyClass.

The placement version of operator new is made available by including the header file <new>. The
presence of the argument address instructs the operator new that memory has already been allocated for
the object at that address, and that only the constructor of the class needs to be called.
The result of the code in Listing 2.41 is shown in Listing 2.42.

user@host % ./a.out
On host: 1000
On coprocessor: 1000

Listing 2.42: Using a virtual-shared class.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


2.3. MYO (VIRTUAL-SHARED) MEMORY MODEL 65

2.3.5 Summary of Language Extensions for the MYO Model


Table 2.1 summarizes the usage of _Cilk_shared and _Cilk_offload keywords.

Entity Syntax Semantics

Function int _Cilk_shared f(int x) Executable code for both host and
{ return x+1; } target; may be called from either side

Global variable _Cilk_shared int x = 0; Visible on both sides

File/Function static static _Cilk_shared int x; Visible on both sides, only to code
within the file/function

Class class _Cilk_shared x {...}; Class methods, members, and


operators are available on both sides

g
an
W
Pointer to shared data int _Cilk_shared *p; p is local (not shared),
can point to shared data

e ng
A shared pointer int *_Cilk_shared p;
nh p is shared;
should only point at shared data
rYu
fo

Offloading x = _Cilk_offload func(y); func executes on coprocessor


a function call if possible
d
re
pa

x = _Cilk_offload_to(n) func must be executed on


re

func(y); specified (n-th) coprocessor


yP

Offloading x = _Cilk_spawn _Cilk_offload Non-blocking offload


el

asynchronously func(y);
iv
us

Offload a _Cilk_offload Loop executes in parallel on target.


cl
Ex

parallel for-loop _Cilk_for(i=0; i<N; i++) The loop is implicitly outlined


{ a[i] = b[i] + c[i] } as a function call

Table 2.1: Keywords _Cilk_shared and _Cilk_offload usage for data and functions.

Prepared for Yunheng Wang c Colfax International, 2013


TM
66 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

TM
2.4 Multiple Intel R Xeon Phi Coprocessors in a System and Clus-
TM
ters with Intel R Xeon Phi Coprocessors
We have discussed in Sections 2.4.1, 2.4.2 and 2.4.3 how to use a single Intel Xeon Phi coprocessor
using native (or MPI) applications, in the explicit offload model or in the virtual-shared memory model. This
section describes how multiple coprocessors can be used in these programming models.
There are several ways to employ several Intel Xeon Phi coprocessors simultaneously. The best method
depends on the structure and parallel algorithm of the application.
In distributed memory applications using MPI, there exists a multitude of methods for utilizing multiple
hosts and multiple devices (see Section 3.3.1). However, all of these methods can be placed into one of the
following two categories:
(1) MPI processes run only on hosts and perform offload to coprocessors, and
(2) MPI processes run as native applications on coprocessors (or on coprocessors as well as hosts).

g
an
For applications utilizing MPI in mode (1), and for offload applications using only a single host, multiple

W
coprocessors per host can be utilized using a combination of approaches described in Section 2.4.1 and
Section 2.4.2:
e ng
(1a) spawning multiple threads on the host, each performing offload to the respective coprocessor, and
nh
Yu

(1b) performing asynchronous offloads from one host thread.


r

For MPI applications in mode (2), scaling across multiple coprocessors occurs naturally.
fo

We will start with the discussion of the offload model, in its explicit implementation and in the MYO
ed

variation, and then proceed to discussing the usage of MPI for heterogeneous applications with multiple Intel
ar

Xeon Phi coprocessors. Note that this section is not a tutorial on OpenMP, Intel Cilk Plus or MPI. Refer to
p
re

Chapter 3 for information expressing parallelism using these frameworks.


yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
2.4. MULTIPLE INTEL R XEON PHI COPROCESSORS IN A SYSTEM AND CLUSTERS WITH INTEL R
TM
XEON PHI COPROCESSORS 67

2.4.1 Using Several coprocessors in the Explicit Offload Model


Querying the Number of Devices
In the host code, the number of available Intel Xeon Phi coprocessors can be queried with a call to the
function _Offload_number_of_devices(), as shown in Listing 2.43.

1 #include <stdio.h>
2
3 _Cilk_shared int numDevices;
4
5 int main() {
6 numDevices = _Offload_number_of_devices();
7 printf("Number of available coprocessors: %d\n\n" ,numDevices);
8 }

Listing 2.43: _Offload_number_of_devices() will return the number of Intel Xeon Phi coprocessors in the

g
system.

an
W
Note: at the time of the writing of this document, the Intel C Compiler version 13.0.1 recognizes

ng
the function _Offload_number_of_devices() only if (a) at least one _Cilk_shared variable or

e
function is declared in the compiled code, or (b) the code is compiled with the argument -offload-build.
nh
Yu
Specifying an Explicit Offload Target
r
fo

With several Intel Xeon Phi coprocessors installed in a system, it is possible to request offload to a
d

specific coprocessor. This has been demonstrated in Listing 2.28, where mic:0 indicates that the offload must
re

be performed to the first coprocessor in the system. Another example is shown in Listing 2.44.
pa
re

#pragma offload target(mic: 0)


yP

1
2 {
foo();
el

3
}
iv

4
us
cl

Listing 2.44: target(mic:0) directs the offload to the first Intel Xeon Phi coprocessor in the system.
Ex

Specifying a target number of 0 or greater indicates that the call applies to the coprocessor with the
corresponding zero-based number. For a target number greater than or equal to the coprocessor count, the
offload will be directed to the coprocessor equal to the target number modulo device count. For example, with
4 coprocessors in the system, mic:1, mic:5, mic:9, etc., direct offload to the second coprocessor.
Specifying mic:-1 instead will invite the runtime system to choose a coprocessor or fail if none are
found.
In applications using asynchronous offloads, specifying target numbers is critical, as waiting for a signal
from the wrong coprocessor can result in the code hanging. The same applies to applications that use data
persistence on the coprocessor. If a persistent array is allocated on a specific coprocessor, but an offload
pragma tries to re-use that array on a different coprocessor, a runtime error will occur.

Prepared for Yunheng Wang c Colfax International, 2013


TM
68 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

Multiple Blocking Offloads from Host in Parallel Threads


One of the core methods to employ multiple coprocessors, described in this section, is to spawn several
host tasks or threads, and perform an offload to the respective device from every thread. In order to do that,
Pthreads, OpenMP or Intel Cilk Plus may be used. Listing 2.45 illustrates this approach using OpenMP.

1 #include <stdlib.h>
2 #include <stdio.h>
3
4 __attribute__((target(mic))) int* response;
5
6 int main(){
7 int n_d = _Offload_number_of_devices();
8 if (n_d < 1) {
9 printf("No devices available!");
10 return 2;
11 }
12 response = (int*) malloc(n_d*sizeof(int));

g
response[0:n_d] = 0;

an
13
14 #pragma omp parallel for

W
15 for (int i = 0; i < n_d; i++) {

ng
16 // The body of this loop is executed by n_d host threads concurrently
17 #pragma offload target(mic:i) inout(response[i:1]) e
{
nh
18
19 // Each offloaded segment blocks the execution of the thread that launched it
Yu

20 response[i] = 1;
21 }
r

}
fo

22
23 for (int i = 0; i < n_d; i++)
ed

24 if (response[i] == 1) {
ar

25 printf("OK: device %d responded\n", i);


26 } else {
p
re

27 printf("Error: device %d did not respond\n", i);


}
yP

28
29 }
el
iv

Listing 2.45: Illustration of employing several Intel Xeon Phi coprocessors simultaneously using multiple host threads.
us

This code must be compiled with the compiler argument -openmp in order to enable #pragma omp
cl
Ex

The for-loop in line 15 is executed in parallel on the host. Today’s computing systems support a maximum
of eight Intel Xeon Phi coprocessors, and the number of CPU cores in these systems is no less than eight.
Therefore, the default behavior of this parallel loop is to launch all n_d host threads simultaneously. Each
thread executes its own offloaded segment, and all offloaded segments will therefore run concurrently. See
Section 3.2.3 for more information about parallel loops in OpenMP.
In high performance applications, algorithms generally distribute work across available coprocessors.
The example in Listing 2.45 illustrates one of the language constructs that may be used for work distribution.
The clause inout(response[i:1]) indicates that only a segment of array response should be sent in
and out of coprocessor mic:i, namely, the segment starting with the index i and having a length of 1. This
is an example of Intel Cilk Plus array notation further discussed in Section 3.1.7.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
2.4. MULTIPLE INTEL R XEON PHI COPROCESSORS IN A SYSTEM AND CLUSTERS WITH INTEL R
TM
XEON PHI COPROCESSORS 69

Multiple Asynchronous Offloads from a Single Host Thread


Another way to employ several coprocessors does not involve spawning multiple host threads. Instead, a
single host threads spawns multiple asynchronous (i.e., non-blocking) offloads. This approach is illustrated in
Listing 2.46.

1 #include <stdlib.h>
2 #include <stdio.h>
3
4 __attribute__((target(mic))) int* response;
5
6 int main(){
7 int n_d = _Offload_number_of_devices();
8 if (n_d < 1) {
9 printf("No devices available!");
10 return 2;
11 }
12 response = (int*) malloc(n_d*sizeof(int));

g
response[0:n_d] = 0;

an
13
14 for (int i = 0; i < n_d; i++) {

W
15 #pragma offload target(mic:i) inout(response[i:1]) signal(&response[i])

ng
16 {
17 // The offloaded job does not block the execution on the host

e
response[i] = 1;
nh
18
19 }
Yu
20 }
21
r

for (int i = 0; i < n_d; i++) {


fo

22
23 // This loop waits for all asynchronous offloads to finish
d

24 #pragma offload_wait target(mic:i) wait(&response[i])


re

25 }
pa

26
re

27 for (int i = 0; i < n_d; i++)


if (response[i] == 1) {
yP

28
29 printf("OK: device %d responded\n", i);
} else {
el

30
printf("Error: device %d did not respond\n", i);
iv

31
32 }
us

33 }
cl
Ex

Listing 2.46: Illustration of employing several Intel Xeon Phi coprocessors simultaneously using asynchronous offloads

3150:
The code sample shown above uses only one host thread, but this thread spawns multiple offloads in
for-loop line 14. The asynchronous nature of offload is requested by the clause signal. Any pointer can be
chosen as a signal, as long as the pointer assigned to each coprocessor is unique. In this code, for simplicity,
the signal is chosen as a pointer to the array sent to the respective coprocessor. Loop in line 22 waits for
signals. The arrival of each signal indicates the end of the offload.

Prepared for Yunheng Wang c Colfax International, 2013


TM
70 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

2.4.2 Using Multiple Coprocessors in the MYO Model


With the MYO model, multiple coprocessors can be employed similarly to the explicit offload model.

Querying the Number of Coprocessors


Querying the number of coprocessors in the MYO model is done in the same way as in the explicit
offload model (see Section 2.4.1 and Listing 2.43).

Specifying Implicit Offload Target


If there are several Intel Xeon Phi coprocessors present, the programmer can choose which one to use for
offloading with _Cilk_offload_to(number) keyword, as shown in Listing 2.47.

1 _Cilk_offload_to(i) func();

g
an
Listing 2.47: _Cilk_offload_to(i) will use Intel Xeon Phi coprocessor number i (counted from zero) for offload-

W
ing. See also Section 2.4.1 for information about the rules of coprocessor specification.

ng
Note that, even though the keyword _Cilk_spawn is a part of the parallel framework Intel Cilk Plus,
e
nh
the programmer is not restricted to using Intel Cilk Plus for parallelizing the offloaded code. OpenMP and
Yu

Pthreads may be used as well.


r
fo

Asynchronous Implicit Offload


ed

In order to effect asynchronous offload in the MYO model, the keyword _Cilk_spawn should be
ar

placed before _Cilk_offload_to.


p
re
yP

1 _Cilk_spawn _Cilk_offload_to(i) func();


el
iv

Listing 2.48: Asynchronous offload to specified target of function


us
cl

The keyword _Cilk_spawn is a part of Intel Cilk Plus, and therefore synchronization between spawned
Ex

offloads is done in the same way as with jobs spawned on the host, i.e., using _Cilk_sync. More information
can be found in Section 3.2.4.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
2.4. MULTIPLE INTEL R XEON PHI COPROCESSORS IN A SYSTEM AND CLUSTERS WITH INTEL R
TM
XEON PHI COPROCESSORS 71

Multiple Blocking Offloads in Threads


Listing 2.49 illustrates the same approach in the MYO model as Listing 2.45 in the explicit offload model.

1 #include <stdlib.h>
2 #include <stdio.h>
3
4 int _Cilk_shared *response;
5
6 void _Cilk_shared Respond(int _Cilk_shared & a) {
7 a = 1;
8 }
9
10 int main(){
11 int n_d = _Offload_number_of_devices();
12 if (n_d < 1) {
13 printf("No devices available!");
14 return 2;

g
15 }

an
16 response = (int _Cilk_shared *) _Offload_shared_malloc(n_d*sizeof(int));

W
17 response[0:n_d] = 0;
18 _Cilk_for (int i = 0; i < n_d; i++) {

ng
19 // All iterations start simultaneously in n_d host threads

e
20 _Cilk_offload_to(i)
21
}
Respond(response[i]);
nh
Yu
22
23 for (int i = 0; i < n_d; i++)
r

24 if (response[i] == 1) {
fo

25 printf("OK: device %d responded\n", i);


} else {
d

26
re

27 printf("Error: device %d did not respond\n", i);


pa

28 }
29 }
re
yP

Listing 2.49: Illustration of employing several Intel Xeon Phi coprocessors simultaneously using multiple host threads.
el
iv

In this case, the loop in Line 18 is executed in parallel with the help of the Intel Cilk Plus library. It is
us

expected that the number of available Cilk Plus workers is greater than the number of coprocessors in the
cl
Ex

system, and therefore, all offloads will start simultaneously. See Section 3.2.3 for more information about
parallel loops in Intel Cilk Plus.

Prepared for Yunheng Wang c Colfax International, 2013


TM
72 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

Multiple Asynchronous Offloads


OpenMP and Intel Cilk Plus can be inter-operated, however, it in the case of nested parallelism, is
preferable to use parallel Intel Cilk Plus constructs inside of parallel OpenMP constructs, and not vice-versa.
However, when spawned tasks execute on the coprocessor, this rule does not need to be followed. Listing 2.50
illustrates the same approach in the MYO model as Listing 2.46 in the explicit offload model.

1 #include <stdlib.h>
2 #include <stdio.h>
3
4 int _Cilk_shared *response;
5
6 void _Cilk_shared Respond(int _Cilk_shared & a) {
7 a = 1;
8 }
9
10 int main(){
int n_d = _Offload_number_of_devices();

g
11

an
12 if (n_d < 1) {
printf("No devices available!");

W
13
14 return 2;

ng
15 }
16 response = (int _Cilk_shared *) _Offload_shared_malloc(n_d*sizeof(int));
e
nh
17 response[0:n_d] = 0;
18 for (int i = 0; i < n_d; i++) {
Yu

19 _Cilk_spawn _Cilk_offload_to(i)
Respond(response[i]);
r

20
fo

21 }
22 _Cilk_sync;
ed

23 for (int i = 0; i < n_d; i++)


ar

24 if (response[i] == 1) {
p

25 printf("OK: device %d responded\n", i);


re

26 } else {
yP

27 printf("Error: device %d did not respond\n", i);


28 }
el

29 }
iv
us

Listing 2.50: Illustration of employing several Intel Xeon Phi coprocessors simultaneously using multiple host threads.
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
2.4. MULTIPLE INTEL R XEON PHI COPROCESSORS IN A SYSTEM AND CLUSTERS WITH INTEL R
TM
XEON PHI COPROCESSORS 73

2.4.3 Running MPI Applications on Multiple Intel Xeon Phi Coprocessors.


In cluster environments, there are two fundamental approaches to running MPI jobs that employ Intel
Xeon Phi coprocessors.

(1) MPI processes run only on processors and perform offload to coprocessors attached to their respective
host. In this case, multiple coprocessors can be used as described in Section 2.4.1 and Section 2.4.2. It is
possible to employ this method with either the bridge, or the static pair network topology of coprocessors
(see Section 1.5.2).

(2) MPI processes run as native applications on coprocessors (or on coprocessors as well as processors). The
procedure employing multiple coprocessors with this approach is presented in this section. Note that in
order to use this approach with more than one host (i.e., on a cluster), the network connections of Intel
Xeon Phi coprocessors must be configured in the bridge topology, so that all coprocessors are directly
IP-addressable on the same private network as the hosts. However, in this section, we restrict the example
to a single host with multiple coprocessors, and therefore the network configuration is unimportant.

g
an
Code

W
Let us re-use the “Hello World” application for MPI shown in Listing 2.9. For convenience, this code is

ng
reproduced in Listing 2.51.

e
nh
Yu
1 #include "mpi.h"
2 #include <stdio.h>
r
fo

3 #include <string.h>
4
d

5 int main (int argc, char *argv[]) {


re

6 int i, rank, size, namelen;


pa

7 char name[MPI_MAX_PROCESSOR_NAME];
re

8
MPI_Init (&argc, &argv);
yP

9
10
el

11 MPI_Comm_size (MPI_COMM_WORLD, &size);


MPI_Comm_rank (MPI_COMM_WORLD, &rank);
iv

12
MPI_Get_processor_name (name, &namelen);
us

13
14
cl

15 printf ("Hello World from rank %d running on %s!\n", rank, name);


Ex

16 if (rank == 0) printf("MPI World size = %d processes\n", size);


17
18 MPI_Finalize ();
19 }

Listing 2.51: Source code HelloMPI.c of a “Hello world!” program with MPI.

Note that we assume that the MPI library has been NFS-shared with both coprocessors as discussed in
Section 1.5.4.

Prepared for Yunheng Wang c Colfax International, 2013


TM
74 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

Launching MPI Applications on Coprocessor from Host

In order to run this code on two coprocessors attached to the machine, let us first see how we can launch
an MPI job on the coprocessor from the host (see Listing 2.52). In this case, an additional environment variable,
I_MPI_MIC, must be set on the host, and the argument -host mic0 must be passed to mpirun.

user@host% source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh


user@host% mpiicc -o HelloMPI.MIC HelloMPI.c -mmic
user@host% scp HelloMPI.MIC mic0:~/
HelloMPI.MIC 100% 12KB 12.4KB/s 00:00
user@host% export I_MPI_MIC=1
user@host% mpirun -host mic0 -n 2 ~/HelloMPI.MIC
Hello World from rank 1 running on mic0!
Hello World from rank 0 running on mic0!
MPI World size = 2 ranks
user@host%

g
an
W
Listing 2.52: Launching an Intel MPI application from the host.

e ng
nh
Yu

MPI Applications on Multiple Coprocessors


r
fo
ed

In order to start the application on two coprocessors, we can specify the list of hosts and their respective
ar

parameters using the separator ‘:’, as shown in Listing 2.53. In the same way, applications on remote coproces-
p

sors and remote hosts can be launched, if the bridged network topology makes these remote coprocessors or
re

hosts directly addressable on the private network of the host.


yP
el
iv

user@host% scp HelloMPI.MIC mic1:~/


us

HelloMPI.MIC 100% 12KB 12.4KB/s 00:00


cl

user@host% mpirun -host mic0 -n 2 ~/HelloMPI.MIC : -host mic1 -n 2 ~/HelloMPI.MIC


Ex

Hello World from rank 2 running on mic1!


Hello World from rank 3 running on mic1!
Hello World from rank 1 running on mic0!
Hello World from rank 0 running on mic0!
MPI World size = 4 ranks

Listing 2.53: Launching an Intel MPI application on two coprocessors from the host.

MPI Machine File

In practice, in order to run jobs on hundreds or thousands of hosts and coprocessors, mpirun accepts a
file with the list of machines instead of individual hosts separated with ‘:’, as demonstrated in Listing 2.54.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


TM
2.4. MULTIPLE INTEL R XEON PHI COPROCESSORS IN A SYSTEM AND CLUSTERS WITH INTEL R
TM
XEON PHI COPROCESSORS 75

user@host% cat mymachinefile.txt


mic0
mic1
user@host% mpirun -machinefile mymachinefile.txt -n 4 ~/HelloMPI.MIC
Hello World from rank 3 running on mic1!
Hello World from rank 1 running on mic1!
Hello World from rank 2 running on mic0!
Hello World from rank 0 running on mic0!
MPI World size = 4 ranks

Listing 2.54: Launching an Intel MPI application using a machine file.

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


TM
76 CHAPTER 2. PROGRAMMING MODELS FOR INTEL R XEON PHI APPLICATIONS

Heterogeneous MPI Applications: Host and Coprocessor(s)


It is also possible to add some processes executing on the host, as shown in Listing 2.55.

user@host% mpiicc -o ~/HelloMPI.XEON HelloMPI.c


user@host% mpirun -host mic0 -n 2 ~/HelloMPI.MIC : -host mic1 -n 2 ~/HelloMPI.MIC : \
% -host localhost -n 2 ~/HelloMPI.XEON
Hello World from rank 5 running on localhost!
Hello World from rank 4 running on localhost!
Hello World from rank 2 running on mic1!
Hello World from rank 3 running on mic1!
Hello World from rank 1 running on mic0!
Hello World from rank 0 running on mic0!
MPI World size = 6 ranks

Listing 2.55: Launching an Intel MPI application on two coprocessors and the host itself. The symbol ‘\’ in the second
line indicates the continuation of the shell command onto the next line.

g
an
W
Peer to Peer Communication between Coprocessors

e ng
Note that in order for MPI jobs on two or more coprocessors to work, they must be able to communicate
with each other via TCP/IP. In order to check whether IP packets can travel from one coprocessor to another,
nh
one can log in to the coprocessor and try to ping another coprocessor. If this test fails, the administrator must
Yu

check that packet forwarding is enabled on the host. Enabling packet forwarding can be done by editing the
r

file /etc/sysctl.conf and ensuring that the following line is present (or changing 0 to 1 otherwise):
fo
ed

net.ipv4.ip_forward = 1
p ar
re

Listing 2.56: Enabling packet forwarding in the host file /etc/sysctl.conf to facilitate peer to peer communication
yP

between coprocessors.
el
iv

If the file /etc/sysctl.conf was edited, the settings will become effective after system reboot. It is
us

also possible to enable packet forwarding for current session using the command shown in Listing 2.57.
cl
Ex

user@host% sudo /sbin/sysctl -w net.ipv4.ip_forward=1

Listing 2.57: Enabling packet forwarding on the host.

It is not necessary to run this command again after reboot.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


77

Chapter 3

Expressing Parallelism

Chapter 2 discussed the methods of data sharing in applications employing Intel Xeon Phi coprocessors.

g
an
This chapter introduces parallel programming language extensions of C/C++ supported by the Intel C++

W
Compiler for programming the Intel Xeon and Intel Xeon Phi architectures. It discusses data parallelism
(SIMD instructions and automatic vectorization), shared-memory thread parallelism (OpenMP, Intel Cilk Plus)

ng
and distributed-memory process parallelism with message passing (MPI). The purpose of this chapter is to

e
nh
introduce parallel programming paradigms and language constructs, rather than to provide optimization advice.
For optimization, refer to Chapter 4.
Yu
r
fo

3.1 Data Parallelism in Serial Codes


d
re
pa

This section introduces vector instructions (SIMD parallelism) in Intel Xeon processors and Intel Xeon
Phi coprocessors and outlines the Intel C++ Compiler support for these instructions. Vector operations
re
yP

illustrated in this section can be used in both serial and multi-threaded codes, however, examples are limited to
serial codes for simplicity. This section introduces the language extensions and concepts of SIMD calculations;
el

optimization practices for vectorization are discussed in Section 4.3.


iv
us
cl

3.1.1 SIMD Operations: Concept and History


Ex

Most processor architectures today include SIMD (Single Instruction Multiple Data) parallelism in the
form of a vector instruction set. SIMD instructions are designed to apply the same mathematical operation to
several integer or floating-point numbers. The following pseudocode illustrates SIMD instructions:

Scalar Loop SIMD Loop


1 for (i = 0; i < n; i++) 1 for (i = 0; i < n; i += 4)
2 A[i] = A[i] + B[i]; 2 A[i:(i+4)] = A[i:(i+4)] + B[i:(i+4)];

Listing 3.1: This pseudocode illustrates the concept of SIMD operations. The SIMD loop (right) performs 1/4 the number
of iterations of the scalar loop (left), and each addition operator acts on 4 numbers at a time (i.e., addition here is a single
instruction for multiple data elements).

The maximum potential speedup of this SIMD-enabled calculation with respect a scalar version is equal
to the number of values held in the processor’s vector registers. In the example in Listing 3.1, this factor is

Prepared for Yunheng Wang c Colfax International, 2013


78 CHAPTER 3. EXPRESSING PARALLELISM

equal to 4. The practical speedup with SIMD depends on the width of vector registers, type of scalar operands,
type of instruction and associated memory traffic. See Section 3.1.2 for more information about various SIMD
instruction sets. Additional reading for code vectorization can be found in the book [13] by Aart Bik, former
lead architect of automatic vectorization in the Intel compilers.

3.1.2 MMX, SSE, AVX and IMCI Instruction Sets


SIMD instructions typically include common arithmetic operations (addition, subtraction, multiplication
and division), as well as comparisons, reduction and bit-masked operations. Section “Intrinsics” of the Intel
C++ Compiler Reference [14] contains complete information on the instruction sets and operations within
them supported by the Intel Compilers.
One of the most important factors determining the theoretical maximum speedup of a vector instruction
set is the width of vector registers. Table 3.1 below lists the instruction sets along with the number types and
operations, supported by the Intel compiler.

g
Instruction Set Year and Intel Processor Vector Packed Data Types

an
registers

W
MMX 1997, Pentium 64-bit 8-, 16- and 32-bit integers

ng
SSE 1999, Pentium III 128-bit 32-bit single precision FP (floating-point)
SSE2 2001, Pentium 4 128-bit
e
8 to 64-bit integers; single & double prec. FP
nh
SSE3–SSE4.2 2004 – 2009 128-bit (additional instructions)
Yu

AVX 2011, Sandy Bridge 256-bit single and double precision FP


AVX2 2013, (future) Haswell 256-bit integers, additional instructions
r
fo

IMCI 2012, Knights Corner 512-bit 32- and 64-bit integers; single & double prec.
ed

FP
p ar

Table 3.1: History of SIMD instruction sets supported by the Intel processors. Processors supporting modern instruction
re

sets are backward-compatible with older instruction sets. The Intel Xeon Phi coprocessor is an exception to this trend,
yP

supporting only the IMCI (“Initial Many-Core Instructions”) instruction set.


el
iv

3.1.3 Is Your Code Using SIMD Instructions?


us
cl

Even if you did not know about SIMD instructions before, or did not make specific efforts to employ
Ex

them in your code, your application may already be using SIMD parallelism.
Some high-level mathematics libraries, such as the Intel MKL, contain implementations of common
operations for linear algebra, signal analysis, statistics, etc, which use SIMD instructions. In codes where
performance-critical calculations call such library functions, vectorization is employed without burdening
the programmer. Whenever your application performs an operations that can be expressed as an Intel MKL
library function, the easiest way to vectorize this operation is to call the library implementation. This applies
to workloads for the Intel Xeon and Intel Xeon Phi architecture alike.
In original high performance codes, SIMD operations may be implemented by the compiler through a
feature known as automatic vectorization. Automatic vectorization is enabled at the default optimization level
-O2. However, in order to gain the most from automatic vectorization, the programmer must organize data
and loops in a certain way, as described further in this section. Automatic vectorization is the most convenient
way to employ SIMD, because cross-platform porting is performed by the compiler.
Finally, SIMD instructions may be called explicitly via assembly code or vector intrinsics. This method
may sometimes yield better performance than automatic vectorization, but cross-platform porting is difficult.
The rest of this section explains how to ensure that user code is vectorized.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.1. DATA PARALLELISM IN SERIAL CODES 79

3.1.4 Data Alignment


Before demonstrating how to utilize SIMD instructions in codes for Intel Xeon processors and Intel
Xeon Phi coprocessors, let us diverge into a very important related topic: data alignment. A prerequisite for
successful use of SIMD instructions is placing the data to be processed at a memory address which allows for
aligned data access. This typically means that the address is a multiple of the vector register width in bytes.
For example, 128-bit SSE load and store instructions require 16-byte alignment, and 512-bit KNC load and
store instructions require 64-byte alignment.
Strictly speaking, the definition of data alignment is the following. Pointer p is said to address a memory
location aligned on an n-byte boundary if ((size_t)p%n==0).
Complete information on data alignment with the Intel C++ compiler can be found in the compiler
reference [15]. This section summarizes compiler support for data alignment and provides practical examples.

Data Alignment on the Stack


When alignment of data on the stack is necessary, the __declspec(align) qualifier can be used, as

g
an
shown in Listing 3.2.

W
ng
1 __declspec(align(64)) float A[n];

e
nh
Listing 3.2: Allocating a stack array A aligned on a 64-byte boundary.
r Yu
fo
d

An alternative method, preferred in Linux, is the keyword __attribute__((aligned(X))) demon-


re

strated in Figure 3.3.


pa

1 float A[n] __attribute__((aligned(64)));


re
yP

Listing 3.3: Alternative way of creating a stack array A aligned on a 64-byte boundary.
el
iv
us

In both examples shown above, array A will be placed in memory in such a way that the address of A[0]
cl

is a multiple of 64, i.e., aligned on a 64-byte boundary. Note that setting a very high alignment value may
Ex

lead to significant fraction of virtual memory being wasted. Also remember that the boundary value must be a
power of two. See the Intel C++ Compiler Reference [16] for more information.

Alignment of Memory Blocks on the Heap


With the Intel C++ compiler, aligned arrays can be allocated with the _mm_malloc and _mm_free
intrinsics [17], which replace the unaligned malloc and free calls. In order to access these intrinsics, the
developer must include the header file malloc.h. See usage example in Listing 3.4.

1 #include <malloc.h>
2 // ...
3 float *A = (float*)_mm_malloc(n*sizeof(float), 16);
4 // ...
5 _mm_free(A);

Listing 3.4: Allocating and freeing a memory block aligned on a 16-byte boundary with _mm_malloc/_mm_free.

Prepared for Yunheng Wang c Colfax International, 2013


80 CHAPTER 3. EXPRESSING PARALLELISM

An alternative way to achieve alignment when allocating memory on the heap is to use the malloc call
to allocate a block of memory slightly larger than needed, and then point a new pointer to an aligned address
within that block. The advantage of this method is that it can be used in compilers that do not support the
_mm_malloc/_mm_free calls. See Listing 3.5 for an example of this procedure.

1 #include <stdlib.h>
2 // ...
3 char *holder = malloc(bytes+32-1); // Not guaranteed to be aligned
4 size_t offset = (32-((size_t)(*holder))%32)%32; // From holder to nearest aligned addr.
5 float *ptr=(float*) ((char*)(holder) + offset); // ptr[0] aligned on 32-byte boundary
6 // ...
7 free(holder); // use original pointer to deallocate memory

Listing 3.5: Allocating and freeing a memory block aligned on a 32-byte boundary with malloc and free. Note that in
this case, the pointer ptr should be used to access data, but memory must be free-d via holder.

g
an
Alignment of Objects Created with the Operator new

W
In C++, the operator new does not guarantee alignment. In order to align a C++ class on a boundary, the
e ng
programmer can allocate an aligned block of memory using one of the methods shown above, and then use the
placement version of the operator new as shown in Listing 3.6. Naturally, if this method is used for objects of
nh
derived types (classes and structures), then the internal structure of these types must be designed in such a way
Yu

that the data used for vector operations is aligned.


r
fo
ed

1 #include <new>
2 // ...
ar

3 void *buf = _mm_malloc(sizeof(MyClass), 64); // buf[0] is aligned on a 64-byte boundary


p

4 MyClass *ptr = new (buf) MyClass; // placing MyClass without allocating new memory
re

5 // ...
yP

6 ptr->~MyClass();
7 _mm_free(buf);
el
iv
us

Listing 3.6: Placing an object of type MyClass into a memory block aligned on a 64-byte boundary. Note that the
delete operator should not be called on ptr; instead, the destructor should be run explicitly, followed by freeing the
cl
Ex

allocated memory block.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.1. DATA PARALLELISM IN SERIAL CODES 81

3.1.5 Using SIMD with Inline Assembly Code, Compiler Intrinsics and Class Li-
braries
SIMD instructions can be explicitly called from the user code using inline assembly, compiler intrinsics
and class libraries. Note that this method of using SIMD instructions is not recommended, as it limits the
portability of the code across different architectures. For example, porting a code that runs on Intel Xeon
processors and uses AVX intrinsics to run on Intel Xeon Phi coprocessors with IMCI intrinsics requires that
the portion of code with intrinsics is completely re-written. Instead of explicit SIMD calls, developers are
encouraged to employ automatic vectorization with methods described in Section 3.1.6 through Section 3.1.10.
However, this section provides information about intrinsics for reference.

Inline Assembly Code


While inline assembly code is a very fine-grained method of using processor instructions, it is beyond
the scope of this training, and interested readers can refer to the compiler documentation [18]. For an in-depth
review of Intel Xeon Phi coprocessor instruction set, refer to the document [19]. An alternative to inline

g
an
assembly, compiler intrinsics, provide the same level of control and performance, while keeping the code more

W
readable. The next section introduces the use of compiler intrinsics.

e ng
Intel R Compilers intrinsics
nh
For every instruction set supported by the Intel compiler, there is a corresponding header file that declares
Yu
the corresponding short vector types and vector functions. Table 3.2 lists these header files.
r
fo
d

Instruction Set Include header file


re

MMX mmintrin.h
pa

SSE xmmintrin.h or ia32intrin.h


re

SSE2 emmintrin.h or ia32intrin.h


yP

SSE3 pmmintrin.h or ia32intrin.h


SSSE3 tmmintrin.h or ia32intrin.h
el
iv

SSE4 smmintrin.h and nmmintrin.h


us

AVX immintrin.h
cl

AVX2 immintrin.h
Ex

IMCI immintrin.h

Table 3.2: Header files for the Intel C++ Compiler intrinsics.

In order to use intrinsics to apply arithmetic operations to data stored in memory:

1. the data has to be loaded into variables representing the content of vector registers;

2. intrinsic functions must be applied to these variables;

3. the data in resultant vector register variables must be stored back in memory.

In addition, in some cases, the data loaded into vector registers must be aligned, i.e., placed at a memory
address which is a multiple of a certain number of bytes. See Section 3.1.4 for more information on data
alignment.
Codes in Listing 3.7 illustrate using the SSE2 and IMCI intrinsics for the addition of two arrays shown in
Listing 3.1. Note that the stride of the loop variable i is 4 for the SSE2 code and 16 for the IMCI code.

Prepared for Yunheng Wang c Colfax International, 2013


82 CHAPTER 3. EXPRESSING PARALLELISM

1 for (int i=0; i<n; i+=4) { 1 for (int i=0; i<n; i+=16) {
2 __m128 Avec=_mm_load_ps(A+i); 2 __m512 Avec=_mm512_load_ps(A+i);
3 __m128 Bvec=_mm_load_ps(B+i); 3 __m512 Bvec=_mm512_load_ps(B+i);
4 Avec=_mm_add_ps(Avec, Bvec); 4 Avec=_mm512_add_ps(Avec, Bvec);
5 _mm_store_ps(A+i, Avec); 5 _mm512_store_ps(A+i, Avec);
6 } 6 }

Listing 3.7: Addition of two arrays using SSE2 intrinsics (left) and IMCI intrinsics (right). These codes assume that
the arrays float A[n] and float B[n] are aligned on a 16-byte boundary, and that n is a multiple of 4 for SSE
and a multiple of 16 for IMCI. Variables Avec and Bvec are 128 = 4 × sizeof(float) bits in size for SSE2 and
512 = 16 × sizeof(float) bits for the Intel Xeon Phi architecture.

The SSE2 code in Listing 3.7 will run only on Intel Xeon processors, and the IMCI code will run
only on Intel Xeon Phi coprocessors. The necessity to maintain a separate version of a SIMD code for each
target instruction set is generally undesirable, however, it cannot be avoided when code is vectorized with
intrinsics. A better approach to expressing SIMD parallelism is using the Intel Cilk Plus extensions for array

g
notation (see Section 3.1.7) or auto-vectorizable C loops and vectorization pragmas (see Section 3.1.6 through

an
Section 3.1.10).

W
Note that switching between different instruction sets in a code employing SIMD intrinsics should be

ng
done with care. In some cases, in order to switch between different instruction sets supported by a processor,
e
register have to be set to a certain state to avoid a performance penalty. See the Intel Compilers Reference [20]
nh
for details.
r Yu

Class Libraries
fo

The C++ vector class library provided by the Intel Compilers ([21], [22]) defines short vectors as C++
ed

classes, and operators acting on these vectors are defined in terms of SIMD instructions. A similar library was
ar

recently released by Agner Fog [23]. Table 3.3 lists the header files that should be included in order to gain
p
re

access to the Intel C++ Class Library.


yP

Instruction Set Include header file


el

MMX
iv

ivec.h
us

SSE fvec.h
cl

SSE2 dvec.h
Ex

AVX TBA
IMCI TBA

Table 3.3: Header files for the Intel C++ Class Library.

Codes in Listing 3.8 demonstrate how the C++ vector class library included with the Intel C++ compiler
can be used to execute the SIMD loop shown in Listing 3.1.

1 for (int i=0; i<n; i+=4) { 1 for (int i=0; i<n; i+=16) {
2 F32vec4 *Avec=(F32vec4*)(A+i); 2 F32vec16 *Avec=(F32vec16*)(A+i);
3 F32vec4 *Bvec=(F32vec4*)(B+i); 3 F32vec16 *Bvec=(F32vec16*)(B+i);
4 *Avec = *Avec + *Bvec; 4 *Avec = *Avec + *Bvec;
5 } 5 }

Listing 3.8: Addition of two arrays using the Intel C++ vector class library with SSE2 (left) and IMCI instructions (right).
These codes assume that the arrays float A[n] and float B[n] are aligned on a 16-byte boundary, and that n is a
multiple of 4 for SSE2 and a multiple of 16 for the Intel Xeon Phi architecture.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.1. DATA PARALLELISM IN SERIAL CODES 83

3.1.6 Automatic Vectorization of Loops


In order to take advantage of SIMD instructions, the developer does not need to call them explicitly via
inline assembly or intrinsics. The alternative route is using the automatic vectorization feature of the Intel
compiler. This function transforms scalar C/C++ loops into loops with short vectors and SIMD instructions
during code compilation. The benefits of using automatic vectorization instead of intrinsic functions or vector
class libraries are:
1) code portability (porting to a new instruction set is possible by re-compilation)
2) ease of development and, in some cases,
3) heuristic compile-time analysis and empirical run-time analysis of the profitability of various vectorization
paths implemented by the compiler.
A practical example of automatic vectorization the loop from Listing 3.1 is shown in Listing 3.9. This
example, while trivial, demonstrates some of the important aspects of automatic vectorization.

g
an
1 #include <stdio.h>

W
2
3 int main(){

ng
4 const int n=8;

e
5 int i;
6 __declspec(align(64)) int A[n];
nh
Yu
7 __declspec(align(64)) int B[n];
8
r

9 // Initialization
fo

10 for (i=0; i<n; i++)


d

11 A[i]=B[i]=i;
re

12
pa

13 // This loop will be auto-vectorized


14 for (i=0; i<n; i++)
re

15 A[i]+=B[i];
yP

16
17 // Output
el

18 for (i=0; i<n; i++)


iv

19 printf("%2d %2d %2d\n", i, A[i], B[i]);


us

20 }
cl
Ex

user@host% icpc autovec.c -vec-report3


autovec.c(10): (col. 3) remark: loop was not vectorized:
vectorization possible but seems inefficient.
autovec.c(14): (col. 3) remark: LOOP WAS VECTORIZED.
autovec.c(18): (col. 3) remark: loop was not vectorized:
existence of vector dependence.
user@host% ./a.out
0 0 0
1 2 1
2 4 2
3 6 3
4 8 4
5 10 5
6 12 6
7 14 7

Listing 3.9: The source code file autovec.c (top panel) illustrates a regular C++ code that will be auto-vectorized by
the compiler. The only step the developer had to make in this example is allocating the arrays on a 64-byte boundary. The
bottom panel shows the compilation and runtime output of the code.

Prepared for Yunheng Wang c Colfax International, 2013


84 CHAPTER 3. EXPRESSING PARALLELISM

Let us focus on the source code in Listing 3.9 first. Unlike codes in Listing 3.7 and Listing 3.8, the code
in Listing 3.9 is oblivious of the architecture that it is compiled for. Indeed, this code can be compiled and
auto-vectorized for Pentium 4 processors with SSE2 instructions as well as for Intel Xeon Phi coprocessors.
The only place where architecture is implicitly assumed is the alignment boundary value of 64. This value is
greater than the SSE requirement (16) and is chosen to satisfy the IMCI alignment requirements.
Now let us take a look at the compilation and runtime output of the code shown in Listing 3.9.
• The code was compiled with the argument -vec-report3, which forces the compiler to print some
of the automatic vectorization status information.
• No special optimization arguments were used. Automatic vectorization is enabled for optimization level
-O2, which is the default optimization level, and higher.
• The first line of the compiler output indicates that the initialization loop in line 10 of the source code
was not vectorized: “vectorization possible but seems inefficient”. This happened because the array size
is known at compile time and is very small. The heuristics analyzed by the auto-vectorizer suggest that
the vectorization overhead is not going to be beneficial for performance.

g
an
• The second line of the compiler output reads in capitals: “LOOP WAS VECTORIZED”. This is the

W
expected result, and it applies to the loop in line 14 that performs addition.

ng
• The third line indicates that the loop in line 18 was not vectorized because of the “existence of vector
e
nh
dependence”, because the printf statement in that loop cannot be expressed via vector instructions.
Yu

• Code output following the compilation report shows that the code works as expected.
r

As a proof of that this C code is indeed a portable solution, one can compile it for native execution on
fo

Intel Xeon Phi coprocessors. Listing 3.10 illustrates the result, which is self-explanatory.
ed
p ar

user@host% icpc autovec.c -vec-report3 -mmic


re

autotest.c(10): (col. 3) remark: LOOP WAS VECTORIZED.


yP

autotest.c(14): (col. 3) remark: LOOP WAS VECTORIZED.


autotest.c(18): (col. 3) remark: loop was not vectorized:
el

existence of vector dependence.


iv
us

Listing 3.10: Compilation and runtime output of the code in Listing 3.9 for Intel Xeon Phi execution
cl
Ex

In addition to portability across architectures, reliance on automatic vectorization provides other benefits.
For instance, auto-vectorizable code may release the programmer of the requirement that the number of
iterations should be a multiple of the number of data elements in the vector register. Indeed, the compiler will
peel off the last few iterations if necessary, and perform them with scalar instructions. It is also possible to
automatically vectorize loops working with data that are not aligned on a proper boundary. In this case, the
compiler will generate code to check the data alignment at runtime and, if necessary, peel off a few iterations
at the start of the loop in order to perform the bulk of the calculations with fast aligned instructions.
Generally, the only type of loops that the compiler will auto-vectorize is a for-loop, with the number of
iterations run known at runtime, or, better yet, at compile time. Memory access in the loop must have a regular
pattern, ideally with unit stride.
Non-standard loops that cannot be automatically vectorized include: loops with irregular memory access
pattern, calculations with vector dependence, while-loops or for-loops in which the number of iterations
cannot be determined at the start of the loop, outer loops, loops with complex branches (i.e., if-conditions),
and anything else that cannot be, or is very difficult to vectorize.
Further information on automatic vectorization of loops can be found in Section 3.1.10 and Section 4.3,
and in the Intel C++ compiler reference [24].

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.1. DATA PARALLELISM IN SERIAL CODES 85

TM
3.1.7 Extensions for Array Notation in Intel R Cilk Plus
Automatic vectorization in the Intel C++ compiler is not limited to loops. The Intel Cilk Plus extension
provides additional tools that enable the programmer to indicate data parallelism so that the compiler can
automatically vectorize with SIMD operations.
Array notation is a method for specifying slices of arrays or whole arrays, and applying element-wise
operations to arrays of the same shape. The Intel C++ Compiler implements these operations using vector
code, mapping data-parallel constructs to the SIMD hardware.
In the example code in Listing 3.9, the addition loop in lines 14-15 can be replaced with the code shown
in Listing 3.11. When this code is compiled with the Intel C++ Compiler, the addition operation will be
automatically vectorized.

1 A[:] += B[:];

Listing 3.11: Intel Cilk Plus extensions for array notation example: to all elements of array A, add elements of array B.

g
an
W
It is also possible to specify a slice of arrays:

e ng
1 A[0:16] += B[32:16];

nh
Yu
Listing 3.12: Intel Cilk Plus extensions for array notation example: to sixteen elements of array A (0 through 15) add
r

sixteen elements of array B (32 through 47).


fo
d
re

And it is possible to indicate a stride:


pa
re

1 A[0:16:2] += B[32:16:4];
yP
el

Listing 3.13: Intel Cilk Plus extensions for array notation example: to sixteen elements of array A (0, 2, 4, . . . , 30) add
iv

sixteen elements of array B (32, 36, 38, . . . 92).


us
cl

The Intel Cilk Plus extensions are enabled by default, and therefore no additional modifications of the
Ex

code or compiler arguments are necessary. However, in order to enable compilation with non-Intel compilers,
the programmer must protect the expressions with array notation with preprocessor directives and provide an
alternative implementation of these expressions with loops that can be understood by other compilers. See
Listing 3.14 for an example.

1 #ifdef __INTEL_COMPILER
2 A[:] += B[:]
3 #else
4 for (int i = 0; i < 16; i++)
5 A[i] += B[i];
6 #endif

Listing 3.14: Protecting Intel Cilk Plus array notation in order to enable compilation with non-Intel compilers.

Array notation extensions also work with multidimensional arrays. Refer to the Intel C++ Compiler
documentation for more details on Intel Cilk Plus [25] and the array notation extensions of this library [26].

Prepared for Yunheng Wang c Colfax International, 2013


86 CHAPTER 3. EXPRESSING PARALLELISM

3.1.8 Elemental Functions


Elemental functions are an additional method available with the Intel C++ Compiler to enforce code
vectorization. They are written as a regular C/C++ functions, operating on only scalar numbers with scalar
syntax. In the code, elemental functions can be called to operate in a data-parallel context, and the compiler
will automatically implement vectorization where possible. They can also be used in data- and thread-parallel
contexts with automatic parallelization across multiple threads. Every elemental function must be pure, i.e.,
without side-effects. In particular, elemental functions must not modify global data that other instances of
that function depend on. Listing 3.15 demonstrates the addition of two arrays where the addition operation is
executed in a regular (not elemental) function.

1 float my_simple_add(float x1, float x2){


2 return x1 + x2;
3 }
4
5 // ...in a separate source file:
for (int i = 0; i < N, ++i) {

g
6

an
7 output[i] = my_simple_add(inputa[i], inputb[i]);
8 }

W
ng
Listing 3.15: Scalar function for addition in C.
e
nh
Yu

If the code of the function and the call to the function are located in the same file, the compiler may
perform inter-procedural optimization (inline the function) and vectorize this loop. However, what if the
r
fo

function is a part of a library? That would make it impossible for the compiler to inline the function code to
replace scalar addition with SIMD operations in that case. The solution to this situation is offered by elemental
ed

functions. In order to declare my_simple_add as an elemental function, __attribute__(vector)


ar

must be added to the function declaration. And in order to force the vectorization of the loop using this
p
re

function, #pragma simd must be used. Listing 3.16 demonstrates this method.
yP
el

1 __attribute__(vector) float my_simple_add(float x1, float x2) {


iv

2 return x1 + x2;
us

3 }
4 #pragma simd
cl

5 for (int i = 0; i < N, ++i) {


Ex

6 output[i] = my_simple_add(inputa[i], inputb[i]);


7 }

Listing 3.16: Vectorized function for addition in Intel Cilk Plus.

The usage of elemental functions may be combined with array notation as shown in Listing 3.17.

1 __attribute__(vector) float my_simple_add(float x1, float x2){


2 return x1 + x2;
3 }
4 my_simple_add(inputa[:], inputb[:]);

Listing 3.17: Vectorized function for addition in Intel Cilk Plus.

For more information on elemental functions in Intel Cilk Plus, refer to the Intel C++ Compiler compiler
documentation [27].

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.1. DATA PARALLELISM IN SERIAL CODES 87

3.1.9 Assumed Vector Dependence. The restrict Keyword.


True vector dependence, such as in the code in Listing 3.18, makes it impossible to vectorize loops
manually or automatically.

1 float* a;
2 // ...
3 for (int i = 1; i < n; i++)
4 a[i] += b[i]*a[i-1];

Listing 3.18: Vector dependence makes the vectorization of this loop impossible.

However, in some cases the compiler may not have sufficient information in order to determine whether
a true vector dependence is present in the loop. Such cases are referred to as assumed vector dependence.

g
an
W
Assumed Vector Dependence Example

ng
Code in Listing 3.19 shows a case where it is impossible to determine whether a vector dependence exists.

e
nh
If pointers a and b point to distinct, non-overlapping memory segments, then there is no vector dependence.
However, there is a possibility that the user will pass to the function a and b pointing to overlapping memory
Yu
addresses (e.g., a==b+1), in which case vector dependence will exist.
r
fo
d
re

1 void mycopy(int n, float* a, float* b) {


for (int i = 0; i < n; i++)
pa

2
3 a[i] = b[i];
re

4 }
yP
el

Listing 3.19: Vector dependence may occur if memory regions referred to by a and b overlap. Intel Compilers may refuse
iv

to vectorize this loop due to assumed vector dependence.


us
cl
Ex

In order to illustrate what happens in situations with assumed vector dependence, we place the code from
Listing 3.19 into file vdep.cc and compile it. The compiler output shown in Listing 3.20 reports that the
loop is not vectorized. The reason for that is that an assumed vector dependence has been found.

user@host% icpc -vec-report3 -c vdep.cc


vdep.cc(2): (col. 3) remark: loop skipped: multiversioned.
vdep.cc(2): (col. 3) remark: loop was not vectorized: not inner loop.

Listing 3.20: Compiler argument -vec-report3 prints diagnostic information about automatic vectorization.

Ignoring Assumed Vector Dependence

In cases when the developer knows that there will not be a true vector dependence situation, it is possible
to instruct the compiler to ignore assumed vector dependencies found in a loop. This can be done with
#pragma ivdep, as shown in Listing 3.21.

Prepared for Yunheng Wang c Colfax International, 2013


88 CHAPTER 3. EXPRESSING PARALLELISM

1 void mycopy(int n, float* a, float* b) {


2 #pragma ivdep
3 for (int i = 0; i < n; i++)
4 a[i] = b[i];
5 }

Listing 3.21: The pragma before this loop instructs the compiler to ignore assumed vector dependence.

Listing 3.22 shows the compilation output. This time, automatic vectorization has succeeded.

user@host% icpc -vec-report3 -c vdep.cc


vdep.cc(3): (col. 3) remark: LOOP WAS VECTORIZED.
vdep.cc(3): (col. 3) remark: loop was not vectorized: not inner loop.

g
an
Listing 3.22: Automatic vectorization succeeds thanks to #pragma ivdep.

W
ng
It must be noted that if the function compiled in this way is called with overlapping arrays a and b (i.e.,
e
nh
with true vector dependence), the code may produce incorrect results or crash.
r Yu
fo
ed

Pointer Disambiguation
p ar

A more fine-grained method to disambiguate the possibility of vector dependence is the restrict
re

keyword. This keyword applies to each pointer variable qualified with it, and instructs the compiler that
yP

the object accessed by the pointer is only accessed by that pointer in the given scope. In order to enable
the keyword restrict, the compiler argument -restrict must be used. An example of the usage of
el
iv

keyword restrict is shown in Listing 3.23. This time, automatic vectorization has succeeded as well. Note
us

that the compiler was given the argument -restrict, without which compilation would have failed.
cl
Ex

1 void mycopy(int n, float* restrict a, float* restrict b) {


2 for (int i = 0; i < n; i++)
3 a[i] = b[i];
4 }

user@host% icpc -vec-report3 -restrict -c vdep.cc


vdep.cc(2): (col. 3) remark: LOOP WAS VECTORIZED.
vdep.cc(2): (col. 3) remark: loop was not vectorized: not inner loop.

Listing 3.23: Automatic vectorization succeeds thanks to the restrict keyword.

Hint: sometimes it may be desirable to disable the restrict keyword. In order to avoid editing code to do
that, it is useful to define a compiler macro RESTRICT and set it to the value “restrict” or to an empty
value, depending on the purpose. In the code, the macro RESTRICT should be used instead of the word
restrict. This is illustrated in Listing 3.24.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.1. DATA PARALLELISM IN SERIAL CODES 89

1 void mycopy(int n, float* RESTRICT a, float* RESTRICT b) {


2 for (int i = 0; i < n; i++)
3 a[i] = b[i];
4 }

user@host% icpc -c vdep.cc -vec-report3 -DRESTRICT=restrict -restrict


vdep.cc(2): (col. 3) remark: LOOP WAS VECTORIZED.
vdep.cc(2): (col. 3) remark: loop was not vectorized: not inner loop.
user@host% icpc -c vdep.cc -vec-report3 -DRESTRICT=
vdep.cc(2): (col. 3) remark: loop skipped: multiversioned.
vdep.cc(2): (col. 3) remark: loop was not vectorized: not inner loop.

Listing 3.24: Using the macro RESTRICT to toggle pointer disambiguation.

3.1.10 Summary of Vectorization Pragmas, Keywords and Compiler Arguments.

g
The following list contains some compiler pragmas that may be useful for tuning vectorized code

an
performance. Details can be found in Intel C++ compiler reference [28]. In the PDF version of this document,

W
the items in the list below are hyperlinks pointing to the corresponding articles in the compiler reference.

ng
• #pragma simd

e
Used to guide the compiler to automatically vectorize more loops (e.g., some outer loops). Arguments
nh
of this pragma can guide the compiler in cases when automatic vectorization is difficult. It is also
Yu
possible to improve vectorization efficiency by specifying expected runtime parameters of the loop in
r

the arguments of #pragma simd.


fo

• #pragma vector always


d
re

Instructs the compiler to implement automatic vectorization of the loop following this pragma, even
pa

if heuristic analysis suggests otherwise, or if non-unit stride or unaligned accesses make vectorization
re

inefficient.
yP

• #pragma vector aligned | unaligned


el

Instructs the compiler to always use aligned or unaligned data movement instructions. Useful, for
iv

instance, when the developer guarantees data alignment. In this case, placing #pragma vector
us

aligned before the loop eliminates unnecessary run-time checks for data alignment, which improves
cl

performance.
Ex

• #pragma vector nontemporal | temporal


Instructs the compiler to use non-temporal (i.e., non-streaming) or temporal (i.e., streaming) stores. Can
be useful when the result of a vector operation is not used down the line in the automatically vectorized
loop. In this case, placing #pragma vector nontemporal will force the code to send the result
of each vector instruction directly to RAM instead of placing it in cache. This may improve performance,
as the data sent to RAM will not contaminate the valuable processor cache and leave more room for
frequently re-used data.
• #pragma novector
Instructs the compiler not to vectorize the loop following this pragma. Can be used for convenience:
the developer may place this pragma before a non-vectorizable loop in order to keep the vectorization
report cleaner. In some cases, #pragma novector can improve performance. For example, if a loop
contains a calculation-heavy branch that is rarely taken. If such a loop is auto-vectorized, the branch
will be always taken, and iterations in which the branch should not have been taken will be masked out
from the result. However, #pragma novector can turn such loop into a scalar loop, in which only
the taken branches are evaluated.

Prepared for Yunheng Wang c Colfax International, 2013


90 CHAPTER 3. EXPRESSING PARALLELISM

• #pragma ivdep
Instructs the compiler to ignore vector dependence, which increases the likelihood of automatic loop
vectorization. See Section 3.1.9 for more information. The keyword restrict can often help to
achieve a similar result.
• restrict qualifier and -restrict command-line argument
This keyword qualifies a pointer as restricted, i.e., the developer using the restrict keyword guaran-
tees to the compiler that in the scope of this pointer’s visibility, it points to data which is not referenced
by any other pointer. Qualifying function arguments with the restrict keyword helps in the elimi-
nation of assumed vector dependencies. The restrict keyword in the code must be enabled by the
-restrict compiler argument. See Section 3.1.9 for more detail.
• #pragma loop count
Informs the compiler of the number of loop iterations anticipated at runtime. This helps the auto-
vectorizer to make more accurate predictions regarding the optimal vectorization strategy.
• __assume_aligned keyword

g
an
Helps to eliminate runtime alignment checks when data is guaranteed to be properly aligned. This

W
keyword produces an effect similar to that of #pragma vector aligned, but provides more

ng
granular control, as __assume_aligned applies to an individual array that participates in the
calculation, and not to the whole loop.
e
nh
• -vec-report[n]
Yu

This compiler argument indicates the level of verbosity of the automatic vectorizer. -vec-report3
r

provides the most verbose report including vectorized and non-vectorized loops and any proven or
fo

assumed data dependencies.


ed
ar

• -O[n]
p

Optimization level, defaults to -O2. Automatic vectorization is enabled with -O2 and higher optimiza-
re

tion levels.
yP

• -x[code]
el
iv

Instructs the compiler to target specific processor features, including instruction sets and optimizations.
us

For example, to generate AVX code, -xAVX can be used; for SSE2, -xSSE2. Using -xhost targets
cl

the architecture found in the system that performs the calculation.


Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.1. DATA PARALLELISM IN SERIAL CODES 91

3.1.11 Exclusive Features of the IMCI Instruction Set


One of the strongest features of the Intel Xeon Phi architecture as a computing platform is that it is
possible to achieve high performance without low-level optimizations, intrinsics or assembly code in user
applications. Indeed, automatic vectorization and support for parallel libraries make it possible to write a
single code that runs efficiently on both Intel Xeon processors and Intel Xeon Phi coprocessors. This code
may also be able to scale to future generations of the Intel Xeon product family. For the programmer that
takes advantage of this portability feature by relying on automatic vectorization, it is not necessary to know
every detail of the instruction set. However, understanding the type of instructions that the compiler uses
to automatically vectorize C, C++ or Fortran codes allows the programmer to design algorithms and data
structures in a way that is most efficient for automatic vectorization.
Intel Initial Many Core Instructions (Intel IMCI) is the instruction set supported by Intel Xeon Phi
coprocessors. Intel IMCI can be considered a superset of the SSE 4.2 and AVX instructions supported by Intel
Xeon processors, however, it is important to realize that Intel Xeon Phi coprocessors do not directly support
SSE or AVX instructions.
The instructions of Intel IMCI operate on special 512-bit registers, which can pack up to eight 64-bit

g
an
elements (long integers or double precision floating-point numbers) or up to sixteen 32-bit elements (integers
or single precision floating-point numbers). For use with intrinsic functions, these registers are represented

W
by three data types declared in the header file immintrin.h: __mm512 (single precision floating-point

ng
vector), __mm512i (32- or 64- bit integer vector) and __mm512d (double precision floating-point vector).

e
Most instructions operate on three arguments: either two source registers with a separate destination register,
or three source registers, one of which is also a destination. nh
Yu
For each operation, two types of instructions are available: unmasked an masked. Unmasked instructions
r

apply the requested operation to all elements of the vector registers. Masked instructions apply the operation
fo

to some of the elements and preserve the value of other elements in the output register. The set of elements
d
re

that must be modified in the output registers is controlled by an additional argument of type __mmask16 or
pa

__mmask8. This is a short integer value, in which bits set to 1 or 0 indicate that the corresponding output
re

elements should be modified or preserved by the masked operation using this bitmask.
yP

The classes of available IMCI instructions are outlined in the list below, illustrated with calls to the
respective intrinsic functions.
el
iv

Initialization instructions are used to fill a 512-bit vector register with one or multiple values of scalar
us

elements. Example:
cl
Ex

1 __mm512 myvec = _mm512_set1_ps(3.14f);

The above example creates a 512-bit short vector of sixteen SP floating-point numbers and initializes all
sixteen elements to a value of 3.14f;

Load and store instructions copy a contiguous 512-bits chunk of data from a memory location to the vector
register (load) or from the vector register to a memory location (store). The address from/to which the
copying takes place must be 64-byte aligned. Additional versions of these instructions operate only on
the high or low 64 bits of the vector. Example:

1 float myarr[128] __attribute__((align(64)));


2 myarr[:] = 1.0f;
3 __mm512 myvec = _mm512_load_ps(&myarr[32]);

In this example, elements 32 through 47 of array myarr are loaded into the vector register assigned to
variable myvec.

Prepared for Yunheng Wang c Colfax International, 2013


92 CHAPTER 3. EXPRESSING PARALLELISM

Gather and scatter instructions are used to copy non-contiguous data from memory to vector registers
(gather), or from vector registers to memory (scatter). This type of instructions is unique to the Intel
Xeon Phi architecture and is not available in Intel Xeon processors. The memory access pattern must
have a power of 2 stride (1, 2, 4, 8, . . . elements). The copying of data can be done simultaneously
with type conversion. It is also possible to specify prefetching from memory to cache for this type of
operation. Example:

1 __mm512i myvec = _mm512_set1_epi32(-1);


2 float myarr[128] __attribute__(align(64));
3 _mm512_i32scatter_ps(myvec, &myarr[0], 4);

The above code scatters the values in integer short vector myvec to array myarr starting with the index
0 with a stride of 4. That is, elements 0, 1, 2, . . . , 15 of the short vector myvec will be copied to array
elements myvec[0], myvec[4], myvec[8], . . . , myvec[60], respectively.

Arithmetic instructions are the core of high performance calculations. The list below illustrates the scope of

g
these instructions.

an
W
a) Addition, subtraction and multiplication are available for all data types supported in the IMCI. It is

ng
possible to specify the rounding method for floating-point operations. Example:
e
nh
1 __mm512 c = _mm512_mul_ps(a, b);
Yu

b) Fused Multiply-Add instruction (FMA) is the basis of several operations in linear algebra, including
r
fo

xAXPY and dot-product calculations. These instructions perform element-wise multiplication


ed

of vectors v1 and v2 and add the result to vector v3. The FMA instruction is currently only
supported by the Intel Xeon Phi architecture, and there is no FMA support in today’s Intel Xeon
ar

processors. The latency and throughput of FMA is comparable to that of individual addition or of
p
re

individual multiplication instruction, and therefore it is always preferable to use FMA instead of
yP

separate addition and multiplication where possible. It is possible to specify the rounding method
for floating-point operations. Example:
el
iv

_mm512_fmadd_ps(v1, v2, v3);


us

1
cl
Ex

c) Division and transcendental function implementations are available in the Intel Short Vector Math
Library (SVML). The following transcendental operations are supported:
- Division and reciprocal calculation;
- Error function;
- Inverse error function;
- Exponential functions (natural, base 2 and base 10) and the power function. Base 2 exponential
is the fastest implementation;
- Logarithms (natural, base 2 and base 10). Base 2 logarithm is the fastest implementation;
- Square root, inverse square root, hypothenuse value and cubic root;
- Trigonometric functions (sin, cos, tan, sinh, cosh, tanh, asin, acos, atan);
- Rounding functions
The following example calculates the reciprocal square root of each element of vector y:
1 __mm512 x = _m512_invsqrt_ps(y);

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.1. DATA PARALLELISM IN SERIAL CODES 93

Swizzle and permute instructions rearrange (shuffle) scalar elements in a vector register. For these operations
it is convenient to think of a 512-bit register as a set of four 128-bit blocks. The swizzle operation
rearranges elements within each 128-bit block, and the permute operation rearranges the 128-bit
blocks in the register according to the pattern specified by the user. These instructions can be used in
combination with another intrinsic, which save processor cycles. Example:

1 __mm512 myv1, myv2;


2 // ...
3 __mm512_add_ps(myv1, __mm512_swizzle_ps(myv2, _MM_SWIZ_REG_DCAB));

In this example, the swizzle operation with the pattern DCAB is applied to the 512-bit SP floating-point
vector myv2, and then this vector, swizzled, is added to another vector of the same type, myv1.

Comparison instructions perform element-wise comparison between two 512-bit vectors and return a bit-
mask value with bits set to 0 or 1 depending on the result of the respective comparison. Example:

__mmask16 result = _m512_cmp_ps_mask(x, y, _MM_CMPINT_LT);

g
1

an
W
The above code compares vectors x and y and returns the bitmask result where bits are set to 1 if the

ng
corresponding element in x is less than the corresponding element in y.

e
nh
Conversion and type cast instructions perform conversion from single to double precision and from double
Yu
to single precision floating-point numbers, from floating-point numbers to integers and from integers to
floating-point numbers.
r
fo

Bitwise instructions perform bit-wise AND, OR, XAND and XOR operations on elements in 512-bit short
d
re

vectors.
pa

Reduction and minimum/maximum instructions allow the calculation of the sum of all elements in a vector,
re

the product of all elements in a vector, and the evaluation of the minimum or maximum of all elements
yP

in a vector. These instructions are exclusive in the Intel IMCI.


el

Vector mask instructions allow to set the values of type __mmask16 and __mmask8 and to perform bitwise
iv
us

operations on them. Masks can be used in all IMCI instructions to control which of the elements in the
cl

resulting vector are modified, and which are preserved in an operation. Bitmasked operations are an
Ex

exclusive feature of the Intel IMCI.


Miscellaneous instructions are available for decomposing floating-point numbers into the mantissa and the
exponent, fixing NaNs and performing low-precision transcendental operations;
Scalar instructions are available for bit counting, cache eviction and thread delay.

Prepared for Yunheng Wang c Colfax International, 2013


94 CHAPTER 3. EXPRESSING PARALLELISM

3.2 Task Parallelism in Shared Memory


The contents of this section are focused on multi-threaded programming (i.e., task parallelism) using
the Intel Cilk Plus and Intel Open Multi-Processing (OpenMP) parallel libraries. This section introduces the
API and programming paradigms of multi-threaded codes. For optimization advice in shared-memory parallel
applications, refer to Section 4.4.
While POSIX threads, also known as Pthreads, are sometimes used to parallelize applications, the
Pthreads standard does not contain HPC-specific features such as workload balancing, processor affinity,
reducers, etc. Computationally intensive algorithms are usually better implemented using one of the specialized
standards for building thread-parallel applications, such as OpenMP or Intel Cilk Plus.
This training will present OpenMP and Cilk Plus side by side, and we will leave it to the programmer to
make the decision regarding which standard to use.

TM
3.2.1 About OpenMP and Intel R Cilk Plus

g
OpenMP and Cilk Plus have the same scope of application to parallel algorithms and similar functionality.

an
The choice between OpenMP and Cilk Plus as the parallelization method may be dictated either by convenience,

W
or by performance considerations. It is often easy enough to implement the code with both frameworks and

ng
compare the performance. In general, trivial and highly parallel algorithms should run equally well in any of
e
these two parallel frameworks. For complex algorithms with nested parallelism and heterogeneous tasks,
nh
Yu

• Intel Cilk Plus generally provides good performance “out of the box”, but offers little freedom for
fine-tuning. With this framework, the programmer should focus on exposing the parallelism in the
r
fo

application rather than optimizing low-level aspects such as thread creation, work distribution and data
sharing.
ed
ar

• OpenMP may require more tuning to perform well, however, it allows more control over scheduling and
p
re

work distribution.
yP

Additionally, Intel OpenMP and Intel Cilk Plus libraries can be used side by side in the same code
el

without conflicts. In case of nested parallelism, it is preferable to use Cilk Plus parallel regions inside OpenMP
iv

parallel regions, and not the other way around.


us
cl
Ex

Program Structure in OpenMP


OpenMP is a traditional, well-established cross-platform standard with which many high performance
application developers are familiar. It provides high-level abstraction for task parallelism, and eliminates the
low-level details of iteration space partitioning, data sharing, and thread creation, scheduling, and synchro-
nization. In order to parallelize an application with OpenMP, the programmer supplements the code with
OpenMP pragmas. These pragmas instruct OpenMP-aware compilers to produce parallel versions of the
respective statements and to bind to the OpenMP implementation. It is possible to disable OpenMP support in
the compiler, and the code with OpenMP pragmas will still compile. In this case the pragmas will be treated
as comments, and parallelization will not occur (i.e., the code will be serialized). The OpenMP standard,
however, does not guarantee that the results of the application will be the same with and without OpenMP.
A program with OpenMP directives begins execution as a single thread, called the initial thread of
execution. It is executed sequentially until the first parallel construct is encountered. After that the initial
thread creates a team of threads to be executed in parallel, and becomes the master of this team. All program
statements enclosed by the parallel construct are executed in parallel by each thread in the team, including all
routines called from within the enclosed statements. At the end of the parallel construct each thread wait for
others to arrive. When that happens the team is dissolved, and only the master thread continues execution of

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 95

the code following the parallel construct. The other threads in the team enter a wait state until they are needed
to form another team.
Listing 3.25 illustrates the structure of applications with OpenMP constructs, and provides the comments
explanation of each construct or section.

1 main() { // Begin serial execution.


2 ... // Only the initial thread executes
3 #pragma omp parallel // Begin a parallel construct and form
4 { // a team.
5 #pragma omp sections // Begin a work-sharing construct.
6 {
7 #pragma omp section // One unit of work.
8 {...}
9 #pragma omp section // Another unit of work.
10 {...}
11 } // Wait until both units of work complete.
12 ... // This code is executed by each team member.

g
13 #pragma omp for nowait // Begin a work-sharing Construct

an
14 for(...)

W
15 { // Each iteration chunk is unit of work.
16 ... // Work is distributed among the team members.

ng
17 } // End of work-sharing construct.

e
18 ... // nowait was specified so threads proceed.
19 #pragma omp critical //
nh
Begin a critical section.
Yu
20 {...} // Only one thread executes at a time.
21 #pragma omp task // Execute in another thread without blocking this thread
r

{...}
fo

22
23 ... // This code is executed by each team member.
d

24 #pragma omp barrier // Wait for all team members to arrive.


re

25 ... // This code is executed by each team member.


pa

26 } // End of Parallel Construct


// Disband team and continue serial execution.
re

27
28 ... // Possibly more parallel constructs.
yP

29 } // End serial execution.


el
iv

Listing 3.25: The following example illustrates the execution model of an application with OpenMP constructs. Credit:
us

Intel C++ Compiler XE 13.0 User and Reference Guides


cl
Ex

The principle illustrated by this listing is summarized by the following rules:

1. Code outside #pragma omp parallel is serial, i.e., executed by only one thread

2. Code directly inside #pragma omp parallel is executed by each thread of the team

3. Code inside work-sharing construct #pragma omp for is distributed across the threads in the team

In order to compile a C++ program with OpenMP pragmas using the Intel C++ Compiler the programmer
must specify the compiler argument -openmp. Without this argument, the code will still compile, but all
code will be executed with only one thread. In order to make certain functions and variables of the OpenMP
library available, #include <omp.h> must be used at the beginning of the code.

TM
Program Structure with Intel R Cilk Plus
Intel Cilk Plus is an emerging standard currently supported by GCC 4.7 and the Intel C++ Compiler.
Its functionality and scope of application are similar to those of OpenMP. There are only three keywords in

Prepared for Yunheng Wang c Colfax International, 2013


96 CHAPTER 3. EXPRESSING PARALLELISM

the Cilk Plus standard: _Cilk_for, _Cilk_spawn, and _Cilk_sync. Programming for Intel Xeon Phi
coprocessors may also require keywords _Cilk_shared and _Cilk_offload. However, these keywords
allow to implement a variety of parallel algorithms. Language extensions such as array notation, hyperobjects,
elemental function and #pragma simd are also a part of Intel Cilk Plus. Unlike OpenMP, the Cilk Plus
standard guarantees that serialized code will produce the same results as parallel code, if the program has a
deterministic behavior. Last, but not least, Intel Cilk Plus is designed to seamlessly integrate vectorization and
thread-parallelism in applications using this framework.
There are only three keywords in the Cilk Plus standard: _Cilk_for, _Cilk_spawn and _Cilk_sync.
They allow for the implementation of a variety of parallel algorithms. Programming for Intel Xeon Phi copro-
cessors may also require keywords _Cilk_shared and _Cilk_offload.

1 void foo() {
2 ... // Executed by a single worker
3 _Cilk_spawn foo(...) { // Nested parallelism:
4 ... // Execute by a separate worker without blocking this function
5 }

g
_Cilk_sync; // Wait for all tasks spawned from this functions to complete

an
6
7 }

W
8
void bar() {

ng
9
10 _Cilk_for(...) { // May be nested inside another parallel region
e
11 ... // Distribute workload across all available workers
nh
12 }
Yu

13 }
14
r

main() { // Begin serial execution.


fo

15
16 ... // Only one worker executes
ed

17 _Cilk_spawn foo(); // Execute foo() without blocking current scope


_Cilk_for(...) { // Share work between available workers
ar

18
19 ...
p

}
re

20
21 _Cilk_sync; // Wait until all jobs spawned from this function complete
yP

22 _Cilk_offload bar(); // Offload function to coprocessor


el
iv

Listing 3.26: Execution model of an Intel Cilk Plus application.


us
cl

In order to make certain functions of Intel Cilk Plus available, the programmer must use #include
Ex

<cilk/cilk.h>.
The nature of Intel Cilk Plus keywords and semantics preserves the serial nature of codes. The lack
of locks in the code is compensated by the availability of hyperobjects, which facilitate and motivate more
scalable parallel algorithms.
Intel Cilk Plus uses an efficient scheduling algorithm based on “work stealing”, which may be more
efficient than OpenMP in complex multi-program applications.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 97

TM
3.2.2 “Hello World” OpenMP and Intel R Cilk Plus Programs
A sample OpenMP program and its compilation are shown in Listing 3.27,.

1 #include <omp.h>
2 #include <stdio.h>
3
4 int main(){
5 const int nt=omp_get_max_threads();
6 printf("OpenMP with %d threads\n", nt);
7
8 #pragma omp parallel
9 printf("Hello World from thread %d\n", omp_get_thread_num());
10 }

user@host$ export OMP_NUM_THREADS=5


user@host$ icpc -openmp hello_omp.cc
user@host$ ./a.out

g
an
OpenMP with 5 threads
Hello World from thread 0

W
Hello World from thread 3

ng
Hello World from thread 1
Hello World from thread 2

e
nh
Hello World from thread 4
user@host$
Yu
user@host$ icpc -openmp-stubs hello_omp.cc
hello_omp.cc(8): warning #161: unrecognized #pragma
r
fo

#pragma omp parallel


^
d
re

user@host$ ./a.out
pa

OpenMP with 1 threads


re

Hello World from thread 0


yP

user@host$
el

Listing 3.27:
iv

Top: Hello World program in OpenMP. Note the inclusion of the header file omp.h. Parallel execution is requested via
us

#pragma omp parallel.


cl

Bottom: Compiling the Hello World program with OpenMP. Intel Compilers flag -openmp links the OpenMP runtime
Ex

library for parallel execution, -openmp-stubs serializes the program. Environment variable OMP_NUM_THREADS
controls the default number of threads spawned by #pragma omp parallel. By default, the number of threads is set
to the number of cores (or hyper-threads) in the system.

Prepared for Yunheng Wang c Colfax International, 2013


98 CHAPTER 3. EXPRESSING PARALLELISM

A sample Intel Cilk Plus program and its compilation are shown in Listing 3.28.

1 #include <cilk/cilk.h>
2 #include <stdio.h>
3
4 int main(){
5 const int nw=__cilkrts_get_nworkers();
6 printf("Cilk Plus with %d workers.\n", nw);
7
8 _Cilk_for (int i=0; i<nw; i++) // Light workload: gets serialized
9 printf("Hello World from worker %d\n", __cilkrts_get_worker_number());
10
11 _Cilk_for (int i=0; i<nw; i++) {
12 double f=1.0;
13 while (f<1.0e40) f*=2.0; // Extra workload: gets parallelized
14 printf("Hello Again from worker %d (%f)\n", __cilkrts_get_worker_number(), f);
15 }
16 }

g
an
user@host$ export CILK_NWORKERS=5

W
user@host$ icpc hello_cilk.cc

ng
user@host$ ./a.out
Cilk Plus with 5 workers. e
Hello World from worker 0
nh
Hello World from worker 0
Yu

Hello World from worker 0


Hello World from worker 0
r
fo

Hello World from worker 0


Hello Again from worker 0 (10889035741470030830827987437816582766592.000000)
ed

Hello Again from worker 0 (10889035741470030830827987437816582766592.000000)


ar

Hello Again from worker 1 (10889035741470030830827987437816582766592.000000)


Hello Again from worker 3 (10889035741470030830827987437816582766592.000000)
p
re

Hello Again from worker 0 (10889035741470030830827987437816582766592.000000)


user@host$
yP

user@host$ icpc -cilk-serialize hello_cilk.cc


user@host$ ./a.out
el

Cilk Plus with 5 workers.


iv

Hello World from worker 0


us

Hello World from worker 0


cl

Hello World from worker 0


Ex

Hello World from worker 0


Hello World from worker 0
Hello Again from worker 0 (10889035741470030830827987437816582766592.000000)
Hello Again from worker 0 (10889035741470030830827987437816582766592.000000)
Hello Again from worker 0 (10889035741470030830827987437816582766592.000000)
Hello Again from worker 0 (10889035741470030830827987437816582766592.000000)
Hello Again from worker 0 (10889035741470030830827987437816582766592.000000)
user@host$

Listing 3.28:
Top: Hello World program in Intel Cilk Plus. Note the inclusion of the header file cilk.h to enable Intel Cilk Plus
constructs. Two parallel loops are included to demonstrate dynamic (i.e., determined at runtime) scheduling of loop
iterations.
Bottom: Compiling the Hello World program with Intel Cilk Plus. No compiler flags necessary to enable Intel Cilk Plus,
however, the flag -cilk-serialize can be used to disable parallelism in Intel Cilk Plus constructs. Environment
variable CILK_NWORKERS controls the default number of Intel Cilk Plus workers.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 99

TM
3.2.3 Loop-Centric Parallelism: For-Loops in OpenMP and Intel R Cilk Plus
A significant number of HPC tasks are centered around for-loops with pre-determined loop bounds and a
constant increment of the loop iterator. Such loops can be easily parallelized in shared-memory systems using
#pragma omp parallel for in OpenMP or the statement _Cilk_for in Intel Cilk Plus. Additional
arguments control how loop iterations are distributed across available threads or workers.
Figure 3.1 illustrates the workflow of a loop parallelized in shared memory using OpenMP or Intel Cilk
Plus.

g
Loop iterations

an
W
e ng
nh

Program flow
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Figure 3.1: Parallelizing a for-loop with OpenMP or Intel Cilk Plus.

As the above figure illustrates, the execution of a parallel loop is initiated by a single thread. When the
loop starts, multiple threads (in the case of OpenMP) or workers (for Intel Cilk Plus) are spawned, and each
thread gets a portion of the loop iteration space (called “chunk” in the terminology of OpenMP) to process.
When a thread (with OpenMP) or worker (with Intel Cilk Plus) has completed its initial task, it receives from
the scheduler (with OpenMP) or steals from another worker (with Intel Cilk Plus) another chunk to process.
It is possible to instruct parallelization libraries to chose the chunk size dynamically, starting with large and
progressing to smaller chunks as the job is nearing completion. This way, load balance across threads or
workers is maintained without a significant scheduling overhead.
Code samples illustrating the usage of OpenMP and Intel Cilk Plus language constructs to parallelize
loops follow.

Prepared for Yunheng Wang c Colfax International, 2013


100 CHAPTER 3. EXPRESSING PARALLELISM

For-Loops in OpenMP
With OpenMP, #pragma omp parallel for must be placed before the loop to request its paral-
lelization, as shown in Listing 3.29.

1 #pragma omp parallel for


2 for (int i=0; i<n; i++) {
3 printf("Iteration %d is processed by thread %d\n", i, omp_get_thread_num());
4 // ... iterations will be distributed across available threads...
5 }

Listing 3.29: The OpenMP library will distribute the iterations of the loop following the #pragma omp parallel
for across threads.

Alternatively, it is possible to call a parallelized loop by placing #pragma omp for nested inside a
#pragma omp parallel construct, as demonstrated in Listing 3.30

g
an
W
1 #pragma omp parallel
2 {
3
4
// Code placed here will be executed by all threads.
// Stack variables declared here will be private to each thread.
e ng
nh
5 int private_number=0;
Yu

6 #pragma omp for


7 for (int i=0; i<n; i++) {
r

8 // ... iterations will be distributed across available threads...


fo

9 }
ed

10 // ... code placed here will be executed by all threads


11 }
p ar
re

Listing 3.30: When placing #pragma omp for closely nested inside a #pragma omp parallel region, there
yP

should be no word “parallel” before the word “for”. Thread synchronization is implied at the beginning and end of the
for-loop.
el
iv
us

Stack variables declared inside the parallel context or in the body of the loop will be available only on
cl

the thread processing these variables. Variables visible in the scope in which the loop is launched are available
Ex

to all threads, and therefore must be protected from race conditions. In order to efficiently share data between
loop iterations with OpenMP, the reduction clause or locks must be used, as described in Section 3.2.5
If a parallel loop has fewer iterations than the number of available OpenMP threads, then all iterations
will start immediately with one iteration per thread. For parallel loops with more iterations than OpenMP
threads, the run-time library will divide the iterations between threads. In each thread, iterations assigned to it
will be executed sequentially, i.e., the number of simultaneously processed iterations will never be greater than
the number of threads. By default, OpenMP sets the maximum number of threads to be equal to the number of
logical cores in the system.
Depending on the scheduling mode requested by the user, iteration assignment to threads can be either
done before the start of the loop, or it can be decided dynamically. It is possible to tune the performance of
for-loops in OpenMP by specifying the scheduling mode (static, dynamic or guided) and the granularity of
work distribution, known as chunk size.
static : with this scheduling mode, OpenMP evenly distributes loop iterations across threads before the loop
begins. This scheduling method has the smallest parallelization overhead, because no communication
between threads is performed at runtime for scheduling purposes. The downside of this method is that it
may result in load imbalance, if threads complete their iterations at different rates.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 101

dynamic : with this scheduling mode, OpenMP will distribute some fraction of the iteration space across
threads before the loop begins. As threads complete their iterations, they are assigned more work, until
all the work is completed. This method has a greater overhead, but may improve load balance across
threads.
guided : this method is similar to dynamic, except that the granularity of work assignment to threads
decreases as the work nears completion. This method has even greater overhead than dynamic, but
may result in higher overall performance due to better load balancing.

The chunk size controls the minimum number of iterations that are assigned to each thread at any given
scheduling step (except the last one). With small chunk size, dynamic and guided have the potential to
achieve better load balance at the cost of performing more scheduling work. With greater chunk size, the
scheduling overhead is reduced, but load imbalance may be increased. Typically, the optimal chunk size must
be chosen by the programmer empirically.
There are two methods to request the method of scheduling in a loop. The first method is to set the
environment variable OMP_SCHEDULE in order to control the execution of the whole application:

g
an
W
user@host% export OMP_SCHEDULE="dynamic,4"
user@host% ./my_application

e ng
nh
Listing 3.31: Controlling run-time scheduling of parallel loops with an environment variable. The format of the
value of OMP_SCHEDULE is “mode[,chunk_size]”, where mode is one of: static, dynamic, guided, and
Yu
chunk_size is an integer.
r
fo
d

The second is to indicate the scheduling method in the clauses of #pragma omp for. This method
re

provides finer control, but less freedom to modify program behavior after compilation. Listing 3.32 illustrates
pa

that method:
re
yP

1 #pragma omp parallel for schedule(dynamic, 4)


for (int i = 0; i < N ; i++) {
el

2
// ...
iv

3
4 }
us
cl
Ex

Listing 3.32: Controlling the run-time scheduling of a parallel loop with clauses of #pragma omp for.

Prepared for Yunheng Wang c Colfax International, 2013


102 CHAPTER 3. EXPRESSING PARALLELISM

TM
For-Loops in Intel R Cilk Plus
In Intel Cilk Plus, a parallel for-loop is created as shown in Listing 3.33.

1 _Cilk_for (int i=0; i<n; i++) {


2 // ... iterations will be distributed across available threads...
3 printf("Iteration %d is processed by worker %d\n", i, __cilkrts_get_worker_number());
4 }

Listing 3.33: The Intel Cilk Plus library will distribute the iterations of the loop following across threads available workers.

Stack variables declared in the body of the loop will be available only on the worker processing these
variables. Variables visible in the scope in which the loop is launched are available to all strands, and therefore
must be protected from race conditions. In order to efficiently share data between Intel Cilk Plus workers,
hyperobjects must be used, as described in Section 3.2.5.
Just like with OpenMP, the run-time Intel Cilk Plus library will process loop iterations in parallel. The

g
an
total iteration space will be divided into chunks, each of which will be executed serially by one of the Intel

W
Cilk Plus workers. By default, the maximum number of workers in Intel Cilk Plus is equal to the number of
logical cores in the system. The number of workers actually used at runtime is dependent on the amount of
ng
work in the loop, and may be smaller than the maximum. This behavior is different from OpenMP, as OpenMP
e
nh
by default spawns a pre-determined number of threads, regardless of the amount of work in the loop.
Yu

Similarly to OpenMP, Intel Cilk Plus allows the user to control the work sharing algorithm in for-loops by
setting the granularity of work distribution. This is done with #pragma cilk grainsize, as illustrated
r
fo

in Listing 3.34
ed
ar

1 #pragma cilk grainsize = 4


p

2 _Cilk_for (int i = 0; i < N; i++) {


re

3 // ...
yP

4 }
el
iv

Listing 3.34: Controlling grain size in Intel Cilk Plus.


us
cl

The value of grainsize is the minimum number of iterations assigned to any worker in one scheduling
Ex

step. Like with OpenMP, the choice of grainsize is a compromise between load balancing and the overhead of
scheduling. The default value of grainsize chosen by Intel Cilk Plus works well enough in many cases.
Unlike OpenMP, Intel Cilk Plus has only one mode of scheduling, called work stealing. Work stealing,
depending on the nature of the calculation, may be more or less efficient than OpenMP scheduling methods.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 103

3.2.4 Fork-Join Model of Parallel Execution: Tasks in OpenMP and Spawning in


TM
Intel R Cilk Plus
The fork-join model of parallel execution consists of creating a child task that can be executed in parallel
with the parent task (the ‘fork’ step), and running both tasks until a barrier is reached, which signals to
terminate the child task (the ‘join’ step). Child tasks can fork, too, creating a tree of tasks. This model enables
parallel algorithms that cannot be expressed with loop-centric parallelism, such as parallel recursion.
Implementations of the fork-join model in the OpenMP and Intel Cilk Plus libraries offer a significant
advantage over the fork() function in Pthreads. When a very large number of tasks is spawned, the OpenMP
and Intel Cilk Plus implementation schedule their execution in such a way that the hardware system is never
over-subscribed, whereas the Pthreads fork model does not guarantee that property.

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

- Elemental function

- Fork

- Join

Figure 3.2: Fork-join model of shared-memory parallel execution.

Figure 3.2 illustrates the progress of a parallel code employing the fork-join model. Note that the number
of parallel task can far exceed the physical number of cores in the computing platform. The order of execution
of parallel tasks in available cores is generally not the same as the order in which the tasks were spawned.

Prepared for Yunheng Wang c Colfax International, 2013


104 CHAPTER 3. EXPRESSING PARALLELISM

Fork-Join Model in OpenMP: Tasks

A new feature of the OpenMP 3.0 standard, supported by the Intel OpenMP Library, is #pragma omp
task. This pragma allows to create a task that should be executed in parallel with the current scope. A very
large number of tasks can be spawned, however, they will not oversubscribe the system, because the runtime
library will start task execution only when a thread becomes available.
Listing 3.35 illustrates the usage of the OpenMP task pragma to create a parallel recursive algorithm.

1 #include <omp.h>

g
2 #include <stdio.h>

an
3

W
4 void Recurse(const int task) {
5 if (task < 10) {

ng
6 printf("Creating task %d...", task+1);
7 #pragma omp task e
nh
8 {
9 Recurse(task+1);
Yu

10 }
long foo=0; for (long i = 0; i < (1<<20); i++) foo+=i;
r

11
fo

12 printf("result of task %d in thread %d is %ld\n", task, omp_get_thread_num(), foo);


}
ed

13
14 }
ar

15
p

16 int main() {
re

17 #pragma omp parallel


yP

18 {
19 #pragma omp single
el

20 Recurse(0);
iv

21 }
us

22 }
cl
Ex

Listing 3.35: Source code omptask.cc demonstrating the use of #pragma omp task to effect the fork-join parallel
algorithm.

This code calls the function Recurse(), which forks off recursive calls to itself, requesting that those
recursive calls be run in parallel, without any synchronization of the caller function to its forks. The for-loop
in the code is used only to make the tasks perform some arithmetic work, so that we can see the pattern of task
creation and execution.
Note that #pragma omp task occurs inside of a parallel region, however, parallel execution is
initially restricted to only one thread with #pragma omp single. This is a necessary condition for
parallel tasking. Without #pragma omp parallel, all tasks will be executed by a single thread. Without
#pragma omp single, multiple threads will start task number 0, which is not the desired behavior.
Listing 3.36 demonstrates the execution pattern of this code with four threads.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 105

user@host% icpc -openmp -o omptask omptask.cc


user@host% export OMP_NUM_THREADS=4
user@host% ./omptask
Creating task 1...Creating task 2...Creating task 3...Creating task 4...result of task 0
in thread 0 is 549755289600
result of task 1 in thread 2 is 549755289600
Creating task 5...Creating task 6...result of task 2 in thread 1 is 549755289600
Creating task 7...result of task 3 in thread 3 is 549755289600
Creating task 8...result of task 4 in thread 0 is 549755289600
Creating task 9...result of task 5 in thread 2 is 549755289600
Creating task 10...result of task 6 in thread 1 is 549755289600
result of task 7 in thread 3 is 549755289600
result of task 9 in thread 2 is 549755289600
result of task 8 in thread 0 is 549755289600

Listing 3.36: Compilation and running omptask.cc shown in Listing 3.35.

g
One can see that the code forked off as many jobs as there were available threads (in this case, four), and

an
the creation of other jobs had to wait until one of the threads became free.

W
It is also informative to see the difference between the parallel execution pattern and the serial execution.

ng
In order to run the code serially, we can set the maximum number of OpenMP threads to 1, as shown in

e
Listing 3.37
nh
Yu
user@host% export OMP_NUM_THREADS=1
r

user@host% ./omptask
fo

Creating task 1...Creating task 2...Creating task 3...Creating task 4...Creating task 5.
d

..Creating task 6...Creating task 7...Creating task 8...Creating task 9...Creating task
re

10...result of task 9 in thread 0 is 549755289600


pa

result of task 8 in thread 0 is 549755289600


result of task 7 in thread 0 is 549755289600
re

result of task 6 in thread 0 is 549755289600


yP

result of task 5 in thread 0 is 549755289600


result of task 4 in thread 0 is 549755289600
el

result of task 3 in thread 0 is 549755289600


iv

result of task 2 in thread 0 is 549755289600


us

result of task 1 in thread 0 is 549755289600


cl

result of task 0 in thread 0 is 549755289600


Ex

user@host%

Listing 3.37: Running omptask.cc from Listing 3.35 with a single OpenMP thread.

Evidently, in the serial version, the execution recursed into the deepest level before returning to the
calling function. This is the behavior that one would expect from this code if it was stripped of all OpenMP
pragmas.

Prepared for Yunheng Wang c Colfax International, 2013


106 CHAPTER 3. EXPRESSING PARALLELISM

TM
Fork-Join Model in Intel R Cilk Plus: Spawning
In Intel Cilk Plus, the fork-join model is effected by the keyword _Cilk_spawn. This keyword must
be placed before the function that is forked off, and the function will then be executed in parallel with the
current scope. Listing 3.38 demonstrates the same program as Listing 3.35, but now in the Intel Cilk Plus
framework.

1 #include <stdio.h>
2 #include <cilk/cilk.h>
3
4 void Recurse(const int task) {
5 if (task < 10) {
6 printf("Creating task %d...", task+1);
7 _Cilk_spawn Recurse(task+1);
8 long foo=0; for (long i = 0; i < (1L<<20L); i++) foo+=i;
9 printf("result of task %d in worker %d is %ld\n", task,
10 __cilkrts_get_worker_number(), foo);
11 }

g
an
12 }
13

W
14 int main() {

ng
15 Recurse(0);
16 } e
nh
Yu

Listing 3.38: Source code cilkspawn.cc demonstrating the use of _Cilk_spawn to effect the fork-join parallel
algorithm.
r
fo
ed

No additional compiler arguments are required to compile cilkspawn.cc. Listing 3.39 demonstrates
ar

compiling and running this code.


p
re
yP

user@host% icpc -o cilkspawn cilkspawn.cc


user@host% export CILK_NWORKERS=4
el

user@host% ./cilkspawn
iv

Creating task 1...Creating task 2...Creating task 3...Creating task 4...Creating task 5.
us

..Creating task 6...Creating task 7...Creating task 8...Creating task 9...Creating task
10...result of task 9 in worker 0 is 549755289600
cl

result of task 0 in worker 2 is 549755289600


Ex

result of task 8 in worker 0 is 549755289600


result of task 1 in worker 1 is 549755289600
result of task 2 in worker 3 is 549755289600
result of task 3 in worker 2 is 549755289600
result of task 7 in worker 0 is 549755289600
result of task 6 in worker 2 is 549755289600
result of task 5 in worker 3 is 549755289600
result of task 4 in worker 1 is 549755289600
user@host%

Listing 3.39: Compiling and running cilkspawn.cc from Listing 3.38 with a four Intel Cilk Plus workers.

Unlike OpenMP code omptask.cc, this code parallelized with Intel Cilk Plus had spawned all tasks
and queued them for pick up by workers. After that, the code proceeded to run the tasks, as workers employed
work-stealing to balance the load.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 107

3.2.5 Shared and Private Variables


Variable Sharing in OpenMP

In OpenMP parallel regions and loops, multiple threads have access to variables that had been declared
before the parallel region was started. Consider example in Listing 3.40.

1 #include <omp.h>
2 #include <stdio.h>
3
4 int main() {
5 int someVariable = 5;
6 #pragma omp parallel
7 {
8 printf("For thread %d, someVariable=%d\n", omp_get_thread_num(), someVariable);
9 }
10 }

g
an
user@host% icpc -o omp-shared omp-shared.cc -openmp

W
user@host% export OMP_NUM_THREADS=4
user@host% ./omp-shared

ng
For thread 0, someVariable=5

e
For thread 2, someVariable=5
For thread 1, someVariable=5
nh
Yu
For thread 3, someVariable=5
user@host%
r
fo
d

Listing 3.40: Code omp-shared.cc illustrating the use of shared variables in OpenMP.
re
pa

In omp-shared.cc, all threads execute the code inside of #pragma omp parallel. All of these
re
yP

threads have access to variable someVariable declared before the parallel region. By default, all variables
declared before the parallel region are shared between threads. This means that (a) all threads see the value of
el

shared variables, and (b) if one thread writes to the shared variable, all other threads see the modified value.
iv

The latter case may lead to race conditions and unpredictable behavior, unless the write operation is protected
us

as discussed in Section 3.2.6.


cl

In some cases, it is preferable to have a variable of private nature, i.e., have an independent copy of this
Ex

variable in each thread. In order to effect this behavior, the programmer may declare this variable inside the
parallel region as shown in Listing 3.41. Naturally, the programmer can initialize the value of this private
variable with the value of a shared variable.

1 int varPrivate = 3;
2 #pragma omp parallel
3 {
4 int varPrivateLocal = varPrivate; // Each thread will have a copy of varPrivateLocal
5 // ...
6 #pragma omp for
7 for (int i = 0; i < N; i++) {
8 int varTemporary = varPrivateLocal;
9 }
10 }
11 }

Listing 3.41: Variables declared outside the OpenMP parallel region are shared, variables declared inside are private.

Prepared for Yunheng Wang c Colfax International, 2013


108 CHAPTER 3. EXPRESSING PARALLELISM

In the code in Listing 3.41, an independent copy of varPrivateLocal is available in each thread.
This variable persists throughout the parallel region. Similarly, an independent copy of varTemporary will
exist in each thread. The value of this variable persists for the duration of a single loop iteration, but does not
persist across loop iterations.
There is an additional way to provide to each thread a private copy of some of the variables declared
before the parallel region. This can be done by using clauses private and firstprivate in #pragma
omp parallel as shown in Listing 3.42. With clause private,
a) the variable is private to each thread,
b) the initial value of a private variable is undefined, and
c) the value of the variable in the encompassing scope does not change at the end of the parallel region.
Clause firstprivate is similar to private, but the initial value is initialized with the value outside of
the parallel region.

g
an
1 #include <omp.h>
#include <stdio.h>

W
2
3

ng
4 int main() {
5 int varShared = 5; e
nh
6 int varPrivate = 1;
7 int varFirstprivate = 2;
Yu

8 #pragma omp parallel private(varPrivate) firstprivate(varFirstprivate)


{
r

9
fo

10 printf("For thread %d, varShared=%d varPrivate=%d varFirstprivate=%d\n",


11 omp_get_thread_num(), varShared, varPrivate, varFirstprivate);
ed

12 if (omp_get_thread_num() == 0) {
ar

13 varShared = -varShared; // Race condition, undefined behavior!


p

14 varPrivate = -varPrivate; // ok: each thread has own varPrivate


re

15 varFirstprivate = -varFirstprivate; // ok for the same reason


yP

16 }
17 }
el

18 printf("After parallel region, varShared=%d varPrivate=%d varFirstprivate=%d\n",


iv

19 varShared, varPrivate, varFirstprivate);


us

20 }
cl
Ex

user@host% icpc -o omp-private omp-private.cc -openmp


user@host% export OMP_NUM_THREADS=4
user@host% ./omp-private
For thread 0, varShared=5 varPrivate=0 varFirstprivate=2
For thread 1, varShared=5 varPrivate=0 varFirstprivate=2
For thread 2, varShared=5 varPrivate=0 varFirstprivate=2
For thread 3, varShared=-5 varPrivate=0 varFirstprivate=2
After parallel region, varShared=-5 varPrivate=1 varFirstprivate=2
user@host%

Listing 3.42: Code omp-private.cc illustrating the use of shared variables in OpenMP.

Note that in C++, clauses private and firstprivate duplicate the functionality of scope-local
variables demonstrated in Listing 3.41. However in Fortran, the user must declare all variables at the beginning
of the function, and therefore there is no way to avoid using the clauses private, firstprivate and
lastprivate.
Another type of private variable behavior in OpenMP is effected by clause lastprivate, which
applies only to #pragma omp parallel for. For lastprivate variables, the value in the last

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 109

iteration is copied back to the scope outside the parallel region.


Some programmers may also find the clause default useful, as it allows to define the default nature of
variables as shared, or none e.g.,

1 #pragma omp parallel for default(none) shared(a,b,c) lastprivate(d,e)


2 for (int i = 0; i < n; i++) {
3 ...
4 }

Listing 3.43: Using clause default to request that all variables declared outside the OpenMP parallel region is not
visible within the region.

In the above code, variables a, b, c and i will be shared, variables d and e will be lastprivate,
and all other variables will be none, thus not visible for the parallel region. With default(none), if the
programmer forgets to specify the sharing type for any of the variables used in the loop, the compilation will
fail — this behavior may be desirable in complex cases for explicit variable behavior check.

g
an
W
TM
Variable Sharing in Intel R Cilk Plus

ng
In Intel Cilk Plus, there is no additional pragma-like control over shared or private nature of variables.

e
nh
All variables declared before _Cilk_for are shared, and all variables declared inside the loop are only
Yu
visible to the strand executing the iteration, and exist for the duration of the loop iteration. There is no
native way to declare a variable that persists in a given worker throughout the parallel loop, like variable
r
fo

varPrivateLocal in Listing 3.41. The syntax of Intel Cilk Plus intentionally prohibits the user from
assigning a variable to a worker rather than to a chunk of data. Instead of doing this, developers must design
d
re

their algorithm to use hyperobjects such as reducers and holders, as discussed in Section 3.2.7.
pa
re
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


110 CHAPTER 3. EXPRESSING PARALLELISM

3.2.6 Synchronization: Avoiding Unpredictable Program Behavior


Up until now, the discussion of parallelism in shared memory was restricted to algorithms without any
interaction between threads or workers. However, for certain algorithms and operations, synchronization
between threads may be necessary. For example, one must never allow multiple threads to simultaneously
write to a shared variable, because concurrent modification of data may lead to unpredictable results. In other
cases, a thread or a group of threads must wait until some other threads have finished their work. This section
discusses the tools available in the OpenMP and Intel Cilk Plus parallel frameworks. Note that in general,
synchronization impedes the parallel scalability of applications. Whenever possible, instead of synchronization
operations, programmers must use reduction and private variables as discussed in Section 3.2.7.

Synchronization in OpenMP: Critical Sections


Consider the following example:

1 #include <omp.h>

g
#include <stdio.h>

an
2
3

W
4 int main() {

ng
5 const int n = 1000;
6 int sum = 0; e
#pragma omp parallel for
nh
7
8 for (int i = 0; i < n; i++) {
Yu

9 // Race condition
10 sum = sum + i;
r

}
fo

11
12 printf("sum=%d (must be %d)\n", sum, ((n-1)*n)/2);
ed

13 }
p ar

user@host% icpc -o omp-race omp-race.cc -openmp


re

user@host% export OMP_NUM_THREADS=32


yP

user@host% ./omp-race
sum=208112 (must be 499500)
el

user@host%
iv
us

Listing 3.44: Code omp-race.cc has unpredictable behavior and produces incorrect results due to a race condition in
cl

line 10.
Ex

In line 10 of code omp-race.cc in Listing 3.44, a situation known as a race condition occurs. The
problem is that variable sum is shared between all threads, and therefore more than one thread may execute this
line concurrently. If two threads simultaneously execute line 10, both will have to read, increment and write
the updated value of sum. However, what if one thread updates sum while another thread was incrementing
the old value of sum? This may, and will, lead to an incorrect calculation. Indeed, the output shows a value of
sum=208112 instead of 499500. Moreover, if we run this code multiple times, every time the result will be
different, because the pattern of races between threads will vary from run to run. The parallel program has a
non-predictable behavior! How does one resolve this problem?
The easiest, yet the most inefficient way to protect a portion of a parallel code from concurrent execution
in OpenMP is a critical section, as illustrated in Listing 3.45. #pragma omp critical used in this code
protects the code inside its scope from concurrent execution. The whole iteration space will still be executed
by the code in parallel, but only one thread at a time will be allowed to enter the critical section, while other
threads wait their turn. At this stage in the training we are not concerned with performance, but let us note that
this is a very inefficient way to resolve the race condition in the problem shown in Listing 3.44. We provide
this example because in some cases, a critical section is the only way to avoid unpredictable behavior.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 111

1 #pragma omp parallel for


2 for (int i = 0; i < n; i++) {
3 #pragma omp critical
4 { // Only one thread at a time can execute this section
5 sum = sum + i;
6 }
7 }

user@host% icpc -o omp-critical omp-critical.cc -openmp


user@host% ./omp-race
sum=499500 (must be 499500)
user@host%

Listing 3.45: Parallel fragment of code omp-critical.cc has predictable behavior, because the race condition was
eliminated with a critical section.

g
Synchronization in OpenMP: Atomic Operations

an
W
A more efficient method of synchronization, albeit limited to certain functions, is the use of atomic
operations. Atomic operations allow the program to safely update a scalar variable in a parallel context. These

ng
operations are effected with #pragma omp atomic, as shown in Figure 3.46.

e
nh
Yu
1 #pragma omp parallel for
2 for (int i = 0; i < n; i++) {
r
fo

3 // Lightweight synchronization
4 #pragma omp atomic
d

5 sum += i;
re

6 }
pa
re

Listing 3.46: This parallel fragment of code omp-critical.cc has predictable behavior, because the race condition
yP

was eliminated with an atomic operation. Note that for this specific example, atomic operations are not the most efficient
el

solution.
iv
us

Only the following operations can be executed as atomic:


cl
Ex

Read : operations in the form v = x


Write : operations in the form x = v
Update : operations in the form x++, x--, --x, ++x, x binop= expr and x = x binop expr

Capture : operations in the form v = x++, v = x-, v = -x, v = ++x, v = x binop expr
Here x and v are scalar variables, binop is one of +, *, -, - /, &, ˆ , |, «, ». No “trickery” is
allowed for atomic operations: no operator overload, no non-scalar types, no complex expressions.
In many cases, atomic operations are an adequate solution for accessing and modifying shared data.
However, in this particular case, the parallel scalability of the algorithm may be further improved by using
reducers instead of atomic operations, as discussed in Section 3.2.7

Prepared for Yunheng Wang c Colfax International, 2013


112 CHAPTER 3. EXPRESSING PARALLELISM

Synchronization in OpenMP: #pragma omp taskwait


For algorithms employing the fork-join model (#pragma omp task, see Section 3.2.4), OpenMP
has a special pragma that pauses the execution in the current thread until all tasks spawned by that thread are
completed. Listing 3.47 demonstrates the use of #pragma omp task.

1 #include <omp.h>
2 #include <stdio.h>
3
4 int main() {
5 const int N=1000;
6 int* A = (int*)malloc(N*sizeof(int));
7 for (int i = 0; i < N; i++) A[i]=i;
8 #pragma omp parallel
9 {
10 #pragma omp single
11 {
12 // Compute the sum in two threads

g
int sum1=0, sum2=0;

an
13
14 #pragma omp task shared(A, N, sum1)

W
15 {

ng
16 for (int i = 0; i < N/2; i++)
17 sum1 += A[i]; e
}
nh
18
19 #pragma omp task shared(A, N, sum2)
Yu

20 {
21 for (int i = N/2; i < N; i++)
r

sum2 += A[i];
fo

22
23 }
ed

24
ar

25 // Wait for forked off tasks to complete


26 #pragma omp taskwait
p
re

27
printf("Result=%d (must be %d)\n", sum1+sum2, ((N-1)*N)/2);
yP

28
29 }
}
el

30
free(A);
iv

31
32 }
us
cl

user@host% icpc -o omptaskwait omptaskwait.cc -openmp


Ex

user@host% ./omptaskwait
Result=499500 (must be 499500)
user@host%

Listing 3.47: Code omp-taskwait.cc illustrates the usage #pragma omp taskwait.

The code in Listing 3.46 is an inefficient way to approach the problem, because it uses only two threads.
A better way to perform parallel reduction is described in Section 3.2.7. Nevertheless, for scalable fork-join
parallel algorithms, #pragma omp taskwait is a native way in OpenMP to implement synchronization
points.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 113

Synchronization in Intel Cilk Plus: _Cilk_sync


At this point, the discussion will proceed to synchronization in Intel Cilk Plus. The equivalent of
#pragma omp taskwait in Intel Cilk Plus is the compiler statement _Cilk_sync. It plays the same
role for tasks that have been forked off with _Cilk_spawn. However, in contrast with OpenMP, this
statement is the only native means of explicit synchronization in Intel Cilk Plus. Listing 3.48 is a Intel Cilk
Plus implementation of the algorithm demonstrated in Listing 3.47.

1 #include <stdio.h>
2
3 void Sum(const int* A, const int start, const int end, int & result) {
4 for (int i = start; i < end; i++)
5 result += A[i];
6 }
7
8 int main() {
9 const int N=1000;

g
10 int* A = (int*)malloc(N*sizeof(int));

an
11 for (int i = 0; i < N; i++) A[i]=i;

W
12
13 // Compute the sum with two tasks

ng
14 int sum1=0, sum2=0;

e
15
16 _Cilk_spawn Sum(A, 0, N/2, sum1);
_Cilk_spawn Sum(A, N/2, N, sum2); nh
Yu
17
18
r

19 // Wait for forked off sums to complete


fo

20 _Cilk_sync;
d

21
re

22 printf("Result=%d (must be %d)\n", sum1+sum2, ((N-1)*N)/2);


pa

23
24 free(A);
re

25 }
yP
el

Listing 3.48: Code cilk-sync.cc illustrates the usage _Cilk_sync.


iv
us

Just as with OpenMP, this is an inefficient way to implement parallel reduction, and a better method is
cl

described in Section 3.2.7


Ex

Intel Cilk Plus: Locks


There are no native locks in Intel Cilk Plus. However, Intel Cilk Plus interoperates with locks in, e.g.,
Threading Building Blocks. This topic is beyond the scope of this training. As a rule of thumb, for optimum
performance and portability, developers should try to design parallel algorithms using only native Intel Cilk
Plus keywords and hyperobjects, instead of resorting to additional synchronization methods.

TM
Implicit Synchronization in OpenMP and Intel R Cilk Plus
In addition to synchronization methods described above, OpenMP and Intel Cilk Plus contain implicit
synchronization points at the beginning and end of parallel loops and parallel regions (in OpenMP only). This
means that code execution does not proceed until all iterations of the parallel loop have been performed, or
until the last statement of the parallel region has been executed in every thread.

Prepared for Yunheng Wang c Colfax International, 2013


114 CHAPTER 3. EXPRESSING PARALLELISM

3.2.7 Reduction: Avoiding Synchronization

Some parallel algorithms that require synchronization only to modify a common quantity can be expressed
in terms of reduction. This possibility arises if the operation with which the common quantity is calculated
is associative (such as integer addition or multiplication) or approximately associative (such as floating-
point addition or multiplication), i.e., and the order of operations does not affect the result. OpenMP has
reduction clauses for parallel pragmas, and Intel Cilk Plus has specialized variables called reducers in
order to effect reduction. It is also possible to instrument a reduction algorithm using private variables and
minimal synchronization. Properly instrumented parallel reduction avoids excessive synchronization and
communication, which improves the parallel scalability and, therefore, the application performance.

Reduction Clause in OpenMP

g
In OpenMP, for-loops can automatically perform reduction for certain operations on scalar variables.

an
Listing 3.49 illustrates the algorithm shown in Listing 3.44, Listing 3.45 and Listing 3.46 instrumented using

W
the OpenMP reduction clause:
e ng
nh
Yu

1 #include <omp.h>
2 #include <stdio.h>
r
fo

3
4 int main() {
ed

5 const int n = 1000;


ar

6 int sum = 0;
7 #pragma omp parallel for reduction(+: sum)
p

for (int i = 0; i < n; i++) {


re

8
9 sum = sum + i;
yP

10 }
printf("sum=%d (must be %d)\n", sum, ((n-1)*n)/2);
el

11
12 }
iv
us

user@host% icpc -o omp-reduction omp-reduction.cc -openmp


cl

user@host% ./omp-reduction
Ex

sum=499500 (must be 499500)


user@host%

Listing 3.49: Code omp-reduction.cc has race condition eliminated with a reduction clause.

The syntax of the reduction clause is reduction(operator:variables), where operator is one of: +,
*, -, &, |, ˆ , &&, ||, max or min, and variables is a comma-separated list of variables to which these
operations are applied.
It is possible to implement reduction for other operations and other types of variables using private
variables and a critical section or an atomic barrier after the loop. This, in fact, is what happens behind the
curtains when the reduction clause is specified in OpenMP. With this method, each thread must have a
private variable of the same type as the global reduction variable. In each thread, reduction is performed to
that private variable without synchronization with other threads. At the end of the loop, a critical section is
used in order to reduce the private variables from each thread into the global variable. The principle of this
method is shown in Listing 3.50.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 115

1 #include <omp.h>
2 #include <stdio.h>
3
4 int main() {
5 const int n = 1000;
6 int sum = 0;
7 #pragma omp parallel
8 {
9 int sum_th = 0;
10 #pragma omp for
11 for (int i = 0; i < n; i++) {
12 sum_th = sum_th + i;
13 }
14
15 #pragma omp atomic
16 sum += sum_th;
17
18 }
19 printf("sum=%d (must be %d)\n", sum, ((n-1)*n)/2);

g
}

an
20

W
Listing 3.50: Code omp-reduction2.cc implements reduction using private variables and a minimum reduction

ng
section.

e
nh
Yu
In Listing 3.50, this specific example could also be implemented a critical section instead of atomic
operations. The solution with a critical section is not optimal in this case, however, it may be necessary in other
r
fo

cases, when the reduction operation is not atomic, or the data type of the reduction variable is not supported by
the reduction clause. For example, a C++ container class as a reduction variable is not eligible for the OpenMP
d
re

reduction clause or for atomic operations. However, reduction into a C++ container class can be done using a
pa

critical section.
re
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


116 CHAPTER 3. EXPRESSING PARALLELISM

TM
Reducers in Intel R Cilk Plus
Intel Cilk Plus, compared to OpenMP, allows the user less fine-grained control over synchronization, but
makes up for it with versatile support for hyperobjects: reducers and holders. The restricted lexicon of the
Intel Cilk Plus framework encourages the programmer to employ efficient parallel algorithms, which avoid
excessive synchronization and exhibit high parallel scalability. In addition, lexical restrictions of Intel Cilk
Plus enforce serial semantics and ensure that serialized version of the code will produce the same results as the
parallel version.
Reducers are variables that hold shared data, yet these variables can be safely used by multiple strands
of a parallel code. At runtime, each worker operates on its own private copy of the data, which reduces
synchronization and communication between workers.
Let us demonstrate an Intel Cilk Plus implementation of the example shown in Listing 3.49. Listing 3.51
demonstrates the parallel sum reduction algorithm with Intel Cilk Plus.

1 #include <cilk/reducer_opadd.h>
#include <stdio.h>

g
2

an
3
4 int main() {

W
5 const int n = 1000;

ng
6 cilk::reducer_opadd<int> sum;
7 sum.set_value(0); e
_Cilk_for (int i = 0; i < n; i++) {
nh
8
9 sum += i;
Yu

10 }
11 printf("sum=%d (must be %d)\n", sum.get_value(), ((n-1)*n)/2);
r
fo

12 }
ed
ar

Listing 3.51: Code cilk-reduction.cc implements reduction using reducers.


p
re

Note the following details in cilk-reduction.cc:


yP

a) Header file corresponding to a specific reducer must be included. In this case, it is cilk/reducer_opadd.h
el

for the addition reducer.


iv
us

b) Reducers are generic (template) C++ classes.


cl
Ex

c) Inside of the parallel region, the reducer sum is used just like a regular variable of type int, except that
only one operation with it is allowed: +=.
d) Outside the parallel region, the reducer can only be used via accessors and mutators (in this case,
get_value() and set_value()).

The power of reducers in Intel Cilk Plus is greatly enhanced by support for user-defined reducers. This
procedure is described in the Intel C++ Compiler reference [29]. However, for a lot of applications, the scope
of reducers provided in Intel Cilk Plus may be sufficient.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 117

The list of reducers supported by Intel Cilk Plus is shown below. Reducer names are self-explanatory,
and additional information can be found in Intel C++ Compiler reference [30].

reducer_list_append in <cilk/reducer_list_append.h> supports operation push_back().

reducer_list_prepend in <cilk/reducer_list_prepend.h> supports operation push_front().


reducer_max and reducer_max_index in <cilk/reducer_max.h> support operation cilk::max_of.
reducer_min and reducer_min_index in <cilk/reducer_min.h> support operation cilk::min_of.
reducer_opadd in <cilk/reducer_opadd.h> supports operations +=, -=, =, ++ and --.

reducer_opand in <cilk/reducer_opand.h> supports operations & and &=.


reducer_opor in <cilk/reducer_opor.h> supports operations | and |=.
reducer_opxor in <cilk/reducer_opxor.h> supports operations ˆ and ˆ =.

g
an
reducer_ostream in <cilk/reducer_ostream.h> supports operation «.

W
reducer_basic_string in <cilk/reducer_string.h> supports operation += to create a string.

ng
reducer_string and reducer_wstring are shorthands for reducer_basic_string for

e
types char and wchar_t, respectively.
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


118 CHAPTER 3. EXPRESSING PARALLELISM

TM
Holders in Intel R Cilk Plus
Holders in Intel Cilk Plus are hyperobjects that allow thread-safe read/write accesses to common data.
Holders are similar to reducers, with the exception that they do not support synchronization at the end of the
parallel region. This enables to instrument holders with a single C++ template class called cilk::holder.
The role of holders in Intel Cilk Plus is similar to the role of private variables in OpenMP declared in
the same way that the variable sum_th is declared in Listing 3.50. However, holders provide additional
functionality in fork-join codes. Namely, the view of a holder upon the first spawned child of a function (or
the first child spawned after a sync) is the same as upon the entry to the function, even if a different worker is
executing the child. This functionality allows to use holders as a replacement to argument passing. Unlike a
truly shared variable, a holder has undetermined state in some cases (in spawned children after the first one
and in an arbitrary iteration of a _Cilk_for loop), because each strand manipulates its private view of the
holder.
Listing 3.52 and Listing 3.54 demonstrate the use of a holder as a private variable. In Listing 3.52, in
the _Cilk_for loop, a separate copy of variable scratch is constructed for each iteration. If the cost
of constructor ScratchType() is too high, then using a holder as shown in Listing 3.54 in may improve

g
efficiency. When ScratchType is wrapped in the template class cilk::holder, the constructor of

an
ScratchType()) will be called only once for each worker. At the same time, the view of the variable

W
scratch is undetermined in an arbitrary iteration of the loop.
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 119

1 #include <stdio.h>
2 #include <cilk/holder.h>
3 const int N = 100000;
4 class ScratchType {
5 public:
6 int data[N];
7 ScratchType(){printf("Constructor called by worker %d\n",
8 __cilkrts_get_worker_number());}
9 };
10 int main(){
11 _Cilk_for (int i = 0; i < 10; i++) {
12 ScratchType scratch;
13 scratch.data[0:N] = i;
14 int sum = 0;
15 for (int j = 0; j < N; j++) sum += scratch.data[j];
16 printf("i=%d, worker=%d, sum=%d\n", i, __cilkrts_get_worker_number(), sum);
17 }
18 }

g
an
Listing 3.52: Source cilk-noholder.cc demonstrates using a private variable for intermediate calculations in a

W
_Cilk_for loop.

e ng
user@host% icpc -o cilk-noholder cilk-noholder.cc nh
Yu
user@host% export CILK_NWORKERS=2
user@host% ./cilk-noholder
r
fo

Constructor called by worker 0


Constructor called by worker 1
d
re

i=0, worker=0, sum=0


Constructor called by worker 0
pa

i=5, worker=1, sum=500000


re

Constructor called by worker 1


yP

i=1, worker=0, sum=100000


Constructor called by worker 0
el

i=6, worker=1, sum=600000


iv

Constructor called by worker 1


us

i=2, worker=0, sum=200000


Constructor called by worker 0
cl

i=7, worker=1, sum=700000


Ex

Constructor called by worker 1


i=8, worker=1, sum=800000
i=3, worker=0, sum=300000
Constructor called by worker 0
Constructor called by worker 1
i=4, worker=0, sum=400000
i=9, worker=1, sum=900000

Listing 3.53: Compiling and running code cilk-noholder.cc from Listing 3.52. Note that the constructor of
ScratchType() is called for every loop iteration.

Prepared for Yunheng Wang c Colfax International, 2013


120 CHAPTER 3. EXPRESSING PARALLELISM

1 #include <stdio.h>
2 #include <cilk/holder.h>
3 const int N = 100000;
4 class ScratchType {
5 public:
6 int data[N];
7 ScratchType(){printf("Constructor called by worker %d\n",
8 __cilkrts_get_worker_number());}
9 };
10 int main(){
11 cilk::holder<ScratchType> scratch;
12 _Cilk_for (int i = 0; i < 10; i++) {
13 scratch().data[0:N] = i; // Operator () is an accessor to data in a holder
14 int sum = 0;
15 for (int j = 0; j < N; j++) sum += scratch().data[j];
16 printf("i=%d, worker=%d, sum=%d\n", i, __cilkrts_get_worker_number(), sum);
17 }
18 }

g
an
Listing 3.54: Source cilk-holder.cc demonstrates holder usage in Intel Cilk Plus. Listing 3.55 demonstrates that

W
this code may yield better efficiency than the code without holders in Listing 3.52.

e ng
nh
user@host% icpc -o cilk-holder cilk-holder.cc
Yu

user@host% export CILK_NWORKERS=2


user@host% ./cilk-holder
r
fo

Constructor called by worker 0


Constructor called by worker 1
ed

i=0, worker=0, sum=0


ar

i=5, worker=1, sum=500000


p

i=1, worker=0, sum=100000


re

i=6, worker=1, sum=600000


yP

i=2, worker=0, sum=200000


i=7, worker=1, sum=700000
el

i=3, worker=0, sum=300000


iv

i=8, worker=1, sum=800000


us

i=4, worker=0, sum=400000


i=9, worker=1, sum=900000
cl
Ex

Listing 3.55: Compiling and running cilk-holder.cc. Note that the constructor of ScratchType() is called
once for every worker, but not once for every iteration. If the cost of the constructor is high, this code may provide better
efficiency than the code in Listing 3.52.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.2. TASK PARALLELISM IN SHARED MEMORY 121

3.2.8 Additional Resources on Shared Memory Parallelism


We have provided an overview of the fundamentals parallel programming in shared memory with Intel
compilers in two frameworks: OpenMP and Intel Cilk Plus. We focused on expressing and controlling task
parallelism, leaving the optimization discussion for the next chapter. Understanding the methods and language
extensions covered here are sufficient for leveraging the performance optimization examples in Chapter 4. In
many real-world applications, this tool set will also be sufficient.
For readers wishing to continue studying OpenMP and Intel Cilk Plus, or to learn about other parallel
frameworks and parallel programming, we provide a list of references below.

1) A dry, but comprehensive description can be found in OpenMP specifications can be found at the OpenMP
Architecture Review Board Web site http://openmp.org/wp/openmp-specifications/ [31]

2) A detailed OpenMP tutorial from Blaise Barney of Lawrence Livermore National Laboratory is available
at https://computing.llnl.gov/tutorials/openMP/ [32]
3) Intel Cilk Plus pages in the Intel C++ Compiler reference provide details and examples for programming

g
an
with this parallel framework [25]

W
4) The Intel Threading Building Blocks project (TBB) is another powerful parallel framework and library:

ng
http://threadingbuildingblocks.org [33]. This product has an open-source implementation.

e
nh
5) Intel Array Building Blocks (ArBB) is high-level library for parallel data processing [34].
Yu
6) The book “Intel Xeon Phi Coprocessor High Performance Programming” by Jim Jeffers and James Reinders
r
fo

[35] (see also http://www.lotsofcores.com/ [36]).


d

7) The book “Structured Parallel Programming: Patterns for Efficient Computation” by Michael McCool,
re

Arch D. Robinson and James Reinders [37] is a developer’s guide to patterns for high-performance parallel
pa

programming (see also the Web site of the book [38]). The book discusses fundamental parallel algorithms
re

and their implementations in Intel Cilk Plus and TBB.


yP

8) The book “Parallel Programming in C with MPI and OpenMP” by Michael J. Quinn [39] is full of
el
iv

examples of high performance applications implemented in OpenMP and MPI, illustrating the programming,
us

optimization and benchmarking methodology studied in detail in this work.


cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


122 CHAPTER 3. EXPRESSING PARALLELISM

3.3 Task Parallelism in Distributed Memory, MPI

At this point, the reader is familiar with data parallelism in Intel Xeon family processors (SIMD
instructions), with task parallelism in multi- and many-core systems (multiple threads operating in shared
memory). The next level of parallelism is scaling an application across multiple compute nodes in distributed
memory. The most commonly used framework for distributed memory HPC calculations is the Message
Passing Interface (MPI). This section discusses expressing parallelism with MPI.

3.3.1 Parallel Computing in Clusters with Multi-Core and Many-Core Nodes

MPI is a communication protocol. It allows multiple processes, which do not share common memory,
but reside on a common network, to perform parallel calculations, communicating with each other by passing

g
an
messages. MPI messages are arrays of predefined and user-defined data types. The purpose of MPI messages

W
range from task scheduling to exchanging large amounts of data necessary to perform the calculation. MPI
guarantees that the order of sent messages is preserved on the receiver side. The MPI protocol also provides
ng
error control. However, the developer is responsible for communication fairness control, as well as for task
e
nh
scheduling and computational load balancing.
Yu

Originally, in the era of single-core compute nodes, the dominant MPI usage model in clusters was to run
one MPI process per physical machine. With the advent of multi-core, multi-socket, and now heterogeneous
r
fo

many-core systems, the range of usage models of MPI has grown (see Figures 3.3 — 3.6):
ed
ar

a) It is possible to run one MPI process per compute node, exploiting parallelism in each machine with a
p
re

shared-memory parallel framework, such as OpenMP or Intel Cilk Plus (see Section 3.2). Figure 3.3
yP

illustrates this configuration.


el
iv
us
cl
Ex

Figure 3.3: Hybrid MPI and OpenMP parallelism diagram. One multi-threaded MPI process per node.

b) Alternatively, one single-threaded MPI process can run on each physical or logical core of each machine in
the cluster. In this case, MPI processes running on one compute node still do not share memory address
space. However, message passing between these processes is more efficient, because fast virtual fabrics
can be used for communication. This approach is illustrated in Figure 3.4.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.3. TASK PARALLELISM IN DISTRIBUTED MEMORY, MPI 123

Figure 3.4: Pure MPI parallelism diagram. One single-threaded MPI process per core.

c) Another option is to run multiple multi-threaded MPI processes per compute node, as shown in Figure 3.5.
In this case, each process exploits parallelism in shared memory, and MPI communication between
processes adds distributed-memory parallelism. This hybrid approach may yield optimum performance for
with a high frequency or large volume of communication.

g
an
W
e ng
nh
r Yu
fo
d
re
pa

Figure 3.5: Hybrid MPI and OpenMP parallelism. Several multi-threaded MPI processes per node.
re
yP

d) In heterogeneous clusters with Intel Xeon Phi coprocessors, MPI programmers have a choice of running
el

MPI processes on hosts and coprocessors natively (as in cases a, b and c), or running MPI processes only
iv
us

on hosts and performing offload to coprocessors (see Figure 3.6).


cl
Ex

Figure 3.6: Hybrid MPI and OpenMP parallelism diagram with offload from hosts to coprocessors.

Multiple implementations of MPI have been developed since the protocol’s inception in 1991. In this
training, we will be using the Intel MPI library version 4.1, which implements MPI version 2.2 specification.
Intel MPI has native support for Intel Xeon Phi coprocessors, integrates with Intel software development tools,
and operates with a variety of interconnect fabrics.

Prepared for Yunheng Wang c Colfax International, 2013


124 CHAPTER 3. EXPRESSING PARALLELISM

3.3.2 Program Structure in Intel R MPI


Compiling and Running Applications
MPI applications in C, C++ and Fortran must be compiled with special wrappers over the respective
compilers. The following commands invoke Intel MPI compilers:
mpiicc for C language (icc is default compiler),
mpiicpc for C++ language (icpc is default compiler),
mpiifort for Fortran 77 and Fortran 95 (ifort is default compiler).
In order to run an MPI application, it must be executed with an MPI execution tool. Intel MPI contains a
simplified script that starts MPI applications, called mpirun. This script accepts the list of hosts on which the
application is executed, either as command line arguments, or in a machine file.
Typically, the same MPI application is launched on each MPI host. That is, each MPI host executes the
same program. However, it does not mean that all processes perform the same work. At runtime, each MPI

g
an
process is assigned a unique identifier called MPI rank. MPI ranks are integers that begin at 0 and increase
contiguously. Using these ranks, processes can coordinate execution and identify their role in the application

W
even before they exchange any messages. It is also possible to launch multiple executables on different hosts
ng
as a part of a single application. For complex applications, processes can be bundled into communicators and
e
nh
groups.
A “Hello World” MPI application was demonstrated in Chapter 2 in Section 2.1 and Section 2.4, and the
Yu

reader may refer to these sections to refresh this material.


r
fo

Structure of MPI Applications


ed
ar

Listing 3.56 schematically demonstrates the structure of all MPI applications in C++.
p
re
yP

1 #include "mpi.h"
2
el

3 int main(int argc, char** argv) {


iv

4
us

5 // Set up MPI environment


cl

6 int ret = MPI_Init(&argc,&argv);


if (ret != MPI_SUCCESS) {
Ex

7
8 MyErrorLogger("...");
9 MPI_Abort(MPI_COMM_WORLD, ret);
10 }
11
12 int worldSize, myRank, myNameLength;
13 char myName[MPI_MAX_PROCESSOR_NAME];
14 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
15 MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
16 MPI_Get_processor_name(myName, &myNameLength);
17
18 // Perform work
19 // Exchange messages with MPI_Send, MPI_Recv, etc.
20 // ...
21
22 // Terminate MPI environment
23 MPI_Finalize();
24 }

Listing 3.56: Structure of an MPI application.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.3. TASK PARALLELISM IN DISTRIBUTED MEMORY, MPI 125

The code in Listing 3.56 illustrates the following rules:

• Header file #include <mpi.h> is required for all programs that make Intel MPI library calls.
• Intel MPI calls begin with MPI_
• The MPI portion of the program begins with a call to MPI_Init and ends with MPI_Finalize
• Communicators can be used to address a substructure of MPI processes, and the default communicator
MPI_COMM_WORLD includes all current MPI processes.
• Each process within a communicator identifies itself with a rank, which can be queried by calling the
function MPI_Comm_rank
• The number of processes in the given communicator can be queried with MPI_Comm_size.

• Using the ranks and the world size, it is possible to distribute roles between processes in an application
even before any messages are exchanged.

g
an
• Most MPI routines return an error code. The default MPI behavior is to abort program execution is there

W
was an error.

e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


126 CHAPTER 3. EXPRESSING PARALLELISM

3.3.3 Point-to-Point Communication


Now that we have created, compiled and executed a “Hello world” parallel MPI application (see
Section 2.1 and Section 2.4), let us move on to passing messages between MPI processes.
In this section we will discuss only blocking communication routines. These routines pause the execution
of a task until it is safe to re-use (for sending) or use (for receiving) the memory space holding the message.
Non-blocking routines are discussed in Section 3.3.4.
At this point, will consider point-to-point communication, i.e., operations in which messages have only
one source rank and one destination rank. In Section 3.3.5, we will also discuss collective communication, i.e.,
message exchange operations with more than one source or more than one destination.

Example
Listing 3.57 demonstrates the use of blocking point-to-point communication routines. In this code,
multiple “worker” processes report to the “master” process with rank equal to 0.

g
an
1 #include <mpi.h>
2 #include <stdio.h>

W
3 int main (int argc, char *argv[]) {

ng
4 int i, rank, size, namelen;
5 char name[MPI_MAX_PROCESSOR_NAME]; e
nh
6 MPI_Status stat;
7 MPI_Init (&argc, &argv);
Yu

8 MPI_Comm_size (MPI_COMM_WORLD, &size);


9 MPI_Comm_rank (MPI_COMM_WORLD, &rank);
r
fo

10 MPI_Get_processor_name (name, &namelen);


11 if (rank == 0) {
ed

12 printf ("I am the master process, rank %d of %d running on %s\n", rank, size, name);
ar

13 for (i = 1; i < size; i++) {


// Blocking receive operation in the master process
p

14
re

15 MPI_Recv (&rank, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);


yP

16 MPI_Recv (&namelen, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);


17 MPI_Recv (name, namelen + 1, MPI_CHAR, i, 1, MPI_COMM_WORLD, &stat);
el

18 printf ("Received hello from rank %d running on %s\n", rank, name);


iv

19 }
} else {
us

20
21 // Blocking send operations in all other processes
cl

22 MPI_Send (&rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);


Ex

23 MPI_Send (&namelen, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);


24 MPI_Send (name, namelen + 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD);
25 }
26 MPI_Finalize ();
27 }

user@host% mpiicpc -o mpi-p2p mpi-p2p.cc


user@host% mpirun -np 4 ./mpi-p2p
I am the master process, rank 0 of 4 running on host
Received hello from rank 1 running on host
Received hello from rank 2 running on host
Received hello from rank 3 running on host

Listing 3.57: Source code mpi-p2p.cc illustrates basic MPI communication.

Program shown in Listing 3.57 uses two functions that are new in our discussion: MPI_Send and
MPI_Recv. These functions, respectively, send and receive a message. MPI_Send and MPI_Recv and
their variations (read on) are the basis of MPI.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.3. TASK PARALLELISM IN DISTRIBUTED MEMORY, MPI 127

Blocking Point to Point Message Passing Routines


The syntax of standard MPI routines for blocking point to point communication is described below.

MPI_Recv (&buf, count, datatype, source, tag, comm, &status) is a basic blocking
receive operation. It posts the intent to receive a message and blocks (i.e., waits) until the requested
message is received into the receive buffer buf.
MPI_Send (&buf, count, datatype, dest, tag, comm) is a basic blocking send operation.
It send the message contained in the send buffer buf, and blocks until it is safe to re-use the send buffer.

Here and elsewhere, the meaning and type of common parameters is:

Type and Name Role


void* buf Pointer to the message data

int count Number of elements in the send buffer

g
an
MPI_Datatype datatype Indicates the type of data elements in the buffer. Table 3.5

W
lists predefined MPI data types

ng
Rank of the process to which the message is sent

e
int dest
nh
Yu
int source Rank of the process from which a message is received. Spe-
cial wild card value MPI_ANY_SOURCE allows to receive a
r
fo

message from any task


d
re

int tag User-defined arbitrary non-negative integer assigned used to


pa

uniquely identify a message. Tag specified in a send opera-


tion must be matched in the corresponding receive operation.
re

Special tag value MPI_ANY_TAG overrides this behavior,


yP

allowing to receive a message with any tag. According to


el

the MPI standard, integers 0 − 32767 can be used as tags.


iv

Depending on the implementation, the allowed range may be


us

wider.
cl
Ex

MPI_Comm comm Communication context, or set of processes for which the


source or destination fields are valid. MPI_COMM_WORLD is
used to access to all processes belonging to the current MPI
application.

MPI_Status* status Pointer to a structure containing the source, the tag and the
length of the received message. In order to access the length
from status, function MPI_Get_count must be used.

Table 3.4: Common MPI function arguments.

Prepared for Yunheng Wang c Colfax International, 2013


128 CHAPTER 3. EXPRESSING PARALLELISM

The data types supported by MPI are shown in Table 3.5. Note that user-defined data types can be created
in MPI.

Required types Length, bytes


MPI_PACKED 1
MPI_BYTE 1
MPI_CHAR 1
MPI_UNSIGNED_CHAR 1
MPI_SIGNED_CHAR 1
MPI_WCHAR 2
Optional types Length, bytes
MPI_SHORT 2
MPI_INTEGER1 1
MPI_UNSIGNED_SHORT 2
MPI_INTEGER2 2
MPI_INT 4
MPI_INTEGER4 4
MPI_UNSIGNED 4
MPI_INTEGER8 8
MPI_LONG 4
MPI_LONG_LONG 8
MPI_UNSIGNED_LONG 4
MPI_UNSIGNED_LONG_LONG 8
MPI_FLOAT 4

g
an
MPI_DOUBLE 8
MPI_REAL4 4

W
MPI_LONG_DOUBLE 16
MPI_REAL8 8

ng
MPI_REAL16 16
MPI_CHARACTER 1 e
MPI_LOGICAL 4
nh
MPI_INTEGER 4
Yu

MPI_REAL 4
r

MPI_DOUBLE_PRECISION 8
fo

MPI_COMPLEX 2*4
ed

MPI_DOUBLE_COMPLEX 2*8
p ar

Table 3.5: Data types in required and recommended by the MPI standard. For a list of the types available in a specific
re

MPI implementation, read the <mpi.h> file of that implementation.


yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.3. TASK PARALLELISM IN DISTRIBUTED MEMORY, MPI 129

Reliability, Order and Fairness


The MPI protocol provides reliable message transmission. This means that a sent message is always
received correctly. If messages are sent over unreliable network layers (e.g., TCP/IP), then it is the job of the
MPI implementation to ensure reliability at the MPI message level. At the same time, MPI does not provide
any mechanisms available to the user for transmission error correction.
Additionally, MPI guarantees that messages will not overtake each other. Namely, if a single sender
sends two messages, they will be received in the order that they were sent. If a single receiver posts receives
for two messages, they will be satisfied in the order posted. Note that these rules do not apply if multiple
threads in a host are performing communication.
On the other hand, MPI does not guarantee fairness in servicing connection attempts. For instance, if a
task posts a receive from MPI_ANY_SOURCE, and two other tasks send messages with matching tags (see
Figure 3.7), then only one of the sends will compete. There is no guarantee which send will complete. It is the
programmer’s responsibility to prevent such conflicts.

g
task 0 task 1

an
sender sender

W
data data

e ng
nh
Yu
task 2
r

receiver
fo
d
re

Figure 3.7: Diagram of MPI fairness conflict.


pa
re
yP

“Now you know your MPI"


el

Functions MPI_Recv and MPI_Send are easy to use, and they provide message passing functionality
iv

that is sufficient for many real-world HPC applications. That said, the discussion of parallelism in MPI could
us

be terminated at this point. However, we will discuss additional topics of MPI in the rest of Section 3.3. Here
cl

is what to expect in the continuation of this discussion:


Ex

1. Buffering is a system-level functionality of MPI that enables significant optimization for communication
efficiency. The use of buffering may require additional efforts to prevent errors.
2. Non-blocking send and receive operations can be used to overlap computation and communication.
3. Collective communication is helpful for certain parallel patterns, including reduction. Reduction is built
into MPI in the form of dedicated functions.
Finally, in this chapter we did not touch the issues of performance with MPI. This, along with other
performance tuning questions, is left for Chapter 4 (Section 4.7).

Prepared for Yunheng Wang c Colfax International, 2013


130 CHAPTER 3. EXPRESSING PARALLELISM

3.3.4 MPI Communication Modes


Functions MPI_Send and MPI_Recv are the “standard” functions for sending and receiving messages.
MPI implementations optimize these functions for a balance between efficiency and stability. The expected
behavior is that these functions take as little time as possible, and yet, other send and receive operations may be
safely called after MPI_Send or MPI_Recv complete. However, there are other flavors of send and receive
operations available to users who wish to fine-tune their applications, as discussed below.

Terminology: Application (Send) Buffer, System Buffer and User Space Buffer
Historically, the word “buffer” in the context of MPI is used in multiple terms with very different meaning.
It is important to understand the difference between these terms for future discussion.

a) Application buffer collectively refers to send buffers and receive buffers. This is a memory region in the
user application which holds the data of the sent or received message. In Table 3.4, the variable void
*buf represents either the send, or the receive buffer. In the code in Listing 3.57, the role of send and

g
receive application buffers is played by variables rank, namelen and name.

an
W
b) System buffer is a memory space managed by the MPI runtime library, which is used to hold messages
that are pending for transmission on the sender side, or for reception to the application on the receiver
ng
side. The purpose of the system buffer is to enable asynchronous communication. The system buffer is
e
nh
not visible to the programmer. System buffers in MPI may exist on both the sender side and the receiver
Yu

side. The standard functions MPI_Send and MPI_Recv typically use system-level buffers provided and
managed by the MPI runtime library.
r
fo

c) User space buffer plays the same role as the system buffer: it can temporarily store messages to enable
ed

asynchronous communication. However, this special buffer space is allocated and managed by the user and
ar

can only be used in specialized buffered send functions.


p
re
yP

Terminology: Synchronous and Asynchronous Communication


el

In this discussion, we will be using the terms synchronous and asynchronous communication modes and
iv

the terms blocking and non-blocking operations. In MPI, these pairs of terms are not synonymous. It may
us

further add to the confusion that the meaning of synchronous and asynchronous in MPI is different from that in
cl

the offload programming model for Intel Xeon Phi coprocessors. Let us clarify these terms before discussing
Ex

specific MPI communication modes.

a) Synchronous communication means that the sender must wait until the corresponding receive request
is posted by the receiver. After a “handshake” between the sender and receiver occurs, the message is
passed without buffering. This mode is more deterministic and uses less memory than asynchronous
communication, but at the cost of the time lost for waiting.

b) Asynchronous communication in the case of sending means that the sender does not have to wait for the
receiver to be ready. The sender may put the message into the system buffer (either on the sender, or on the
receiver side) or into the user space buffer, and return.

Terminology: Blocking and Non-Blocking Functions


Another concept in MPI is blocking and non-blocking functions.

a) Blocking send functions pause execution until it is safe to modify the current send buffer. Blocking receive
functions wait until the message is fetched into the receive buffer.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.3. TASK PARALLELISM IN DISTRIBUTED MEMORY, MPI 131

b) Non-blocking send functions return immediately and execute the transmission “in background”. Non-
blocking receive functions only post the intent to receive a message, but do not pause execution. It is not
safe to re-use or modify the send buffer before ensuring that a non-blocking send operation has completed.
Likewise, it is unsafe to read from the receive buffer before ensuring that a non-blocking receive operation
has completed. In order to ensure that a non-blocking operation is complete, each non-blocking MPI
function must have a corresponding MPI_Wait function.
Blocking and non-blocking functions exist in synchronous as well as asynchronous flavors.

Terminology: Ready Communication


If the programmer can guarantee that by the time that a send function is called, a matching receive is
already pending, it is possible to use the ready mode send. This eliminates the hand-shake and may accelerate
communication. If the receive is not posted, this is an error condition and the whole application must be
aborted. There are both blocking and non-blocking ready mode functions. Any mode of the send function can
be paired with any mode of the receive function.

g
an
Explanation of Communication Modes

W
ng
In order to better illustrate synchronous and asynchronous, blocking and non-blocking, and ready mode
functions, consider this real-world analogy. Suppose the sender (let us call her Sierra) wants to communicate

e
nh
to the receiver (let us call him Romeo) the time and place of their lunch meeting. The following situations are
Yu
equivalent to the various communication modes in MPI:
r

1) Blocking asynchronous send: Sierra dials Romeo’s telephone number and leaves a message on Romeo’s
fo

answering machine. Sierra does not return to her activities until she had left the message. This reflects the
d

blocking nature of this transaction. At the same time, after the transaction is complete, there is no guarantee
re

that Romeo has personally received the message. This reflects the asynchronous nature of the transaction.
pa

The answering machine plays the role of a receiver-side system buffer in this case.
re
yP

2) Blocking synchronous send: Sierra keeps dialing Romeo’s telephone number until Romeo personally
picks up the phone. This transaction is blocking, because Sierra cannot return to her other activities until
el

she speaks to Romeo. This transaction is synchronous because at the end of the transaction, Romeo has
iv
us

definitely received the message.


cl

3) Non-blocking asynchronous send: Sierra tells her assistant to call Romeo and leave a message on his
Ex

answering machine. Sierra returns to her other activities immediately, so this transaction is non-blocking.
Another property of non-blocking transactions: Sierra must wait for her assistant to finish with this task
before assigning him another task. Her assistant does not have to reach Romeo personally; leaving the
message on the answering machine is satisfactory in this case, so this transaction is asynchronous.
4) Non-blocking synchronous send: Sierra tells her assistant to call Romeo, and to make sure to talk to
him personally, and not to his answering machine. Sierra can do other things while her assistant works
on transmitting the message, so this is a non-blocking transaction. This non-blocking transaction is
synchronous, because after the assistant has finished with this task, Romeo is sure to have received the
message.
5) Blocking ready mode send: Romeo is already on hold on Sierra’s phone line when Sierra picks up
the phone. Sierra returns to her other activities only after she had transmitted her message (blocking
transaction).
6) Non-blocking ready mode send: Romeo is already on hold on Sierra’s phone line, but she re-directs him
to her assistant. Sierra returns to her other activities immediately, so this is non-blocking communication.

Prepared for Yunheng Wang c Colfax International, 2013


132 CHAPTER 3. EXPRESSING PARALLELISM

Summary of Communication Modes


Table 3.6 summarizes the MPI communication modes and functions effecting them.

Function Effect Use Scenarios


MPI_Send Blocking send operation. Synchronous or asynchronous Default blocking send operation.
depending on MPI implementation and runtime condi-
tions. Returns when it is safe to re-use the send buffer.

MPI_Bsend, Blocking asynchronous send operation with user space Used for asynchronous blocking communica-
MPI_Buffer_attach, buffer. Returns when it is safe to re-use the send tion when system buffer is inefficient, prone
MPI_Buffer_detach buffer. User space buffer must be allocated with to overflows, or is not used by MPI_Send.
MPI_Buffer_attach prior to calling MPI_Bsend.

MPI_Ssend Blocking synchronous send operation. Not buffered. Re- Used (a) when message must be received be-
turns when it is safe to re-use the send buffer. fore function return, or (b) to eliminate mem-
ory overhead of system or user space buffers.

g
Blocking ready mode send operation. Synchronous or Used in codes with fine-grained communica-

an
MPI_Rsend
asynchronous depending on the MPI implementation and tion to improve performance. It is program-

W
runtime conditions. Returns when it is safe to re-use mer’s responsibility to ensure that matching

ng
the send buffer. Assumes that the matching receive had receives post before MPI_Rsend.
already been posted, error otherwise. e
nh
MPI_Recv Blocking receive operation. Can be paired with any send operation.
Yu

MPI_Isend Non-blocking send operation. Synchronous or asyn- Default non-blocking send operation. Used
r
fo

chronous depending on the MPI implementation and run- to overlap communication and computation
time conditions. MPI_Wait must be called prior to re- between MPI_Isend and MPI_Wait.
ed

using the send buffer.


p ar
re

MPI_Ibsend, Non-blocking asynchronous send operation with user Potentially the most efficient send method.
MPI_Buffer_attach, space buffer. MPI_Wait must be called prior to re-using Asynchronous and non-blocking, allows to
yP

MPI_Buffer_detach the send buffer. User space buffer must be allocated with overlap computation and communication. See
MPI_Buffer_attach prior to calling MPI_Bsend. also comment for MPI_Bsend.
el
iv
us

MPI_Issend Non-blocking synchronous send operation. Not buffered. Used to overlap communication and compu-
cl

MPI_Wait must be called prior to re-using the send tation. At the same time, eliminates memory
Ex

buffer. overhead of system or user space buffers.

MPI_Irsend Non-blocking ready send operation. Synchronous or asyn- Used instead of MPI_Isend in codes with
chronous depending on the MPI implementation and run- fine-grained communication to improve per-
time conditions. MPI_Wait must be called prior to re- formance by eliminating “handshakes”.
using the send buffer. Assumes that the matching receive
had already been posted, error otherwise.

MPI_Irecv Non-blocking receive operation. MPI_Wait must be Can be paired with any send operation. Used
called prior to using the receive buffer. to overlap computation and communication
between MPI_Irecv and MPI_Wait.

MPI_Wait, Blocks execution until one or more matching non- Every asynchronous operation must have a
MPI_Waitall blocking send or receive operations return. After that, matching MPI_Wait.
MPI_Waitany, it is safe to re-use the send buffer or use the receive buffer.
MPI_Waitsome

Table 3.6: Basic functions for MPI communication. Details may be found at [40] or by clicking function names.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.3. TASK PARALLELISM IN DISTRIBUTED MEMORY, MPI 133

Example: Blocking Asynchronous Send with User Space Buffer


Functions MPI_Send and MPI_Recv demonstrated in Listing 3.57 are the standard blocking send
and receive operations in MPI. The runtime MPI library decides how to use the system buffer and whether
to perform asynchronous transfers. For applications with large and frequent data transfers, the user may
wish to take control over buffering in order to ensure that (a) buffering is used consistently and therefore all
transactions are asynchronous, and (b) system buffer does not overflow. In order to let the user control over
buffering of blocking transactions, MPI provides function MPI_Bsend. Listing 3.58 demonstrates how this
function is used with user space buffer.

1 #include <mpi.h>
2 #include <stdio.h>
3 #include <stdlib.h>
4
5 int main (int argc, char *argv[]) {
6 const int M = 100000, N = 200000;
7 float data1[M]; data1[:]=1.0f;

g
an
8 double data2[N]; data2[:]=2.0;
9 int myRank, worldSize, size1, size2;

W
10

ng
11 MPI_Init (&argc, &argv);
12 MPI_Comm_size (MPI_COMM_WORLD, &worldSize);

e
MPI_Comm_rank (MPI_COMM_WORLD, &myRank);
nh
13
14
Yu
15 if (worldSize > 1) {
16 if (myRank == 0) {
r
fo

17 // Sender side: allocate user-space buffer for asynchronous communication


18 MPI_Pack_size(M, MPI_FLOAT, MPI_COMM_WORLD, &size1);
d

19 MPI_Pack_size(N, MPI_DOUBLE, MPI_COMM_WORLD, &size2);


re

20 int bufsize = size1 + size2 + 2*MPI_BSEND_OVERHEAD;


pa

21 printf("size1 = %d, size2 = %d, MPI_BSEND_OVERHEAD = %d, allocating %d bytes\n",


re

22 size1, size2, MPI_BSEND_OVERHEAD, bufsize);


void* buffer = malloc(bufsize);
yP

23
24 MPI_Buffer_attach(buffer, bufsize);
MPI_Bsend(data1, M, MPI_FLOAT, 1, 1, MPI_COMM_WORLD);
el

25
MPI_Bsend(data2, N, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD);
iv

26
27 MPI_Buffer_detach(&buffer, &bufsize);
us

28 free(buffer);
cl

29 } else if (myRank == 1) {
Ex

30 // Receiver side does not have to do anything special


31 MPI_Status stat;
32 MPI_Recv(data1, M, MPI_FLOAT, 0, 1, MPI_COMM_WORLD, &stat);
33 MPI_Recv(data2, N, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, &stat);
34 }
35 }
36 MPI_Finalize ();
37 }

user@host% mpiicpc -o mpi-buffered mpi-buffered.cc


user@host% mpirun ./mpi-buffered
size1 = 400000, size2 = 1600000, MPI_BSEND_OVERHEAD = 95, allocating 2000190 bytes
user@host%

Listing 3.58: Source code mpi-buffered.cc illustrates blocking asynchronous transactions with user space buffer.

Note that the required size of the buffer is calculated using the MPI function MPI_Pack_size, and a
constant MPI_BSEND_OVERHEAD is added to the buffer size.

Prepared for Yunheng Wang c Colfax International, 2013


134 CHAPTER 3. EXPRESSING PARALLELISM

Example: Non-Blocking Standard Send

In order to overlap communication and computation, MPI provides non-blocking send and receive
functions MPI_Isend and MPI_Irecv. Non-blocking functions return immediately, and the code can
perform other operations while communication proceeds. However, for non-blocking send operations, it is
unsafe to re-use the send buffer until blocking function MPI_Wait is called. Likewise, for non-blocking
receive operations, it is unsafe to assume that the message was delivered to the receive buffer until MPI_Wait
is called.
Note that calling MPI_Wait immediately after MPI_Isend or MPI_Irecv is equivalent to using
MPI_Send or MPI_Recv. The purpose of non-blocking functions is to enable the code to perform some
additional operations while the message is in transit.
Listing 3.59 demonstrates the use of non-blocking send operation.

1 #include <mpi.h>
2 #include <stdio.h>

g
3

an
4 int main (int argc, char *argv[]) {
const int N = 100000, tag=1;

W
5
6 float data1[N], data2[N]; data1[:]=0.0f;

ng
7 int myRank, worldSize;
8
e
nh
9 MPI_Request request;
10 MPI_Status stat;
Yu

11
MPI_Init (&argc, &argv);
r

12
fo

13 MPI_Comm_size (MPI_COMM_WORLD, &worldSize);


MPI_Comm_rank (MPI_COMM_WORLD, &myRank);
ed

14
15
ar

16 if (worldSize > 1) {
p

17 if (myRank == 0) {
re

18 // Sender side: starting non-blocking send of data1


yP

19 MPI_Isend(data1, N, MPI_FLOAT, 1, tag, MPI_COMM_WORLD, &request);


20 // Sender can perform some other work while data1 is in transit
el

21 for (int i = 0; i < N; i++)


iv

22 data2[i] = 1.0f;
us

23 // MPI_Wait will block until it safe to modify data1


24 MPI_Wait(&request, &stat);
cl

25 } else if (myRank == 1) {
Ex

26 // Receiver side: blocking receive of data1


27 MPI_Recv(data1, N, MPI_FLOAT, 0, tag, MPI_COMM_WORLD, &stat);
28 // At the end of blocking MPI_Recv, it is safe to use data1
29 }
30 }
31
32 MPI_Finalize ();
33 }

Listing 3.59: Source code mpi-nonblocking.cc illustrates non-blocking standard send.

Note that when communicating processes use a network interconnect (e.g., Ethernet or Infiniband) as
physical network fabric, communication does not stall calculations. In this case, non-blocking communication
may improve performance by masking communication time. However, if the sender and receiver are executing
on the same host, their communication is essentially a memory copy. In this case, non-blocking communication
may be detrimental to performance, because communication and computation will compete for resources,
resulting in undesirable contention.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.3. TASK PARALLELISM IN DISTRIBUTED MEMORY, MPI 135

3.3.5 Collective Communication and Reduction


Intel MPI collective communication routines involve all processes in the scope of a communicator. There
are three major types of collective communication routines:
Synchronization — all processes wait until each of them has reached a synchronization point;
Data Movement — broadcast, scatter and gather, all-to-all communication;
Collective Computation (reduction) — one member of the group collects data from the other members and
performs an associative operation (min, max, add, multiply, etc.) on that data.
Patterns of commonly used broadcast, scatter and gather data movement operations and of reduction operations
are illustrated in Figure 3.8.

sender
data

g
an
Broadcast

W
ng
receiver receiver receiver receiver

e
nh
Yu
sender
r

data data
fo

data data
d

Scatter
re
pa
re

receiver receiver receiver receiver


yP
el
iv

sender sender sender sender


us

data data data data


cl
Ex

Gather

receiver

sender sender sender sender


1 3 5 7

Reduction
16

receiver

Figure 3.8: Diagram of MPI collective communication patterns.

Prepared for Yunheng Wang c Colfax International, 2013


136 CHAPTER 3. EXPRESSING PARALLELISM

MPI collective communication functions are summarized in Table 3.7.

Function Effect
MPI_Barrier Performs group barrier synchronization. Upon reaching the MPI_Barrier call, each
task is blocked until all tasks in the group reach the same MPI_Barrier call.

MPI_Bcast Broadcasts (i.e., sends) a message from one process to all other processes in the group.

MPI_Scatter Distributes distinct messages from a single source task to each task in the group.

MPI_Gather Gathers distinct messages from each task in the group into a single destination task.

MPI_Allgather For each task, performs a one-to-all broadcasting operation within the group.

MPI_Reduce Applies a reduction operation on all tasks in the group and places the result in one task.
Predefined MPI reduction operations are summarized in Table 3.8.

g
an
MPI_Allreduce Applies a reduction operation and places the result in all tasks in the group. This is

W
equivalent to MPI_Reduce followed by MPI_Bcast.

e ng
MPI_Reduce_scatter Performs an element-wise reduction on a vector across all tasks in the group. The
resulting vector is split into disjoint segments and distributed across the tasks. This is
nh
equivalent to MPI_Reduce followed by MPI_Scatter operation.
Yu

MPI_Op_create Creates a user-defined reduction operation for MPI_Reduce and MPI_Allreduce.


r
fo
ed

MPI_Alltoall Each task in a group performs a scatter operation, sending a distinct message to all the
tasks in the group ordered by index.
p ar
re

MPI_Scan Performs a scan operation with respect to a reduction operation across a task group.
yP
el

Table 3.7: Collective communication functions in MPI. Details may be found at [40] or by clicking function names.
iv
us
cl

Intel MPI Reduction Operations C Data Types Fortran Data Types


Ex

MPI_MAX maximum integer, float integer, real, complex


MPI_MIN minimum integer, float integer, real, complex
MPI_SUM sum integer, float integer, real, complex
MPI_PROD product integer, float integer, real, complex
MPI_LAND logical AND integer logical
MPI_BAND bit-wise AND integer, MPI_BYTE integer,MPI_BYTE
MPI_LOR logical OR integer logical
MPI_BOR bit-wise OR integer, MPI_BYTE integer,MPI_BYTE
MPI_LXOR logical XOR integer logical
MPI_BXOR bit-wise XOR integer, MPI_BYTE integer,MPI_BYTE
MPI_MAXLOC max value and location float, double and long double real, complex, double precision
MPI_MINLOC min value and location float, double and long double real, complex, double precision

Table 3.8: MPI reduction operations

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


3.3. TASK PARALLELISM IN DISTRIBUTED MEMORY, MPI 137

Example: Using the Scatter Operation in MPI


The usage of one of the collective communication operations in OpenMP is shown in Listing 3.60.

1 #include "mpi.h"
2 #include <stdio.h>
3 #define SIZE 4
4
5 int main(int argc, char *argv[]) {
6 int numtasks, rank, sendcount, recvcount, source;
7 float sendbuf[SIZE][SIZE] = {
8 {1.0, 2.0, 3.0, 4.0},
9 {5.0, 6.0, 7.0, 8.0},
10 {9.0, 10.0, 11.0, 12.0},
11 {13.0, 14.0, 15.0, 16.0}};
12 float recvbuf[SIZE];
13
14 MPI_Init(&argc,&argv);

g
15 MPI_Comm_rank(MPI_COMM_WORLD, &rank);

an
16 MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

W
17
18 if (numtasks == SIZE) {

ng
19 source = 1;

e
20 sendcount = SIZE;
21 recvcount = SIZE;
nh
MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount,
Yu
22
23 MPI_FLOAT,source,MPI_COMM_WORLD);
r

24 printf("rank= %d Results: %f %f %f %f\n",rank,recvbuf[0],


fo

25 recvbuf[1],recvbuf[2],recvbuf[3]);
}
d

26
re

27 else
pa

28 printf("Must specify %d processors. Terminating.\n",SIZE);


29
re

30 MPI_Finalize();
yP

31 }
el
iv

Listing 3.60: Source code mpi-scatter.cc demonstrates one of Intel MPI collective communication operations: a
us

scatter operation on the rows of a two-dimensional array.


cl
Ex

This code demonstrates how function MPI_Scatter is used to distribute (i.e., scatter) the rows of a
matrix from one source process to all other processes. Execution of this application with 4 processes produces
the following output:

user@host% mpirun -n 4 ./MPIscatter


rank= 1 Results: 5.000000 6.000000 7.000000 8.000000
rank= 2 Results: 9.000000 10.000000 11.000000 12.000000
rank= 3 Results: 13.000000 14.000000 15.000000 16.000000
rank= 0 Results: 1.000000 2.000000 3.000000 4.000000

Listing 3.61: Intel MPI scatter collective communication example output.

Prepared for Yunheng Wang c Colfax International, 2013


138 CHAPTER 3. EXPRESSING PARALLELISM

3.3.6 Further Reading


The following list suggests additional sources of information on expressing parallelism in MPI and using
MPI implementations.

1) The official source of all information on MPI is the MPI Forum Web page http://www.mpi-forum.org/ [41]

2) The MPI version 2.2 standard specification can be found in [42]


3) As of September 21, 2012, a new MPI standard version 3.0 is available [43]. However, Intel MPI, as of
version 4.1 used in this training, does not support it.
4) A detailed list of resources on MPI (manuals, tutorials, white papers, etc.) compiled by Argonne National
Laboratory (ANL) is available at http://www.mcs.anl.gov/mpi/ [44]

5) The ANL Web site also features a reference of all MPI routines, which includes syntax and specification.
This reference can be found at http://www.mcs.anl.gov/research/projects/mpi/www/ [40] . All hyperlinks

g
attached to MPI function names in this book point to that document.

an
W
6) A reference guide for Intel MPI version 4.1 can be found at [6]. This manual contains information about
the specifics of Intel’s implementation of MPI.
e ng
7) A popular book on MPI and its interoperation with OpenMP is “Parallel Programming in C with MPI and
nh
OpenMP” by Michael J. Quinn [39]
Yu

Chapter 4 of this document contains more MPI examples and discusses performance analysis and
r
fo

optimization methods in distributed memory computing.


ed
p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


139

Chapter 4

Optimizing Applications for Intel R


Xeon R Product Family

g
an
W
This chapter delves deeper into the issues of extracting performance out of parallel applications on Intel

ng
Xeon processors and Intel Xeon Phi coprocessors and provides practical examples of high performance codes.

e
This chapter re-iterates on the skills and methods introduced in Chapter 3.
nh
Yu
TM
4.1 Roadmap to Optimal Code on Intel R Xeon Phi Coprocessors
r
fo
d

The Intel Xeon Phi coprocessor is a massively parallel vector processor, and optimization strategies
re

for it are qualitatively the same as those for Intel Xeon processors: SIMD support (vectorization), thread
pa

parallelism and high arithmetic intensity or streaming memory access pattern. However, these requirements
re

are quantitatively more strict for Intel Xeon Phi calculations: the code must be able to utilize wider vectors,
yP

support a greater number of threads, and the penalty for non-local memory access is greater on the coprocessor.
el
iv

4.1.1 Optimization Checklist


us
cl

In general, in order to expect better performance from an Intel Xeon Phi coprocessor than from the host
Ex

system, the developer should be able to answer “yes” to the following questions:

Has scalar optimization been performed? Some applications can be improved by consistently employing
single precision floating-point arithmetics instead of double precision or a mix of the two, using
array notation instead of pointer arithmetics, removing unnecessary type conversions and eliminating
redundant operations. Section 4.2 discusses these optimizations.

Does the code vectorize? Not only should the compiler report indicate that the automatic vectorization of
performance-critical loops has succeeded. In addition, the programmer must enforce unit-stride memory
access pattern, proper data alignment and eliminate type conversions in vector calculations. See
Section 4.3 for details.

Does the applications scale above 100 threads? Some applications designed for earlier generation proces-
sors may be serial or only support 2 to 4 threads. While this is sufficient to extract significant additional
performance from an Intel Xeon processor, these applications will not show satisfactory performance on
Intel Xeon Phi coprocessors. Massive parallelism is required to fully utilize the coprocessor, because
it derives its performance from concurrent work of over 50 cores with four-way hyper-threading and

Prepared for Yunheng Wang c Colfax International, 2013


140 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

low clock speeds. Even if an application can utilize hundreds of threads, thread contention due to
various effects can quench performance. Methods for improving the parallel scalability of task-parallel
calculations are described in Section 4.4.
Is it arithmetically intensive or bandwidth-limited? Applications that are not optimized for data locality
in space and time, and programs with complex memory access patterns may exhibit better performance
on the host system than on the Intel Xeon Phi coprocessor. If complex memory access is an inherent
property of the algorithm, it may be possible to re-structure data to pack memory accesses more closely.
Some algorithms be modified to better utilize the cache hierarchy (for compute-bound calculations) or
to improve memory streaming capabilities (for bandwidth-bound calculations) with techniques such as
loop tiling and cache-oblivious algorithms. These optimizations are described in Section 4.5.
Is cooperation between the host and the coprocessor(s) efficient? When an application utilizes more than
one coprocessor or more than one compute node, load balancing across compute units and the efficiency
of data communication between the host system(s) and coprocessor(s) becomes an issue. Load balancing
across compute units can be tuned using specialized Intel software development tools. In addition, it may

g
be possible to reduce the communication overhead by utilizing improved algorithms and/or optimizing

an
data marshaling policies. These optimizations are described in Section 4.6 and Section 4.7.

W
4.1.2 Expectations e ng
nh
It is often the case that an application providing satisfactory performance on Intel Xeon processors
Yu

initially performs poorly on Intel Xeon Phi coprocessors. This does not necessarily mean that the problem
r

is not “MIC-friendly”. Intel Xeon processors have a resource-rich architecture with hardware prefetchers,
fo

branch predictors, deep pipelines and high clock speeds of the cores, which can compensate sub-optimal
ed

aspects of a variety of workloads. At the same time, on Intel Xeon Phi coprocessors, the same sub-optimal
ar

behaviors of unoptimized codes are widely exposed by the resource-efficient MIC architecture. The good
p

news is that, generally, optimization for Intel Xeon processors leads to performance benefits for Intel Xeon Phi
re

coprocessors, and vice versa.


yP

Optimization methods described in this section yield performance benefits for both the many-core
el

and the multi-core architecture. In the best case scenario, a single Intel Xeon Phi coprocessor is capable
iv

of outperforming two Intel Xeon processors system by a factor of 2x (for linear arithmetics-bound codes)
us

up to 5x (for bandwidth-bound and transcendental arithmetics-bound calculations). This estimate is based


cl

on the theoretical limit of the arithmetic capabilities and memory bandwidth of a single Intel Xeon Phi
Ex

coprocessor compared to those of a two-way Intel Xeon processor-based host system with the Sandy Bridge
microarchitecture. However, applications that do not achieve this speedup can still benefit from an Intel Xeon
Phi coprocessor in the system, because available programming models allow to team up the host and the
coprocessor using asynchronous offload or heterogeneous MPI.
Note that the best-case speedup of 2x-5x is lower than frequently quoted speedups of 20x-100x introduced
by using a General Purpose Graphics Processing Unit (GPGPU). This does not mean that the performance of
Intel Xeon Phi coprocessors is an order of magnitude lower than the performance of GPGPUs. The comparison
between GPGPU and CPU performance is complicated by the fact that these systems feature completely
different hardware architectures and use different codes. Therefore, it is easy to be mislead regarding the
performance of a system by comparing poorly optimized CPU codes with highly optimized GPGPU codes.
On the other hand, a Intel Xeon Phi coprocessor runs applications compiled from the same code as the CPU,
and features a similar architecture. Therefore, a highly optimized application for Intel Xeon Phi coprocessors
is likely to show good performance on Intel Xeon processors as well.
In general, before optimizing an application for the MIC architecture, it is important to optimize it for the
multi-core architecture first. Most of the methodology described in this chapter applies to both architectures.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.2. SCALAR OPTIMIZATIONS 141

4.2 Scalar Optimizations


Before proceeding to the details of data- and task-parallel code optimization, it is useful to consider
scalar optimizations that improve the performance of each parallel task. These code modifications will
naturally translate to improve the performance of vectorized parallel applications. Unlike optimizations of
vectorization, parallelism and cache traffic, which require modifying data structures and order of operations,
scalar optimizations generally require reducing the total number and computational cost of operations.

4.2.1 Assisting the Compiler


Intel Compilers can perform some of the optimizations described in this section automatically. However,
it is important to know how to facilitate the compiler’s job.

g
an
Optimization Level

W
ng
Performance-critical code should be compiled with the optimization level -O2 or -O3. A simple way

e
to set a specific optimization level is to use the compiler argument -O2 or -O3 [45]. However, a more
nh
fine-grained approach is possible by including #pragma intel optimization_level in the code
Yu
[46]. Figure 4.1 illustrates these methods.
r
fo
d

#pragma intel optimization_level 3


re

1
2 void my_function() {
pa

user@host% icc -o mycode -O3 source.c


3 //...
re

4 }
yP
el

Listing 4.1: Left: specifying the optimization level -O3 as a compiler argument. The specified optimization level is
iv

applied to the whole source file. Right: specifying the optimization level -O3 as a pragma. The optimization level specified
us

in this way applies only to the statement following the pragma.


cl
Ex

The default optimization level is -O2, which optimizes the application for speed. At this level, enabled
optimization functions include: automatic vectorization, inlining, constant propagation, dead-code elimination,
loop unrolling, and other. This is the generally recommended optimization level for most purposes.
The optimization level -O3 enables more aggressive optimization than -O2. It includes all of the features
of -O2 and, in addition, performs loop fusion, block-unroll-and-jam, if-statement collapse, and other. While
-O3 may improve performance with respect to -O2, it may sometimes result in worse performance. It is
straightforward to empirically determine the fastest optimization level.

Using the const Qualifier

Whenever a local variable or a function argument is not supposed to change value in the code, it is
beneficial to declare it with the const qualifier. This enables more aggressive compiler optimizations,
including pre-computing commonly used combinations of constants. For example, the code in Listing 4.2
executes 4.5x faster when w and T are declared with the const qualifier.

Prepared for Yunheng Wang c Colfax International, 2013


142 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Sub-optimal Optimized
1 #include <stdio.h> 1 #include <stdio.h>
2 2
3 int main() { 3 int main() {
4 const int N=1<<28; 4 const int N=1<<28;
5 double w = 0.5; 5 const double w = 0.5;
6 double T = (double)N; 6 const double T = (double)N;
7 double s = 0.0; 7 double s = 0.0;
8 for (int i = 0; i < N; i++) 8 for (int i = 0; i < N; i++)
9 s += w*(double)i/T; 9 s += w*(double)i/T;
10 printf("%e\n", s); 10 printf("%e\n", s);
11 } 11 }

user@host% icpc noconst.cc user@host% icpc const.cc


user@host% time ./a.out user@host% time ./a.out
6.710886e+07 6.710886e+07

real 0m0.461s real 0m0.097s

g
user 0m0.460s user 0m0.094s

an
sys 0m0.001s sys 0m0.003s

W
user@host% user@host%

e ng
nh
Listing 4.2: The sub-optimal code on the left takes 4.5x longer to compute than the optimized code on the right. The
const qualifier on variables w and T permits the compiler to pre-compute the ratio w/T instead of computing it in every
Yu

iteration.
r
fo

The difference in the performance of these two codes is explained by the fact that the compiler is certain
ed

that it is safe to pre-compute the common expression w/T. The value of this expression is then used it in the
ar

body of the loop in the multiplication operation.


p
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.2. SCALAR OPTIMIZATIONS 143

Array Reference by Index instead of Pointer Arithmetics


Intel Compilers are able to better optimize the code when array index is used instead of pointer arithmetics
to access pointer-based data (i.e., a[i] instead of *(a+i). Listing 4.3 Illustrates this recommendation.

Sub-optimal Optimized
1 #include <stdio.h>
1 #include <stdio.h>
2
2
3 int main() {
3 int main() {
4 const long N = 1000;
4 const long N = 1000;
5 float a[N*N], b[N*N], c[N*N];
5 float a[N*N], b[N*N], c[N*N];
6 a[:] = b[:] = 1.0f;
6 a[:] = b[:] = 1.0f;
7 c[:] = 0.0f;
7 c[:] = 0.0f;
8
8
9 for (int i = 0; i < N; i++)
9 for (int i = 0; i < N; i++)
10 for (int j = 0; j < N; j++) {
10 for (int j = 0; j < N; j++)
11 float* cp = c + i*N + j;
11 for (int k = 0; k < N; k++)
12 for (int k = 0; k < N; k++)

g
12 c[i*N + j] +=
*cp += a[i*N + k] * b[k*N + j];

an
13
13 a[i*N + k] * b[k*N + j];
14 }

W
14
15
15 printf("%f\n", c[0]);

ng
16 printf("%f\n", c[0]);
16 }
17 }

e
nh
Yu
user@host% icc array_pointer.cc user@host% icpc array_index.cc
user@host% time ./a.out user@host% time ./a.out
r

1000.000000 1000.000000
fo
d

real 0m1.110s real 0m0.228s


re

user 0m1.104s user 0m0.225s


pa

sys 0m0.005s sys 0m0.002s


user@host% user@host%
re
yP

Listing 4.3: The intentionally crippled code on the left takes 5x longer to compute than the optimized code on the right.
el
iv

The only difference between the codes is the reference to the array element c[i*N + j]. This example illustrates that
us

the index notation works faster, as it enables the compiler to implement automatic vectorization (with reduction, in this
case).
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


144 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.2.2 Eliminating Redundant Operations


In some cases, the code may perform unnecessary calculations even though the programmer did not
intend that. Removing those redundant operations, as outlined in this section, can benefit the application
performance.

Common Subexpression Elimination

Sometimes the compiler can automatically detect when an expression is re-used multiple times or within
a loop, and pre-compute it, as was shown in Listing 4.2. This procedure is known as common subexpression
elimination. In order to insure against situations when the compiler is unable to implement this optimization,
it can be expressed in the code as shown in Figure 4.4.

Sub-optimal Optimized

g
1 for (int i = 0; i < n; i++) 1 for (int i = 0; i < n; i++) {

an
2 { 2 const double sin_A = sin(A[i]);
for (int j = 0; j < m; j++) { for (int j = 0; j < m; j++) {

W
3 3
4 const double r = 4 const double cos_B = cos(B[j]);

ng
5 sin(A[i])*cos(B[j]); 5 const double r = sin_A*cos_B;
6 // ... 6
e// ...
nh
7 } 7 }
8 } 8 }
r Yu
fo
ed

Listing 4.4: Left: unoptimized code computes the value of sin(A[i]) for every iteration in j. Right: optimized code
ar

eliminates redundant calculations of sin(A[i]) by pre-computing it. Note that the assignment of the variable cos_B
will be eliminated by the compiler at -O2 and higher in a procedure known as constant propagation.
p
re
yP

Ternary Operator Trap


el
iv

The ternary operator ?: is a short-hand expression for the if. . . else statement. Sometimes the syntax
us

of this operator can cause redundant calculations, like shown in the example below.
cl
Ex

1 #define min(a, b) ( (a) < (b) ? (a) : (b) )


2 const float c = min(my_function(x), my_function(y));

Listing 4.5: In this sub-optimal code, three calls to my_function() will be made: two calls to evaluate the condition
(a<b) and one more call to substitute the result.

In order to avoid this trap, the programmer may pre-compute the expressions used in the ternary operator.

1 #define min(a, b) ( (a) < (b) ? (a) : (b) )


2 const float result_a = my_function(x);
3 const float result_b = my_function(y);
4 const float c = min(a, b);

Listing 4.6: In this optimized code, only two calls to my_function() will be made.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.2. SCALAR OPTIMIZATIONS 145

Overhead of Abstraction
Complex C++ classes may introduce a significant overhead for operations that are expected to be fast
for simple objects like C arrays. For example, data containers may perform some operations to update their
internal state with every read or write operation. It may be possible to reduce the computational expense of
manipulations with complex classes by outsourcing a part of the calculation to simpler objects.
For example, the calculation shown in Figure 4.7 was found in a scientific code employing the Cern
ROOT library [47]. The code performs binning (i.e., hashing) of events marked by energy values into a
histogram, in which the values of the bins represent the number of events with energies between the bin
boundaries. The histogram is represented by the ROOT library’s class TH1F, which offers additional
histogram functionality required in the project. However, the binning process is a bottleneck of the project,
because the method Fill, called over 109 times, involves unnecessary overhead.

1 // class TH1F contains a histogram with the number of bins equal to nBins,
2 // which span the values from energyMin to energyMax. nBins is of order 100
3 TH1F* energyHistogram = new TH1F("energyHistogram", "", nBins, energyMin, energyMax)

g
4

an
5 // Method TH1F::Fill adds an event with the value eventEnergy[i] to the histogram.

W
6 for (unsigned long i = 0; i < nEvents; i++) // nEvents is of order 1e9.
energyHistogram->Fill( eventEnergy[i] );

ng
7

e
nh
Listing 4.7: This code employs the ROOT library to construct a histogram implemented in class TH1F. The generation
Yu
of the histogram is a bottleneck of the project.
r
fo

The code was optimized by pre-computing the values of the bins in the histogram using a more lightweight
d

object, an array of long integers. The array was then loaded into the histogram using an overloaded Fill
re

method, which takes the number of events in the bin, as opposed to adding a single event at a time. As a result,
pa

the execution time of the binning process was significantly reduced.


re
yP

1 // array tempHistogram is used to prepare the histogram for loading into class TH1F
el

2 long tempHistogram[nBins]; tempHistogram[:] = 0;


iv

3 const floatinvBinWidth = (energyMin - energyMax) / (float)nBins;


us

4 for (long i = 0; i < nEvents; i++) {


cl

5 bin = (int)floorf(eventEnergy[i] * invBinWidth);


if ((0 <= bin) && (bin < nBins))
Ex

6
7 tempHistogram[bin]++;
8 }
9 // Now the histogram class is filled, but this time
10 // only nBins=100 calls to TH1F::Fill are made, instead of nEvents=1e9 calls
11 TH1F* energyHistogram = new TH1F("energyHistogram", "", nBins, energyMin, energyMax);
12 for (int i = 0; i < nBins; i++)
13 energyHistogram->Fill( ((float)i+0.5f)*binWidth, (float)tempHistogram[i]);

Listing 4.8: This code produces approximately the same results as the code in Listing 4.7, but works much faster, because
the expensive method Fill is called fewer times.

The calculation of the histogram can be further accelerated through vectorization and thread parallelism.
An example of these optimizations is discussed in Section 4.4.1.
While the example above is specific to the ROOT library, the principle illustrated here applies to other
situations as well. When the overhead of operations in a library function or class is beyond the control of the
developer, it may be possible to eliminate that overhead by pre-computing some of the data in lightweight
objects like C arrays.

Prepared for Yunheng Wang c Colfax International, 2013


146 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Algebraic Elimination of Expensive Instructions (Strength Reduction)


One of the most commonly used arithmetic operations, floating-point division, is significantly more
expensive than floating-point multiplication (see, e.g., [48] for benchmark results). Replacing division with
multiplication generally improves performance.
This optimization is trivial in expressions involving constants. For example, x*0.5f computes signifi-
cantly faster than x/2.0f. In loops, performance can be improved by precomputing the reciprocal of the
denominator and multiplying by it, as illustrated in Listing 4.9.

Sub-optimal Optimized
1 for (int i = 0; i < n; i++) { 1 const float rn = 1.0f/(float)n;
2 A[i] /= n; 2 for (int i = 0; i < n; i++)
3 } 3 A[i] *= rn;

Listing 4.9: Left: unoptimized code uses the division operation with implicit type conversion in a loop. Right: optimized

g
code pre-computes the reciprocal of the denominator and replaces division with multiplication.

an
W
In some cases, the compiler can automatically precompute the reciprocal; however, doing it explicitly
e ng
like in Listing 4.9 may improve cross-platform and cross-compiler portability of the code. In other cases,
simplifying expressions so as to reduce the number of divisions can help. Listing 4.10 demonstrates examples
nh
of such cases:
Yu

Sub-optimal Optimized
r
fo

1 for (int i = 0; i < n; i++) { 1 for (int i = 0; i < n; i++) {


ed

2 A[i] = (B[i]/C[i])/D[i]; 2 A[i] = B[i]/(C[i]*D[i]);


ar

3 E[i] = A[i]/B[i] + C[i]/D[i]; 3 E[i] = (A[i]*D[i] + B[i]*C[i])/


p

4 4 (B[i]*D[i]);
re

5 } 5 }
yP
el

Listing 4.10: Left: unoptimized code uses two division operations in each line. Right: optimized code produces
iv

approximately the same results but eliminates one division operation in each line.
us
cl

The same applies to arithmetic expressions with expensive transcendental operations. Simplifying
Ex

expressions in order to perform fewer expensive operations at the cost of performing a greater number of less
expensive operations may result in a performance increase. This technique is known as strength reduction in
the context of compiler optimization.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.2. SCALAR OPTIMIZATIONS 147

4.2.3 Controlling Precision and Accuracy


Consistency of Precision
The following list outlines the guidelines for precision optimization:

1. Single precision floating-point numbers (float) should be used instead of double precision (double)
wherever possible. Similarly, signed 32-bit integers (int) should be preferred to unsigned and 64-bit
integers (long), including array indices.
2. Typecast operations should be avoided. Note the following conventions:
a) Integer constants 1, 0 and -1 are of type int. Constant -1L is long, and 1UL is unsigned long.
b) Floating-point constants 1.0 and 0.0 are of type double. Constant 1.0f is float;
c) In expressions with mixed types, implicit typecast of lower-precision numbers to higher-precision
numbers is assumed. For example, x+i*a where x is of type double, i is int, and a is float, is
equivalent to x + (double)i*(double)a.

g
an
3. The Intel Math library defines in math.h fast single precision versions of most arithmetic functions. The

W
names of single precision functions are derived from double precision function names by appending the

ng
suffix -f. For example,

e
nh
a) function sin(x) takes an argument of type double and returns a value of the same type;
Yu
b) function sinf(x) takes and returns float and executes faster;
r
fo

c) for float x, expression sin(x) is equivalent to sin((double)x), which is inefficient;


d

d) for double x, expression sinf(x) is equivalent to sinf((float)x), precision loss.


re
pa

4. Instead of the exponential function exp()/expf() and the natural logarithm log()/logf(), use faster
re

base 2 versions offered by the Intel Math Library: exp2()/exp2f() and log2()/log2f().
yP

5. Note that single precision functions ending with -f (for example, sinf(), expf(), fabsf(), etc.) are
el

guaranteed to work faster than their double precision counterparts (sin(), expf(), fabs(), etc.) with
iv

the Intel compilers. However, we have seen cases where with other compilers, single precision functions
us

are slower than double precision functions. The same applies to base-2 exponential and logarithm.
cl
Ex

Figure 4.11 illustrates the above recommendations.

1 const int m =1000000, n = 1000000;


2 const long N1 = m*n; // Overflow error
3 const long N2 = 1000000*1000000; // Overflow error
4 const long N3 = 1000000L*1000000L; // Correct
5 const double twoPi = 6.283185307179586;
6 const float phase = 0.3f;
7 const float s1 = sin(twoPi*phase); // Equivalent to (float)sin(twoPi*(double)phase)
8 const float s2 = sinf(twoPi*phase); // Equivalent to sinf((float)(twoPi*(double)phase));
9 const float twoPif = 6.2831853f;
10 const float s3 = sinf(twoPif*phase); // Efficient solution
11 const double d1 = exp(a); // Inefficient if computed for multiple values of a, instead
12 const double l2e = log2(exp(1.0)); // Base 2 logarithm of e=2.71828...
13 const double d2 = exp2(a*l2e); // Efficient if used multiple times with precomputed l2e

Listing 4.11: Controlling the implied precision of constants and functions.

Prepared for Yunheng Wang c Colfax International, 2013


148 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Floating-Point Semantics
The Intel C++ Compiler may represent floating-point expressions in executable code differently, depend-
ing on the floating-point semantics, i.e., rules for finite-precision algebra allowed in the code. These rules are
controlled by an extensive set of command-line compiler arguments. The argument -fp-model controls
floating-point semantics at a high level.
Table 4.1 explains the usage of the argument -fp-model. For more information, see the Compiler
Reference [49] and the white paper “Consistency of Floating-Point Results using the Intel Compiler or Why
doesn’t my application always give the same answer?” by Dr. Martyn J. Corden and David Kreitzer [50].

Argument Effect
-fp-model strict Only value-safe optimizations, exception control is enabled (but may be disabled
using -fp-model no-except), floating-point contractions (e.g., the fused
multiply-add instruction) are disabled. This is the strictest floating-point model

-fp-model precise Only value-safe optimizations, exception control is disabled (but may

g
be enabled using -fp-model except). Serial floating-point calcu-

an
lations are reproducible from run to run. Some parallel OpenMP

W
calculations can be made reproducible by using the environment vari-

ng
able KMP_DETERMINISTIC_REDUCTION. The combination -fp-model
precise -fp-model source produces floating-point results compliant
e
nh
with the IEEE-754 standard.
Yu

-fp-model fast=1 Value-unsafe optimizations are allowed, exceptions are not enforced, contractions
r

are enabled. This is the default floating-point semantics model. The short-hand
fo

for this model is -fp-model fast.


ed
ar

-fp-model fast=2 Enables more aggressive optimizations than fast=1, possibly leading to better
p

performance at the cost of lower accuracy.


re
yP

-fp-model source Intermediate arithmetic results are rounded to the precision defined in the source
el

code. Using source also assumes precise, unless overridden by strict


iv

or fast.
us
cl

-fp-model double Intermediate arithmetic results are rounded to 53-bit (double) precision. Using
Ex

double also assumes precise, unless overridden by strict or fast.

-fp-model extended Intermediate arithmetic results are rounded to 64-bit (extended) precision. Using
extended also assumes precise, unless overridden by strict or fast.

-fp-model [no-]except except enables, no-except disables the floating-point exception semantics.

Table 4.1: Command-line arguments for high-level floating-point semantics control with the Intel C++ Compiler.

In this context, “value-unsafe” optimizations refer to code transformations that produce only ap-
proximately the same result. For example, floating-point multiplication is generally non-associative in
finite-precision arithmetics, i.e., a*(b*c)6= (a*b)*c. If value-unsafe optimizations are enabled, the
compiler may replace an expression like bar=a*a*a*a with foo=a*a; bar=foo*foo. However, if
only value-safe optimizations are allowed, then the expression will be computed from left to right, i.e.,
bar=((a*a)*a)*a. The two expressions produce approximately the same result, but the former employs
one less multiplication operation.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.2. SCALAR OPTIMIZATIONS 149

Listing 4.12 and Listing 4.13 illustrate floating-point model control.

1 #include <stdio.h>
2 #include <math.h>
3 int main() {
4 for (int i = 0; i < 100; i++) {
5 const int N=i*5000;
6 double A = 0.1;
7 for (int r = 0; r < N; r++)
8 A = sqrt(1.0-4.0*(A-0.5)*(A-0.5));
9 if (i<10) printf("After %5d iterations, A=%e\n", N, A);
10 }
11 }

Listing 4.12: Code fpmodel.cc used in Listing 4.13 to illustrate the effect of relaxed floating-point model. The loop
with the sqrt() function performs an iterative update of the value A.

g
an
W
user@host% icpc -o fpmodel-1 -mmic \ user@host% icpc -o fpmodel-2 -mmic \
fpmodel.cc -fp-model fast=2 fpmodel.cc

ng
user@host% scp fpmodel-1 mic0:~/ user@host% scp fpmodel-2 mic0:~/

e
fpmodel-1 100% 11KB 11.6KB/s 00:00 fpmodel-2 100% 11KB 11.2KB/s 00:00
user@host% ssh mic0 time ./fpmodel-1 nh
user@host% ssh mic0 time ./fpmodel-2
Yu
After 0 iterations, A=0.100000 After 0 iterations, A=0.100000
After 5000 iterations, A=0.178449 After 5000 iterations, A=0.178449
r

After 10000 iterations, A=0.633073 After 10000 iterations, A=0.633073


fo

After 15000 iterations, A=0.988303 After 15000 iterations, A=0.988303


d

After 20000 iterations, A=0.534324 After 20000 iterations, A=0.534324


re

After 25000 iterations, A=0.802155 After 25000 iterations, A=0.113889


pa

After 30000 iterations, A=0.513582 After 30000 iterations, A=0.244553


re

After 35000 iterations, A=0.887305 After 35000 iterations, A=0.432902


After 40000 iterations, A=0.552932 After 40000 iterations, A=0.997650
yP

After 45000 iterations, A=0.939329 After 45000 iterations, A=0.794235


real 0m 2.07s real 0m 1.16s
el

user 0m 2.07s user 0m 1.16s


iv

sys 0m 0.00s sys 0m 0.00s


us
cl
Ex

Listing 4.13: Compiling and running the code illustrated in Listing 4.12 on an Intel Xeon Phi coprocessor. Test case
shown in panel on the left uses default transcendental function accuracy, -fp-model fast=1. Case shown in the other
panel uses relaxed transcendental function accuracy, -fp-model fast=2.

In the code shown in Listing 4.12, a single floating-point number is subjected to an iterative procedure.
It can be demonstrated analytically that this procedure has a stochastic character, i.e., small perturbations in
initial conditions lead to large deviations in the result after several iterations. Listing 4.13 demonstrates that up
to 20000 iterations, codes compiled with -fp-model fast=1 and 2 produce identical results. However,
by iteration 25000, the results are completely different. This occurs because at iteration 23431 (as tested on
our hardware), the two codes produce slightly different results due to different numerical accuracy, and this
subtle difference is amplified by the stochastic iteration in the subsequent steps. Note that at the same time,
the code compiled with -fp-model fast=2 performs almost twice as fast as the code compiled with the
default floating-point model.
This example demonstrates how relaxing the floating-point model may lead to a significant performance
increase on Intel Xeon Phi coprocessors. However, it is only safe to do so in well-behaved, numerically stable
applications.

Prepared for Yunheng Wang c Colfax International, 2013


150 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Precision Control for Transcendental Functions

By default, Intel C++ Compiler replaces calls to arithmetic operations, such as division, square root,
sine, cosine, and other, with the respective calls to the Intel Math Library or Intel Short Vector Math Library
functions or to the processor vector instructions. It is possible to instruct the compiler to use low-precision
functions for some arithmetic operations in order to gain more performance. Naturally, this must be done with
care and only in “well-behaved” applications that can tolerate the imprecise results.
Table 4.2 summarizes the Intel C++ Compiler command line arguments for precision control. For more
information of the function precision control in Intel C++ Compiler see the Intel C++ Compiler Reference
[51] and the white paper “Advanced Optimizations for Intel MIC Architecture, Low Precision Optimizations”
by Wendy Doerner [52].

g
Argument Effect

an
-fimf-precision= Defines the precision for math library functions.

W
value[:funclist] Here, value is one of: high, medium or low, which correspond to pro-

ng
gressively less accurate but more efficient math functions, and funclist
is a comma-separated list of functions that this rule is applied to.
e
nh
Value high is equivalent to max-error=0.6, medium equivalent to
Yu

max-error=4 and low to accuracy-bits=11 in single precision or


accuracy-bits=26 in double precision (see below).
r

By default, this option is not specified, and the compiler uses default heuris-
fo

tics when calling math library functions.


ed

This is an aggregate compiler option; see -fimf-max-error and


ar

-fimf-accuracy-bits for fine-grained control.


p
re

-fimf-max-error= The maximum allowable error expressed in ulps (units in last place) [53].
yP

ulps[:funclist] Max error of 1 ulps corresponds to the last mantissa bit being uncertain; 4
el

ulps is three uncertain bits, etc. This is a more fine-grained method of setting
iv

accuracy than -fimf-precision.


us
cl

-fimf-accuracy-bits= The number of correct bits required for mathematical function accuracy. The
conversion formula between accuracy bits and ulps is: ulps = 2p−1−bits ,
Ex

n[:funclist]
where p is 24 for single precision, 53 for double precision and 64 for long
double precision (the number of mantissa bits). This is a more fine-grained
method of setting accuracy than -fimf-precision.

-fimf-domain-exclusion= Defines a list of special-value numbers that do not need to be handled by the
n[:funclist] functions. Here, n is an integer derived by the bitwise OR of the following
values: extremes: 1, NaNs: 2, infinites: 4, denormals: 8, zeroes: 16. For
example, n=15 indicates that extremes, NaNs, infinites and denormals
should be excluded from the domain of numbers that the mathematical
functions must correctly process.

Table 4.2: Command-line arguments for mathematical function precision control with the Intel C++ Compiler.

Listing 4.14 and Listing 4.15 illustrate math function precision control.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.2. SCALAR OPTIMIZATIONS 151

1 #include <stdio.h>
2 #include <math.h>
3
4 int main() {
5 const int N = 1000000;
6 const int P = 10;
7 double A[N];
8 const double startValue = 1.0;
9 A[:] = startValue;
10 for (int i = 0; i < P; i++)
11 #pragma simd
12 for (int r = 0; r < N; r++)
13 A[r] = exp(-A[r]);
14
15 printf("Result=%.17e\n", A[0]);
16 }

Listing 4.14: Code precision.cc used in Listing 4.15 to illustrate the effect of relaxed transcendental function

g
precision.

an
W
ng
user@host% icpc -o precision-1 -mmic \ user@host% icpc -o precision-2 -mmic \

e
-fimf-precision=low precision.cc
user@host% scp precision-1 mic0:~/ nh -fimf-precision=high precision.cc
user@host% scp precision-2 mic0:~/
Yu
precision-1 100% 11KB 11.3KB/s 00:00 precision-2 100% 19KB 19.4KB/s 00:00
user@host% ssh mic0 time ./precision-1 user@host% ssh mic0 time ./precision-2
r
fo

Result=5.68428695201873779e-01 Result=5.68428725029060722e-01
real 0m 0.08s real 0m 0.14s
d
re

user 0m 0.06s user 0m 0.12s


sys 0m 0.02s sys 0m 0.02s
pa

user@host% user@host%
re
yP

Listing 4.15: Compiling and running the code illustrated in Listing 4.14 on an Intel Xeon Phi coprocessor. Test case
el

shown in panel on the left uses low precision, the case on the right uses high precision.
iv
us

The change of the precision of the exponential function from high to low results in almost a factor of 2
cl
Ex

speedup.

Prepared for Yunheng Wang c Colfax International, 2013


152 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.2.4 Library Functions for Standard Tasks


Intel MKL implementations of certain standard tasks are highly optimized for Intel processors as well as
Intel Xeon Phi coprocessors [54]. Therefore, it usually makes sense to use an Intel MKL function instead of a
generic library function or designing and optimizing a custom implementation. More information on using
Intel MKL can be found in Section 4.8.
Examples in Listing 4.16 demonstrate the use of Intel MKL to generate a set of random numbers.

1 #include <stdlib.h> 1 #include <stdlib.h>


2 #include <stdio.h> 2 #include <stdio.h>
3 3 #include <mkl_vsl.h>
4 int main() { 4 int main() {
5 const size_t N = 1<<29L; 5 const size_t N = 1<<29L;
6 const size_t F = sizeof(float); 6 const size_t F = sizeof(float);
7 float* A = (float*)malloc(N*F); 7 float* A = (float*)malloc(N*F);
8 srand(0); // Initialize RNG 8 VSLStreamStatePtr rnStream;
9 for (int i = 0; i < N; i++) { 9 vslNewStream( &rnStream, // Init. RNG

g
10 A[i]=(float)rand()/(float)RAND_MAX; 10 VSL_BRNG_MT19937, 1 );

an
11 } 11 vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD,

W
12 12 rnStream, N, A, 0.0f, 1.0f);
13 printf("Generated %ld random \ 13 printf("Generated %ld random \

ng
14 numbers.\nA[0]=%e\n", N, A[0]); 14 numbers.\nA[0]=%e\n", N, A[0]);
15 free(A); 15
e
free(A);
nh
16 } 16 }
Yu

user@host% icpc -mmic -o rand rand.cc user@host% icpc -mkl -mmic -o rand-mkl \
r
fo

user@host% rand-mkl.cc
user@host% user@host% export SINK_LD_LIBRARY_PATH=\
ed

user@host% # Run on coprocessor /opt/intel/composerxe/mkl/lib/mic:\


ar

user@host% # and benchmark /opt/intel/composerxe/lib/mic


p

user@host% time micnativeloadex rand user@host% time micnativeloadex rand-mkl


re

Generated 536870912 random numbers. Generated 536870912 random numbers.


yP

A[0]=8.401877e-01 A[0]=1.343642e-01
el
iv

real 0m56.591s real 0m7.951s


us

user 0m0.002s user 0m0.053s


sys 0m0.011s sys 0m0.168s
cl
Ex

Listing 4.16: Comparison of random number generation on an Intel Xeon Phi coprocessor with the C Standard General
Utilities Library (left-hand side) and Intel MKL (right-hand side).

Both codes in Listing 4.16 perform the same task: the generation of a set of random numbers. However,
the code in the left-hand side of the listing is based on the C Standard General Utilities Library, and in the
right-hand side, the code is using the Intel MKL. The performance on the Intel Xeon Phi coprocessor is better
with MKL by a factor of at least 7x. In addition, the MKL implementation is thread-safe and can be efficiently
used by multiple threads in an application. That is not true of the random number generator implemented in
the C Standard General Utilities Library.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.3. DATA PARALLELISM: VECTORIZATION THE RIGHT AND WRONG WAY 153

4.3 Data Parallelism: Vectorization the Right and Wrong Way


This section presents optimization advice for automatic vectorization. The guidelines for developing
applications suitable for auto-vectorization are outlined: unit-stride access, elimination of real and assumed
vector dependence and data alignment, as well as some supplementary compiler functions, such as vectorization
pragmas and vectorization report.

4.3.1 Unit-Stride Access and Spatial Locality of Reference


While the automatic vectorization function of the Intel C++ compiler is able to deal with complex data
layouts, the best performance is achieved when data is arranged in a way that allows unit-stride access, i.e.,
for every iteration of vectorized loop, scalar data elements packed into the SIMD register must be adjacent
in memory. The rule of thumb for achieving unit-stride access is to use structures of arrays (SoA) instead of
arrays of structures (AoS).
Consider the following physical problem (examples presented here are based on a Colfax Research
publication [55]). Suppose we have m particles identified by index i, and each particle is described by a set of

g
an
three Cartesian coordinates: ~r ≡ (ri,x , ri,y , ri,z ) and charge qi . We need to calculate the electric potential

W
~ at multiple observation locations R
Φ(R) ~ j ≡ (Rj,x , Rj,y , Rj,z ), where the index j denotes one of n locations.
The expression for that potential is given by Coulomb’s law:

e ng
m
nh qi
  X
~j = −
Φ R , (4.1)
Yu
i=1
~j
~ri − R
r
fo

where | | denote the magnitude (i.e., length) of a vector:


d

q
~ 2 2 2
re

~ri − R = (ri,x − Rx ) + (ri,y − Ry ) + (ri,z − Rz ) . (4.2)


pa

Figure 4.1 is a visual illustration of the problem. In the left panel, m = 512 charges are distributed in a
re

lattice-like pattern. Each of these particles contributes to the electric potential at every point in space. The right
yP

panel of the figure shows the electric potential at m = 128 × 128 points in the xy-plane at z = 0 calculated
el

using Coulomb’s law (Equation 4.1).


iv
us

Charge Distribution
cl

Positive charges Electric Potential


Ex

Negative charges
0.4
0.3 0.4 0.3
0.3 0.2
0.2 0.1
0.2 0
0.1
Φ(x,y,z=0)

0.1 -0.1
-0.2
z 0 0 -0.3
-0.1 -0.4
-0.1
-0.2
-0.2 -0.3
1 1
-0.3 y -0.4 y
0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0
x x

Figure 4.1: Left panel: a set of charged particles. Right panel: the electric potential Φ in the z = 0 plane produced by
~
charged particles shown in the left panel. For every point in the xy-plane, Equation 4.1 was applied to calculate Φ(R),
where the summation from i = 1 to i = m is taken over the m charged particles.

Prepared for Yunheng Wang c Colfax International, 2013


154 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Elegant, but Inefficient Solution


For a physicist, it is natural to treat particles as the basis of physical models, and therefore it is temping
to start designing a C++ code by defining the particle structure as in Listing 4.17, and then declare an array of
such structures. This is an elegant and natural solution. However, it is inefficient for automatic vectorization.

1 struct Charge { // Elegant, but ineffective data layout


2 float x, y, z, q; // Coordinates and value of this charge
3 };
4 // The following line declares a set of m point charges:
5 Charge chg[m];

Listing 4.17: Data organization as an array of structures is often inefficient for vectorization.

~ defined
The code Listing 4.18 demonstrates a function that calculates the electric potential at a point R
by quantities Rx, Ry and Rz in the code.

g
an
W
1 // This version performs poorly, because data layout of class Charge

ng
2 // does not allow efficient vectorization
3 void calculate_electric_potential( e
const int m, // Number of charges
nh
4
5 const Charge* chg, // Charge distribution (array of structures)
Yu

6 const float Rx, const float Ry, const float Rz, // Observation point
7 float & phi // Output: electric potential
r
fo

8 ) {
9 phi=0.0f;
ed

10 for (int i=0; i<m; i++) { // This loop will be auto-vectorized


ar

11 // Non-unit stride: (&chg[i+1].x - &chg[i].x) != sizeof(float)


const float dx=chg[i].x - Rx;
p

12
re

13 const float dy=chg[i].y - Ry;


const float dz=chg[i].z - Rz;
yP

14
15 phi -= chg[i].q / sqrtf(dx*dx+dy*dy+dz*dz); // Coulomb’s law
}
el

16
}
iv

17
us
cl

Listing 4.18: Inefficient solution for Coulomb’s law application: access to quantities x, y, z and q has a stride of 4 rather
Ex

than 1, because data are stored as an array of structures.

A reference calculation time for this code on a two-way system with Intel Xeon E5-2680 processors with
m = 211 and n = 222 is 0.90 seconds. On a single Intel Xeon Phi coprocessor (pre-production sample), the
runtime is 0.73 seconds.
In order to understand why this result can be improved, consider the inner for-loop in line 10 of
Listing 4.18. The variable chg[i].x in the i-th iteration is 4*sizeof(float)=16 bytes away in memory
from chg[i+1].x used in the next iteration. This corresponds to a stride of 16/sizeof(float)=4
instead of 1, which will incur a performance hit when the data are loaded into the processor’s vector registers.
The same goes for members y, z and q of class Charge. Even though Intel Xeon Phi coprocessors support
gather/scatter vector instructions, unit-stride access to vector data is almost always more efficient.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.3. DATA PARALLELISM: VECTORIZATION THE RIGHT AND WRONG WAY 155

Optimized Solution with Unit-Stride Access

In order to achieve unit-stride data access in the for-loop of function calc_electric_potential,


the structure of data needs to be re-organized. Instead of the inefficient struct Charge, we can declare a
structure that contains the properties of charges as arrays, as shown in Listing 4.19.

1 struct Charge_Distribution {
2 // This data layout permits effective vectorization of Coulomb’s law application
3 const int m; // Number of charges
4 float * x; // Array of x-coordinates of charges
5 float * y; // ...y-coordinates...
6 float * z; // ...etc.
7 float * q; // These arrays are allocated in the constructor
8 };

g
an
Listing 4.19: Data storage as a structure of arrays is usually beneficial for vectorization.

W
e ng
nh
With this new data structure, the function calculating the electric potential takes on the form shown in the
code listing in Listing 4.20.
r Yu
fo
d
re

1 // This version vectorizes better thanks to unit-stride data access


pa

2 void calculate_electric_potential(
3 const int m, // Number of charges
re

4 const Charge_Distribution & chg, // Charge distribution (structure of arrays)


yP

5 const float Rx, const float Ry, const float Rz, // Observation point
6 float & phi // Output: electric potential
el

7 ) {
iv

8 phi=0.0f;
us

9 for (int i=0; i<chg.m; i++) {


cl

10 // Unit stride: (&chg.x[i+1] - &chg.x[i]) == sizeof(float)


const float dx=chg.x[i] - Rx;
Ex

11
12 const float dy=chg.y[i] - Ry;
13 const float dz=chg.z[i] - Rz;
14 phi -= chg.q[i] / sqrtf(dx*dx+dy*dy+dz*dz);
15 }
16 }

Listing 4.20: Efficient vectorization: unit-stride access to quantities x, y, z and q.

Clearly, the inner for-loop in line 9 of Listing 4.20 has unit-stride data access, as chg.x[i] is imme-
diately followed by chg.x[i+1] in memory, and the same is true for all other quantities accessed via the
array iterator i. The code execution time for m = 211 , n = 222 is now 0.51 second on the host system and
0.37 seconds on the coprocessor. Figure 4.2 summarizes the results.
Note that this performance can be further improved by excluding denormals from the special cases of
floating-point numbers handled by the reciprocal square root function. This optimization is discussed in the
Exercise Section A.4.4.

Prepared for Yunheng Wang c Colfax International, 2013


156 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Electric potential calculation


1.0
0.90 s Host system
Intel Xeon Phi Coprocessor
0.8
0.73 s
Time, s (lower is better)

0.6
0.51 s 0.51 s

0.4 0.37 s

0.22 s
0.2

g
0.0

an
Non-unit stride Unit-stride Unit-stride with
(array of structures) (structure of arrays) relaxed precision

W
e ng
Figure 4.2: Electric potential calculation with Coulomb’s law discussed in Section 4.3.1. The non-unit stride case uses an
array of structures to represent particles, unit-stride case uses a structure of arrays. The relaxed arithmetics case is not
nh
discussed in the text, but is available in the Exercise Section A.4.4.
r Yu

The example considered above demonstrates how converting data layout from an array of structures
fo

to a structure of arrays can significantly improve vectorized code performance. Note that the optimal data
ed

layout, a structure of arrays, somewhat limits the opportunities for abstraction in C++ codes, reverting the code
ar

back to the old C-style loop- and array-centric paradigm. This limitation has been expressed as the statement
p
re

“abstraction kills parallelism”, which is attributed to Prof. Paul Kelly, Imperial College [56]. However, the
yP

necessity of unit-stride access is an inherent property of all computer architectures with hierarchical memory
structure. It is a consequence of a more general prerequisite of high performance, locality of reference of data
el

in space. See also Section 4.5 for an illustration of how improving locality of reference in time can improve
iv

performance by optimizing the cache traffic.


us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.3. DATA PARALLELISM: VECTORIZATION THE RIGHT AND WRONG WAY 157

4.3.2 Guiding Automatic Vectorization with Compiler Hints


Loop Count

By default, when the compiler encounters a loop for which vectorization may be beneficial, multiple
versions of this loop will be generated. For instance, if the loop iteration count is not known at compile time,
the compiler may generate a scalar version for very short loops and a SIMD version for longer loops.
If the programmer expects a certain iteration count of loops in the code, it may be beneficial to inform
the compiler of that count. This may result in a more efficient execution path selection, because the compiler
will be able to make better decision regarding the vectorization strategy, loop unrolling, iteration peeling, etc.
#pragma loop count is used to declare the expected loop iteration count. Note that the iteration count
must be a constant known at compile time. Listing 4.21 shows a function performing the multiplication of a
packed sparse matrix by a vector. The sparse matrix is stored as an array of contiguous chunks of non-zero
elements. The average length of the chunk is 100.

g
an
1 void PackedSparseMatrix::MultiplyByVector(const float* inVector, float* outVector) {
// This function computes the matrix-vector product Ma=b, where

W
2
3 // M is a sparse matrix stored in a packed format in this->packedData;

ng
4 // a is the input vector of length this->nRows represented by inVector
5 // b is the output vector of length this->nCols represented by outVector

e
nh
6 #pragma omp parallel for schedule(dynamic,30)
7 for (int i = 0; i < nRows; i++) {
Yu
8 outVector[i] = 0.0f;
9 for (int nb = 0; nb < blocksInRow[i]; nb++) {
r
fo

10 // For each row i, the value blocksInRow[i]


11 // is the number of packed contiguous blocks of non-zero elements
d

// of the matrix in this row (computed in class constructor).


re

12
13 const int idx = blockFirstIdxInRow[i]+nb; // Block index in storage
pa

14 const int offs = blockOffset[idx]; // Offset of the block in storage


re

15 const int j0 = blockCol[idx]; // Column in the original matrix for to this block
yP

16 float sum = 0.0f;


17 // Pragma loop count assists the application at runtime
el

18 // in choosing the optimal execution path.


iv

19 // When the actual loop count blockLen[idx] is guessed correctly at compile time,
// this pragma may boost performance.
us

20
21 #pragma loop count avg(100)
cl

22 for (int c = 0; c < blockLen[idx]; c++) {


Ex

23 // This computes the expression a[i] = sum(M[i,j]*b[j])


24 // using packed matrix data.
25 sum += packedData[offs+c]*inVector[j0+c];
26 }
27 outVector[i] += sum;
28 }
29 }
30 }

Listing 4.21: Multiplication of a sparse matrix in packed row format by a vector. Performance is improved with the help
of #pragma loop count, which informs the compiler of the expected for-loop count. This facilitates a better choice
of vectorization strategy.

Before including #pragma loop count, the multiplication of a sparse 20000 × 20000 matrix with
a filling factor of 10% and average contiguous block size of 100 by a vector takes 2.00 ± 0.08 ms on the
coprocessor. After the inclusion of #pragma loop count avg(100), this time drops to 1.79 ± 0.03 ms.
More details on this problem can be found in the Exercise Section A.4.5.

Prepared for Yunheng Wang c Colfax International, 2013


158 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Aligned Data Notice


Data alignment on a 64-byte boundary is required for vector instructions in the Many Integrated Core
architecture of Intel Xeon Phi coprocessors. The alignment of the first element in a pointer-based array
is generally not known at compile time. Therefore, in automatically vectorized loops, the compiler must
implement a check for alignment. Depending on the results of the check, the application may peel off several
iterations at the beginning of the loop in order to reach the first aligned element. The alignment check may
take a significant time, especially for short loops, and multiple versions of the code required for execution take
up space in the L1 instruction cache.
If the programmer can guarantee that pointer-based arrays in a vectorized loop are aligned, it is beneficial
to tell the compiler to assume alignment at the beginning of the loop. This is done using #pragma vector
aligned. Listing 4.22 demonstrates the use of this pragma.

1 void PackedSparseMatrix::MultiplyByVector(const float* inVector, float* outVector) {


2 // This function computes the matrix-vector product Ma=b, where
3 // M is a sparse matrix stored in a packed format in this->packedData;

g
an
4 // a is the input vector of length this->nRows represented by inVector
5 // b is the output vector of length this->nCols represented by outVector

W
6 #pragma omp parallel for schedule(dynamic,30)

ng
7 for (int i = 0; i < nRows; i++) {
8 outVector[i] = 0.0f; e
for (int nb = 0; nb < blocksInRow[i]; nb++) {
nh
9
10 // For each row i, the value blocksInRow[i]
Yu

11 // is the number of packed contiguous blocks of non-zero elements


12 // of the matrix in this row (computed in class constructor).
r
fo

13 const int idx = blockFirstIdxInRow[i]+nb; // Block index in storage


14 const int offs = blockOffset[idx]; // Offset of the block in storage
ed

15 const int j0 = blockCol[idx]; // Column in the original matrix at start of block


ar

16 float sum = 0.0f;


17 // Pragma vector aligned makes a promise to the compiler that the elements of
p
re

18 // vectorized arrays used in the first iteration are aligned on a 64-byte boundary
#pragma vector aligned
yP

19
20 #pragma loop count avg(128) min(16)
for (int c = 0; c < blockLen[idx]; c++) {
el

21
// This computes the expression a[i] = sum(M[i,j]*b[j])
iv

22
23 // using packed matrix data.
us

24 sum += packedData[offs+c]*inVector[j0+c];
cl

25 }
Ex

26 outVector[i] += sum;
27 }
28 }
29 }

Listing 4.22: Multiplication of a sparse matrix in packed row format by a vector. Here, #pragma vector aligned
informs the compiler that data alignment in memory is guaranteed by the programmer, and therefore alignment checks are
not necessary.

Note that in order for the code in Listing 4.22 to work, arrays packedData and inVector must
be aligned in such a way that for every instance of the for (int c = 0; ...) loop, the addresses of
packedData[offs] and inVector[j0] are on a 64-byte aligned boundary. If this data is not aligned,
but #pragma vector aligned is used, the code will crash. In this code, the alignment property is
guaranteed by the following:

a) Arrays packedData and inVector are allocated using the function _mm_malloc() discussed in
Section 3.1.4. The listing below illustrates the allocation and deallocation of these arrays:

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.3. DATA PARALLELISM: VECTORIZATION THE RIGHT AND WRONG WAY 159

1 float* packedData = _mm_malloc(sizeof(float)*nData, 64);


2 float* inVector = _mm_malloc(sizeof(float)*nRows, 64);
3 // ...
4 _mm_free(packedData);
5 _mm_free(inVector);

b) The length of each block of contiguous zero elements in the matrix is padded to a value that is a multiple
of 64. That is, for every idx, blockLen[idx] % 64 == 0 and therefore blockCol[idx] % 64
== 0 and blockOffset[idx] % 64 == 0. This is illustrated in the full listing included in Exercise
Section A.4.5

Summary
Figure 4.3 summarizes the effect of user-guided automatic vectorization with #pragma loop count
and #pragma vector aligned.

g
an
Multiplication of a packed sparse matrix by vector

W
Host system

ng
5 Intel Xeon Phi Coprocessor
4.60 s

e
4 3.99 s 3.97 s nh
Yu
Time, s (lower is better)

r
fo

3
d
re

2.10 s
pa

2 1.95 s
1.65 s
re
yP

1
el
iv

0
us

Unoptimized #pragma loop_count Alignment and


#pragma vector aligned
cl
Ex

Figure 4.3: Results of the sparse matrix by vector multiplication benchmark after assisting the compiler with vectorization
pragmas.

Note that in the case of the #pragma loop count, no additional modifications in the code are
required to reap the benefits of increased performance. However, the performance of the code may be degraded
if the loop count used at runtime is significantly different from the value used at compile time in the pragma.
In the case of #pragma vector aligned, the programmer must ensure that the accessed data is
indeed aligned in memory in such a way that in every instance of the vectorized loop, the data accessed in the
first iteration resides on a 64-byte aligned address in memory. See Section 3.1.4 for more information on data
alignment.
Finally, one must note that on the coprocessor, data alignment and the alignment notice increased
performance. However, on the host system, the performance slightly dropped when data alignment was used.
This happened because:
a) AVX instructions used by the Intel Xeon E5 Sandy Bridge processor are not sensitive to alignment.
Therefore, the host version of the application was not slowed down by misaligned data.

Prepared for Yunheng Wang c Colfax International, 2013


160 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

b) In this particular application, data alignment could be guaranteed only by padding some of the data blocks
to a length that is a multiple of 64-bytes. Therefore, the total data set on which the application was operating
was increased when data alignment was implemented. This explains the performance drop on the host
system. At the same time, it illustrates that on the Intel Xeon Phi coprocessor, it may be more efficient to
process a larger amount of aligned data than a smaller amount of unaligned data.
See Exercise Section A.4.5 for the complete code used in this example.

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.3. DATA PARALLELISM: VECTORIZATION THE RIGHT AND WRONG WAY 161

4.3.3 Branches in Automatically Vectorized Loops


Combinatorial Algorithms

Branches expressed with if-statements, while-loops and the ternary operator (?:) are common in
combinatorial problems. Combinatorial algorithms with numerous branches are generally non-vectorizable and
often inherently sequential. In order to optimize branch performance in scalar algorithms, one must consider
the behavior of branch predictors and the cost of pipeline flushes for mispredicted branches. Oftentimes,
algorithms benefit from the elimination of fine-grained branches and replacing them with coarse-grained
branches, because it reduces the frequency of branching and branch misprediction.
In this training, we do not discuss optimizations relevant to combinatorial algorithms, because combina-
torial problems are rarely suitable for execution on Intel Xeon Phi coprocessors. However, we will discuss a
special case of branches in vectorized loops that can occur in numerical problems: masked calculations.

g
an
Masked SIMD Calculations

W
ng
In this section we consider masked calculations expressed as branches that are applied to every iteration

e
in a vector loop. For instance, consider the code in Listing 4.23.
nh
r Yu
1 void NonMaskedOperations(const int m, const int n, const int* flag, float* data) {
fo

2 #pragma omp parallel for schedule(dynamic)


d

3 for (int i = 0; i < m; i++)


re

4 for (int pass = 0; pass < 10; pass++)


pa

5 for (int j = 0; j < n; j++)


6 data[i*n+j] = sqrtf(data[i*n+j]);
re

7 }
yP

8
9 void MaskedOperations(const int m, const int n, const int* flag, float* data) {
el

10 #pragma omp parallel for schedule(dynamic)


iv

11 for (int i = 0; i < m; i++)


us

12 for (int pass = 0; pass < 10; pass++)


cl

13 for (int j = 0; j < n; j++)


Ex

14 if (flag[j] == 1)
15 data[i*n+j] = sqrtf(data[i*n+j]);
16 }

Listing 4.23: Function NonMaskedOperations applies an expensive arithmetic operation to the elements of matrix
data. Function MaskedOperations applies the same operation only to elements for which the mask flag[] is set
to 1.

In function NonMaskedOperations, the inner j-loop performs a computationally expensive opera-


tion sqrtf() on every element of the [m x n] matrix data. In contrast, function MaskedOperations
applies the sqrtf() operation only when the value flag[j] is set to 1. The array flag[j] can be
considered a mask for the application of the operation sqrtf(). One might attempt to optimize the function
MaskedOperations by interchanging the loops in order to increase the granularity of the branch, as shown
in Listing 4.24. However, it will quickly become clear that this route is not optimal, as the stride of data access
in the new inner loop is now n instead of 1, which results in drastically slower code. Therefore, we must
abandon the idea of eliminating branches from the vector loop.

Prepared for Yunheng Wang c Colfax International, 2013


162 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

1 void MaskedOperationsHugeStride(const int m, const int n, const int* flag, float* data) {
2 #pragma omp parallel for schedule(dynamic)
3 for (int j = 0; j < n; j++) {
4 for (int pass = 0; pass < 10; pass++)
5 if (flag[j] == 1)
6 for (int i = 0; i < m; i++)
7 data[i*n+j] = sqrtf(data[i*n+j]);
8 }
9 }

Listing 4.24: Inefficient attempt at optimizing the code shown in Listing 4.23. In this code, the branch is taken outside the
loop, which reduces the number of branch condition evaluations. However, matrix data is accessed with a stride of n,
which is inefficient.

Let us return to the original code in Listing 4.23. Compilation with the flag -vec-report3 re-
veals that the compiler did vectorize the inner j-loop in both functions: NonMaskedOperations() and

g
NonMaskedOperations(). In order to understand what exactly happened, we can benchmark the code

an
with different masks:

W
taken, and all elements of the array data remain unchanged.
e ng
(a) With the mask flag[0]=flag[1]=...flag[n-1]=0, the branch if (flag[j]==1) is never
nh
Yu

(b) With the mask flag[0]=...=flag[n-1]==1, then the branch if (flag[j]==1) is always
r
fo

taken, and sqrtf() must be applied to all elements of data.


ed
ar

(c) If the mask has alternating zeroes and ones (flag[0]=0, flag[1]=1, flag[2]=0, flag[3]=1,
p

etc.), then the branch is taken every other time, and sqrtf() must be applied to every other element.
re
yP

(d) Finally, we will consider a mask that contains 16-element blocks of zeroes alternating with 16-element
el

blocks of ones: flag[0]=flag[1]=...=flag[15]=0, flag[16]=...=flag[31]=1, etc.


iv
us

We benchmarked this application on a matrix with m=213 , n=215 on the host system and on the coproces-
cl

sor (as a native application). The result can be seen in Figure 4.4.
Ex

In the case of mask (a) (all flags set to 0), the execution time is significantly lower than in all other cases.
This means that the code was able to recognize at runtime that none of the expensive sqrtf() operations
must be executed.
Between the cases masks (b) (all flags set to 1) and (c) (alternating 0 and 1 flags), the performance is
not significantly different. This is because the code took a vectorized path. In this path, each vector iteration
computes sqrtf() for several consecutive scalar iterations (8 on processor, 16 on coprocessor). After that,
the results for scalar iterations with the flag set to 1 are written to the destination array, and results for which
the flag is not set are discarded. Therefore, in cases (b) and (c), the sqrtf() function must be computed for
all elements, which explains similar performance.
For mask (d) (16 flags set to 1 alternating with 16 flags set to 0), the execution time is considerably lower.
This case deserves special attention. Indeed, the fraction of the flags set to 1 in this case is 50%, just like in
case (c), but the performance is almost twice as high. The difference is that with the pattern of flags as in mask
(d), the application and processor are able to recognize that every other SIMD vector gets discarded completely.
The application takes advantage of this and saves computing time by calculating only every other vector. Note
that data alignment is crucial for case (d) to be efficient on the Intel Xeon Phi coprocessor, because without
alignment, the pattern of saved and discarded vector lanes may not permit skipping some vector iterations.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.3. DATA PARALLELISM: VECTORIZATION THE RIGHT AND WRONG WAY 163

Branches with #pragma simd


Consider one modification to the code in Listing 4.23: the inclusion of #pragma simd. This modifica-
tion as shown in Listing 4.25. This pragma makes the vectorization of the j-loop compulsory.

1 void NonMaskedOperations(const int m, const int n, const int* flag, float* data) {
2 #pragma omp parallel for schedule(dynamic)
3 for (int i = 0; i < m; i++)
4 for (int pass = 0; pass < 10; pass++)
5 #pragma simd
6 for (int j = 0; j < n; j++)
7 data[i*n+j] = sqrtf(data[i*n+j]);
8 }
9
10 void MaskedOperations(const int m, const int n, const int* flag, float* data) {
11 #pragma omp parallel for schedule(dynamic)
12 for (int i = 0; i < m; i++)
13 for (int pass = 0; pass < 10; pass++)

g
14 #pragma simd

an
15 for (int j = 0; j < n; j++)

W
16 if (flag[j] == 1)
17 data[i*n+j] = sqrtf(data[i*n+j]);

ng
18 }

e
nh
Listing 4.25: A modification of code in Listing 4.23 with vectorization made mandatory using #pragma simd.
r Yu
fo

The benchmark of the masked vector operations with #pragma simd (Listing 4.25) are shown in
d

Figure 4.5. The execution time now does not depend on the mask. Most notably, for mask (a) and (b), the
re

execution time is identical, even though in (a), all flags are unset, and in (b), all flags are set. At the same time,
pa

on the host system, the execution time for masks (b), (c) and (d) is lower than in the case without #pragma
re

simd (Listing 4.23).


yP

This example demonstrates that whether mandatory vectorization with #pragma simd is beneficial
depends on the pattern of branches in vectorized code. Specifically, when the vectorized loop contains branches
el
iv

which are rarely taken (e.g., mask (a)), the default vectorization method produces better results than the method
us

with mandatory vectorization.


cl

Some codes with masked operations may benefit from tuning the vectorization method to the pattern of
Ex

masks, or by changing the data layout in such a way that the mask has a predictable pattern convenient for
vectorization.

Prepared for Yunheng Wang c Colfax International, 2013


164 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Masked Operations, default vectorization options


200
Host system
168.7 ms Intel Xeon Phi Coprocessor
159.1 ms
150
Time, ms (lower is better)

100 92.7 ms

65.4 ms 65.4 ms
58.4 ms
50 48.8 ms
31.7 ms 32.5 ms
21.2 ms

g
0

an
Mask (a), Mask (b) Masked (c), Masked (d), No Mask
0 0 0 0 ... 1 1 1 1 ... 0101 0..0 1..1 0..0 1..1 ...

W
Figure 4.4: Benchmark of the code in Listing 4.23 performing masked operations on an [m x n] matrix with m=213
e ng
and n=215 . Mask (a) has all elements unset (i.e., the branch leading to the computation of sqrtf() is never taken), mask
nh
(b) has all elements set (branch is always taken), mask (c) has every other element unset and (d) has is made of blocks of
Yu

length 16 of unset and set flags.


r
fo
ed
p ar
re
yP

Masked Operations, #pragma SIMD


el

200
iv

Host system
us

Intel Xeon Phi Coprocessor


cl
Ex

150
Time, ms (lower is better)

100
76.0 ms 75.9 ms 75.9 ms 75.9 ms
66.8 ms 66.8 ms 66.8 ms 66.9 ms
53.3 ms
50
32.8 ms

0 Mask (a), Mask (b) Masked (c), Masked (d), No Mask


0 0 0 0 ... 1 1 1 1 ... 0101 0..0 1..1 0..0 1..1 ...

Figure 4.5: Benchmark of the code in Listing 4.25, in which vectorization is made mandatory with #pragma simd.
See caption to Figure 4.4 for details.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.3. DATA PARALLELISM: VECTORIZATION THE RIGHT AND WRONG WAY 165

4.3.4 Diagnosing the Utilization of Vector Instructions


When one begins to port and optimize an application for execution on Intel Xeon processors and/or Intel
Xeon Phi coprocessors, it is important to determine whether the existing application takes advantage of vector
instructions. There are multiple ways to do that:
1. When the performance-critical parts of the application are known, the programmer may use the compiler
option -vec-report3. With this argument, the compiler prints the information about loops that are
automatically vectorized or not vectorized, along with brief explanations.

2. It is also possible to use the Intel VTune Amplifier XE in order to diagnose the number of scalar and vector
instructions issued by the code.
3. Finally, there is a simple and practical way to test the effect of automatic vectorization on the application
performance. First, compile and benchmark the code with all the default compiler options. Then, compile
the code with arguments -no-vec -no-simd and benchmark again. With these options, automatic
vectorization is disabled. If the difference in the performance is significant, it indicates that automatic

g
vectorization already contributes to the performance. Note that this method works best when the application

an
is benchmarked with only one thread. This reduces the impact of other factors (such as memory traffic and

W
multithreading overhead) on the execution time of the code.

e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


166 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.4 Task Parallelism: Common Pitfalls in Shared-Memory Parallel


Code
Optimization advice for shared-memory parallel codes is presented in this section. This section addresses
the most basic performance considerations for thread parallelism: reducing the synchronization overhead,
optimizing the size of the iteration space, and load balancing.

4.4.1 Too Much Synchronization. Solution: Avoiding True Sharing with Private
Variables and Reduction
When more than one thread accesses a memory location, and at least one of these accesses is a write, a
race condition occurs. In order to avoid unpredictable behavior of a program with a race condition, mutexes
must be used, which generally incurs a performance penalty. In some cases it is possible to avoid a mutex by
using private variables, atomic operations or reducers, as shown in this section.

g
an
Example Problem: Computing a Histogram

W
Consider the problem of computing a histogram. Let us assume that array float age[n] contains
ng
numbers from 0 to 100 representing the ages of n people. The task is to create array hist[m], elements
e
nh
of which contain the number of people in age groups 0-20 years, 20-40 years, 40-50 years, etc. For sim-
Yu

plicity, suppose const int m=5 (number of age groups covering the range 0-100) and const float
group_width=20.0f; (how many years each group spans).
r
fo

An unoptimized serial C code that performs this calculation is shown in Listing 4.26. This code is not
protected from situations when one of the members of age[] is outside the range [0.0 . . . 100.0). We assume
ed

that the user of the function Histogram() is responsible for ensuring that the array age has only valid
ar

entries.
p
re
yP

1 void Histogram(const float* age, int* const hist, const int n,


const float group_width, const int m) {
el

2
for (int i = 0; i < n; i++) {
iv

3
4 const int j = (int) ( age[i] / group_width );
us

5 hist[j]++;
cl

6 }
Ex

7 }

Listing 4.26: Serial, scalar code to compute the histogram of the number of people in age groups.

This code is not optimal, because it cannot be auto-vectorized. The problem with vectorization is a true
vector dependence in the access to array hist. Indeed, consecutive iterations of the i-loop cause scattered
writes to hist, which is not possible to express with vector instructions. However, the operation of computing
the index j does not have a vector dependence, and therefore this part of the calculation can be vectorized.
Before we proceed with parallelizing this code, let us first ensure that it is vectorized. In order to facilitate
automatic vectorization, we can apply a technique called “strip-mining”. In order to “strip-mine” the i-loop,
we express it as two nested loops: the outer loop with index ii and has a stride of vLen = 16, and an
inner loop with index i that “mines” the strip of indexes between ii and ii+vLen. After that, we can split
the inner loop into two consecutive loops: a vector loop for computing the index j and, a scalar loop for
incrementing the histogram value. The strip-mining technique is further discussed in Section 4.4.4. Note that
the choice of the value of vLen=16 is dictated by the fact vector registers of Intel Xeon Phi coprocessors can
fit 16 values of type int.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 167

Listing 4.27 demonstrates a code that produces the same results as the code in Listing 4.26, but faster,
thanks to automatic vectorization. In addition to vectorization, we also implemented a scalar optimization by
replacing the division by group_width with the multiplication by its reciprocal value. We assume in this
code that n is a multiple of vLen, i.e., n%vLen==0. This assumption is easy to lift, as shown in Exercise
Section A.4.7.

1 void Histogram(const float* age, int* const hist, const int n,


2 const float group_width, const int m) {
3
4 const int vecLen = 16; // Length of vectorized loop
5 const float invGroupWidth = 1.0f/group_width; // Pre-compute the reciprocal
6
7 // Strip-mining the loop in order to vectorize the inner short loop
8 // Note: this algorithm works only if n%vLen == 0.
9 for (int ii = 0; ii < n; ii += vecLen) {
10 // Temporary storage for vecLen indices. Necessary for vectorization
11 int histIdx[vecLen] __attribute__((aligned(64)));

g
12

an
13 // Vectorize the multiplication and rounding
#pragma vector aligned

W
14
15 for (int i = ii; i < ii + vecLen; i++)

ng
16 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );

e
17
18
19
// Scattered memory access, does not get vectorized
for (int c = 0; c < vecLen; c++) nh
Yu
20 hist[histIdx[c]]++;
}
r

21
fo

22 }
d
re

Listing 4.27: Vectorizable serial code to compute the histogram of the number of people in age groups.
pa
re

The performance of codes in Listing 4.26 and Listing 4.27 can be found in Figure 4.6. The vector code
yP

performance is the baseline for this example. The function computes with n=230 in 1.27 s on the host system
el

and in 9.23 s on the Intel Xeon Phi coprocessor.


iv

Now that scalar optimization and vectorization have been implemented, let us proceed with the paral-
us

lelization of this code.


cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


168 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Bad Idea: Mutexes


Naturally, the code in Listing 4.27 can be parallelized by distributing the outer for loop (line 9) across
multiple threads. This can be easily done using #pragma omp parallel for. However, the array
hist is shared between all threads, and therefore, race conditions will occur when multiple threads increment
the same element hist[j]. This will lead to unpredictable program behavior and incorrect results.
In order to stabilize program execution, some sort of mutexes should be used. Atomic operations in
OpenMP (see Section 3.2.6) are lightweight mutexes suitable for an increment operation as in line 20 of the
code in Listing 4.27. Listing 4.28 demonstrates an implementation of the histogram computation with atomic
operations.

1 void Histogram(const float* age, int* const hist, const int n,


2 const float group_width, const int m) {
3
4 const int vecLen = 16;
5 const float invGroupWidth = 1.0f/group_width;

g
6

an
7 // Distribute work across threads (n%vLen==0 is assumed)
#pragma omp parallel for schedule(guided)

W
8
9 for (int ii = 0; ii < n; ii += vecLen) {

ng
10 int histIdx[vecLen] __attribute__((aligned(64)));
11
e
nh
12 #pragma vector aligned
13 for (int i = ii; i < ii + vecLen; i++)
Yu

14 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );


r

15
fo

16 for (int c = 0; c < vecLen; c++)


17 // Protect the ++ operation with the atomic mutex (inefficient!)
ed

18 #pragma omp atomic


ar

19 hist[histIdx[c]]++;
p

20 }
re

21 }
yP
el

Listing 4.28: Parallel code to compute the histogram with atomic operations to avoid race conditions.
iv
us

The third set of bars in Figure 4.6 reports the performance result for this code. On the host system, the
cl

execution time is 24.0 s and on the coprocessor, it is 37.7 s. This result shows that the use of atomic operations
Ex

in this code is not a scalable solution. The parallel performance is, in fact, worse than the performance of the
serial code in Listing 4.27. Atomic operations may be a viable solution if they are not used very frequently in
an application; however, they are used too often in the histogram calculation. Another approach must be taken
to parallelize this code, as discussed in the next section.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 169

Better Idea: Private Variables to Avoid True Sharing


As we discussed in Section 3.2.7, a common method for reducing the amount of synchronization between
threads is parallel reduction. In our case, the code performs reduction into an array, and therefore, we cannot
use the reduction clause of OpenMP. However, we can effectively implement reduction by giving each
thread an independent copy of array hist and then accumulating the results of all threads at the end of the
calculation. Listing 4.29 illustrates this idea.

1 void Histogram(const float* age, int* const hist, const int n,


2 const float group_width, const int m) {
3
4 const int vecLen = 16;
5 const float invGroupWidth = 1.0f/group_width;
6
7 #pragma omp parallel
8 { // Spawning threads before starting the for-loop in order to declare
9 // private variables to hold a copy of histogram in each thread

g
10 int hist_priv[m];

an
11 hist_priv[:] = 0;

W
12
13 int histIdx[vecLen] __attribute__((aligned(64)));

ng
14

e
15 // Distribute work across threads
16 #pragma omp for schedule(guided)
for (int ii = 0; ii < n; ii += vecLen) { nh
Yu
17
18 #pragma vector aligned
r

19 for (int i = ii; i < ii + vecLen; i++)


fo

20 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );


d

21
re

22 for (int c = 0; c < vecLen; c++)


pa

23 // This time, writing into the thread’s private variable


24 hist_priv[histIdx[c]]++;
re

25 }
yP

26
27 // Reduce private copies into global histogram
el

28 for (int c = 0; c < m; c++) {


iv

29 // Protect the += operation with the lightweight atomic mutex


us

30 #pragma omp atomic


hist[c] += hist_priv[c];
cl

31
}
Ex

32
33 }
34 }

Listing 4.29: Parallel code to compute the histogram with private copies of shared data. Execution time for n=230 is
0.35 s, which is 13x faster than the serial code.

In Listing 4.29, threads are spawned with #pragma omp parallel before the loop begins. Variable
int hist_priv[m] declared within the scope of that pragma is automatically considered private to each
thread. In the for-loop that follows in line 18, each thread writes to its own histogram copy, and no race
conditions occur. Notice the absence of the word parallel in #pragma omp for in line 17: the loop is
already inside of a parallel region.
After the loop in line 18 is over, the loop in line 34 is executed in each thread, accumulating the result of
all calculations in the shared variable. Atomic operations must still be used here. However, they do not incur a
significant overhead, because m is much smaller than n.
The optimized parallel code performs significantly faster than the serial code, completing n=230 iterations
in 0.12 s on the host system, and in 0.07 s on the coprocessor. This benchmark produces optimal results on the

Prepared for Yunheng Wang c Colfax International, 2013


170 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

host when hyper-threading is not used, i.e., the variable OMP_NUM_THREADS is set to 16 on the host system
with two eight-core Intel Xeon processors.
Figure 4.6 summarizes the effect of optimization of the histogram calculation code discussed in this
section.

Computing a histogram: elimination of synchronization


71.30 s
70 Host system
Intel Xeon Phi coprocessor
60
Time, s (lower is better)

50

40 37.70 s

30

g
24.00 s

an
20

W
ng
10 9.23 s
5.06 s
e
1.27 s
nh
0 0.12 s 0.07 s
Scalar Serial Code Vectorized Serial Code Vectorized Parallel Code Vectorized Parallel Code
Yu

(Atomic Operations) (Private Variables)


r
fo

Figure 4.6: The performance of histogram calculation for n=230 and m=5 using codes in Listing 4.26 (“Scalar Serial
ed

Code”), Listing 4.27 (“Vectorized Serial Code”), Listing 4.28 (“Vectorized Parallel Code (Atomic Operations)”) and
ar

Listing 4.29 (“Vectorized Parallel Code (Private Variables)”). The third case (atomic operations) is severely bottlenecked
p

by synchronization in the atomic mutex. The amount of synchronization is significantly reduced in the fourth case through
re

the use of private variables for parallel reduction.


yP
el

Exercise Section A.4.7 contains working code samples and a more general version of the code, which
iv

does not assume n%vLen==0.


us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 171

4.4.2 False Sharing. Solution: Data Padding and Private Variables


False sharing is a situation similar to a race condition, except that it occurs when two or more threads
access the same cache line or the same block of cache in a coherent cache system (as opposed to accessing the
same data element), and one of those accesses is a write. False sharing does not result in a race condition,
however, it negatively impact performance.
Performance degradation occurs because the x86 architecture processors, as well as the Intel Xeon Phi
coprocessor, have coherent caches. When two or more threads access the same block of cache or the same
cache line, the processor must lock the whole cache line until the write operation is complete, and coherence is
enforced. The cache line size in most modern Intel architectures is 64 bytes, and cache lines are mapped to
memory on 64-byte boundaries. Therefore, if one thread is writing to memory address A, and another thread is
reading from or writing to memory address B, which is within 64 bytes of A, false sharing may occur.

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Figure 4.7: Illustration of false sharing in parallel architectures with cache coherency.

Prepared for Yunheng Wang c Colfax International, 2013


172 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Example of False Sharing


In order to demonstrate false sharing, consider the following example. Suppose that we are solving the
histogram calculation problem of Section 4.4.1 and avoiding race conditions using a modification of the private
variable method shown in Listing 4.29. That is, instead of declaring the variable hist_priv private to each
thread, we declare an array of histograms, and use this array as a shared variable in parallel OpenMP context.
Even though this is a shared variable, each thread will write to its own section of it, and therefore no true
sharing will occur. The code that implements this method is shown in Listing 4.30.

1 #include <omp.h>
2
3 void Histogram(const float* age, int* const hist, const int n,
4 const float group_width, const int m) {
5
6 const int vecLen = 16;
7 const float invGroupWidth = 1.0f/group_width;
8 const int nThreads = omp_get_max_threads();

g
9 // Shared histogram with a private section for each thread

an
10 int hist_thr[nThreads][m];

W
11 hist_thr[:][:] = 0;
12
13
14
#pragma omp parallel
{
e ng
nh
15 // Get the number of this thread
Yu

16 const int iThread = omp_get_thread_num();


17 int histIdx[vecLen] __attribute__((aligned(64)));
r

18
fo

19 // There is no synchronization in this for-loop,


ed

20 // however, false sharing ruins the performance


21 #pragma omp for schedule(guided)
ar

22 for (int ii = 0; ii < n; ii += vecLen) {


p

23
re

24 #pragma vector aligned


yP

25 for (int i = ii; i < ii + vecLen; i++)


26 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
el

27
iv

28 // False sharing occurs here


us

29 for (int c = 0; c < vecLen; c++)


cl

30 hist_thr[iThread][histIdx[c]]++;
Ex

31 }
32 }
33
34 // Reducing results from all threads to the common histogram hist
35 for (int iThread = 0; iThread < nThreads; iThread++)
36 hist[0:m] += hist_thr[iThread][0:m];
37 }

Listing 4.30: This code computes a histogram for data in array age. This code is vectorized and it utilizes thread
parallelism, like the code in Listing 4.29. However, unlike the code in Listing 4.29, this code has no synchronization at all.
Race condition is avoided by introducing an array of histograms hist_thr, one histogram for each thread. However,
with m=5, false sharing occurs when threads access their regions of the shared array hist_thr.

The code in Listing 4.30 may look like it should work similarly to the code in Listing 4.29, because each
thread accesses its own region of memory. At the first glance, this method is even better than the method
with private variables illustrated in Listing 4.29, because there are no mutexes at all in this implementation.
However, in practice, code in Listing 4.30 exhibits poor performance on the host system (see Figure 4.8) and
on the coprocessor.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 173

The cause of performance degradation is that the value of m=5 is rather small, and therefore array ele-
ments hist_thr[0][:] are within m*sizeof(int)=20 bytes of array elements hist_thr[1][:].
Therefore, when thread 0 and thread 1 are accessing their elements simultaneously, there is a chance of hitting
the same cache line or the same block of the coherent L1 cache, which results in one of the threads having to
wait until the other thread unlocks that cache line.

Padding to Avoid False Sharing


If data must be written to adjacent memory locations by different threads, false sharing situations may be
avoided by padding. The amount of padding must be at least equal to size of the cache line, which is 64 bytes.

1 #include <omp.h>
2
3 void Histogram(const float* age, int* const hist, const int n,
4 const float group_width, const int m) {
5

g
6 const int vecLen = 16;

an
7 const float invGroupWidth = 1.0f/group_width;

W
8 const int nThreads = omp_get_max_threads();
9 // Padding for hist_thr[][] in order to avoid a situation

ng
10 // where two (or more) rows share a cache line.

e
11 const int paddingBytes = 64;
12 const int paddingElements = paddingBytes / sizeof(int);
const int mPadded = m + (paddingElements-m%paddingElements); nh
Yu
13
14 // Shared histogram with a private section for each thread
r

15 int hist_thr[nThreads][mPadded];
fo

16 hist_thr[:][:] = 0;
d

17
re

18 #pragma omp parallel


pa

19 {
20 // Get the number of this thread
re

21 const int iThread = omp_get_thread_num();


yP

22
23 int histIdx[vecLen] __attribute__((aligned(64)));
el

24
iv

25 #pragma omp for schedule(guided)


us

26 for (int ii = 0; ii < n; ii += vecLen) {


cl

27
#pragma vector aligned
Ex

28
29 for (int i = ii; i < ii + vecLen; i++)
30 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
31
32 for (int c = 0; c < vecLen; c++)
33 hist_thr[iThread][histIdx[c]]++;
34 }
35 }
36
37 for (int iThread = 0; iThread < nThreads; iThread++)
38 hist[0:m] += hist_thr[iThread][0:m];
39
40 }

Listing 4.31: Increasing the size of the inner dimension of the array hist_thr eliminates false sharing.

Listing 4.31 shows how padding can be done in the case considered above. The only difference between
Listing 4.31 and Listing 4.30 is that the inner dimension of hist_thr is now mPadded instead of m.
Increasing the size of the inner dimension from m to mPadded separates the memory regions in which

Prepared for Yunheng Wang c Colfax International, 2013


174 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

different threads operate. This reduces the penalty paid for cache coherence, which the processor must enforce
even though it is not required in this case. As a result, false sharing is eliminated.
Figure 4.8 summarizes the performance results of the code in Listing 4.30 and Listing 4.31. The latter
code was compiled and benchmarked in three variations: with paddingBytes=64, 128 and 256. For the
last case, the performance of the code is restored to the performance of the baseline code.

Computing a histogram: elimination of false sharing


1.8
1.600 s
Host system
1.6 Intel Xeon Phi coprocessor
1.4
Time, s (lower is better)

1.2
1.0

g
0.8 0.720 s

an
0.6

W
0.4 0.369 s
ng
0.270 s
e
nh
0.2 0.116 s 0.114 s 0.068 s
0.073 s 0.067 s 0.067 s
Yu

0.0Baseline: Parallel Code Poor Performance: Padding to Padding to Padding to


r

(Private Variables) False Sharing 64 bytes 128 bytes 256 bytes


fo
ed

Figure 4.8: The performance of histogram calculation for n=230 and m=5 using codes in Listing 4.29 (“Baseline: Parallel
ar

Code”), Listing 4.30 (“Poor Performance: False Sharing”) and Listing 4.31 (“Padding to 64/128/256 bytes”).
p
re
yP

More information on false sharing can be found in the article [57] by Nicholas Butler.
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 175

4.4.3 Load Imbalance. Solution: Load Scheduling and Grain Size Specification
In Section 3.2.3 we discussed how parallel loops can be executed in different scheduling modes. Specifi-
cally, in OpenMP, static, dynamic and guided modes are available, and scheduling granularity can also
be specified. Choosing a scheduling mode is often a trade-off. Lightweight, coarse-grained scheduling modes
incur little overhead, but may lead to load imbalance. On the other hand, complex, fine-grained scheduling
modes can improve load balance, but may introduce a significant scheduling overhead.
Consider a parallel for-loop that calls a thread-safe serial function in every iteration shown in Listing 4.32.

1 #pragma omp parallel for


2 for (int i = 0; i < n; i++) {
3 BlackboxFunction(i, data[i]);
4 }

Listing 4.32: Sample parallel loop calling a serial function, the execution time of which varies from iteration to iteration.

g
an
Suppose that the execution time of the function varies significantly from call to call. Such an application

W
is prone to load imbalance because some of the parallel threads may be “lucky” to get a quick workload,

ng
and other threads may struggle with a more expensive task. “Lucky” threads will have to wait for all other

e
threads, however, the application cannot proceed further until all of the loop iterations are processed. In order
nh
to improve the performance, we can specify a scheduling mode and a grain size, as in Listing 4.33
r Yu
fo

1 #pragma omp parallel for schedule(dynamic, 4)


2 // ...
d
re
pa

Listing 4.33: Sample parallel loop calling a serial function, the execution time of which varies from iteration to iteration.
re
yP

Here, the dynamic scheduling mode indicates that the iteration space must be split into chunks of
el

length 4, and these chunks must be distributed across available threads. As threads finish with their task, they
iv

will receive another chunk of the problem to work on. Other scheduling modes are static, where iterations
us

are distributed across threads before the calculations begin, and guided, which is analogous to dynamic,
cl

except that the chunk size starts large and is gradually reduced toward the end of the calculation.
Ex

The grain size of scheduling in Listing 4.32 is chosen as 4. Choosing the grain size is a trade-off: with too
small a grain size, too much communication between threads and the scheduler may occur, and the application
may be slowed down by the task of scheduling; with too large a grain size, load balancing may be limited. In
order to be effective, the grain size must be greater than 1 and smaller than n/T, where n is the number of
loop iterations and T is the number of parallel threads.

Example of Load Imbalance Resolution


In the following analysis we study a function that solves a nonhomogeneous system of linear algebraic
equations M~x = ~b using the iterative Jacobi method. We construct a parallel for-loop that calls the Jacobi
solver, and in every loop iteration, a different vector ~b is used for the problem. The iterative Jacobi method
has some inherent variation in the runtime, because for different vectors ~b, it may take different numbers of
iterations to obtain the solution ~c. In order to emphasize the load imbalance, we request a varying degree of
accuracy for different loop iterations. This causes the number of Jacobi iterations to fluctuate greatly from one
call of the solver to another. The solver code is shown in Figure 4.34 and the loop with which the solver is
called is shown in Figure 4.35. The complete working code can be found in Exercise Section A.4.8.

Prepared for Yunheng Wang c Colfax International, 2013


176 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

1 double RelativeNormOfDifference(const int n, const double* v1, const double* v2) {


2 // Calculates ||v1 - v2|| / (||v1|| + ||v2||)
3 double norm2 = 0.0; double v1sq = 0.0; double v2sq = 0.0;
4 #pragma vector aligned
5 for (int i = 0; i < n; i++) {
6 norm2 += (v1[i] - v2[i])*(v1[i] - v2[i]);
7 v1sq += v1[i]*v1[i];
8 v2sq += v2[i]*v2[i];
9 }
10 return sqrt(norm2/(v1sq+v2sq));
11 }
12
13 int IterativeSolver(const int n, const double* M, const double* b, double* x,
14 const double minAccuracy) {
15 // Iteratively solves the equation Mx=b with accuracy of at least minAccuracy
16 // using the Jacobi method
17 double accuracy;
18 double bTrial[n] __attribute__((align(64)));
19 x[0:n] = 0.0; // Initial guess

g
int iterations = 0;

an
20
21 do {

W
22 iterations++;
// Jacobi method

ng
23
24 for (int i = 0; i < n; i++) { e
25 double c = 0.0;
nh
26 #pragma vector aligned
Yu

27 for (int j = 0; j < n; j++)


28 c += M[i*n+j]*x[j];
r

x[i] = x[i] + (b[i] - c)/M[i*n+i];


fo

29
30 }
ed

31
// Verification
ar

32
33 bTrial[:] = 0.0;
p

for (int i = 0; i < n; i++) {


re

34
35 #pragma vector aligned
yP

36 for (int j = 0; j < n; j++)


37 bTrial[i] += M[i*n+j]*x[j];
el

38 }
iv

39 accuracy = RelativeNormOfDifference(n, b, bTrial);


us

40
cl

41 } while (accuracy > minAccuracy);


Ex

42 return iterations;
43 }

Listing 4.34: An iterative Jacobi solver for nonhomogeneous systems of linear algebraic equations. The number of
iterations depends on the initial value of vector ~
x and on the requested accuracy minAccuracy.

1 #pragma omp parallel for // scheduling mode goes here


2 for (int c = 0; c < nBVectors; c++)
3 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);

Listing 4.35: Parallel loop that calls the Jacobi solver with different vectors ~b and requests a different accuracy for every
call. This loop exhibits load imbalance if scheduling mode is not specified.

We benchmarked the Jacobi solver code with various settings for the loop scheduling mode. The results
can be found in Figure 4.9.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 177

Parallel loop execution scheduling modes


0.20
Host system
Intel Xeon Phi coprocessor
0.15

0.167

0.223

0.222
Time, s (lower is better)

0.127

0.117
0.116
0.113

0.113

0.08087
0.086
0.10

0.084
0.073 3
0.078

0.078
0.08

0.0
0.076
0.070

0.070
0.068

0.068

0.0638
0.0637
0.0646
0.065

0.06
0.06
0.06
0.05

0.00
2

nedg, 32
ic, 1

c, 1

1
ult

56

56

256
ic, 4

c, 4

W guided, 4

Plus
ic, 3

c, 3

ed,
ic, 2

c, 2
defa

ami
ami
stat
stat

ed,

Cilk
ami

guid
stat

ami

gauid
stat

dyn
dyn

guid
dyn
dyn

ng
Figure 4.9: Performance of the parallel loop executing the Jacobi solver (Listing 4.34 and Listing 4.35) for a set of vectors

e
~b with various OpenMP loop scheduling modes.
nh
Yu
For the default (unspecified) scheduling mode, the performance on the host system is almost 2x worse
r
fo

than in the case of dynamic or guided scheduling, which exhibits the best performance. This is explained
by the fact that we intentionally randomized the accuracy requirement for each call to the solver. With dynamic
d
re

or guided scheduling, threads that were “lucky” to get low-accuracy calculations are loaded with additional
pa

work. On the coprocessor, the performance was optimal for the default scheduling mode.
re

The grain size for dynamic scheduling in this problem has a “sweet spot” at the value of 4. However,
yP

with guided scheduling, a grain size of 1 works as well as a grain size of 4. That is because guided scheduling
reduces the scheduling overhead by gradually reducing the grain size from a large value down to the grain size
el

specified by the user. Large grain size has a significant negative effect on performance. That happens because
iv

a large grain size is a significant fraction of the iteration space size. This is true for both the host system and
us

the coprocessor, but more pronounced on the latter, due to a greater number of threads.
cl
Ex

The performance of the _Cilk_for on the coprocessor fluctuated greatly (almost by a factor of two)
from one trial to another, and the average performance is reported on the plot.

Diagnosing Load Imbalance with VTune


It is possible to detect situations where load imbalance causes performance loss in a multi-threaded
application. Intel VTune Amplifier XE can be used for that. We created a VTune project for the Jacobi solver
running on the host. The type of analysis used for this benchmark is called “Concurrency”. The benchmark
was run for two versions of the code: default loop scheduling and schedule(guided). For each code
version, the parallel loop was run 10 times.
Results are shown in Figure 4.10 (default scheduling) and Figure 4.11 (guided scheduling). These
screenshots correspond to the “Bottom-up” view in VTune. We expanded the thread panel and chose the “Tiny”
band height. Pale green regions in the thread panel correspond to thread waiting, and deep green — to thread
running. With the default load scheduling, Figure 4.10 reveals situations when just one or two threads were
running, and all other were waiting. Such situations can be eliminated with the help of guided scheduling.
Indeed, in Figure 4.10 we can see that all threads finish their work in about the same time, and the waiting
time is reduced.

Prepared for Yunheng Wang c Colfax International, 2013


178 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

g
an
W
e ng
nh
Figure 4.10: Concurrency profile of the Jacobi solver (Listing 4.34 and Listing 4.35) on the host system in Intel VTune
Yu

Amplifier XE. 10 instances of parallel loop were run with default scheduling. Load imbalance can be seen at the end of
r

each of the 10 parallel calculations.


fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Figure 4.11: Same as Figure 4.10, but with schedule(guided).

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 179

4.4.4 Insufficient Parallelism. Solution: Strip-Mining and Collapsing Nested Loops


Intel Xeon Phi coprocessors feature more than 50 cores with four-way hyper-threading, a total of more
than 200 logical cores. Such degree of hardware parallelism may be difficult to utilize for some algorithms.
For instance, algorithms in which parallel loop count is smaller than the number of logical cores will not
perform well on the coprocessor. A possible solution to this problem is increasing the iteration space over
which parallelization is performed. This section discusses two simple, yet efficient techniques for increasing
available thread parallelism in an application: strip-mining and loop collapse.

Principles: Strip-Mining
Strip-mining is a technique that applies to vectorized loops operating on one-dimensional arrays. This
technique allows to parallelize such loops, retaining vectorization. When this technique is applied, a single
loop operating on a one-dimensional array is converted into two nested loops. The outer loop strides through
“strips” of the array, and the inner loop operates on the data inside the strip (“mining” it). Sometimes, this
technique is used by the compiler “behind the scenes” in order to apply thread parallelism to vectorized loops.

g
However, in some cases, explicit application of strip-mining may be necessary. For example, when nested

an
loops are collapsed (see Listing 4.40 and Listing 4.44), the compiler may be unable to automatically vectorize

W
the loops. Listing 4.36 demonstrates the strip-mining transformation.

e ng
1
2 #pragma omp parallel for nh
// Compiler may be able to simultaneously parallelize and auto-vectorize this loop
Yu
3 #pragma simd
for (int i = 0; i < n; i++) {
r

4
fo

5 // ... do work
}
d

6
re

7
// The strip-mining technique separates parallelization from vectorization
pa

8
9 const int STRIP=80;
re

10 #pragma omp parallel for


yP

11 for (int ii = 0; ii < n; i += STRIP)


12 #pragma simd
el

13 for (int i = ii; i < ii + STRIP; i++) {


iv

14 // ... do work
us

15 }
cl
Ex

Listing 4.36: Strip-mining technique is usually implemented by the compiler “behind the scenes”. However, it is easy to
implement it manually, as shown in this listing.

The size of the strip must usually be chosen as a multiple of the SIMD vector length in order to facilitate
the vectorization of the inner loop. Furthermore, if the iteration count n is not a multiple of the strip size, then
the programmer must peel off n%STRIP iterations at the end of the loop.

Prepared for Yunheng Wang c Colfax International, 2013


180 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Principles: Loop Collapse

Loop collapse is a technique that converts two nested loops into a single loop. This technique can
be applied either automatically (for example, using the collapse clause of #pragma omp for), or
explicitly. Loop collapse is demonstrated in Listing 4.37.

1 // In the example below, m iterations are distributed across threads


2 #pragma omp parallel for
3 for (int i = 0; i < m; i++) {
4 for (int j = 0; j < n; j++) {
5 // ... do work
6 }
7 }
8
9 // In the example below, m*n iterations are distributed across threads

g
10 #pragma omp parallel for collapse(2)

an
11 for (int i = 0; i < m; i++) {

W
12 for (int j = 0; j < n; j++) {
13 // ... do work

ng
14 }
15 } e
nh
16
17 // The example below demonstrates explicit loop collapse.
Yu

18 // A total of m*n iterations are distributed across threads


#pragma omp parallel for
r

19
fo

20 for (int c = 0; c < m*n; c++) {


const int i = c / n;
ed

21
22 const int j = c % n;
ar

23 // ... do work
p

24 }
re
yP

Listing 4.37: Loop collapse exposes more thread parallelism in nested loops, rowsum_unoptimized.cc. The first
el

piece of code does not use loop collapse; the second relies on the automatic loop collapse functionality of OpenMP, and
iv

the third implements loop collapse explicitly.


us
cl
Ex

Example: Row-Wise Reduction of a Short, Wide Matrix

Consider the problem of performing a reduction (sum, average, or another cumulative characteristic)
along the rows of a matrix M[m][n] as illustrated by the equation below:

n
X
Si = Mij , i = 0 . . . m. (4.3)
j=0

Assume that m is small (smaller than the number of threads in the system), and n is large (large enough so that
the matrix does not fit into cache). A straightforward implementation of summing the elements of each row is
shown in Listing 4.38.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 181

1 #include <omp.h>
2 #include <stdio.h>
3 #include <math.h>
4 #include <malloc.h>
5
6 void sum_unoptimized(const int m, const int n, long* M, long* s){
7 #pragma omp parallel for
8 for (int i=0; i<m; i++) {
9 long sum=0;
10 #pragma simd
11 #pragma vector aligned
12 for (int j=0; j<n; j++)
13 sum+=M[i*n+j];
14 s[i]=sum;
15 }
16 }
17
18 int main(){
19 const int n=100000000, m=4; // problem size

g
long* M=(long*)_mm_malloc(sizeof(long)*m*n, 64);

an
20
21 long* s=(long*)_mm_malloc(sizeof(long)*m, 64);

W
22
printf("Problem size: %.3f GB, outer dimension: %d, threads: %d\n",

ng
23
24 (double)(sizeof(long))*(double)(n)*(double)m/(double)(1<<30),

e
25 m, omp_get_max_threads());
26
nh
Yu
27 const int nl=10;
28 double t=0, dt=0;
r

#pragma omp parallel for


fo

29
30 for (int i=0; i<m*n; i++) M[i]=0;
d

31
re

32 for (int l=0; l<nl; l++) { // Benchmarking row_sum(...)


pa

33 const double t0=omp_get_wtime();


sum_unoptimized(m, n, M, s);
re

34
35 const double t1=omp_get_wtime();
yP

36 if (l>=2) { // First two iterations slow on Xeon Phi; exclude them


37 t+=(t1-t0)/(double)(nl-2);
el

38 dt+=(t1-t0)*(t1-t0)/(double)(nl-2);
iv

39 }
us

40 }
cl

41 dt=sqrt(dt-t*t);
Ex

42 const float gbps=(double)sizeof(long)*(double)n*(double)m/t*1e-9;


43 printf("Unoptimized: %.3f +/- %.3f seconds (%.2f +/- %.2f GB/s)\n",
44 t, dt, gbps, gbps*(dt/t));
45
46 _mm_free(s); _mm_free(M);
47 }

Listing 4.38: Function sum_unoptimized() calculates the sum of the elements in each row of matrix M. When the
number of rows, m, is smaller than the number of threads in the system, the performance of this loop suffers from a low
degree of parallelism.

This implementation suffers from insufficient parallelism, because m is too small to keep all cores
occupied. In fact, this is a bandwidth-bound problem, because memory access has a regular pattern, and the
arithmetic intensity is equal to 1. Therefore, the performance concern is utilizing all memory controllers,
rather than all cores. There are 16 memory controllers in the KNC architecture. The performance of this
code, measured on the host system of two Intel Xeon E5-2680 processors, and on an Intel Xeon Phi 5110P
coprocessor, is shown in Listing 4.39.

Prepared for Yunheng Wang c Colfax International, 2013


182 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

user@host$ icpc -openmp rowsum_unoptimized.cc -o rowsum_unoptimized


user@host$ ./rowsum_unoptimized
Problem size: 2.980 GB, outer dimension: 4, threads: 32
Unoptimized: 0.067 +/- 0.000 seconds (47.55 +/- 0.21 GB/s)
user@host$
user@host$ icpc -openmp rowsum_unoptimized.cc -o rowsum_unoptimized_mic -mmic
user@host$ scp rowsum_unoptimized_mic mic0:~/
rowsum_unoptimized_mic 100% 19KB 18.8KB/s 00:00
user@host$ ssh mic0
user@mic0$ ./rowsum_unoptimized_mic
Problem size: 2.980 GB, outer dimension: 4, threads: 240
Unoptimized: 0.546 +/- 0.002 seconds (5.86 +/- 0.03 GB/s)
user@mic0$

Listing 4.39: Baseline performance of the row-wise matrix reduction code shown in Listing 4.38.

Insufficient parallelism may be seen in the VTune profile of the application captured in Figure 4.12. The

g
concurrency histogram (top panel) shows that for the bulk of the runtime, only 4 threads or fewer (out of

an
available 32) were running. The timeline (bottom panel) shows only 4 horizontal lines with dark green patches.

W
This illustrates the same problem with insufficient parallelism as the concurrency histogram.
e ng
In order to improve the performance of this application, the amount of exploitable parallelism in the code
must be expanded. In the remainder of this section, we will implement three optimization techniques:
nh
Yu

1. First, we will try to move the parallel code into the inner loop, which has more iterations.
r
fo

2. Second, we will attempt to use the loop collapse functionality of OpenMP.


ed

3. Finally, we will apply the strip-mining technique and loop collapse.


p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 183

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Figure 4.12: Thread concurrency profile of host implementation of the row-wise matrix reduction code with insufficient
parallelism (Listing 4.38). Top panel: concurrency histogram, bottom panel: timeline with the threads panel expanded.

Prepared for Yunheng Wang c Colfax International, 2013


184 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Optimization Option 1 (Moderately Efficient): Parallelizing the Inner Loop


In order to improve parallelism, one may try to parallelize the inner loop, which has more iterations, as
shown in Listing 4.40.

1 void sum_inner(const int m, const int n, long* M, long* s){


2 for (int i=0; i<m; i++) {
3 long sum=0;
4 #pragma omp parallel for schedule(guided) reduction(+: sum)
5 #pragma simd
6 #pragma vector aligned
7 for (int j=0; j<n; j++)
8 sum+=M[i*n+j];
9 s[i]=sum;
10 }
11 }

g
Listing 4.40: rowsum_inner.cc improves the performance of rowsum_unoptimized.cc shown in Listing 4.38

an
by parallelizing the inner loop instead of the outer loop. This optimization improves the parallel scalability, but does not

W
achieve the best performance.

e ng
Listing 4.41 demonstrates the performance of the code with parallelized inner loop.
nh
Yu

user@host$ icpc -openmp rowsum_inner.cc -o rowsum_inner


r

user@host$ ./rowsum_inner
fo

Problem size: 2.980 GB, outer dimension: 4, threads: 32


ed

Inner loop parallelized: 0.083 +/- 0.001 seconds (38.65 +/- 0.43 GB/s)
user@host$
ar

user@host$ icpc -openmp rowsum_inner.cc -o rowsum_inner_mic -mmic


p

user@host$ scp rowsum_inner_mic mic0:~/


re

rowsum_inner_mic 100% 20KB 20.2KB/s 00:00


yP

user@host$ ssh mic0


user@mic0$ ./rowsum_inner_mic
el

Problem size: 2.980 GB, outer dimension: 4, threads: 240


iv

Inner loop parallelized: 0.038 +/- 0.000 seconds (84.89 +/- 0.29 GB/s)
us

user@mic0$
cl
Ex

Listing 4.41: Performance of the row-wise matrix reduction code with parallelized inner loop.

With parallel inner loop, the performance on the coprocessor has improved, while the performance on the
host system dropped. Performance increase on the coprocessor is explained by the fact that with more parallel
threads operating, the memory controllers are not as severely under-utilized as in the unoptimized version.
However, the performance drop on the host indicates that the code is still not optimal. The host does not benefit
from additional parallelism as much as the coprocessor, because the host has fewer memory controllers, and
therefore, it takes fewer threads to utilize the memory bandwidth on the host. However, the host performance
is impeded in the new version of the code because OpenMP threads are spawned for every i-iteration, which
incurs parallelization overhead. Additionally, when the inner loop is parallelized, the OpenMP library does not
see the whole scope of the data processed by the problem, and has less freedom for optimal load scheduling.
Even though we observed a performance increase on the coprocessor, we will mark this method as
sub-optimal for this problem because of the problems stated above.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 185

Optimization Option 2 (Inefficient): Loop Collapse


In an attempt to instrument a better algorithm, the programmer can use the OpenMP 3.0 functionality of
loop collapse, as shown in Listing 4.42. The performance of this code is reported in Listing 4.43.

1 void sum_collapse(const int m, const int n, long* M, long* s){


2 s[0:m]=0;
3 #pragma omp parallel
4 {
5 long sum[m]; // Private reduction array to avoid false sharing
6 sum[0:m]=0;
7 #pragma omp for collapse(2) schedule(guided)
8 #pragma simd
9 #pragma vector aligned
10 for (int i=0; i<m; i++)
11 for (int j=0; j<n; j++)
12 sum[i]+=M[i*n+j];
13

g
14 // Arrays cannot be declared as reducers in pragma omp,

an
15 // and so the reduction must be programmed explicitly.

W
16 for (int i=0; i<m; i++)
17 #pragma omp atomic

ng
18 s[i]+=sum[i];

e
19 }
20 }
nh
r Yu
Listing 4.42: rowsum_collapse.cc attempts to expand the iteration space of the row-wise matrix reduction problem
fo

by collapsing nested loops. This gives the OpenMP more freedom for load balancing, but precludes automatic vectorization
d

of the inner loop.


re
pa
re
yP

user@host$ icpc -openmp rowsum_collapse.cc -o rowsum_collapse


user@host$ ./rowsum_collapse
el

Problem size: 2.980 GB, outer dimension: 4, threads: 32


iv

Collapse nested loops: 0.113 +/- 0.000 seconds (28.33 +/- 0.01 GB/s)
us

user@host$
user@host$ icpc -openmp rowsum_collapse.cc -o rowsum_collapse_mic -mmic
cl

user@host$ scp rowsum_collapse_mic mic0:~/


Ex

rowsum_collapse_mic 100% 20KB 20.1KB/s 00:00


user@host$ ssh mic0 ~/rowsum_collapse_mic
Problem size: 2.980 GB, outer dimension: 4, threads: 240
Collapse nested loops: 0.490 +/- 0.000 seconds (6.53 +/- 0.00 GB/s)
user@host$

Listing 4.43: Performance of the row-wise matrix reduction code with collapsed nested loops.

While the collapse(2) directive makes OpenMP expand the iteration space into two loops, the code
works slowly on both the host system and the coprocessor. This happens because vectorization fails in this case.
It can be verified by compiling the code with the argument -vec-report3. The inclusion of #pragma
simd does not help to enforce vectorization.
Even though we did not achieve optimal performance with this optimization, we are on the right track,
because we expose the most parallelism to the compiler. In the next optimization step, we will strip-mine the
inner loop, at the same time using the loop collapse pragma for the outer loop. This will enable automatic
vectorization, at the same time exposing the whole iteration space to thread parallelism.

Prepared for Yunheng Wang c Colfax International, 2013


186 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Optimization Option 3 (Optimal): Strip-Mining and Loop Collapse


Finally, consider the code in Listing 4.44, which employs strip-mining in order to transform the j-loop
into two nested loops, and utilizes the loop collapse directive on the outer i-loop and the new jj-loops. The
performance of this code is the best of all attempts, as shown in Listing 4.45.

1 void sum_stripmine(const int m, const int n, long* M, long* s){


2 const int STRIP=800;
3
4 assert(n%STRIP==0);
5 s[0:m]=0;
6 #pragma omp parallel
7 {
8 long sum[m];
9 sum[0:m]=0;
10 #pragma omp for collapse(2) schedule(guided)
11 for (int i=0; i<m; i++)
for (int jj=0; jj<n; jj+=STRIP)

g
12

an
13 #pragma simd
#pragma vector aligned

W
14
15 for (int j=jj; j<jj+STRIP; j++)

ng
16 sum[i]+=M[i*n+j];
17 e
nh
18 // Reduction
19 for (int i=0; i<m; i++)
Yu

20 #pragma omp atomic


21 s[i]+=sum[i];
r
fo

22 }
23 }
ed
ar

Listing 4.44: rowsum_stripmine.cc improves on rowsum_collapse.cc (Listing 4.42) by strip-mining the


p
re

inner loop. This allows OpenMP to balance the load across available threads, while automatic vectorization succeeds in
yP

the inner loop.


el
iv
us

user@host$ icpc -openmp rowsum_stripmine.cc -o rowsum_stripmine


cl

user@host$ ./rowsum_stripmine
Ex

Problem size: 2.980 GB, outer dimension: 4, threads: 32


Strip-mine and collapse: 0.060 +/- 0.001 seconds (53.67 +/- 1.04 GB/s)
user@host$
user@host$ icpc -openmp rowsum_stripmine.cc -o rowsum_stripmine_mic -mmic
user@host$ scp rowsum_stripmine_mic mic0:~/
rowsum_unoptimized_mic 100% 19KB 18.8KB/s 00:00
user@host$ ssh mic0
user@mic0$ ./rowsum_stripmine_mic
Problem size: 2.980 GB, outer dimension: 4, threads: 240
Strip-mine and collapse: 0.024 +/- 0.001 seconds (131.55 +/- 2.83 GB/s)
user@mic0$

Listing 4.45: Performance of the row-wise matrix reduction code with collapsed nested loops and strip-mined inner loop.

Evidently, this optimization is the best of all cases we had considered so far. This is true of both the
host and the coprocessor performance. The success of this version is explained by the fact that parallelism
was exposed to the compiler at two levels: vectorizable inner loop and parallelizable outer loops with a large
iteration count.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 187

Note that for optimal performance, the value of the parameter STRIP must be a multiple of the SIMD
vector length and much greater than the SIMD vector length. It must also be much smaller than the array
length n. The code in Listing 4.45 assumes that n is a multiple of STRIP. It is easy to relax this condition
with an additional peel loop. See, e.g., Listing 4.58.
Figure 4.13 contains a summary of the performance of all versions of the row-wise matrix reduction
algorithm considered in this section. Note that the metric shown in this loop is the effective memory bandwidth
achieved in the algorithm, in GB/s. In this plot, greater bandwidth means better performance.

Parallel row-wise matrix reduction


160
Host system
140 Intel Xeon Phi Coprocessor
131.6
Performance, GB/s (higher is better)

120

100

g
84.9

an
80

W
ng
60 53.7
47.5

e
40 38.6
nh
28.3
Yu
20
r

5.9 6.5
fo

0 Unoptimized Parallel inner loop Collapse nested loops Strip-mine and collapse
d
re
pa
re

Figure 4.13: Performance of all versions of the row-wise matrix reduction code (Listing 4.38, Listing 4.40, Listing 4.42
yP

and Listing 4.44) on the host system and on the Intel Xeon Phi coprocessor.
el

It is also informative to consider the concurrency profile of the optimized code in VTune. The profiling
iv
us

results are shown in Figure 4.14). Comparing them to the profile of the unoptimized code shown in Figure 4.12,
we see that all 32 threads were occupied on the host for the bulk of the execution time, which indicates high
cl
Ex

level of thread parallelism.


The complete code with running instructions for this study case is available in Exercise Section A.4.9.

Prepared for Yunheng Wang c Colfax International, 2013


188 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Figure 4.14: Thread concurrency profile of host implementation of the optimized row-wise matrix reduction code
(Listing 4.44). Top panel: concurrency histogram, bottom panel: timeline with the threads panel expanded.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 189

4.4.5 Wandering Threads. Improving OpenMP Performance by Setting Thread


Affinity
The Intel OpenMP library has the ability to bind OpenMP threads to a set of logical or physical cores.
This functionality is available in both Intel Xeon processors and Intel Xeon Phi coprocessors. Such binding,
known as thread affinity, may improve application performance. For example, it makes sense to use thread
affinity in the following cases:

1. In HPC applications that utilize the whole system, OpenMP threads may migrate from core to another
according to the OS decisions. This migration leads to performance penalties. For example, the migrated
thread may have to re-fetch the cache contents into the new core’s L1 cache. Using thread affinity, the
programmer can forbid thread migration and thus improve performance;
2. For memory bandwidth-bound codes, the optimum number of threads on Intel systems is usually equal
to or smaller than the number of physical cores. In other words, hyper-threading is counter-productive
for bandwidth-bound codes. However, for optimum performance, software threads must be distributed

g
across different physical cores rather than share two logical cores on the same physical core. Placing the

an
threads on different physical cores allows to efficiently utilize all available memory controllers. Setting

W
the corresponding thread affinity pattern helps to achieve such thread distribution;

ng
3. For compute-bound codes with hyper-threading, application performance may be improved by placing

e
nh
threads operating on adjacent data onto the same physical core, so that they may share the data in the
Yu
local L2 cache slice on the Intel Xeon Phi coprocessor. This task may be accomplished by setting thread
affinity and orchestrating the load distribution across OpenMP threads accordingly;
r
fo

4. In applications using Intel Xeon Phi coprocessors in the offload mode, it is preferable to exclude cores
d
re

0-3 from the affinity mask of the calculation. These cores are used by the uOS for offload management,
pa

and thread contention on these cores may slow down the whole application;
re

5. When several independent processes are running on a Non-Uniform Memory Access (NUMA) system,
yP

sharing its resources, it is beneficial to keep each process assigned to a specific core or processor. This
el

facilitates local data allocation and access in NUMA systems and benefits performance. Thread affinity
iv

can be used to effectively partition the system and bind each process to the respective local resources.
us
cl

The KMP_AFFINITY Environment Variable


Ex

Thread affinity in OpenMP applications can be controlled at the application level by setting the environ-
ment variable KMP_AFFINITY [58]. The format of the variable is

KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>]

The Table 4.3 explains the meaning of the arguments.

Prepared for Yunheng Wang c Colfax International, 2013


190 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Argument Default Description


modifier nonverbose, verbose/nonverbose control whether the application must print infor-
respect, mation about the supported affinity upon OpenMP library initialization:
granularity=core number of packages (i.e., processors), number of cores in each package,
number of thread contexts in each core, and OpenMP thread bindings to
physical thread contexts.
respect/norespect control whether to respect the affinity mask in
place for the thread that initializes the OpenMP run-time library
warnings/nowarnings control whether to print warning messages for
the affinity interface
granularity=<specifier> describes the lowest level that OpenMP
threads are allowed to float within a topology map. The values of
<specifier> are core or fine (the latter is equivalent to thread).
With granularity=core, threads bound to a core are allowed
to float between the different thread contexts (logical cores). With
granularity=fine, each thread is bound to a specific thread context.
proclist=[<proc_list>] specifies an explicit mapping of OpenMP

g
an
threads to OS procs. The format of <proc_list> is a comma-
separated string containing the numerical identifiers of OS procs or

W
their ranges, and float lists enclosed in brackets {}. Example:
e ng
proclist=[7,4-6,{0,1,2,3}] maps OpenMP thread 0 to OS proc
7, threads 1, 2 and 3 to procs 4, 5 and 6, respectively, and thread 4 is allowed
nh
to float between procs 0, 1, 2 and 3.
Yu

type none type=compact assigns each OpenMP thread to a thread context as close
r
fo

as possible to the previous thread. This type is beneficial for compute-


ed

intensive calculations.
type=scatter is the opposite of compact: OpenMP threads are dis-
ar

tributed as evenly as possible across the system. This type is beneficial for
p
re

bandwidth-bound applications.
yP

type=balanced is supported only in the MIC architecture and is a com-


promise between scatter and balanced.
el

type=explicit assigns thread affinity according to the list specified in


iv

the proclist= modifier.


us

type=disabled completely disables affinity interface and forces the


cl

OpenMP library to behave as if the affinity interface was not supported by


Ex

the operating system


type=none does not bind OpenMP threads to particular thread contexts.
Compiler still uses the OpenMP thread affinity interface to determine ma-
chine topology, unlike with type=disabled.

permute 0 For compact and scatter affinity maps, controls which levels are most
significant when sorting the machine topology map. A value for permute
forces the mappings to make the specified number of most significant levels
of the sort the least significant, and it inverts the order of significance. The
root node of the tree is not considered a separate level for the sort operations.

offset 0 indicates the starting position (proc ID) for thread assignment.

Table 4.3: Arguments of the KMP_AFFINITY environment variable. This summary table is based on the complete
description in the Intel C++ Compiler Reference Guide [58].

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 191

Note that in order to use different affinity masks on the host and on coprocessors with offload applications,
the environment variable MIC_ENV_PREFIX may be set (see Section 2.2.5). For example, the setup in
Listing 4.46 results in affinity type compact on the host and balanced on the coprocessor.

user@host% export MIC_ENV_PREFIX=MIC


user@host% export KMP_AFFINITY=compact,granularity=fine
user@host% export MIC_KMP_AFFINITY=balanced,granularity=fine

Listing 4.46: Using MIC_ENV_PREFIX to set different affinity masks on the host and on the coprocessor.

In most cases, affinity types compact, scatter or balanced, possibly combined with offset,
allow to set up an efficient thread affinity mask. Examples below illustrate the usage of KMP_AFFINITY for
commonly encountered cases.

Example: Bandwidth-Bound Applications and KMP_AFFINITY=scatter

g
an
In applications bound by memory bandwidth, it is usually beneficial to use one thread per core or less

W
on the host system and two threads per core or less on the coprocessor. This reduces thread contention on

ng
memory controllers. Additionally, affinity type scatter can be used to improve the effective bandwidth.
The bandwidth is improved because all memory controllers are utilized uniformly.

e
nh
Consider the row-wise matrix reduction code from Section 4.4.4. The code shown in Listing 4.44
Yu
demonstrated optimal performance on the host and the coprocessor. However, running this code, one may
notice that the performance fluctuates from run to run. It is possible to fix the performance at the maximum
r
fo

level by setting a thread affinity mask. We demonstrate this in Listing 4.47.


d
re
pa

user@host$ export OMP_NUM_THREADS=32


user@host$ export KMP_AFFINITY=none
re

user@host$ ./rowsum_stripmine; for i in {1..5} ; do ./rowsum_stripmine | tail -1; done


yP

Problem size: 2.980 GB, outer dimension: 4, threads: 32


Strip-mine and collapse: 0.061 +/- 0.002 seconds (52.89 +/- 1.31 GB/s)
el

Strip-mine and collapse: 0.059 +/- 0.002 seconds (54.11 +/- 1.56 GB/s)
iv

Strip-mine and collapse: 0.077 +/- 0.001 seconds (41.71 +/- 0.69 GB/s)
us

Strip-mine and collapse: 0.079 +/- 0.002 seconds (40.42 +/- 1.01 GB/s)
Strip-mine and collapse: 0.070 +/- 0.005 seconds (45.59 +/- 3.14 GB/s)
cl

Strip-mine and collapse: 0.077 +/- 0.001 seconds (41.43 +/- 0.75 GB/s)
Ex

user@host$
user@host$ export OMP_NUM_THREADS=16
user@host$ export KMP_AFFINITY=scatter
user@host$ ./rowsum_stripmine; for i in {1..5}; do ./rowsum_stripmine | tail -1 ; done
Problem size: 2.980 GB, outer dimension: 4, threads: 16
Strip-mine and collapse: 0.059 +/- 0.004 seconds (54.47 +/- 3.25 GB/s)
Strip-mine and collapse: 0.059 +/- 0.002 seconds (54.01 +/- 1.81 GB/s)
Strip-mine and collapse: 0.061 +/- 0.004 seconds (52.30 +/- 3.30 GB/s)
Strip-mine and collapse: 0.062 +/- 0.005 seconds (51.37 +/- 4.29 GB/s)
Strip-mine and collapse: 0.060 +/- 0.002 seconds (53.59 +/- 2.13 GB/s)
Strip-mine and collapse: 0.058 +/- 0.001 seconds (55.48 +/- 1.27 GB/s)
user@host$

Listing 4.47: Setting a thread affinity of type scatter in order to improve the performance of bandwidth-bound applica-
tion. Notice how the bandwidth fluctuates without thread affinity, but remains high with KMP_AFFINITY=scatter.

Additional case study for affinity setting in bandwidth-bound applications on a four-way NUMA system
is can be found in the Colfax Research publication [59].

Prepared for Yunheng Wang c Colfax International, 2013


192 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Example: Compute-Bound Calculations and KMP_AFFINITY=compact/balanced


For applications in which cache utilization and arithmetic performance are more important than memory
bandwidth, it is beneficial to place threads close to each other on the processors.
The code in Listing 4.48 performs and benchmarks a DGEMM calculation. DGEMM is a highly
arithmetically intensive compute-bound problem, and its performance benefits from the affinity mask of type
compact.

1 #include <mkl.h>
2 #include <stdio.h>
3 #include <omp.h>
4
5 int main() {
6 const int N = 10000; const int Nld = N+64;
7 const char tr=’N’; const double v=1.0;
8 double* A = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);
9 double* B = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);

g
10 double* C = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);

an
11 _Cilk_for (int i = 0; i < N*Nld; i++) A[i] = B[i] = C[i] = 0.0f;

W
12 int nIter = 10;
13 for(int k = 0; k < nIter; k++)
14
15
{
double t1 = omp_get_wtime();
e ng
nh
16 dgemm(&tr, &tr, &N, &N, &N, &v, A, &Nld, B, &Nld, &v, C, &N);
double t2 = omp_get_wtime();
Yu

17
18 double flopsNow = (2.0*N*N*N+1.0*N*N)*1e-9/(t2-t1);
r

19 printf("Iteration %d: %.1f GFLOP/s\n", k+1, flopsNow);


fo

20 }
ed

21 _mm_free(A); _mm_free(B); _mm_free(C);


22 }
p ar
re

Listing 4.48: Code bench-dgemm.cc, a benchmark of the DGEMM function in Intel MKL.
yP
el

In Listing 4.49, the code bench-dgemm.cc is compiled as a native application for Intel Xeon Phi
iv

coprocessors, and executed on the coprocessor. First, the application is executed without an affinity mask.
us

Then the affinity mask is specified by passing the environment variable KMP_AFFINITY=compact to the
cl

coprocessor. The effect of this optimization is a factor of two speedup.


Ex

user@host$ icpc -o bench-dgemm -mkl -mmic bench-dgemm.cc


user@host$ micnativeloadex ./bench-dgemm
Iteration 1: 312.7 GFLOP/s
Iteration 2: 346.5 GFLOP/s
Iteration 3: 348.5 GFLOP/s
Iteration 4: 347.2 GFLOP/s
Iteration 5: 348.3 GFLOP/s

user@host$ micnativeloadex ./bench-dgemm -e "KMP_AFFINITY=compact"


Iteration 1: 626.8 GFLOP/s
Iteration 2: 769.1 GFLOP/s
Iteration 3: 769.4 GFLOP/s
Iteration 4: 769.3 GFLOP/s
Iteration 5: 769.4 GFLOP/s

Listing 4.49: Compiling the DGEMM benchmark bench-dgemm.cc as a native Intel Xeon Phi coprocessor application
and running it on the coprocessor in two modes: without an affinity mask, and with an affinity mask of type compact.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 193

Example: Multiple Processes in a NUMA System


Consider a situation when multiple independent processes operate in a NUMA system. This situation may
occur, for example, in batch processing tasks. Suppose that none of the concurrently running processes need
access to the whole system memory or to all CPU cores. In this case we can partition the system’s memory and
cores, and bind the concurrent processes to their respective partitions. This way, we guarantee that processes
will operate on their NUMA local memory, which is beneficial for performance. The KMP_AFFINITY
variable may be used for that. Listing 4.50 shows a code that calculates a Discrete Fast Fourier Transform
(DFFT) of a large one-dimensional array using the FFT functionality of Intel MKL. The array size is 4 GB.

1 #include <omp.h>
2 #include <math.h>
3 #include <stdio.h>
4 #include "mkl_dfti.h"
5
6 int main(int argc, char** argv) {
7 const size_t n = 1L<<30L;

g
an
8 const char* def = "(single instance)";
9 const char* inst = (argc < 2 ? def : argv[1]);

W
10 const double flopsPerTransfer = 2.5*log2((double)n)*n;

ng
11 float *x = (float*)malloc(sizeof(float)*n);
12 _Cilk_for(int i = 0; i < n; i++) x[i] = 1.0f;

e
DFTI_DESCRIPTOR_HANDLE fftHandle;
nh
13
14 MKL_LONG size = n;
Yu
15 DftiCreateDescriptor ( &fftHandle, DFTI_SINGLE, DFTI_REAL, 1, size);
16 DftiCommitDescriptor ( fftHandle );
r
fo

17 const int nTrials = 5;


18 for (int t = 0; t < nTrials; t++) {
d

19 const double t1 = omp_get_wtime();


re

20 DftiComputeForward ( fftHandle, x );
pa

21 const double t2 = omp_get_wtime();


re

22 const double gflops = flopsPerTransfer*1e-9/(t2-t1);


printf("Instance %s, iteration %d: %.3f ms (%.1f GFLOP/s)\n",
yP

23
24 inst, t+1, 1e3*(t2-t1), gflops);
}
el

25
DftiFreeDescriptor ( &fftHandle );
iv

26
27 free(x);
us

28 }
cl
Ex

Listing 4.50: The code in bench-fft.cc computes a large one-dimensional Discrete Fast Fourier Transform (DFFT).

user@host$ icpc -o bench-fft -mkl bench-fft.cc


user@host$ export KMP_AFFINITY=
user@host$ export MKL_NUM_THREADS=32
user@host$ ./bench-fft
Instance (single instance), iteration 1: 2718.542 ms (29.6 GFLOP/s)
Instance (single instance), iteration 2: 2670.348 ms (30.2 GFLOP/s)
Instance (single instance), iteration 3: 2728.801 ms (29.5 GFLOP/s)
Instance (single instance), iteration 4: 2734.630 ms (29.4 GFLOP/s)
Instance (single instance), iteration 5: 2750.043 ms (29.3 GFLOP/s)

Listing 4.51: Running the FFT benchmark bench-fft.cc using all available logical cores.

Prepared for Yunheng Wang c Colfax International, 2013


194 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

In Listing 4.51, we benchmark this application using 32 threads, which is equal to the total number of
available logical cores on the system. The test platform is a system with two Intel Xeon processors, each
containing eight cores with two-way hyper-threading. We do not use thread affinity in this case.
Now suppose that we have to calculate a large number of such FFTs, and all calculations are completely
independent. We can speed up the calculation by using multiple processes with fewer threads, as shown in
Listing 4.52. This time, we are using a script called fftrun_noaffinity to launch two instances of the
application with 16 threads each.

user@host$ cat fftrun_noaffinity


#!/bin/bash
terminate()
{
killall ./bench-fft
}
trap terminate SIGINT

g
export MKL_NUM_THREADS=16

an
export KMP_AFFINITY=

W
./bench-fft 1 &
./bench-fft 2 &

ng
wait
user@host$ ./fftrun_noaffinity
e
nh
Instance 2, iteration 1: 4347.238 ms (18.5 GFLOP/s)
Instance 1, iteration 1: 4703.092 ms (17.1 GFLOP/s)
Yu

Instance 2, iteration 2: 4300.715 ms (18.7 GFLOP/s)


r

Instance 1, iteration 2: 4995.914 ms (16.1 GFLOP/s)


fo

Instance 2, iteration 3: 4237.673 ms (19.0 GFLOP/s)


Instance 1, iteration 3: 4377.071 ms (18.4 GFLOP/s)
ed

Instance 2, iteration 4: 4334.670 ms (18.6 GFLOP/s)


ar

Instance 1, iteration 4: 4498.975 ms (17.9 GFLOP/s)


p

Instance 2, iteration 5: 4182.502 ms (19.3 GFLOP/s)


re

Instance 1, iteration 5: 3749.636 ms (21.5 GFLOP/s)


yP
el

Listing 4.52: Running the FFT benchmark bench-fft.cc as two processes with 16 threads each.
iv
us
cl

Note that the average performance of the code in the case of two 16-threaded processes is 18.5 GFLOP/s
Ex

per process, which amounts to 37 GFLOP/s for the whole system. This is 25% better than the performance
of a single 32-threaded process reported in Listing 4.51. The cause of the performance difference is the
multi-threading and NUMA overhead. This overhead is greater in a single process with 32 cores than in each
of the 16-threaded processes.
In Listing 4.52, we did not restrict the affinity of the threads. This means that some threads were accessing
non-local NUMA memory, incurring a performance hit. In order to optimize the execution of the benchmark,
we can set the environment variable KMP_AFFINITY as shown in Listing 4.53.
The average performance now (Listing 4.53) is 19.4 GFLOP/s, which is 5% better than without the
affinity mask (Listing 4.52). The affinity mask requested here is of type compact, which places the threads
as close to one another as possible. With MKL_NUM_THREADS=16, all threads will end up on the same
CPU socket, because the CPU supports 16 hyper-thread contexts. The setting granularity=fine forbids
threads from moving across hyper-thread contexts. The numbers 0,0 and 0,16 tell the OpenMP library
to start placing threads from OS proc 0 (in the first case) or from OS proc 16 (in the second case). These
two numbers are the permute and offset arguments of KMP_AFFINITY. Note that we must specify
permute in order to set the value of offset, because if we specified only a single number, it would have
been interpreted as permute.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.4. TASK PARALLELISM: COMMON PITFALLS IN SHARED-MEMORY PARALLEL CODE 195

user@host$ cat myrun_affinity


#!/bin/bash
terminate()
{
killall ./bench-fft
}
trap terminate SIGINT
export MKL_NUM_THREADS=16
export KMP_AFFINITY=granularity=fine,compact,0,0
./bench-fft 1 &
export KMP_AFFINITY=granularity=fine,compact,0,16
./bench-fft 2 &
wait
user@host$ ./myrun_affinity
Instance 1, iteration 1: 4206.466 ms (19.1 GFLOP/s)
Instance 2, iteration 1: 4213.147 ms (19.1 GFLOP/s)
Instance 1, iteration 2: 4059.701 ms (19.8 GFLOP/s)
Instance 2, iteration 2: 4094.301 ms (19.7 GFLOP/s)
Instance 1, iteration 3: 4073.644 ms (19.8 GFLOP/s)

g
Instance 2, iteration 3: 4144.214 ms (19.4 GFLOP/s)

an
Instance 1, iteration 4: 4159.326 ms (19.4 GFLOP/s)

W
Instance 2, iteration 4: 4228.130 ms (19.0 GFLOP/s)
Instance 1, iteration 5: 4221.576 ms (19.1 GFLOP/s)

ng
Instance 2, iteration 5: 4162.325 ms (19.3 GFLOP/s)

e
nh
Yu
Listing 4.53: Running the FFT benchmark bench-fft.cc as two processes with 16 threads each, using the
KMP_AFFINITY variable in order to bind each processes to the respective CPU socket.
r
fo
d

If the available RAM size permits, it may be possible to further improve the performance of this problem
re

by increasing the number of processes while proportionally reducing the number of threads per process. While
pa

doing so, it is beneficial to set the affinity mask of each process in order to reserve a certain partition of the
re

system’s resources for it.


yP

This example and optimization method are motivated by an astrophysical research project reported on in
[60]. Note that the optimization demonstrated above is not specific to the computational problem of FFT; it
el
iv

applies to any other application in a NUMA system which can be run as multiple independent instances.
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


196 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Affinity and Number of Threads in Offload Applications


In offload applications for Intel Xeon Phi coprocessors, one coprocessor core (usually the core with the
greatest number) is reserved for offload-related tasks. By default, this core is excluded from the OpenMP
affinity mask, and the number of threads in offload applications defaults to 4 · (N − 1) as opposed to 4N for
native applications. Here N is the number of cores, and the factor 4 reflects four-way hyper-threading in Intel
Xeon Phi coprocessors. For example, on a 60-core coprocessor (N =60), the default number of OpenMP
threads in native applications is 240, and in offload applications, 236.
Setting the affinity mask using the KMP_AFFINITY variable (or MIC_KMP_AFFINITY if the vari-
able MIC_ENV_PREFIX=MIC is set as shown in Section 2.2.5) may override the default behavior and
schedule calculations on the core reserved for offload-related tasks of MPSS. If this happens, the appli-
cation may be throttled down by the impeded offload functionality. In order to avoid scheduling cal-
culations on the MPSS core when using MIC_KMP_AFFINITY, it is recommended to set the variable
MIC_PLACE_THREADS=59C,4t, which explicitly requests to use 59 cores with 4 threads on each.

4.4.6 Diagnosing Parallelization Problems with Scalability Tests

g
an
In the process of porting and optimizing applications on Intel Xeon processors and Intel Xeon Phi

W
coprocessors, the programmer must ensure good parallel scalability of the application. On the host, applications
e ng
must efficiently scale to 16-32 threads in order to harness the task parallelism of Intel Xeon processors. On
a 60-core coprocessor, the application must have a good scaling up to 120-240 threads in order to reap the
nh
benefits of parallelism. Excessive synchronization, insufficient exposed parallelism and false sharing may limit
Yu

the parallel scalability and prevent performance gains on the Intel Xeon Phi architecture.
r

The following simple test may help to assess the need for shared-memory algorithm optimization. This
fo

test works in most OpenMP applications without any special tools. First, the application must be benchmarked
ed

with a single thread by setting OMP_NUM_THREADS=1. Then, the benchmark must be run with 32 OpenMP
ar

threads on the host and 240 threads (or 236 for offload applications) on the coprocessor. For compute-
p

bound applications, KMP_AFFINITY=granularity=fine,compact may be used. In a highly scalable


re

application, the performance difference between a single-threaded and multi-threaded calculation must be a
yP

factor of 16 or more on the host, and a factor of 120 or more on the coprocessor. If the performance difference
el

is not significant, it is worthwhile investigating the common shared-memory pitfalls discussed in Section 4.4.
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 197

4.5 Memory Access: Computational Intensity and Cache Management


The estimate of the theoretical peak performance in double precision (64-bit floating-point numbers) for
a 60-core Intel Xeon Phi coprocessor clocked at 1.0 GHz and utilizing 512-bit SIMD registers is

Arithmetic Performance = 60 × 1.0 × (512/64) × 2 = 960 GFLOP/s. (4.4)

Here, the factor 2 assumes that the fused multiply-add operation is employed, performing two floating-point
operations per cycle. At the same time, the peak memory bandwidth of this system performing 6.0 GT/s using
8 memory controllers with 2 channels in each, working with 4 bytes per channel, is

Memory Bandwidth = η × 6.0 × 8 × 2 × 4 = η × 384 GB/s, (4.5)

where η is the practical efficiency of bandwidth accessibility, η ≈ 0.5. This amounts to 0.5 × 384/8 = 24
billion floating-point numbers per second (in double precision). Therefore, in order to sustain optimal load
on the arithmetic units of an Intel Xeon Phi coprocessor, the code must be tuned to perform no less than
960/24 = 40 floating-point operations on every number fetched from the main memory. In comparison, a

g
an
system based on two eight-core Intel Xeon E5 processors clocked at 3.0 GHz delivers up to

W
Arithmetic Performace = 2 sockets × 8 × 3.0 × (256/64) × 2 = 384 GFLOP/s, (4.6)

ng
e
with a memory bandwidth

Memory Bandwidth = 2 sockets × η × 6.4 × 8 = η × 102 GB/s,


nh
Yu
(4.7)
r

where the additional factor of 2 in the estimate of performance reflects the presence of two ALUs (Arithmetic
fo

Logic Units) in each Sandy Bridge processor. Even though this processor does not have an FMA instruction,
d
re

xAXPY-like algorithms may favorably utilize the processor’s pipeline and employ both ALUs.
pa

Generally, the more arithmetic operations per memory access a code performs, the easier it is to fully
utilize the arithmetic capabilities of the processor. That is, high arithmetic intensity applications tend to be
re
yP

compute-bound. In contrast, low arithmetic intensity applications are bandwidth-bound, if they access memory
in a streaming manner, or latency-bound if their memory access pattern is irregular.
el

The relationship between the arithmetic inten-


iv

Roofline model: theoretical peak


sity and the resource limitation of an application can
us

Host system Theor. max performance


be better understood with the help of the roofline 1000 1000
cl

900 Coprocessor 900


model. Roofline model is a theoretical tool for 800 800
Ex

700 700
Performance, GFLOP/s

assessing the optimization options of HPC appli- 600 600


500 500
cations. This model was suggested by Williams,
th

Theor. max performance 400


wid

400
Waterman & Patterson [61]. In order to build the
nd

th
ba

roofline model for a specific architecture, one plots 300 300


wid
ax

nd

two lines on a log-log graph where the arithmetic


r. m

ba
eo

200 200
intensity is plotted along the horizontal axis, and
ax
Th

r. m

performance (in GFLOP/s) on the vertical axis. The


eo
Th

two lines correspond to the bandwidth (a line with


100 100
a unit slope normalized to the expected bandwidth) 1 2 4 8 16 32 64 128 256
and to the peak arithmetic performance (a horizontal Arithmetic Intensity
line normalized to the expected peak performance).
The point where these two lines intersect corre- Figure 4.15: Basic roofline model for a host with two Intel
sponds to the maximum bandwidth and maximum Xeon E5 processors and for an Intel Xeon Phi coprocessor.
arithmetic performance. An example of the roofline model is shown in Figure 4.15.
The utility of the roofline model plot is in its predictive power in the selection of code optimization
options. Any application can be thought of as a column in this plot positioned at the arithmetic intensity of the

Prepared for Yunheng Wang c Colfax International, 2013


198 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

application and extending upwards until it hits the “roof” represented by the model. If the column hits the
sloping part of the roof (the bandwidth line), then the application is bandwidth-bound. Such an application may
be optimized by improving the memory access pattern to boost the bandwidth or by increasing the arithmetic
intensity. If the column hits the horizontal part of the roof (the performance line), then the application is
compute-bound. Therefore, it may be optimized by improving the arithmetic performance by vectorization,
utilization of specialized arithmetic instructions, or other arithmetics-related methods.
The roofline model can be extended by adding ceilings to the model. Figure 4.16 demonstrates an
extended roofline model for the host system with two Intel Xeon E5 processors and for a single Intel Xeon
Phi coprocessor. In this figure, an additional model is produced by introducing a realistic memory bandwidth
efficiency η=50%. Additionally, we introduced a ceiling “without FMA” for the coprocessor and “one ALU”
for the host. One of these ceiling correspond to applications that do no employ the fused multiply-add operation
on the coprocessor, or do not fill the host processor pipeline in a fashion that utilizes both arithmetic logic
units (ALUs) of Sandy Bridge processors. This assumption reduces the maximum arithmetic performance by
approximately a factor of 2. Another ceiling additionally assumes that the application is scalar, i.e., does not
use vector instructions. In double precision, this reduces the theoretical peak performance on the host by a

g
factor of 4 and on the coprocessor by a factor of 8 (see Section 1.3.2 for additional discussion on this subject).

an
W
Roofline model: various conditions

103
Host system: theoretical maximum
practical bandwidth (η=50%)
one ALU
e ng Coprocessor, theor. max performance
103
nh
no vectorization
Coprocessor: theoretical maximum
Yu

practical bandwidth (η=50%)


i dth Coprocessor, without FMA
no fused multiply-add dw 0%
no vectorization an 5
r

b η=
Performance, GFLOP/s

ax
fo

ncy
.m ie Host, theor. max performance
or ic idt
h
e eff
ed

, th h dw
s or w idt b an
es nd
ar

x Host, only one ALU


roc ba ma
op or
,
or
. 0%
p

C ss he =5
ce T yη
re

p ro m, en
c
Co ste ici
yP

y ff
102 ts he 102
os dt
H
d wi
el

Ban
Coprocessor, without FMA or vectorization
iv
us

Host, one ALU, no vectorization


cl
Ex

101 101
1/2 1 2 4 8 16 32 64 128 256
Arithmetic Intensity

Figure 4.16: Extended roofline model for a host with two Intel Xeon E5 processors and for an Intel Xeon Phi coprocessor
with a realistic bandwidth efficiency factor and additional ceilings.

The information in the roofline model plot can be used in order to predict which optimizations are
likely to benefit a given application. It also indicates the threshold arithmetic intensity at which the workload
transitions from bandwidth-bound to compute-bound. The arithmetic intensity is a property of the numerical
algorithm and can be varied for algorithms more expensive than O(N ). Code optimizations that improve the
memory performance and increase the arithmetic intensity are presented in the present section.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 199

4.5.1 Cache Organization on Intel Xeon Processors and Intel Xeon Phi Coprocessors
Intel Xeon processors and Intel Xeon Phi coprocessors have similar memory hierarchy: a large, but
relatively slow RAM is cached with a smaller, but faster L2 cache, which, in turn, is cached by an even smaller
and even faster L1 cache, which resides in direct proximity of the core registers. See Figure 1.6 for Knights
Corner core topology, Table 1.2 for technical specifications, and Table 1.1 for cache properties.
One aspect of cache organization distinguishes Intel Xeon processors from Intel Xeon Phi coprocessors.
Intel Xeon processors have the L2 cache symmetrically shared between all cores, while in Intel Xeon Phi
coprocessors, the L2 cache can be viewed as slices local to every core and connected via an IPN (Figure 1.5
illustrates the die layout).

4.5.2 Cache Misses


Any algorithm that operates on data in RAM incurs cache misses when the data is loaded from RAM
into all levels of cache hierarchy for the first time. Additionally, if the data set does not fit in the cache, the
algorithm will incur additional cache misses as it processes the data, because the same data may be evicted

g
from cache and fetched from RAM or lower-level cache multiple times. Every cache miss on a read operation

an
makes the core stall until the data requested by the core is fetched from memory. A cache miss on a write does

W
not necessarily stall the core, because the core may not need to wait until the data is written.

ng
The latencies of communication with caches can be masked (i.e., overlapped with calculations). In both

e
Intel Xeon processors and Intel Xeon Phi coprocessors, hyper-threading is used in order to allow one parallel
nh
thread to use the core while the other thread is waiting for data to be read or written. Additionally, Intel
Yu
Xeon cores are out-of-order processors, and, depending on the algorithm, they can execute other instructions
r

(sometimes speculatively) while a cache miss is being processed. This is not true of Intel Xeon Phi cores,
fo

which are in-order processors.


d
re

It is usually feasible to estimate the theoretical minimum of cache misses for any given algorithm.
pa

Furthermore, it is often possible to reduce the occurrence of cache misses in an algorithm by changing the
order of operations. The techniques for such optimizations include:
re
yP

1) permuting nested loops when it improves the locality of data access;


el

2) loop tiling (also known as loop blocking) for algorithms with nested loops acting on multi-dimensional
iv
us

arrays;
cl

3) recursive cache-oblivious algorithms,


Ex

4) loop fusion and inter-procedural optimization.


These methods are described in Sections 4.5.3, 4.5.4, 4.5.5 and 4.5.6.

Prepared for Yunheng Wang c Colfax International, 2013


200 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.5.3 Loop Interchange (Permuting Nested Loops)


Loop interchange is an optimization technique in which the order of nesting in nested loops is inverted. In
some cases, loop interchange that can be performed automatically by the compiler, without the programmer’s
participation. However, as we will see in Example 1 in Section 4.5.4, the compiler may be unable to detect
some situations in which loop interchange is favorable. In this section we illustrate how changing the order of
loop nesting can improve cache traffic, and how cache problems can be diagnosed using the VTune Amplifier
tool. Consider the problem of matrix-vector multiplication illustrated in Listing 4.54.

1 #include <omp.h>
2 #include <stdio.h>
3
4 void loop1(int n, double* M, double* a, double* b){
5 // More optimized: unit-stride access to matrix M
6 for (int i=0; i<n; i++)

g
7 for (int j=0; j<n; j++)

an
8 b[i]+=M[i*n+j]*a[j];

W
9 }
10

ng
11 void loop2(int n, double* M, double* a, double* b){
// Less optimized: stride n access to matrix M
12
e
nh
13 for (int j=0; j<n; j++)
14 for (int i=0; i<n; i++)
Yu

15 b[i]+=M[i*n+j]*a[j];
}
r

16
fo

17
int main(){
ed

18
19 const int n=10000, nl=10; // n is the problem size
ar

20 double t0, t;
p

21 double* M=(double*)malloc(sizeof(double)*n*n);
re

22 double* a=(double*)malloc(sizeof(double)*n);
yP

23 double* b=(double*)malloc(sizeof(double)*n);
24 M[0:n*n]=0; a[0:n]=0; b[0:n]=0;
el

25
iv

26 t=0; // Benchmarking loop 1 (nl runs for statistics)


us

27 for (int l=0; l<nl; l++) {


t0=omp_get_wtime();
cl

28
29 loop1(n, M, a, b);
Ex

30 t+=(omp_get_wtime()-t0)/(double)nl;
31 }
32 printf("Loop 1 (stride 1): %.3f s (%.2f GFLOP/s)\n",
33 t, (double)(2*n*n)/t*1e-9);
34
35 t=0; // Benchmarking loop 2 (nl runs for statistics)
36 for (int l=0; l<nl; l++) {
37 t0=omp_get_wtime();
38 loop2(n, M, a, b);
39 t+=(omp_get_wtime()-t0)/(double)nl;
40 }
41 printf("Loop 2 (stride n): %.3f s (%.2f GFLOP/s)\n",
42 t, (double)(2*n*n)/t*1e-9);
43
44 free(b); free(a); free(M);
45 }

Listing 4.54: matvec-miss.cc executes and times a serial matrix-vector multiplication. Two implementations are
tested: in loop1(), the j loop is nested inside the i loop, and in loop2(), the i-loop is nested inside the j-loop.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 201

Cache Misses with Low Compiler Optimization: -O1

First, let us compile this code with the arguments -O1 and -openmp (the latter is needed only to enable
the convenient OpenMP timing function omp_get_wtime()). The resulting code will have a low level
of optimization, but it will allow to illustrate the difference between loops 1 and 2. The result is shown in
Listing 4.55.

user@host$ icpc -openmp -O1 matvec-miss.cc


user@host$ ./a.out
Loop 1 (stride 1): 0.135 s (1.48 GFLOP/s)
Loop 2 (stride n): 0.370 s (0.54 GFLOP/s)
user@host$
user@host$ icpc -openmp -O1 matvec-miss.cc -mmic
user@host$ scp a.out mic0:~/
a.out 100% 14KB 13.7KB/s 00:00
user@host$ ssh mic0
user@mic0$ ./a.out

g
an
Loop 1 (stride 1): 2.312 s (0.09 GFLOP/s)
Loop 2 (stride n): 71.567 s (0.00 GFLOP/s)

W
user@mic0$

e ng
Listing 4.55: Running the code in Listing 4.54 compiled with the optimization level -O1.
nh
r Yu
If the code in Listing 4.54 is compiled with the optimization level -O1, then on the host system, the
fo

function loop2() performs almost three times slower than loop1() due to the cache misses that it incurs.
d

On the coprocessor, loop2() is more than 30x slower than loop1(). The penalty of unoptimized cache
re

traffic is more pronounced on the Intel Xeon Phi architecture.


pa

The performance of loop2() is poor due to the scattered memory access pattern in the nested loops.
re

Indeed, the matrix M is the largest data container in the problem, and the inner i-loop in loop2() accesses
yP

this matrix with a stride of n. Such access pattern is inefficient for two reasons:
el
iv

a) Every access to memory fetches not one floating-point number, but a whole cache line containing this
us

number. Cache lines are 64 bytes long in Intel Xeon processors and Intel Xeon Phi coprocessors, and
cl

they map to contiguous 64 bytes in the main memory. in loop2(), the inner i-loop uses only one
Ex

floating-point number from the fetched cache line and moves on to fetch another cache line in the next
i-iteration. In contrast, in loop1(), the inner loop is the j-loop, which continues to use the cache line
that is already in memory, reading 64/sizeof(double)=8 consecutive floating-point numbers. This
means that for 8 consecutive iterations of the j-loop, no further cache misses on matrix M are incurred.

b) By the time that the i-loop in loop2() finishes and returns to the same value of i in the next j-iteration,
the cache line containing M[i*n+j] and M[i*n+j+1] will likely already be evicted from cache, and a
new cache miss would occur.

Figure 4.17 shows the result of the General Exploration analysis for the Sandy Bridge architecture in
Intel VTune Amplifier XE. The important metric indicating the need for cache optimization is the Last level
cache (LLC) hit ratio of 1.0 (see the Summary view screenshot in the top panel). Additionally, the bottom-up
view, shown in the screenshot in the bottom, indicates that function loop2() suffers from a high ratio of
memory replacements. In contrast, loop1() has far fewer memory replacements, and takes almost 3 times
fewer clock cycles.

Prepared for Yunheng Wang c Colfax International, 2013


202 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Figure 4.17: VTune General Exploration for the code in Listing 4.54. Notice how function loop2() incurs far more
memory replacements, LLC hits and DTLB overhead than loop1().

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 203

Reduced Cache Misses with -O2

The compiler is often able to determine sub-optimal memory access patterns and optimize them. For
instance, at the default optimization level, -O2, the compiler may permute (interchange) nested for-loops in
loop2(), so that memory traffic is optimized. Listing 4.56 shows the result of compiler optimization with
the argument -O2. The argument -vec-report is used in order to make it visible that the loop in line 38
was permuted. This information is also included in the optimization report, which can be requested by adding
the -opt-report compiler argument.

user@host$ icpc -openmp -O2 matvec-miss.cc -vec-report


matvec-miss.cc(24): (col. 7) remark: LOOP WAS VECTORIZED.
matvec-miss.cc(24): (col. 20) remark: LOOP WAS VECTORIZED.
matvec-miss.cc(29): (col. 5) remark: LOOP WAS VECTORIZED.
matvec-miss.cc(38): (col. 5) remark: PERMUTED LOOP WAS VECTORIZED.
matvec-miss.cc(7): (col. 5) remark: LOOP WAS VECTORIZED.
matvec-miss.cc(14): (col. 5) remark: LOOP WAS VECTORIZED.

g
user@host$ ./a.out

an
Loop 1 (stride 1): 0.076 s (2.64 GFLOP/s)

W
Loop 2 (stride n): 0.077 s (2.59 GFLOP/s)
user@host$

ng
user@host$ icpc -openmp -O2 matvec-miss.cc -vec-report -mmic

e
matvec-miss.cc(24): (col. 7) remark: LOOP WAS VECTORIZED.
matvec-miss.cc(24): (col. 20) remark: LOOP WAS VECTORIZED.
nh
Yu
matvec-miss.cc(24): (col. 31) remark: LOOP WAS VECTORIZED.
matvec-miss.cc(29): (col. 5) remark: LOOP WAS VECTORIZED.
r

matvec-miss.cc(38): (col. 5) remark: PERMUTED LOOP WAS VECTORIZED.


fo

matvec-miss.cc(7): (col. 5) remark: LOOP WAS VECTORIZED.


d

matvec-miss.cc(14): (col. 5) remark: LOOP WAS VECTORIZED.


re

user@host$ scp a.out mic0:~/


pa

a.out 100% 22KB 22.0KB/s 00:00


user@host$ ssh mic0
re

user@mic0$ ./a.out
yP

Loop 1 (stride 1): 0.220 s (0.91 GFLOP/s)


Loop 2 (stride n): 0.220 s (0.91 GFLOP/s)
el

user@mic0$
iv
us
cl

Listing 4.56: At optimization levels -O2 and higher, the compiler optimizes the code by permuting (interchanging) lines
Ex

13 and 14 in loop2() in order to reduce cache misses. Both loops run more efficiently and equally fast. Compiler
argument -g is necessary in order to allow VTune to resolve the symbols in the C code.

With loop permutation automatically performed at optimization level -O2, the performance of both loops
is identical. In addition, both loops gain additional speed due to automatic vectorization, which is enabled at
-O2. Speedup is observed both on the host system, and on the Intel Xeon Phi coprocessor. The performance
boost is more pronounced on the coprocessor.
Note that profiling of this optimized code with VTune profiling yields higher values of negative cache
performance metrics, such as memory replacements and LLC misses (see Figure 4.18). This may seem
counter-intuitive, because the wall clock performance was actually improved. However, high values of these
metrics do not mean that the code performs worse. These metrics are relative to the number of instructions
retired. Vectorization reduces the number of retired instructions almost by a factor of 4. The fact that LLC
misses and memory replacements are still high indicates the need for further improvement, as discussed in the
next section.
While the compiler can detect and alleviate cache performance issues such as shown here, additional
techniques can yield even better results. The Section 4.5.4 and Section 4.5.5 discuss these techniques.

Prepared for Yunheng Wang c Colfax International, 2013


204 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Figure 4.18: VTune General Exploration for the code in Listing 4.54. The compiler argument -O2 makes functions
loop1() and loop2() equally fast. It also enables automatic vectorization, which improves the wall clock performance
by a factor of 2. Cache performance metrics indicate opportunities for further cache optimization.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 205

4.5.4 Loop Tiling (Blocking)


Principles
Loop tiling, also known as loop blocking, is an easy to implement technique for algorithms that involve
nested loops, multidimensional arrays and regular access patterns. This technique is also known as “strip-mine
and permute”, because this method involves strip-mining the outer loop (see Section 4.4.4) and permuting the
middle and the inner loops. This technique is schematically illustrated in Listing 4.57.

1 // Plain nested loops


2 for (int i = 0; i < m; i++)
3 for (int j = 0; j < n; j++)
4 compute(a[i], b[j]); // Memory access is unit-stride in j
5
6 // Tiled nested loops
7 for (int ii = 0; ii < m; ii += TILE)
8 for (int j = 0; j < n; j++)
for (int i = ii; i < ii + TILE; i++) // Re-use data for each j with several i

g
9

an
10 compute(a[i], b[j]); // Memory access is unit-stride in j

W
11
12 // Doubly tiled nested loops

ng
13 for (int ii = 0; ii < m; ii += TILE)
14 for (int jj = 0; jj < n; jj += TILE)

e
nh
15 for (int i = ii; i < ii + TILE; i++) // Re-use data for each j with several i
16 for (int j = jj; j < jj + TILE; j++)
Yu
17 compute(a[i], b[j]); // Memory access is unit-stride in j
r
fo

Listing 4.57: Schematic organization of loop tiling. If the array b[0:n] does not fit in cache, then tiling the outer ii
d

loop and and adding the innermost i loop increases the locality of data access and thus improves cache traffic. Tiling the
re

j-loop may additionally improve traffic.


pa
re
yP

In order to analyze this optimization, let us assume that the array b does not fit in cache. Then, in
the unoptimized version (lines 1 through 4 in Listing 4.57), for every iteration in i, all the data of b will
el

have to be read from memory into cache, evicted from cache and then fetched again in the next i-iteration.
iv

Re-organization of the loops with tiling (lines 6-10 in Listing 4.57) ensures that every value of b[j] is used
us

several times before b[j+1] is fetched. In the case with loop tiling, the j-loop may be vectorized. Then
cl

every value of b[j]. . . b[j+V-1] is used several times (here V is the vector register length). Re-usage of
Ex

data reduces the number of times that array b has to be loaded into cache and, thus, improves performance.
The loop in j can be tiled in a similar manner (lines 12-17 in Listing 4.57).
Ideally, the data spanned by the innermost loops should utilize the whole cache. This means that the size
of the tile must be tuned to the specifics of computer architecture.
Special precautions should be taken with loop tiling:
1) When the values of m and n are not multiples of the tile size TILE, some iterations must be peeled off, as
shown in Listing 4.58.
2) If in the original algorithm, automatic vectorization was used in the inner j loop, then in the tiled version,
automatic vectorization must be applied to the j loop, which is not inner in the tiled version. That can be
achieved either by unrolling the inner i-loop (see Listing 4.59), or by using #pragma simd and making
the value of TILE a constant known at compile time (see Listing 4.60).
3) It is best to make array termination conditions as simple as possible, and use constants for tile sizes, in
order to facilitate automatic vectorization. Additionally, multiversioned, redundant code such as shown in
Listing 4.58 is instrumental for compiler-assisted optimization, even if it does not look elegant.

Prepared for Yunheng Wang c Colfax International, 2013


206 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

1 for (int ii = 0; ii < m - m%TILE; ii+=TILE) // m - m%TILE is always a multiple of TILE


2 for (int j = 0; j < n; j++)
3 for (int i = ii; i < ii + TILE; i++)
4 compute(a[i], b[j]);
5
6 for (int i = m - m%TILE; i<m; i++) // Remaining iterations
7 for (int j = 0; j < n; j++)
8 compute(a[i], b[j]);

Listing 4.58: Peeling off the last several iterations of a tiled loop when m is not a multiple of TILE.

1 for (int ii = 0; ii < m; ii += 4) // Explicitly defining TILE=4


2 for (int j = 0; j < n; j++) {
3 // Explicitly unrolling the i-loop in order to retain vectorization in j
4 compute(a[ii + 0], b[j]);
compute(a[ii + 1], b[j]);

g
5

an
6 compute(a[ii + 2], b[j]);
7 compute(a[ii + 3], b[j]);

W
8 }

e ng
Listing 4.59: Explicit unrolling of the inner loop in order to retain vectorization in j.
nh
r Yu
fo

1 const int TILE = 4; // TILE must be a constant in order to unroll the i -loop
for (int ii = 0; ii < m; ii += TILE)
ed

2
3 #pragma simd
ar

4 for (int j = 0; j < n; j++)


p

5 #pragma unroll(TILE)
re

6 for (int i = ii; i < ii + TILE; i++) // Compiler can unroll this loop
yP

7 compute(a[i], b[j]);
el
iv

Listing 4.60: Automatic unrolling of the inner loop in order to retain vectorization in j. Note the #pragma simd
us

instruction necessary to vectorize the outer loop. #pragma unroll may not be necessary if the loop body is not too
cl

complex.
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 207

Example 1: Loop Interchange and Tiling to Reduce Cache Misses


The code shown in Listing 4.61 is a calculation of the transient emissivity of cosmic dust grains (code
courtesy of T. Porter and A. Vladimirov, Stanford University). This function is an unoptimized implementation
of Equation (56) in [62] with the solution vector calculated based on the works [62] and [63]. The original
code was simplified for usage in this example.

1 void ComputeEmissivity(const int wlBins, double* emissivity, // Output


2 const double* wavelength,
3 const int gsMax, const double* grainSizeD, const double* absorption,
4 const int tempBins, const double* planckFunc, const double* distribution // Input data
5 ) {
6 // This function computes the emissivity of transient dust grains
7 for (int i = 0; i < wlBins; i++) {
8 double sum = 0;
9 for (int j = 0; j < gsMax; j++) {
10 const double gsd = grainSizeD[j];
const double crossSection = absorption[j*wlBins + i];

g
11

an
12 const double product = gsd*crossSection;
double result = 0;

W
13
14 #pragma vector aligned

ng
15 for (int k = 0; k < tempBins; k++)
16 result += planckFunc[i*tempBins + k]*distribution[j*tempBins + k];

e
17
18 }
sum += result*product;
nh
Yu
19 emissivity[i] = sum*wavelength[i];
20 }
r
fo

21 }
d
re

Listing 4.61: Unoptimized emissivity calculation.


pa
re

One does not have to understand the physical meaning of the calculation in Listing 4.61 in order to
yP

optimize this code. However, the important aspects of this function is how it is executed, and how big the data
el

set is.
iv
us

• This function is a thread-safe routine. Multiple instances of this function are called from a parallel loop
cl

(not shown here).


Ex

• In the parallel loop, the input array distribution and the output array emissivity are private to
each parallel instance of function ComputeEmissivity().
• All other input arrays are shared between all instances of the function.
• The input data of this function is guaranteed to remain intact for the duration of the execution.
• The values of the parameters in this code are in the neighborhood of wlBins≈gsMax≈tempBins≈128.
It is also safe to assume that tempBins is a power of 2.

The size of the most used 2-dimensional arrays in this code, planckFunc and distribution, can be
estimated at 64 kB, and therefore these arrays do not fit into the L1 cache of the host processor, nor the of the
coprocessor. Furthermore, with hyper-threading, there may be insufficient L2 cache in each core to hold the
arrays belonging to all hyper-threads. Therefore, cache optimizations must be employed in this application.
First, one may consider opportunities for loop interchange. The code of the function has three nested
loops in variables i, j and k, respectively. The inner loop in k is executed the greatest number of times,
and therefore, optimizing memory access in this loop promises the greatest benefits. Visual inspection of the

Prepared for Yunheng Wang c Colfax International, 2013


208 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

loop in variable k shows that the loop has unit stride and operates with aligned vector instructions. These are
desirable properties, which we would like to preserve. If we cannot touch the loop in k, we can attempt to
interchange loops in i and j. This interchange leads to a significant improvement in performance, and we
explain this optimization below. Listing 4.62 shows the code of the function with interchanged i and j loops.

1 void ComputeEmissivity(const int wlBins, double* emissivity, // Output


2 const double* wavelength,
3 const int gsMax, const double* grainSizeD, const double* absorption,
4 const int tempBins, const double* planckFunc, const double* distribution // Input data
5 ) {
6 // This function computes the emissivity of transient dust grains
7 // In this version, the i- and j-loops are permuted
8 // in order to reduce cache misses upon reading grainSizeD[]
9 // and improve the locality of access to absorption[]
10 emissivity[0:wlBins] = 0.0;
11 for (int j = 0; j < gsMax; j++) {
12 const double gsd = grainSizeD[j];

g
13 for (int i = 0; i < wlBins; i++) {

an
14 double result = 0;
#pragma vector aligned

W
15
16 for (int k = 0; k < tempBins; k++)

ng
17 result += planckFunc[i*tempBins + k]*distribution[j*tempBins + k];
18
e
nh
19 const double crossSection = absorption[j*wlBins + i];
20 const double product = gsd*crossSection;
Yu

21 emissivity[i] += result*product*wavelength[i];
}
r

22
fo

23 }
24 }
ed
ar

Listing 4.62: Emissivity calculation with interchanged i and j-loops. Note that we moved the access to grainSizeD
p
re

outside the i-loop and eliminated the variable sum.


yP
el

The performance of the new code is significantly better than in the unoptimized version, as shown in
iv

Figure 4.19. This is explained by the change in the locality of access to array distribution:
us
cl

• In the original code, array distribution was read from front to back in every iteration of the outer
Ex

loop, the i-loop. By the end of the i-iteration, the beginning of the array distribution may have
been evicted from the L2 cache, and in the subsequent i-iteration, this array must be fetched into
cache again. Therefore, array distribution was read in from memory into the L2 cache up to
wlBins≈128 times.

• In contrast, in the optimized code, for every iteration of the outer loop (the j-loop in the optimized
version), only the j-th row of array distribution is used. A single row is only sizeof(float)
× tempBins ≈ 512 bytes long, which is small enough to fit it in the L1 cache. Therefore, array
distribution is fetched from memory into cache only once in function ComputeEmissivity()

At the first glance, is may seem counter-intuitive that this loop interchange results in a performance
increase. Indeed, we improved the locality of access to distribution, but ruined the locality of access to
equally large array planckFunc. In order to understand why this optimization was effective, recall that the
function ComputeEmissivity() is called from a thread-parallel region, and distribution is private
to each thread in the region, while planckFunc is shared between all threads. This explains why it is more
important for performance to maintain the locality of access to distribution. Indeed, for four hyper-
threads working on a single core of the Intel Xeon Phi coprocessor, only one copy of planckFunc needs to

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 209

reside in the L2 cache, but four instances of distribution must also be there. Therefore, non-local access
to distribution leads to a greater number of cache evictions and fetches in each core than non-local
access to planckFunc, and this explains why loop interchange successfully improved performance.
Let us continue with the optimization of this code and apply loop tiling in order to further improve the
efficiency of memory access to array distribution. This optimization is demonstrated in Listing 4.63.
The size of the tile must be chosen empirically, i.e., we must use the tile size that produces the best performance.
Oftentimes, the optimal tile size is a power of 2 not exceeding the number 16. This upper limit is tied to the
number of vector registers in processor and coprocessor cores. In the case of this code, the optimal tile size
turned out to be equal to 8.

1 void ComputeEmissivity(const int wlBins, double* emissivity, // Output


2 const double* wavelength,
3 const int gsMax, const double* grainSizeD, const double* absorption,
4 const int tempBins, const double* planckFunc, const double* distribution // Input data
5 ) {
6 // This function computes the emissivity of transient dust grains

g
7 // In this version, loop tiling is implemented in the i-loop

an
8 const int iTile = 8; // Found empirically

W
9 assert(wlBins%iTile==0);
10 emissivity[0:wlBins] = 0.0;

ng
11 for (int j = 0; j < gsMax; j++) {

e
12 const double gsd = grainSizeD[j];
13 for (int ii = 0; ii < wlBins; ii+=iTile) { // i-loop tiling
double result[iTile]; result[:] = 0.0; nh
Yu
14
15 #pragma vector aligned
r

16 #pragma simd
fo

17 for (int k = 0; k < tempBins; k++)


d

18 #pragma novector
re

19 for (int i = ii; i < ii+iTile; i++)


pa

20 result[i-ii] += planckFunc[i*tempBins + k]*distribution[j*tempBins + k];


21
re

22 for (int i = ii; i < ii+iTile; i++) {


yP

23 const double crossSection = absorption[j*wlBins + i];


24 const double product = gsd*crossSection;
el

25 emissivity[i] += result[i-ii]*product*wavelength[i];
iv

26 }
us

27 }
cl

28 }}
Ex

Listing 4.63: Emissivity calculation with tiled i-loop. Note that the variable result must now be an array of size equal
to the tile size, and that #pragma simd must be used in order to vectorize the k-loop, because now it has another loop
nested in its body.

In the optimized code in Listing 4.63, additional performance is gained because the locality of access
to distribution is improved in a higher level of cache hierarchy. Indeed, in the previous version
(Listing 4.62), for every i-iteration, the j-th row of distribution was read from front to back. In the
subsequent i-iteration, this row must be fetched again from the L1 cache into registers and, potentially, from
the L2 cache into the L1 cache. With tiling (Listing 4.63), every time a vector register is filled with the data
of distribution, this register is used iTile=8 times before this register is discarded. It can be seen in
Figure 4.19 that the effect of this optimization is a factor of 1.6x improvement on the coprocessor and 1.4x on
the host system.

Prepared for Yunheng Wang c Colfax International, 2013


210 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Cache traffic optimization with loop tiling


6
5.71 s Host system
Intel Xeon Phi Coprocessor
5
Time, s (lower is better)

3
2.67 s

2 2.01 s
1.61 s 1.46 s 1.40 s
1 1.03 s
0.67 s

g
0

an
Unoptimized Permuted i- and j- loops Tiled j-loop Tiled j- and i-loops

W
and i-loops”, can be found in Exercise Section A.4.11.
e ng
Figure 4.19: Performance of the emissivity calculation (Listing 4.61, Listing 4.62 and Listing 4.63). The last case, “tiled j-
nh
Yu

It is possible to further optimize this code by improving the locality of access to array planckFunction.
r
fo

This can be done by tiling the j-loop. We will not discuss this optimization in the main text and refer the
reader to Exercise Section A.4.11, where this optimization can be found along with full code of this example.
ed
p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 211

Example 2: Tiling a Parallel For-Loop


Loop tiling can also improve cache traffic when a thread-parallel loop is tiled. In this case, the performance
increase comes from better locality of data in the thread executing the calculation. As an example of this
optimization, consider the problem of in-place parallel transposition of a square matrix. This problem is
especially sensitive to cache optimization, because it is completely devoid of any arithmetic workload. The
unoptimized (i.e., “cache-ignorant”) algorithm for this task is shown in Listing 4.64.

1 void Transpose(float* const A, const int n){


2 _Cilk_for (int i = 0; i < n; i++) {
3 for (int j = 0; j < i; j++) {
4 const float c = A[i*n + j];
5 A[i*n + j] = A[j*n + i];
6 A[j*n + i] = c;
7 }
8 }
9 }

g
an
W
Listing 4.64: Cache-ignorant (i.e., unoptimized) algorithm of parallel in-place square matrix transposition.

ng
Running on the host system with two Intel Xeon processors and natively on the Intel Xeon Phi coprocessor

e
nh
with n=28000, the code in Listing 4.64 performs as shown in Figure 4.20 for case “Unoptimized”.
Yu
The performance of the code can be improved by loop tiling. In Listing 4.65 we show an optimized
implementation of the transposition function with double loop tiling.
r
fo
d

#include <cassert>
re

1
2
pa

3 void Transpose(float* const A, const int n){


re

4 // Tiled algorithm improves data locality by re-using data already in cache


yP

5 #ifdef __MIC__
6 const int TILE = 16;
el

7 #else
iv

8 const int TILE = 32;


us

9 #endif
10 // The below restriction can be lifted using an additional peel loop
cl

11 assert(n%TILE == 0);
Ex

12 _Cilk_for (int ii = 0; ii < n; ii += TILE) {


13 const int iMax = (n < ii+TILE ? n : ii+TILE);
14 for (int jj = 0; jj <= ii; jj += TILE) {
15 for (int i = ii; i < iMax; i++) {
16 const int jMax = (i < jj+TILE ? i : jj+TILE);
17 #pragma loop count avg(TILE)
18 #pragma simd
19 for (int j = jj; j<jMax; j++) {
20 const int c = A[i*n + j];
21 A[i*n + j] = A[j*n + i];
22 A[j*n + i] = c;
23 }
24 }
25 }
26 }
27 }

Listing 4.65: Tiled-loop algorithm of parallel in-place square matrix transposition.

Prepared for Yunheng Wang c Colfax International, 2013


212 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Performance results of this the tiled algorithm are shown in Figure 4.20, labelled as “Loop tiling”. Loop
tiling increases the performance of the code on both the host system and the Intel Xeon Phi coprocessor by a
factor of 5x-6x. There are two reasons why tiling is beneficial in this case:

1) First, similarly to the emissivity calculation code in Example 1 (Listing 4.63), tiling improves the locality
of data access at the highest level of caching. For instance, the inner j-loop performs a scattered write to
elements A[j*n+i]. In the subsequent i-iteration, the j-loop modifies elements A[j*n+i+1]. If the
loop is not tiled, then every i-iteration modifies in this way n cache lines, and for a large n, these cache
lines must be evicted to the higher-level cache or to the main memory. On the other hand, in a tiled loop,
every i-iteration modifies only TILE≤ 32 cache lines, and for iteration i+1, these cache lines are still in
the L1 cache or even in the registers.
2) Another benefit of tiling in this case is that it reduces false sharing. Without tiling, one Intel Cilk Plus
worker may be operating on a certain i and writing to A[j*n+i]; at the same time, another worker
is processing i+1 and writing to A[j*n+i+1]. There will be collisions in which A[j*n+i] and
A[j*n+i+1] are in the same cache line. As we have seen in Section 4.4.2, such situations lead to false

g
sharing where the processor locks the “dirty” cache line until its modification is propagated across all

an
caches accessing it. This incurs significant performance penalties. With tiling, on the other hand, each

W
worker processes a block of iterations in the ij-space, and therefore false sharing does not occur upon
writing to A[i*n+j] or A[j*n+i]. e ng
nh
Note that for optimum performance, we use different tile sizes for the host system and for the coprocessor.
Yu

That the values TILE=32 on the host and TILE=16 on the coprocessor were obtained empirically. We
r

only had to test multiples of 16 and 8 for the tile size (on coprocessor and processor, respectively), because
fo

we intended to have the inner loop automatically vectorized. Additionally, the coprocessor performance
ed

is improved in this code with the help of #pragma simd to enforce vectorization, and #pragma loop
ar

count to tune the loop to the most frequently used number of iterations. More information about these
p

pragmas can be found in Section 3.1.10 and Section 4.3.2 .


re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 213

4.5.5 Cache-Oblivious Recursive Methods


Principles

Loop tiling is most efficient when the tile size is precisely tuned to the cache size of the system. There
exists another approach to cache traffic optimization, known as cache-oblivious algorithms, in which tuning
does not have to be as precise as with tiling. This approach may provide a solution that is more portable across
different computer architectures.
Cache-oblivious algorithms exploit recursion in order to work efficiently for any size of the cache and
of the problem. These methods were introduced by Prokop [64] and subsequently elaborated by Frigo at
al. [65]. The principle of cache-oblivious algorithms is to recursively divide the data set into smaller and
smaller chunks. Regardless of the cache size of the system, recursion will eventually reach a small enough
data subset that fits into the cache. This property of cache-oblivious algorithms provides portability across
various architectures. Listing 4.66 illustrates this approach.

g
an
1 // Unoptimized algorithm
2 void CalculationUnoptimized(void* data, const int size) {

W
3 for (int i = 0; i < n; i++) {

ng
4 // ... perform work;
5 }

e
}
nh
6
7
Yu
8 // Optimized recursive cache-oblivious algorithm
9 void CalculationOptimized(void* data, const int size) {
r
fo

10 // Initiate recursion
11 CalculationRecurse(data, 0, size);
d

12 }
re

13
pa

14 // The recursive function


re

15 void CalculationRecurse(void* data, const int iStart, const int iEnd) {


if (iEnd - iStart < recursionThreshold) {
yP

16
17 for (int i = iStart; i < iEnd; i++) {
// ... perform work
el

18
}
iv

19
20 } else {
us

21 // Recursively split the data set


cl

22 CalculationRecurse(data, iStart, (iStart+iEnd)/2);


Ex

23 CalculationRecurse(data, (iStart+iEnd)/2, iEnd);


24 }
25 }

Listing 4.66: Schematic recursive cache-oblivious algorithm.

In practice, decomposing the problem into size 1 operations is not optimal, as the overhead of thread
creation and function calls may outweigh the benefit of cache-efficient data handling. In addition, size 1
operations prevent vectorization. Therefore, it is beneficial to introduce a threshold of the problem size, at
which the recursion stops, and another algorithm is applied to the reduced problem. This is the meaning of the
condition (iEnd - iStart < recursionThreshold). The value of the recursion threshold must
be chosen empirically; however, it usually does not have to be tuned to the architecture as precisely as the tile
size in tiled algorithms.
Parallelization of the divide-and-conquer approach of cache-oblivious algorithms is straightforward in
the fork-join model of parallelism. Indeed, with the spawning functionality of Intel Cilk Plus it is easy to
spawn off the first of the two recursive calls, as illustrated in the following example.

Prepared for Yunheng Wang c Colfax International, 2013


214 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Example
Recursive cache-oblivious algorithm for matrix transposition was proposed and evaluated by Tsifakis,
Rendell & Strazdins [66]. An implementation of this algorithm employing Intel Cilk Plus is shown in
Listing 4.67. Note that this algorithm may be further improved by eliminating redundant forks, unnecessary
transposition of the diagonal elements, introducing a blocked, multiversioned code for the inner loop in the
non-recursive transposition engine, etc. However, the implementation presented here is intentionally kept
simple in order to convey the underlying principle.

1 void transpose_cache_oblivious_thread(
2 const int iStart, const int iEnd,
3 const int jStart, const int jEnd,
4 float* A, const int n){
5 #ifdef __MIC__
6 const int RT = 64; // Recursion threshold on coprocessor
7 #else
8 const int RT = 32; // Recursion threshold on host

g
9 #endif

an
10 if ( ((iEnd - iStart) <= RT) && ((jEnd - jStart) <= RT) ) {

W
11 for (int i = iStart; i < iEnd; i++) {
12 int je = (jEnd < i ? jEnd : i);
13
14
#pragma simd
for (int j = jStart; j < je; j++) {
e ng
nh
15 const float c = A[i*n + j];
Yu

16 A[i*n + j] = A[j*n + i];


17 A[j*n + i] = c;
r

18 }
fo

19 }
ed

20 return;
21 }
ar

22
p

23 if ((jEnd - jStart) > (iEnd - iStart)) {


re

24 // Split into subtasks j-wise


yP

25 int jSplit = jStart + (jEnd - jStart)/2;


26 // The following line improves performance by splitting on aligned boundaries
el

27 if (jSplit - jSplit%16 > jStart) jSplit -= jSplit%16;


iv

28 _Cilk_spawn transpose_cache_oblivious_thread(iStart, iEnd, jStart, jSplit, A, n);


us

29 transpose_cache_oblivious_thread(iStart, iEnd, jSplit, jEnd, A, n);


cl

30 } else {
Ex

31 // Split into subtasks i-wise


32 int iSplit = iStart + (iEnd - iStart)/2;
33 // The following line improves performance by splitting on aligned boundaries
34 if (iSplit - iSplit%16 > iStart) iSplit -= iSplit%16;
35 const int jMax = (jEnd < iSplit ? jEnd : iSplit);
36 _Cilk_spawn transpose_cache_oblivious_thread(iStart, iSplit, jStart, jMax, A, n);
37 transpose_cache_oblivious_thread(iSplit, iEnd, jStart, jEnd, A, n);
38 }
39 }
40
41 void Transpose(float* const A, const int n){
42 transpose_cache_oblivious_thread(0, n, 0, n, A, n);
43 }

Listing 4.67: Cache-oblivious recursive algorithm of parallel in-place square matrix transposition.

In order to effect parallel recursion, _Cilk_spawn is used to asynchronously execute the first of
the recursive functions. The problem splitting direction alternates between horizontal (i-wise) and vertical
(j-wise) sectioning of the matrix. Race conditions are avoided by ensuring that jMax<=iSplit.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 215

The following fine tuning aspects of the code in Listing 4.67 are crucial for performance:

1) Recursion stops when the subset of the problem is reduced to the threshold set by the variable RT.
2) The threshold RT is chosen empirically and has different optimal values on the host and on the coprocessor.
3) This smallest problem partition is processed serially in each Intel Cilk Plus strand, utilizing vector
instructions for streaming reads and scattered writes.

4) Just like with the tiled algorithm, we found #pragma simd and #pragma loop count to be benefi-
cial for the performance on the coprocessor.
5) Additionally, we facilitate aligned memory accesses on the coprocessor by ensuring that (a) the matrix
begins at an aligned address, (b) the row length and the problem split points are a multiple of the SIMD
vector length. This length is equal to 16 for single-precision floating-point numbers on the coprocessor.

Figure 4.20 shows that the performance of this code exceeds that of the tiled code shown in Listing 4.65

g
by 1.3x on the coprocessor and 1.5x on the host system. The techniques of loop tiling and cache-oblivious

an
recursion benefit the performance of codes both on the host system with Intel Xeon processors and on Intel

W
Xeon Phi coprocessors.

e ng
Parallel, in-place square matrix transposition
600 nh Host system
Yu
545 ms
Intel Xeon Phi Coprocessor
r

500 478 ms
fo
d
re
Time, ms (lower is better)

400
pa
re

300
yP
el

200
iv
us

112 ms
95 ms
cl

100 76 ms 71 ms
Ex

0 Unoptimized Loop tiling Recursion


("cache-ignorant") ("cache-aware") ("cache-oblivious")

Figure 4.20: Execution time of the square matrix transposition algorithm on the host system and on the coprocessor.
Matrix size in this benchmark is n×n with n=28000. See Listing 4.64, Listing 4.65 and Listing 4.67 for the corresponding
source code.

The complete code of the parallel matrix transpose with blocking and recursion, along with step-by-step
optimization instructions, can be found in Exercise Section A.4.11.

Prepared for Yunheng Wang c Colfax International, 2013


216 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.5.6 Cross-Procedural Loop Fusion


When different stages of data processing are executed in loops of similar structure, it may be beneficial
to fuse such loops. Loop fusion is a procedure through which two disjoint loops are merged into a single loop.
This procedure is often beneficial for performance, because it increases data locality. Listing 4.68 illustrates
loop fusion.

1 // Two distinct loops operating on the same data


2 for (int i = 0; i < n; i++)
3 ProcessingStage1(inData[i], outData[i]);
4 for (int i = 0; i < n; i++)
5 ProcessingStage2(outData[i]);
6
7 // The above code expressed as a fused loop
8 for (int i = 0; i < n; i++) {
9 ProcessingStage1(inData[i], outData[i]);
10 ProcessingStage2(outData[i]);
}

g
11

an
W
Listing 4.68
e ng
Loop fusion is beneficial for cache performance, because in the case of two disjoint loops, by the time
nh
that the first loop is finished, the beginning of the data set may have been evicted from caches. However, in a
Yu

fused loop, all stages of data processing occur while the data is still in the caches. In addition, loop fusion may
r

help to reduce the memory footprint of temporary storage if such storage was needed in order to carry some
fo

data from one loop to another.


ed

If two loops that are candidates for fusion are located within the same lexical scope, the Intel C++
ar

Compiler may fuse them automatically. The compiler is also capable of some inter-procedural optimization.
p

However, automatic loop fusion may fail if the compiler does not see both loops at compile time (e.g., the
re

loops are located in separate source files), or if additional measures must be taken for value-safe fusion. In
yP

cases when automatic loop fusion fails, the programmer may need to implement it explicitly.
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 217

Example: Fusing Data Generation and Processing

In the code in Listing 4.69, Intel MKL is used to generate random arrays, and in each array, the mean and
standard deviation of generated numbers is computed. This code is representative of pipelined applications in
which some temporary data are generated and processed in subsequent functions.

1 #include <omp.h>
2 #include <malloc.h>
3 #include <mkl_vsl.h>
4 #include <math.h>
5
6 void GenerateRandomNumbers(const int m, const int n, float* const data) {
7 // Filling arrays with normally distributed random numbers
8 #pragma omp parallel
9 {
10 VSLStreamStatePtr rng; const int seed = omp_get_thread_num();
11 int status = vslNewStream(&rng, VSL_BRNG_MT19937, omp_get_thread_num());

g
12

an
13 #pragma omp for schedule(guided)

W
14 for (int i = 0; i < m; i++) {
15 const float mean = (float)i; const float stdev = 1.0f;

ng
16 status = vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,

e
17 rng, n, &data[i*n], mean, stdev);
18 }
vslDeleteStream(&rng); nh
Yu
19
20 }
r

21 }
fo

22
d

23 void ComputeMeanAndStdev(const int m, const int n, const float* data,


re

24 float* const resultMean, float* const resultStdev) {


pa

25 // Processing data to compute the mean and standard deviation


26 #pragma omp parallel for schedule(guided)
re

27 for (int i = 0; i < m; i++) {


yP

28 float sumx=0.0f, sumx2=0.0f;


29 #pragma vector aligned
el

30 for (int j = 0; j < n; j++) {


iv

31 sumx += data[i*n + j];


us

32 sumx2 += data[i*n + j]*data[i*n + j];


cl

33 }
resultMean[i] = sumx/(float)n;
Ex

34
35 resultStdev[i] = sqrtf(sumx2/(float)n-resultMean[i]*resultMean[i]);
36 }
37
38 }
39
40 void RunStatistics(const int m, const int n,
41 float* const resultMean, float* const resultStdev) {
42 // Allocating memory for scratch space for the whole problem
43 // m*n elements on heap (does not fit on stack)
44 float* data = (float*) _mm_malloc((size_t)m*(size_t)n*sizeof(float), 64);
45
46 GenerateRandomNumbers(m, n, data);
47 ComputeMeanAndStdev(m, n, data, resultMean, resultStdev);
48
49 // Deallocating scratch space
50 _mm_free(data);
51 }

Listing 4.69: Generation and processing of pseudo-random data in a function with disjoint parallel for-loops.

Prepared for Yunheng Wang c Colfax International, 2013


218 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

In the above code, function RunStatistics() is the interface function called from an external code.
This function allocates some temporary data to store random numbers and then calls two functions: one
to generate the pseudo-random data, and another to analyze it statistically. In this example, the compiler
does not fuse the i-loops in GenerateRandomNumbers() and ComputeMeanAndStdev(). With
data generation isolated from data processing, this code is logically structured. However, if array data does
not fit into caches, the performance may suffer, because
a) during the execution of GenerateRandomNumbers(), array data will be fetched into caches, but
then most of it will be evicted by the time the function terminates;
b) the function ComputeMeanAndStdev() must fetch data from memory once again,
c) the amount of memory temporarily allocated for data may be excessively large.
In order to improve the code, we can apply loop fusion, as shown in Listing 4.70.

1 #include <omp.h>

g
an
2 #include <mkl_vsl.h>
3 #include <math.h>

W
4

ng
5 void RunStatistics(const int m, const int n,
6 float* const resultMean, float* const resultStdev) {
e
#pragma omp parallel
nh
7
8 {
Yu

9 // Allocating scratch data, n elements on stack for each thread


10 float data[n] __attribute__((aligned(64)));
r
fo

11 // Initializing a random number generator in each thread


12 VSLStreamStatePtr rng;
ed

13 const int seed = omp_get_thread_num();


ar

14 int status = vslNewStream(&rng, VSL_BRNG_MT19937, omp_get_thread_num());


#pragma omp for schedule(guided)
p

15
re

16 for (int i = 0; i < m; i++) {


// Filling arrays with normally distributed random numbers
yP

17
18 const float seedMean = (float)i; const float seedStdev = 1.0f;
status = vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,
el

19
rng, n, data, seedMean, seedStdev);
iv

20
21 // Processing data to compute the mean and standard deviation
us

22 float sumx=0.0f, sumx2=0.0f;


cl

23 #pragma vector aligned


Ex

24 for (int j = 0; j < n; j++) {


25 sumx += data[j];
26 sumx2 += data[j]*data[j];
27 }
28 resultMean[i] = sumx/(float)n;
29 resultStdev[i] = sqrtf(sumx2/(float)n-resultMean[i]*resultMean[i]);
30 }
31 vslDeleteStream(&rng);
32 }
33 }

Listing 4.70: Generation and processing of pseudo-random data in a function with fused parallel for-loops.

With loop fusion, both operations: the generation of random numbers, and the computation of cumulative
statistical information, will have to reside in the same function. While this may be “poor style” from the
structured programming point of view, this optimization improves cache traffic and allows us to get rid of
the huge array data. Indeed, the data for every iteration in i can be generated, used, and discarded before
proceeding to the next iteration. Therefore we can allocate only sizeof(float)*T*n bytes of scratch

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.5. MEMORY ACCESS: COMPUTATIONAL INTENSITY AND CACHE MANAGEMENT 219

space as opposed to sizeof(float)*m*n bytes, where T is the number of threads in the system, and m is
the number of test arrays. We are assuming here that mT, because otherwise, the application will be troubled
by insufficient parallelism (see Section 4.4.4).
The effect of loop fusion along with reduced scratch memory footprint is shown in Figure 4.21.

Generation of pseudo-random data and statistical analysis before and after loop fusion
300
281 ms Host system
Intel Xeon Phi Coprocessor
250
233 ms
Time, ms (lower is better)

200
180 ms

150
135 ms

g
an
100

W
ng
50

e
0 Unoptimized
nh Optimized
Yu
(disjoint loops) (fused loops)
r
fo

Figure 4.21: Performance of the code generating and analyzing pseudo-random data. The unoptimized case is shown in
d
re

Listing 4.69, and the optimized case in Listing 4.70.


pa
re

The complete working code for this example can be found in Exercise Section A.4.13.
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


220 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.5.7 Advanced Topic: Prefetching


Intel Xeon processors and Intel Xeon Phi coprocessors improve cache traffic on the hardware level
with the help of hardware prefetchers. These devices monitors the memory access pattern, train to it, and
predictively issue requests to fetch data from memory into caches several cycles before this data is used by the
core. Intel Xeon processors have L1 and L2 hardware prefetchers, and Intel Xeon Phi coprocessors have only
an L2 hardware prefetcher.
In addition to hardware prefetching, Intel Xeon processors and Intel Xeon Phi coprocessors support
software prefetch instructions. These instructions request that a certain address (cache line) is fetched from
memory into a cache. Software prefetch instructions do not stall execution, and therefore they can be issued
many cycles before the fetched line is used by the core. The time between the prefetch instruction and the
instruction consuming the data on the core is called the prefetch distance. In loops, prefetch distances are
typically measured in the number of iterations between the prefetch instruction and data consumption.
The Intel C++ Compiler automatically inserts prefetch instructions into the compiled code for the MIC
architecture at optimization level -O2 and above. The prefetch distance and prefetched variables are computed
using heuristics. These heuristics work for array accesses (e.g., A[i][j] or B[i*n+j]) and pointer accesses

g
where the address can be predicted in advance (e.g, *(A+i*n+j)). However, the compiler does not issue

an
prefetch instructions for accesses in the form A[B[i]]. It is possible to see the report on compiler prefetching

W
using the compiler arguments -opt-report3 -opt-report-phase hlo.
e ng
In order to fine-tune the application performance, the programmer may wish to control software prefetch-
ing. If this approach is taken, it is advisable to first turn off automatic compiler prefetching using the compiler
nh
argument -no-opt-prefetch (to disable prefetching in the whole source file) or placing #pragma
Yu

noprefetch before a function or a loop (for a more fine-grained control). After that, prefetching can be
r

modified with the argument -opt-prefetch-distance (to affect the whole file) or effected with the
fo

intrinsic _mm_prefetch() or #pragma prefetch (for precise control of each variable).


ed

The following general considerations may be helpful in the planning of prefetch optimization:
p ar

a) It is possible to diagnose whether the performance of a particular application can be improved with software
re

prefetching. The most simple test is to turn off prefetching in the whole application or a particular loop or
yP

function. If the performance drops significantly, then prefetching plays an important role, and fine-tuning
el

the prefetch distances can lead to performance increase.


iv
us

b) Prefetching is more important for Intel Xeon Phi coprocessors than for Intel Xeon processors. This is in
cl

part explained by the fact that Intel Xeon cores are out-of-order processors, while Intel Xeon Phi cores
Ex

are in-order. Out-of-order execution allows Intel Xeon processors to overlap computation with memory
latency. Additionally, the lack of a hardware L1 prefetcher on Intel Xeon Phi coprocessors makes software
prefetching necessary on the lowest level of cache traffic.

c) Loop tiling and recursive cache-oblivious algorithms (see Section 4.5.4 and Section 4.5.5) improve
application performance by reducing cache traffic, and, therefore, prefetching becomes less important for
algorithms optimized with these techniques.
d) If software prefetching maintains good cache traffic, hardware prefetching does not come into effect.

Additional information on prefetching on Intel Xeon Phi coprocessors can be found in this presentation
by Rakesh Krishnaiyer [67]. The Intel C++ Compiler Reference Guide has detailed information about pragmas
prefetch and noprefetch [68] and compiler argument -opt-prefetch [69].

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.6. PCIE TRAFFIC CONTROL 221

4.6 PCIe Traffic Control


Applications that use the coprocessor in the offload mode can benefit from the optimization of the
communication between the host and the coprocessor. In this section, we demonstrate the fundamental
strategies for optimizing communication in offload applications: memory retention and data persistence on the
coprocessor between offloads, aligning the offloaded arrays, and using large TLB pages.

Default Offload Mode


The code demonstrated in Listing 4.71 will be used for timing the offload performance.

1 void DefaultOffload(const size_t size, char* data) {


2 // Default offload procedure: allocate memory on coprocessor, transfer data
3 // perform offload calculations, deallocate memory on coprocessor
4
5 #pragma offload target(mic:0) in(data: length(size) align(64))

g
6 {

an
7 data[0] = 0; // touch array to avoid dead code elimination

W
8 }
9 }

e ng
nh
Listing 4.71: Function performing offload of data to the coprocessor in the default offload mode. Memory for transferred
data is allocated and deallocated at the beginning and end of each offload. Data is transferred fully in each offload.
r Yu
fo

This function transfers array data to the coprocessor in the default mode. At the beginning of each
d

offload, the offload Runtime Library (RTL) will allocate memory for the respective array on the coprocessor,
re

then the data will be transferred over the PCIe bus, calculations will be performed, and memory will be
pa

deallocated.
re
yP

4.6.1 Memory Retention Between Offloads


el
iv

Consider the case when a function performing offload is called multiple times, and all or some of
us

the pointer-based arrays sent to the coprocessor have the same size. Then offload can be optimized using
cl

the clauses free_if and offload_if in order to preserve the memory allocated for the array on the
Ex

coprocessor (see also Section 2.2.9). This optimization is shown in Listing 4.72.

1 void OffloadWithMemoryRetention(const size_t size, char* data,


2 const int k, const int nTrials) {
3 // Allocate arrays on coprocessor during the first iteration;
4 // retain allocated memory for subsequent iterations
5 // Here k is the zero-based number of iteration, nTrials is the total iteration count.
6
7 #pragma offload target(mic:0) \
8 in(data: length(size) alloc_if(k==0) free_if(k==nTrials-1) align(64))
9 {
10 data[0] = 0; // touch array to avoid dead code elimination
11 }
12 }

Listing 4.72: Function performing offload with data transfer, optimizer with memory retention on the coprocessor. Data is
transferred fully in each offload, however, memory allocated for the data on the coprocessor is allocated only during the
first offload and deallocated during the last offload.

Prepared for Yunheng Wang c Colfax International, 2013


222 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

The effect of memory retention on the offload performance is very significant, because the memory
allocation operation is essentially serial and therefore slow. For large arrays, the memory retention reduces
the latency associated with data offload almost by a factor of 10x. For smaller arrays, the effect is even more
dramatic because the latency of the memory allocation operation comes into play. As a rule of thumb, expect
to transfer data across the PCIe bus at a rate of 6 GB/s, but to allocate memory on the coprocessor at a rate of
0.5 GB/s.

4.6.2 Data Persistence Between Offloads


Finally, let us consider the case when all or some of the pointer-based arrays in several consecutive
offloads contain the same data. In this case, we can save additional time by preventing the transfer of persistent
data across the PCIe bus. This is illustrated in Listing 4.73.

1 void OffloadWithDataPersistence(const size_t size, char* data,

g
2 const int k, const int nTrials) {

an
3

W
4 // Transfer data during the first iteration;
5 // skip transfer for subsequent iterations
6
7
const size_t transferSize = ( k == 0 ? size : 0); e ng
nh
8 #pragma offload target(mic:1) \
Yu

9 in(data: length(transferSize) alloc_if(k==0) free_if(k==nTrials-1) align(64))


10 {
r

11 data[0] = 0; // touch array to avoid dead code elimination


fo

12 }
ed

13 }
p ar

Listing 4.73: Function performing offload of a calculation to the coprocessor with data persistence. After the first offload,
re

data is not transferred again, and memory is not allocated or deallocated until the last offload.
yP
el

In this case, we limited data transport with the help of the clause length(0) in all offloads but the
iv

first one. The array data[0] will still be available in full on the coprocessor, however, data will not be
us

transferred.
cl
Ex

4.6.3 Memory Alignment and TLB Page Size Control


For optimal offload performance, transferred arrays must be allocated on at least a 64-byte boundary. If
memory allocation boundary is an odd multiple of 32, then an intermediate host-side data copy is required
(see [70]). In order to specify a 64-byte memory alignment on the coprocessor, use the specifier align(64)
as shown in Listing 4.71, Listing 4.72 and Listing 4.73. In order to align host-side memory on a 64-byte
boundary, use the _mm_malloc() intrinsic or other methods described in Section 3.1.4.
The performance of data transfer to/from the coprocessor and the performance of certain memory-
intensive workloads can be improved by specifying large memory pages in the TLB of the coprocessor.
The easiest way to achieve large TLB pages in offload applications is using the host-side environment
variable MIC_USE_2MB_BUFFERS. Note that it is not necessary to set MIC_ENV_PREFIX=MIC in order
for MIC_USE_2MB_BUFFERS to work. The value of the latter variable should be set to minimal runtime
size of pointer-based variables for which 2 MB TLB pages must be used instead of the standard 4 kB pages.
For example, MIC_ENV_PREFIX=16k means that for arrays exceeding 16 kB in size, large TLB pages will
be used.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.6. PCIE TRAFFIC CONTROL 223

4.6.4 Offload Benchmark Results


We benchmarked codes in Listing 4.71, Listing 4.72 and Listing 4.73 and show the results in Figure 4.22
and Figure 4.23.
Latencies of different offload modes were measured directly by timing the corresponding functions in
the above mentioned code listings. The latencies are shown in Figure 4.22. Dash-dotted curves labelled
“default offload” are the result of #pragma offload without any free_if or alloc_if specifiers.
These latencies include the communication time with the offload RTL, data transfer time, and the time of
memory allocation and deallocation on the coprocessor in each offload. Solid curves labelled “with memory
retention” are the result of using free_if(0) alloc_if(0) in all but the first and last offloads. These
latencies include only the communication time with the offload RTL and the data transfer time. Finally, dotted
curves “with data persistence” are the result of adding length(0) free_if(0) alloc_if(0) to all
but the first and last offload. These latencies are just the communication time with the offload RTL; data
is not transferred, and memory is not allocated in this case. For each of the offload modes, two settings
for the TLB buffer are tested: default 4-KB TLB pages (red curves with filled circles) and huge 2MB
pages (blue curves with filled triangles). The latter mode is effected by setting the environment variable

g
an
MIC_USE_2MB_BUFFERS=1K.
The effective bandwidth of the PCIe traffic shown in Figure 4.23 is derived by dividing the transferred

W
array size by the offload latency with memory retention. The word “effective” indicates that this metric

ng
includes the communication time with the offload RTL and the data transfer time.

e
Results show that
nh
Yu
a) In the default offload mode, the offload latency is dominated by memory allocation. This is manifested in
the fact that retaining memory allocated on the coprocessor reduces the offload latency by more than a
r
fo

factor of 10x.
d
re

b) When memory retention is used, the latency is comprised of two components: data transfer time (propor-
pa

tional to the array size) and the time of communication with the offload RTL (the latter is independent of
re

the array size). For arrays greater than 256 kB, the data transfer time dominates the latency.
yP

c) The effective bandwidth increases with the array size. It plateaus at over 6.0-6.4 GB/s for arrays greater
el

than 16 MB.
iv
us

d) The maximum bandwidth is slightly greater for arrays over 16 MB when large TLB pages are set via
cl

MIC_USE_2MB_BUFFERS.
Ex

e) Arrays of size 4MB an anomalously low effective bandwidth. The bandwidth is also slightly below the
general trend for 2MB and 8 MB size arrays.
f) The effective bandwidth for arrays smaller than 32 kB is less than 10% of the maximum bandwidth.
The results and optimization techniques discussed above may help to optimize applications in which
computation to communication ratio is not very high. The complete code of the benchmark can be found in
Exercise Section A.4.14.

Prepared for Yunheng Wang c Colfax International, 2013


224 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Offload latencies
Default offload: standard TLB pages
1000 2MB buffers r)
With memory retention: standard TLB pages sfe
ta tran
2MB buffers
n +da
With data persistence: standard TLB pages atio
2MB buffers a lloc
100 mory
(me
ffload
lt o ly)
Latency, ms

fau on
De er
nsf
tra
10 ta
da
n(
e ntio
ret
ory
mem
t h
1 Wi

0.1 With data persistence (no memory allocation or data transfer)

MB

MB

MB
kB

kB

kB

MB

MB

MB
kB

kB

kB

B
an
B

128

256

512

1M

2M

4M

8M

128

256

512

1G
1k

2k

4k

8k

16

32

64

16

32

64
W
Array Size

Figure 4.22: Latencies associated with data and function offload to the coprocessor. Three sets of curves (dash-dotted,
solid and dotted) reflect different offload modes. For each of the offload modes, two settings for the TLB pages are tested:
e ng
nh
default 4-KB TLB pages (red curves with filled circles) and huge 2MB pages (blue curves with filled triangles). See
Yu

Section 4.6.4 for details.


r
fo
ed

Effective PCIe offload bandwidth


ar

7
Default TLB pages
p

2MB buffers
re

6
yP
Effective Bandwidth, GB/s

5
el
iv
us

4
cl
Ex

0
MB

MB

MB
kB

kB

kB

MB

MB

MB
kB

kB

kB

B
B

128

256

512

1M

2M

4M

8M

128

256

512

1G
1k

2k

4k

8k

16

32

64

16

32

64

Array Size

Figure 4.23: The effective bandwidth of data transfer to the coprocessor calculated by dividing the transferred data size to
the offload latency for the “memory retention” case in Figure 4.22. Two settings for the TLB pages are tested: default
4-KB TLB pages (red curves with filled circles) and huge 2MB pages (blue curves with filled triangles). See Section 4.6.4
for details.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 225

4.7 Process Parallelism: MPI Optimization Strategies


MPI applications in computing systems with Intel Xeon Phi coprocessors face two unique challenges:
1. When the workload is shared between the host processors and the coprocessors on an equal basis, the
computing system becomes heterogeneous. In traditional homogeneous computing clusters, one may
orchestrate work sharing based on the assumption that equal parts of the work take equal amounts time
on any two compute devices. However, in computing systems with Intel Xeon Phi coprocessors, the
same amount of work may take different time depending on whether it is processed by a host processor
or by a coprocessors, because they yield a different number of FLOPs per second. Therefore, work
scheduling now must either take into account the relative performance of different compute units, or
utilize dynamic scheduling to balance the workload.

2. The total number of MPI processes can easily exceed 240 even on a single compute node. This is
at least an order of magnitude greater than in traditional systems. The consequence of this large
number is a large amount of MPI communication. If communication quenches the performance of the

g
algorithm, the programmer must consider communication-efficient algorithms. Additionally, hybrid

an
OpenMP/MPI programming can be employed in order to reduce the number of MPI processes, but

W
utilize multithreading within each process.

ng
In this section, we discuss and provide examples of MPI application optimization with the help of load

e
nh
balancing and inter-operation with OpenMP. For load balancing, we utilize “boss-worker” scheduling with
Yu
algorithms also adopted in parallel OpenMP loops:
r

a) static scheduling, where work distribution is known before the beginning of the parallel loop,
fo
d

b) dynamic scheduling, where the scheduler assigns work to the available workers in chunks, and when a
re

worker finishes its chunk of work, it reports to the “boss” to receive the next chunk, and
pa
re

c) guided scheduling, which is similar to dynamic scheduling, except that at the beginning, chunks are large,
yP

and towards the end of the calculation, the chunk size is gradually reduced.
el

We demonstrate how using OpenMP in MPI processes lowers the number of workers, which may decrease the
iv

amount of communication, benefitting performance. We also discuss the subject of process pinning in MPI in
us

the context of inter-operation with OpenMP.


cl

Naturally, the optimization of MPI applications does not end at the techniques discussed above. Efficient
Ex

MPI application must minimize communication and overlap communication with computation, utilize SIMD
instructions, partition problems for local data access, and use efficient computational microkernels. These
topics have been discussed earlier in this chapter.

Prepared for Yunheng Wang c Colfax International, 2013


226 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.7.1 Example Problem: the Monte Carlo Method of Computing the Number π
We illustrate load balancing in MPI for an application that uses the Monte Carlo method to compute the
number π = 3.141592653589793 . . . This problem does not require intensive data transfer, and every Monte
Carlo trial is independent from other trials. Therefore, this method can be categorized as an embarrassingly
parallel algorithm.
The Monte Carlo method of computing the number π can be illustrated with a geometrical model.
Consider a quarter of circle of radius R = 1 inscribed in a square with the side length L = R (see Figure 4.24).
The surface area of the quarter of a circle is
y
1
Aquarter circle = πR2 (4.8)
4

L=1
and the surface area of the square is
1
R=
2
Asquare = L . (4.9)
x
L=1

g
Let us uniformly distribute a set of N random

an
points over the surface area of the square. The

W
- Monte Carlo trial
mathematical expectation of the number of these - unit square area
points inside the quarter of a circle is e ng - quarter circle area
nh
Aquarter circle
hNquarter circle i = N. (4.10) Figure 4.24: Monte Carlo calculation of the number π.
Yu

Asquare
r
fo

It is trivial to show that


hNquarter circle i πR2
ed

4 = 4 2 = π. (4.11)
N 4L
ar

In a computer code, we can generate N uniformly distributed points and compute Nquarter circle . Using
p
re

these two values, we can estimate the number π as


yP

Nquarter circle
π≈4 . (4.12)
el

N
iv
us

The core algorithm for the number π calculation with this Monte Carlo method is expressed in the
cl

following C source code:


Ex

1 const long iter = 1L<<32L; // total number of iterations 2^32


2 long dUnderCurve = 0; // number of points on the circle quarter
3 for (long i = 0; i < iter; i++){
4 const float x = rand()/RAND_MAX; //random X coordinate between 0 and 1
5 const float y = rand()/RAND_MAX; //random Y coordinate between 0 and 1
6 if (x*x + y*y < 1.0f) dUnderCurve++; //check if distance is less than radius R^2=1.0
7 }
8 const double pi = (double)dUnderCurve / (double)iter * 4.0

Listing 4.74: Serial algorithm of the number π calculation with Monte Carlo method (C source code)

Before we proceed to the MPI implementation of this code, let us optimize it using the methods described
earlier in this chapter. This code is bottlenecked by the generation of random numbers. We can improve the
performance of random number generation using one of the SIMD-capable random number generators (RNGs)
from the Intel MKL. These RNGs perform best when they generate an array of random numbers, as opposed
to one number at a time. In order to accommodate this requirement in the code, we must divide the iteration

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 227

space into sufficiently large blocks and transform the for-loop into two nested loops: one iterating over blocks,
another operating within a block. This optimization is nothing more than the strip-mining technique discussed
in Section 4.4.4. The resulting optimized code is shown in Listing 4.75.

1 #include <mkl_vsl.h>
2
3 //...
4
5 const long iter=1L<<32L;
6 const long BLOCK_SIZE=4096;
7 const long nBlocks=iter/BLOCK_SIZE;
8 const int seed = 2375041;
9
10 // Random number generator from MKL
11 VSLStreamStatePtr stream;
12 vslNewStream( &stream, VSL_BRNG_MT19937, seed );
13
14 for (long j = 0; j < nBlocks; j++) {

g
an
15
16 vsRngUniform( 0, stream, BLOCK_SIZE*2, r, 0.0f, 1.0f );

W
17 for (i = 0; i < BLOCK_SIZE; i++) {

ng
18 const float x = r[i];
19 const float y = r[i+BLOCK_SIZE];

e
// Count points inside quarter circle
nh
20
21 if (x*x + y*y < 1.0f) dUnderCurve++;
Yu
22 }
23 }
r
fo

24 const double pi = (double)dUnderCurve / (double)iter * 4.0


d
re

Listing 4.75: Serial core algorithm of the number π calculation with Monte Carlo method (C source code)
pa
re

Note that random coordinates of a point (x and y) are obtained from the array of random numbers r in
yP

such a way that unit-stride access to r is ensured. This is done in order to enable automatic vectorization of
el

the expression x*x+y*y.


iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


228 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.7.2 MPI Implementation without Load Balancing


In Listing 4.76 we demonstrate an initial approach to the problem in MPI. The computing kernel of this
code is based on the code in Listing 4.75. The parallelization of this code in MPI involves the following:

• MPI is used to distribute the work across multiple processes, and MPI initialization and termination
functions are called at the beginning and the end of the calculation.

• Each process computes only a fraction of the work, and this fraction is the same for all processes.
Variable blocksPerProc is the average number of blocks for which each process must run the Monte
Carlo simulation. This variable is declared as a floating-point number in order to deal with situations
where the number of blocks is not a multiple of the number of MPI processes.
• Variables myFirstBlock and myLastBlock contain the range of blocks for each process. These
are integer variables computed in such a way that on average, each MPI process works on a total of
blocksPerProc blocks. Note that computing the beginning and ending block is not necessary in the
calculation of the number π, because blocks are not associated with any data. However, we use these

g
an
quantities in order to make our approach applicable to other loop-centric problems.

W
• Every MPI process initializes the random number generator with a different random seed.

ng
• MPI_Sum reduction collects the total number of points within the quarter circle area from all MPI
e
nh
processes into the variable UnderCurveSum. Only one MPI process rank=0 performs the final
Yu

calculations of the number π.


r

• For a better timing estimate, the calculation is run multiple times (trials=10).
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 229

1 #include <mpi.h>
2 #include <stdio.h>
3 #include <stdlib.h>
4 #include <mkl_vsl.h>
5
6 const long iter=1L<<32L, BLOCK_SIZE=4096L, nBlocks=iter/BLOCK_SIZE, nTrials = 10;
7
8 void RunMonteCarlo(const long firstBlock, const long lastBlock,
9 VSLStreamStatePtr & stream, long & dUnderCurve) {
10 // Performs the Monte Carlo calculation for blocks in range [firstBlock; lastBlock)
11 // to count the number of random points inside of the quarter circle
12 float r[BLOCK_SIZE*2] __attribute__((align(64)));
13 for (long j = firstBlock; j < lastBlock; j++) {
14 vsRngUniform( 0, stream, BLOCK_SIZE*2, r, 0.0f, 1.0f );
15 for (long i = 0; i < BLOCK_SIZE; i++) {
16 const float x = r[i];
17 const float y = r[i+BLOCK_SIZE];
18 if (x*x + y*y < 1.0f) dUnderCurve++; // Count points inside quarter circle
19 }

g
}

an
20
21 }

W
22
int main(int argc, char *argv[]){

ng
23
24 int rank, nRanks, trial;

e
25 MPI_Init(&argc, &argv);
26 MPI_Comm_size(MPI_COMM_WORLD, &nRanks);
nh
Yu
27 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
28
r

// Work sharing: equal amount of work in each process


fo

29
30 const double blocksPerProc = (double)nBlocks / (double)nRanks;
d

31
re

32 for (trial = 0; trial < nTrials; trial++) { // Multiple trials


pa

33 const double start_time = MPI_Wtime();


// Range of blocks processed by this process
re

34
35 const long myFirstBlock = (long)(blocksPerProc*rank);
yP

36 const long myLastBlock = (long)(blocksPerProc*(rank+1));


37 long dUnderCurve=0, UnderCurveSum=0; // Results
el

38
iv

39 // Create and initialize a random number generator from MKL


us

40 VSLStreamStatePtr stream;
cl

41 vslNewStream( &stream, VSL_BRNG_MT19937, trial*nRanks + rank );


Ex

42 RunMonteCarlo(myFirstBlock, myLastBlock, stream, dUnderCurve);


43 vslDeleteStream( &stream );
44
45 // Compute pi
46 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
47 if (rank==0) {
48 const double pi = (double)UnderCurveSum / (double) iter * 4.0 ;
49 const double end_time = MPI_Wtime();
50 const double pi_exact=3.141592653589793;
51 if (trial == 0) printf("#%9s %8s %7s\n", "pi", "Rel.err", "Time, s");
52 printf ("%10.8f %8.1e %7.3f\n", pi, (pi-pi_exact)/pi_exact, end_time-start_time);
53 }
54 MPI_Barrier(MPI_COMM_WORLD);
55 }
56 MPI_Finalize();
57 }

Listing 4.76: Calculation of number π with a Monte Carlo simulation.

Prepared for Yunheng Wang c Colfax International, 2013


230 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

First, let us establish the baseline of the performance of this code by running two benchmarks:

1. on the host system with two eight-core Intel Xeon E5-2670 processors with hyper-threading, using 32
MPI processes, and
2. on a 60-core Intel Xeon Phi B1PRQ-5110P coprocessor using 240 MPI processes.
The procedure for compiling and running the code is shown in Listing 4.77.

user@host% mpiicpc -mkl -o pi_mpi pi_mpi.cc -vec-report


pi_mpi.cc(27): (col. 7) remark: LOOP WAS VECTORIZED.
user@host% mpirun -np 32 -host localhost ./pi_mpi
# pi Rel.err Time, s
3.14163062 1.2e-05 0.824
3.14158209 -3.4e-06 0.837
3.14154471 -1.5e-05 0.827
3.14162596 1.1e-05 0.835
3.14158166 -3.5e-06 0.833

g
an
3.14160449 3.8e-06 0.844
3.14159184 -2.6e-07 0.846

W
3.14155282 -1.3e-05 0.846

ng
3.14154737 -1.4e-05 0.839
3.14159142 -3.9e-07 0.846 e
user@host%
nh
user@host% mpiicpc -mmic -mkl -o pi_mpi.mic pi_mpi.cc -vec-report
Yu

pi_mpi.cc(27): (col. 7) remark: LOOP WAS VECTORIZED.


user@host% scp pi_mpi.mic mic0:~/
r
fo

pi_mpi.mic 100% 337KB 336.9KB/s 00:00


user@host% I_MPI_MIC=1 mpirun -np 240 -host mic0 ~/pi_mpi.mic
ed

# pi Rel.err Time, s
ar

3.14157176 -6.6e-06 1.005


3.14159216 -1.6e-07 0.453
p
re

3.14156315 -9.4e-06 0.450


3.14159555 9.2e-07 0.465
yP

3.14162758 1.1e-05 0.445


3.14163318 1.3e-05 0.444
el

3.14154609 -1.5e-05 0.444


iv

3.14158839 -1.4e-06 0.446


us

3.14160231 3.1e-06 0.447


cl

3.14154475 -1.5e-05 0.445


Ex

user@host%

Listing 4.77: Compiling and running the π calculation on the host system and on an Intel Xeon Phi coprocessor.

The performance of the code on a single coprocessor is in this case 1.9x better than the performance
on the host system. On the host, the calculation takes on average 0.84 seconds, and on the coprocessor it
takes 0.44 seconds. The number of blocks processed in the course of this time is represented by the variable
nBlocks= 232 /4096 = 220 . This translates to a performance of 220 /0.84 s ≈ 1.2 · 106 blocks per second
on the host and 220 /0.44 s ≈ 2.4 · 106 blocks per second on the coprocessor.
Next, we will benchmark the code using both the host processor and the coprocessor. The theoretical
maximum performance that we can expect from this combination is 1.2 · 106 + 2.4 · 106 = 3.6 · 106 blocks
per second. Accordingly, the theoretical minimum runtime is 220 /3.6 · 106 ≈ 0.29 s. However, as we will see
shortly, achieving this performance requires some additional measures in order to improve load balancing.
See Figure 4.27 for a summary of these performance results.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 231

Load balancing is not an issue for a homogeneous system such as an Intel Xeon processor or an Intel Xeon
Phi coprocessor. However, problems begin when we try to utilize the coprocessor (or several coprocessors)
simultaneously with the host. As we estimated above, the theoretical minimum runtime for this code when
it runs on the host together with coprocessor is 0.29 s, which is 50% better than only on the coprocessor.
However, as shown in Listing 4.78, when we split the work between the host and the coprocessor, the runtime
is 0.37 seconds, which is only 20% better than on the coprocessor alone.

user@host% mpirun -np 32 -host localhost ./pi_mpi : -np 240 -host mic0 ~/pi_mpi.mic
# pi Rel.err Time, s
3.14156523 -8.7e-06 0.885
3.14156675 -8.2e-06 0.376
3.14159230 -1.1e-07 0.360
3.14161923 8.5e-06 0.378
3.14161673 7.7e-06 0.379
3.14157532 -5.5e-06 0.359

g
3.14156212 -9.7e-06 0.364

an
3.14158508 -2.4e-06 0.359

W
3.14153097 -2.0e-05 0.359
3.14160180 2.9e-06 0.359

ng
user@host%

e
nh
Listing 4.78: Heterogeneous calculation of the number π on the host system plus an Intel Xeon Phi coprocessor.
r Yu
fo
d

In order to understand the cause of the lower than expected performance, we will visualize the load
re

balance using the Intel Trace Analyzer and Collector Event Timeline Charts. Listing 4.79 demonstrates how to
pa

collect data for Intel Trace Analyzer and Collector during the run of the application.
re
yP
el
iv

user@host% source /opt/intel/itac/8.1.0.024/bin/itacvars.sh


us

user@host% source /opt/intel/itac/8.1.0.024/mic/bin/itacvars.sh


cl

user@host% mpiicpc -mkl -o pi_mpi pi_mpi.c


Ex

user@host% mpiicpc -mmic -mkl -o pi_mpi.mic pi_mpi.c


user@host% scp pi_mpi.mic mic0:~/
pi_mpi.mic 100% 433KB 432.5KB/s 00:00
user@host% export VT_LOGFILE_FORMAT=stfsingle
user@host% mpirun -trace -n 32 -host localhost ./pi_mpi : -n 240 -host mic0 ~/pi_mpi.mic
pi = 3.141536204, rel. error = -0.000017968, time = 0.929734 s
pi = 3.141542756, rel. error = -0.000015883, time = 0.390374 s
...
pi = 3.141504486, rel. error = -0.000028065, time = 0.387111 s
[0] Intel(R) Trace Collector INFO: Writing tracefile pi_mpi.single.stf in /home/user/pi
user@host% traceanalyzer pi_mpi.single.stf

Listing 4.79: Intel MPI execution of pi_mpi with data collection for Intel Trace Analyzer and Collector.

The event timeline chart corresponding to this run is shown in Figure 4.25. In this chart, blue regions
correspond to computation, and red regions correspond to MPI waiting for communication. Two bands shown
in the figure are grouped information for host processes (top) and coprocessor processes (bottom).

Prepared for Yunheng Wang c Colfax International, 2013


232 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

g
an
W
e ng
nh
Figure 4.25: Intel Trace Analyzer and Collector Event Timeline Chart for heterogeneous Intel MPI run with 32 processes
on the host and 240 on the coprocessor, grouped by the host name.
r Yu
fo

The red regions in the top band in Figure 4.25 are the key to understanding the cause of lower than
expected performance of the heterogeneous calculation. These regions indicate that the host was waiting for
ed

the coprocessor for the majority of the elapsed time. This waiting is caused by the fact that the source code
ar

in Listing 4.76 does not differentiate between MPI processes running on the host and on the coprocessor.
p
re

Therefore, the total number of iterations iter is evenly distributed between all 240 + 32 = 272 processes of
yP

the MPI_COMM_WORLD communicator. Because the host processor is slower than the coprocessor by a factor
of 0.84 s/0.44 s = 1.9, but receives only a fraction of 32/272 = 0.12 of the total work, the MPI processes on
el

the host are severely under-loaded.


iv
us

Load imbalance in this problem is a consequence of the heterogeneous nature of the computing system
comprised of an Intel Xeon processor and an Intel Xeon Phi coprocessor. In the next section we discuss how
cl
Ex

to alleviate this problem.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 233

4.7.3 Load Balancing with Static Scheduling


Listing 4.80 shows a section of an improved version of the Monte Carlo calculation of the number π.
This version uses an additional parameter α (represented by variable alpha in the code), which controls the
relative amount of work to be processed on the host and on the coprocessor.

1 // ...
2
3 // Count the number of processes running on the host and on coprocessors
4 int nProcsOnMIC, nProcsOnHost, thisProcOnMIC=0, thisProcOnHost=0;
5 #ifdef __MIC__
6 thisProcOnMIC++; // This process is running on an Intel Xeon Phi coprocessor
7 #else
8 thisProcOnHost++; // This process is running on an Intel Xeon processor
9 #endif
10 MPI_Allreduce(&thisProcOnMIC, &nProcsOnMIC, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
11 MPI_Allreduce(&thisProcOnHost, &nProcsOnHost, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
12

g
// Work sharing calculation

an
13
14 const char* alphast = getenv("ALPHA"); // Load balance parameter

W
15 if (alphast==NULL) { printf("ALPHA is undefined\n"); exit(1); }

ng
16 const double alpha = atof(alphast);
17 #ifndef __MIC__

e
// Blocks per rank on host
nh
18
19 const double blocksPerRank =
Yu
20 ( nProcsOnMIC > 0 ? alpha*nBlocks/(alpha*nProcsOnHost+nProcsOnMIC) :
21 (double)nBlocks/nProcsOnHost );
r

const long blockOffset = 0;


fo

22
23 const int rankOnDevice = rank;
d

24 #else
re

25 // Blocks per rank on coprocessor


pa

26 const double blocksPerRank = nBlocks / (alpha*nProcsOnHost + nProcsOnMIC);


re

27 const long blockOffset = nProcsOnHost*alpha*nBlocks/(alpha*nProcsOnHost+nProcsOnMIC);


const int rankOnDevice = rank - nProcsOnHost;
yP

28
29 #endif
el

30
// Range of blocks processed by this process
iv

31
32 const long myFirstBlock = blockOffset + (long)(blocksPerRank*rankOnDevice);
us

33 const long myLastBlock = blockOffset + (long)(blocksPerRank*(rankOnDevice+1));


cl

34 long dUnderCurve=0, UnderCurveSum=0;


Ex

35
36 // Create and initialize a random number generator from MKL
37 VSLStreamStatePtr stream;
38 vslNewStream(&stream, VSL_BRNG_MT19937, rank*nTrials + t);
39 RunMonteCarlo(myFirstBlock, myLastBlock, stream, dUnderCurve);
40 vslDeleteStream( &stream );
41
42 // Reduction to collect the results of the Monte Carlo calculation
43 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
44
45 // Compute pi
46 if (rank==0) {
47 const double pi = (double) UnderCurveSum / (double) iter * 4.0 ;
48 // ...
49 }

Listing 4.80: Monte Carlo code modified to statically balance the load according to a user-defined parameter α.

Prepared for Yunheng Wang c Colfax International, 2013


234 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

In the code in Listing 4.80, parameter α is read from the environment variable ALPHA. This parameter is
quantitatively defined as

bhost
α= , (4.13)
bMIC

where bhost and bMIC are the number of blocks per process on the host and on the coprocessor, respec-
tively. These values are represented by the variable blocksPerProc in the code. A reminder: in our
implementation, a block is a minimal partition of the problem, equal to a set of 4096 random points.
The value α = 1.0 reproduces the case of Listing 4.76, where every MPI process is assigned the
same amount of work. Values α > 1.0 assign more work to each process running on the host than to each
process running on the coprocessor. Correspondingly, for α < 1.0, each host process receives less work than
each coprocessor process. The optimal value of α depends on the specific problem and computing system
components (the number of coprocessors, the clock frequency of host processors, etc).
In order to compute bhost and bMIC for a given value of α, the following relationship must be used:

g
an
Btotal = bhost Phost + bMIC PMIC , (4.14)

W
ng
where Phost is the number of MPI processes running on host processors, PMIC is the number of processes
e
nh
running on coprocessors, and Bhost is the total number of blocks in the problem. In the code, these quantities
Yu

are represented by variables nProcsOnHost, nProcsOnMIC and nBlocks, respectively. From these
relations, the amounts of work per process on the host and on the coprocessor are expressed as
r
fo
ed

Btotal
bhost = , (4.15)
ar

αPhost + PMIC
p

αBtotal
re

bMIC = . (4.16)
αPhost + PMIC
yP
el

Note that in order to count the number of MPI processes running on the host and coprocessors, target-
iv

specific code is used to increment the respective process counter in lines 5-9 of Listing 4.80, and then
us

all-to-all reduction with MPI_Allreduce is performed (in lines 10-11). Target-specific code is discussed in
cl

Section 2.2.6.
Ex

As mentioned above, α = 1.0 reproduces the work sharing scheme of the unoptimized code in List-
ing 4.76. Generally, a in a heterogeneous system comprised of host processors and Intel Xeon Phi coprocessors,
better load balancing can be achieved by using a value of α other than 1.0. The optimal value depends on
the performance of the calculation on each of the platforms and on the system configuration. It is possible
to estimate the optimal value of alpha from the event timeline chart in ITAC shown in Figure 4.25. In this
figure, two numbers are highlighted with green ellipses. These numbers are average times of the main loop
3.18
calculation on the host and on the coprocessor. The ratio of those two numbers is 0.893 ≈ 3.56. Therefore, if
the original code is modified so that MPI processes on the host execute 3.56 times more iterations, there will
be less time wasted by host processes for waiting. This corresponds to the value of α = 3.56.
It is also possible to to determine the optimal value of α with a calibration study that tests the performance
for multiple values of this parameter. Figure 4.26 shows the dependence of the execution time of the Monte
Carlo code from Listing 4.80 as a function of parameter α. This empirical approach provides the optimal
α = 3.3 − 3.7, which is close to the estimate that we obtained using Intel Trace Analyzer and Collector.
In practical HPC applications, the determination of the optimal α can be performed either by running
a small calibration workload prior to starting a production calculation, or “on the fly”, by measuring the
performance of the computing kernels on all devices of the heterogeneous system.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 235

Effect of load balancing between host and coprocessor in the Monte Carlo calculation of π

Run time
0.5
Load imbalance on Host
Load imbalance on Coprocessor

0.4
Baseline (no load balancing)
Time, s (lower is better)

0.3 Theoretical best

0.2

0.1

g
an
0.0
0 1 2 3 4 5 6 7 8
Parameter α

W
ng
Figure 4.26: Load balance: execution time (in seconds) of the number π calculation with the Monte Carlo method as a

e
function of the parameter α (the ratio of the amount of work per process on the host to the amount of work per process on
the coprocessor). nh
r Yu
fo

In Figure 4.27, we summarize the performance results for the Monte Carlo calculation of π on the host
d

with two Intel Xeon E5-2670 processors, an Intel Xeon B1PRQ-5110P coprocessor, and on both devices
re

simultaneously.
pa
re
yP

Load balance: execution times


0.9
el

0.839
iv

0.8
us

0.7
cl
Time, s (lower is better)

Ex

0.6

0.5
0.449
0.4 0.366

0.3 0.283

0.2

0.1

0.0
es) es) 1.0 3.4
ess ess = =
roc roc h i, α hi, α
(32p 0p nP nP
ly (24 eo eo
on nly +X +X
on hio on on
Xe nP Xe Xe
o
Xe

Figure 4.27: Load balance: execution time (in seconds) of the Monte Carlo calculation of π with 32 MPI processes on an
Intel Xeon host, 240 processes on an Intel Xeon Phi coprocessor, unbalanced using both an the host and coprocessor, and
finally, using both devices with static load balancing (α = 3.4).

Prepared for Yunheng Wang c Colfax International, 2013


236 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.7.4 Load Balancing with Dynamic Scheduling


Static load balancing allowed us to scale a Monte Carlo calculation across the host and the coprocessor
with optimal load balancing (Section 4.7.3). However, this approach is not always optimal. Static load
balancing requires fine tuning of parameters that control work sharing. With static load balancing, work
sharing parameters must be calibrated for every computing system configuration, which may be inconvenient.
In addition, static load balancing may not produce good results if the runtime of any part of the problem is
non-deterministic. For example, if the calculation involves an iterative process, and the number of iterations
varies from one subset of the data to another, then some processes may be “unlucky” to receive a longer
workload, and all other processes will have to wait for the slowest process at the synchronization point.
The solution to both problems is dynamic load balancing. With dynamic load balancing, MPI processes
that finish with their share of work receive additional work. If the scheduling scheme is optimized, then it is
not necessary to calibrate the application for every system configuration, and fluctuations of the execution time
from one part of the problem to another are naturally absorbed. The optimization of the scheduling scheme
may require some tuning. However, in general, parameters do not need to be tuned as precisely as with static
balancing.

g
The drawback of dynamic load balancing schemes is that from one run to another, the distribution of

an
work across MPI processes may vary depending on the runtime conditions and MPI message arrival times.

W
As a consequence, the exact result of a calculation is not reproducible if the calculation involves random

exclusively on integers.
e ng
numbers or not precisely associative operations. This is true of almost all applications except those operating
nh
In this section we will show how to implement dynamic load balancing in the boss-worker model using
Yu

Intel MPI communication between ranks. Boss — one of the MPI processes — is dedicated to assigning parts
r

of the problem (called “chunks” in this context) to workers — the rest of MPI processes in the run. When a
fo

worker finishes its assigned chunk of work, it reports back to the boss to receive either another chunk, or a
ed

command to terminate calculations.


ar

In order to illustrate dynamic load balancing in MPI, we will use the same example problem as in
p

Section 4.7.3, the calculation of π with a Monte Carlo method. The core of the calculation remains the
re
yP

same, however, additional communication is included in the code of each process. Before a worker begins
calculation, it waits for a message from the boss (rank 0) with the values of begin and end, indicating the
el

first and last block that the worker must process. When all of the blocks are processed, the worker waits for
iv

more messages, until the received message contains a negative value for begin and end, which indicates
us

the end of the calculation. This scheme is similar to the dynamic scheduling mode for OpenMP loops (see
cl

Section Section 4.4.3). The code that implements this scheduling algorithm is shown in Listing 4.81.
Ex

The following properties of this code must be noted:


a) Process with the rank equal to 0 plays the role of the boss.
b) The size of the chunk is read from the environment variable GRAIN_SIZE.
c) The chunk size remains constant for the duration of the run
d) Workers keep reporting to the boss for a new chunk of work until the boss returns a message with the value
of the starting block equal to −1. This is the signal to terminate work.
e) The random number generator is initialized only once for each worker. For every subsequent chunk, the
worker continues iterating its RNG, but does not re-seed it. The reason for this strategy is that with some
RNGs, the cross-correlation between random streams with different seeds can be much stronger than the
auto-correlation of a single random stream with a small lag. In other words, frequent re-seeding of the
RNG in a worker can ruin the quality of random numbers generated in this worker. We avoid re-seeding by
preserving the state of the random number generator between different chunks of work.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 237

1 // ...
2 long dUnderCurve = 0, UnderCurveSum = 0;
3
4 if (rank == 0) {
5
6 // Boss assigns work
7 const char* grainSizeSt = getenv("GRAIN_SIZE");
8 if (grainSizeSt == NULL) { printf("GRAIN_SIZE undefined\n"); exit(1); }
9 grainSize = atof(grainSizeSt);
10 long currentBlock = 0;
11 while (currentBlock < nBlocks) {
12 msg[0] = currentBlock; // First block for next worker
13 msg[1] = currentBlock + grainSize; // Last block
14 if (msg[1] > nBlocks) msg[1] = nBlocks; // Stay in range
15 currentBlock = msg[1]; // Update position
16
17 // Wait for next worker
18 MPI_Recv(&worker, 1, MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD, &stat);
19

g
// Assign work to next worker

an
20
21 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);

W
22 }

ng
23
24 // Terminate workers

e
25 msg[0] = -1; msg[1] = -2;
26 for (int i = 1; i < nRanks; i++) {
nh
Yu
27 MPI_Recv(&worker, 1, MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD, &stat);
28 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
r

}
fo

29
30 } else {
d

31
re

32 // Worker performs the Monte Carlo calculation


pa

33 VSLStreamStatePtr stream; // Create and initialize a RNG from MKL


vslNewStream(&stream, VSL_BRNG_MT19937, rank*nTrials + t);
re

34
35 float r[BLOCK_SIZE*2] __attribute__((align(64)));
yP

36
37 // Range of blocks processed by this worker
el

38 msg[0] = 0;
iv

39 while (msg[0] >= 0) {


us

40 // Ask boss for the next chunk of work


cl

41 MPI_Send(&rank, 1, MPI_INT, 0, rank, MPI_COMM_WORLD);


Ex

42 MPI_Recv(&msg, 2, MPI_LONG, 0, rank, MPI_COMM_WORLD, &stat);


43 const long myFirstBlock = msg[0];
44 const long myLastBlock = msg[1];
45 RunMonteCarlo(myFirstBlock, myLastBlock, r, stream, dUnderCurve);
46 }
47 vslDeleteStream( &stream );
48 }
49
50 // Reduction to collect the results of the Monte Carlo calculation
51 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
52
53 // Compute pi
54 if (rank==0) {
55 const double pi = (double) UnderCurveSum / (double) iter * 4.0 ;
56 // ...
57 }

Listing 4.81: Boss-worker model used for dynamic load balancing of the Monte Carlo simulation.

Prepared for Yunheng Wang c Colfax International, 2013


238 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

The tuning parameter for this code is size of the chunk determined by the environment variable
GRAIN_SIZE. Chunk size must be small enough that the number of chunks is much greater than the
number of workers. This will enable even load distribution across workers. At the same time, chunk size must
be large enough so that the workers do not have to communicate with the boss too often. Therefore, we expect
that there is a window of values for chunk size in which the performance is optimal.
For our example, blockSize= 4096, and the total number of iterations is 232 , so the number of
blocks in the problem is 232 /4096 = 220 ≈ 106 . So, for GRAIN_SIZE=1, the number of MPI point-to-point
communications between the boss and the workers is of order 106 . For GRAIN_SIZE= 106 /272 ≈ 4 · 103 ,
the number of chunks is equal to the number of workers in the system comprised of dual host processors and a
single 60-core Intel Xeon Phi coprocessor. Therefore, we expect that for GRAIN_SIZE somewhere between
1 and 4 · 103 , the performance will be optimal.
Running the Monte Carlo calculation in Intel MPI requires some care with the placement of the boss
process. The boss is the process with the rank equal to 0, so this process must be the first one specified in the
command line or in the machine file. Furthermore, in an optimized calculation, the boss process will not use
much CPU time and does not have to occupy a whole logical core. Therefore, on the host (two eight-core
processors with two-way hyper-threading), we can launch 2 × 8 × 2 = 32 workers and one boss.

g
an
Important: by default, Intel MPI pins processes to certain parts of compute devices (sockets, cores,

W
threads). This pinning generally improves performance and should be used for performance-critical calcula-

ng
tions. However, the boss process must be unpinned. Otherwise, one of the worker processes on the host will
be pinned to the same logical core as the host, which may throttle down the scheduling workflow. Listing 4.82
e
nh
demonstrates how to achieve that in a heterogeneous calculation that employs the host and the coprocessor.
r Yu

user@host% mpiicpc -mkl -o pi-dynamic-host pi-dynamic.cc


fo

user@host% mpiicpc -mkl -mmic -o pi-dynamic-mic pi-dynamic.cc


ed

user@host% scp pi-dynamic-host mic0:~/


pi_mpi.mic 100% 341KB 340.7KB/s 00:00
ar

user@host% export I_MPI_MIC=1


p

user@host% export GRAIN_SIZE=1024


re

user@host% mpirun \
yP

-host localhost -np 1 -env I_MPI_PIN 0 ./pi-dynamic-host : \


-host localhost -np 32 ./pi-dynamic-host : \
el

-host mic0 -np 240 ~/pi-dynamic-mic


iv

# pi Rel.err Time, s GrainSize


us

3.14156455 -8.9e-06 0.744 1024


cl

3.14158723 -1.7e-06 0.445 1024


Ex

3.14156120 -1.0e-05 0.442 1024


3.14158789 -1.5e-06 0.432 1024
3.14159633 1.2e-06 0.427 1024
3.14157595 -5.3e-06 0.410 1024
3.14160142 2.8e-06 0.408 1024
3.14157367 -6.0e-06 0.394 1024
3.14158843 -1.3e-06 0.429 1024
3.14160435 3.7e-06 0.425 1024
user@host%

Listing 4.82: Compiling and running the Monte Carlo calculation of π with dynamic load scheduling. Starting one
unpinned boss process on the host, 32 worker processes on the host and 240 workers on the coprocessor.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 239

We measured the performance of the code with dynamic scheduling for a range of values of GRAIN_SIZE.
The results are are shown in Figure 4.28. The plot shows the total calculation time along with additional
performance metrics: (i) MPI communication time is the average time in worker processes between the start
of sending a request to the boss process and receiving a chunk of work to process, (ii) unbalanced time per
process is the average time in worker processes between the end of the work and the reduction of results to the
boss (a barrier was used before the reduction). The latter metric reflects the time that processes that received
too little work are waiting for processes that received too much work.

2.0 Effect of Grain Size on Dynamic Scheduling in Heterogeneous Monte Carlo Calculation of π
Host unbalanced time (average per process)
Coprocessor unbalanced time (avgerage per process)
Host MPI communication time (avgerage per process)
1.5 Coprocessor MPI communication time (avgerage per process)
Total computation time
Time, s (lower is better)

g
an
1.0

W
e ng
0.5 nh
Yu
Theoretical best
r
fo

0.0 0
d

2 21 22 23 24 25 26 27 28 29 210 211 212 213 214


re

Grain Size
pa
re

Figure 4.28
yP

As expected, for low values of GRAIN_SIZE, the runtime is poor due to excessive MPI communication
el
iv

for scheduling. This is illustrated in the Intel Trace Analyzer and Collector timeline chart in Figure 4.29
us

obtained for a run with GRAIN_SIZE=1 with only 32 workers on the host. In fact, according to this figure,
cl

due to message contention on the boss process, some of the workers never receive any work and remain idle
Ex

for the duration of the calculation. This wrecks performance by reducing the amount of hardware parallelism
available to this application.
For large grain sizes, performance is poor due to unbalanced load. The boss quickly hands out large
chunks of work without regard for the number of workers expecting assignments. As a result, all work may be
handed out before the last worker receives a chunk. In a less severe case of imbalance, some workers will
receive two or three chunks while other will receive one or two.
The “sweet spot” for performance appears to be for GRAIN_SIZE between 1000 and 3000. However,
the best performance achieved in this parameter window is 0.40 s per run, which is worse than 0.29 s per run
achieved with static load balancing.
The ability of dynamic balancing to evenly distribute work across the processor and coprocessor is
limited by the number and computational cost of iterations and by the MPI communication latency. However,
the situation may be remedied with the help of OpenMP. Indeed, if we were to run fewer MPI processes, then
larger GRAIN_SIZE can be used, for which the MPI communication throughput is not saturated, but load is
balanced. If OpenMP is used in each of these processes, then all available cores can be occupied for parallel
computing. We discuss this optimization in the next section.

Prepared for Yunheng Wang c Colfax International, 2013


240 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Figure 4.29: Intel Trace Analyzer and Collector screenshot of dynamic load balancing MPI communication. In this
example, GRAIN_SIZE is too small for the available MPI communication throughput. As a result, due to message
contention on the boss process, some workers never receive work and remain idle for the duration of the calculation.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 241

4.7.5 Multi-threading within MPI Processes


In some applications it is beneficial to use OpenMP or Intel Cilk Plus for multi-threading inside of MPI
processes. The degree of such nested OpenMP parallelism can vary from just two threads per MPI process to
having only one MPI process per processor or coprocessor, or even one MPI process per compute node. In the
latter case, the single MPI process can use OpenMP to utilize all cores on the host and, in addition, perform
offload to utilize to all available coprocessors.
In the case of the Monte Carlo code with dynamic load scheduling, we expect that inter-operation of
OpenMP and MPI will improve the performance by reducing the number of workers. That will allow us to use
larger chunk sizes for scheduling and thus reduce the amount of MPI communication. In order to use OpenMP
in the Monte Carlo code for the calculation of π, only the worker code must be modified. The code of the boss
process may remain the same. Listing 4.83 shows a hybrid OpenMP/MPI implementation of the application.

1 if (rank != 0) {
2 // Worker performs the Monte Carlo calculation

g
3

an
4 // Create and initialize a random number generator from MKL

W
5 VSLStreamStatePtr stream[omp_get_max_threads()];
6 #pragma omp parallel

ng
7 {

e
8 // Each thread gets its own random seed
9
nh
const int randomSeed = nTrials*omp_get_thread_num()*nRanks + nTrials*rank + t;
vslNewStream(&stream[omp_get_thread_num()], VSL_BRNG_MT19937, randomSeed);
Yu
10
11 }
r

12
fo

13 msg[0] = 0;
d

14 while (msg[0] >= 0) {


re

15 // Receive from boss the range of blocks to process


pa

16 MPI_Send(&rank, 1, MPI_INT, 0, rank, MPI_COMM_WORLD);


17 MPI_Recv(&msg, 2, MPI_LONG, 0, rank, MPI_COMM_WORLD, &stat);
re

18 const long myFirstBlock = msg[0];


yP

19 const long myLastBlock = msg[1];


20
el

21 #pragma omp parallel


iv

22 {
us

23 float r[BLOCK_SIZE*2] __attribute__((align(64)));


cl

24 const int myThread = omp_get_thread_num();


#pragma omp for schedule(dynamic) reduction(+: dUnderCurve)
Ex

25
26 for (long j = firstBlock; j < lastBlock; j++) {
27 vsRngUniform( 0, stream[myThread], BLOCK_SIZE*2, r, 0.0f, 1.0f );
28 for (long i = 0; i < BLOCK_SIZE; i++) {
29 const float x = r[i];
30 const float y = r[i+BLOCK_SIZE];
31 // Count points inside quarter circle
32 if (x*x + y*y < 1.0f) dUnderCurve++;
33 }
34 }
35 }
36 }
37
38 #pragma omp parallel
39 {
40 vslDeleteStream( &stream[omp_get_thread_num()] );
41 }
42 }

Listing 4.83: Worker code for hybrid MPI/OpenMP Monte Carlo calculation of the number π.

Prepared for Yunheng Wang c Colfax International, 2013


242 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Note that in the thread-parallel implementation of the worker code, we make sure to assign an individual
RNG to every thread. We also re-use the RNG for all chunks of work that the worker processes. This is
preferable to initializing a new RNG for every chunk, because the overhead of initialization and the potential
cross-correlation between random streams with different seeds is undesirable.
Listing 4.84 demonstrates how to compile and run the hybrid code using 16 threads per process on
the host and on the coprocessor. We could choose any other number of threads per process, but to avoid
over-subscribing the system, the product of the number of processes and the number of threads should be
equal to the number of logical cores on the respective device. Therefore, we launch two 16-threaded workers
on the host (2 × 16 = 32) and fifteen 16-threaded workers on the coprocessor (15 × 16 = 240).

user@host% mpiicpc -mkl -openmp -o pi-dynamic-hybrid-host pi-dynamic-hybrid.cc


user@host% mpiicpc -mkl -openmp -mmic -o pi-dynamic-hybrid-mic pi-dynamic-hybrid.cc
user@host% scp pi-dynamic-hybrid-host mic0:~/
pi_mpi.mic 100% 352KB 351.7KB/s 00:00
user@host% export I_MPI_MIC=1
user@host% export GRAIN_SIZE=8192

g
user@host% mpirun \

an
-host localhost -np 1 -env I_MPI_PIN 0 ./pi-dynamic-hybrid-host : \

W
-host localhost -np 2 -env OMP_NUM_THREADS 16 ./pi-dynamic-hybrid-host : \
-host mic0 -np 15 -env OMP_NUM_THREADS 16 ~/pi-dynamic-hybrid-mic
#
#
pi
pi
Rel.err Time, s GrainSize
Rel.err Time, s GrainSize
e ng
nh
3.14156822 -7.8e-06 0.781 16384
3.14159674 1.3e-06 0.312 16384
Yu

3.14159602 1.1e-06 0.307 16384


r

3.14157067 -7.0e-06 0.308 16384


fo

3.14157802 -4.7e-06 0.307 16384


ed

3.14158904 -1.2e-06 0.308 16384


3.14159922 2.1e-06 0.309 16384
ar

3.14157800 -4.7e-06 0.308 16384


p

3.14160929 5.3e-06 0.308 16384


re

3.14154699 -1.5e-05 0.308 16384


yP

user@host%
el
iv

Listing 4.84: Compiling and running the Monte Carlo calculation of π with dynamic load scheduling and OpenMP in
us

worker code. Starting one unpinned boss process on the host, two 16-threaded worker processes on the host and fifteen
cl

16-threaded workers on the coprocessor.


Ex

The execution time in this hybrid OpenMP/MPI code is 0.31 s, which is significantly better than the
best case of 0.40 s with single-threaded MPI processes. The runtime achieved here is close to the theoretical
minimum runtime of 0.29 s.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 243

In order to assess the range of the values of GRAIN_SIZE in which the performance is near the optimal
value, we performed benchmarks with multiple values of this parameter. For each value of GRAIN_SIZE, we
ran the code in four configurations:
1. Single-threaded MPI processes. In this case, 32 worker processes are launched on the host and 240
processes on the coprocessor. This is equivalent to the case discussed in Section 4.7.4.
2. MPI processes with 4 threads in each. In this setup, 8 worker processes run on the host and 60 worker
processes on the coprocessor.
3. MPI processes with 16 threads in each. This corresponds to 2 workers on the host and 15 workers on
the coprocessor.
4. A single MPI worker on the host with 32 threads and a single worker on the coprocessor with 240
threads.

The results of this benchmark are shown in Figure 4.30.

g
an
W
2.0 Performance of Heterogeneous Hybrid Monte Carlo Calculation of π with Dynamic Scheduling

ng
Single-threaded MPI processes
4 OpenMP threads per MPI process

e
nh
16 OpenMP threads per MPI process
One 32-threaded MPI process on host,
Yu
1.5
one 240-threaded process on coprocessor
Time, s (lower is better)

r
fo
d
re

1.0
pa
re
yP

0.5
el

Theoretical best
iv
us
cl

0.0
21 23 25 27 29 211 213 215 217 219
Ex

Grain Size

Figure 4.30

As expected, using multiple OpenMP threads in worker processes improves performance. With 4 threads
per worker, the performance is close to the theoretical maximum in the range of GRAIN_SIZE from 29 to
211 . As the number of threads per worker increases, the window of optimal performance expands and shifts
towards greater values of GRAIN_SIZE.

Prepared for Yunheng Wang c Colfax International, 2013


244 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

MPI Calls from OpenMP Threads


Inter-operation between MPI and OpenMP demonstrated in Listing 4.83 does not involve MPI calls
from a parallel OpenMP region. In that sense, the usage of the MPI library is serial in this code. However, in
applications that perform MPI communication from OpenMP threads within MPI processes, special measures
must be taken.

1. First of all, the thread-safe version of Intel MPI Library must be linked by using the compiler flag
-mt_mpi.
2. MPI must be initialized with the call MPI_Init_thread(), as shown in Listing 4.85.

1 int required=MPI_THREAD_SERIALIZED;
2 int provided;
3
4 MPI_Init_thread(&argc, &argv, required, &provided);
5

g
6 if (provided < required){

an
7 if (rank == 0)

W
8 printf("Warning: MPI implementation provides insufficient threading support.\n";
omp_set_num_threads(1);

ng
9
10 } e
nh
Yu

Listing 4.85: Hybrid OpenMP and MPI initialization.


r
fo

Here, parameter required parameter can be one of the following:


ed

MPI_THREAD_SINGLE The process is single-threaded.


p ar

MPI_THREAD_FUNNELED The process may be multi-threaded, but the application must ensure that
re

only the main thread makes MPI calls.


yP

MPI_THREAD_SERIALIZED The process may be multi-threaded, and multiple threads may make
el

MPI calls, but only one at a time.


iv

MPI_THREAD_MULTIPLE Multiple threads may call MPI, with no restrictions.


us
cl

The call to MPI_Init_thread() will set the value of parameter provided to the value of granted
Ex

hybrid OpenMP and MPI model.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 245

4.7.6 Load Balancing with Guided Scheduling


Load balancing with dynamic scheduling, as shown in Section 4.7.4 and Section 4.7.5, can provide good
performance in a broad range of values of GRAIN_SIZE, especially if multi-threaded MPI processes are
used to reduce the amount of MPI communication for scheduling. Dynamic work scheduling is an attractive
option for applications that cannot use static scheduling either because precise calibration of load distribution
is not practical, or because the different chunks of the problem take different amounts of time to compute due
to iterative or stochastic nature of the algorithm. However, with dynamic scheduling, the need to tune the
application by finding the window of optimal values of GRAIN_SIZE may be perceived as a limitation. In
this section we will demonstrate how to improve the scheduling algorithm in order to expand the window of
optimal values of GRAIN_SIZE.
The improved scheduling method presented here reduces the amount of communication by gradually
changing the size of chunks. The boss will begin with a large chunk size and gradually decrease it as the
calculation progresses. We will still use the parameter GRAIN_SIZE, but in this case it controls the lower
bound on the allowed chunk size. Continuing the analogy with OpenMP loop scheduling mode, we call this
method guided scheduling.

g
an
Listing 4.86 demonstrates the implementation of the boss process that performs load balancing with the
guided scheduling algorithm.

W
e ng
1 if (rank == 0) {
2 // Boss assigns work
const char* grainSizeSt = getenv("GRAIN_SIZE"); nh
Yu
3
4 if (grainSizeSt == NULL) { printf("GRAIN_SIZE undefined\n"); exit(1); }
r

5 grainSize = atof(grainSizeSt);
fo

6 long currentBlock = 0;
d

7 while (currentBlock < nBlocks) {


re

8 // Chunk size is proportional to the number of unassigned blocks


pa

9 // divided by the number of workers...


10 long chunkSize = ((nBlocks-currentBlock)/(nRanks-1))/2;
re

11 // ...but never smaller than GRAIN_SIZE


yP

12 if (chunkSize < grainSize) chunkSize = grainSize;


13 msg[0] = currentBlock; // First block for next worker
el

14 msg[1] = currentBlock + chunkSize; // Last block


iv

15 if (msg[1] > nBlocks) msg[1] = nBlocks; // Stay in range


us

16
cl

17 // Wait for next worker


MPI_Recv(&worker, 1, MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD, &stat);
Ex

18
19
20 // Assign work to next worker
21 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
22
23 currentBlock = msg[1]; // Update position
24 }
25
26 // Terminate workers
27 msg[0] = -1; msg[1] = -2;
28 for (int i = 1; i < nRanks; i++) {
29 MPI_Recv(&worker, 1, MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD, &stat);
30 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
31 }
32
33 } else {
34 // ...Worker code...
35 }

Listing 4.86: Boss worker code with guided work scheduling.

Prepared for Yunheng Wang c Colfax International, 2013


246 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

Quantitatively, the chunk size for any interaction with a worker is chosen using the following equation:
 
Btotal − Bscheduled
Bchunk = max GRAIN_SIZE, η , (4.17)
Ptotal

where Bchunk is the number of blocks in the chunk assigned to the next worker, Btotal is the total number of
blocks in the iteration space, Bscheduled is the number of blocks already scheduled for processing by other
workers, and Ptotal is the number of workers. The coefficient η is another parameter of the algorithm. If the
ratio of the performance of the fastest worker to the performance of the slowest worker does not exceed the
number of workers, then this coefficient can be set to η = 1. For a greater difference between the performance
of individual workers, this parameter can be set to a value between 0 and 1. For the Monte Carlo calculation of
π, we chose η = 0.5. With this value of η, even with only two workers, load balancing is achieved.
We benchmarked the Monte Carlo calculation of π with the guided work scheduling scheme shown
in Listing 4.86 for a range of values of GRAIN_SIZE. For each value of GRAIN_SIZE, four cases were
studied, just as in Section 4.7.5: single-threaded workers, 4-threaded workers, 16-threaded workers and the
case with one 32-threaded worker on the host and one 240-threaded worker on the coprocessor. The results of

g
an
the benchmark are shown in Figure 4.31.

W
2.0 e ng
Performance of Heterogeneous Hybrid Monte Carlo Calculation of π with Guided Scheduling
Single-threaded MPI processes
nh
4 OpenMP threads per MPI process
Yu

16 OpenMP threads per MPI process


One 32-threaded MPI process on host,
r

1.5
one 240-threaded process on coprocessor
fo
Time, s (lower is better)

ed
ar

1.0
p
re
yP
el

0.5
iv
us

Theoretical best
cl
Ex

0.0
21 23 25 27 29 211 213 215 217 219
Grain Size

Figure 4.31

As expected, guided scheduling allows the heterogeneous application to utilize resources optimally in a
wide range of values of GRAIN_SIZE. In fact, the amount of MPI communication is reduced so much that
even GRAIN_SIZE= 1 achieves the theoretical best performance if a sufficient number of threads is used.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.7. PROCESS PARALLELISM: MPI OPTIMIZATION STRATEGIES 247

4.7.7 Load Balancing with Work Stealing


In addition to static, dynamic and guided scheduling methods, the algorithm of “work stealing” can be
used to balance the load in MPI application. This is the method used for loop scheduling in Intel Cilk Plus.
We do not provide an implementation of this method in MPI, leaving it as an exercise for the reader. However,
we will discuss the algorithm in this section.
The greatest difference between the “work stealing” algorithm and the “boss-worker” method is the
absence of a dedicated “boss” process. Each worker has a queue of work items that it needs to process. When
worker A has finished its queue of work, it randomly chooses another worker B. If worker B still has work
items in the queue, then worker A “steals” a part of this queued work.
The tuning of the “work stealing” method on a heterogeneous architecture involving Intel Xeon processors
and Intel Xeon Phi coprocessors may require:
a) Deciding how often workers must check whether another worker is trying to steal work items from
them. Checking whether the MPI process has received any messages may be done using the function
MPI_Probe.

g
an
b) Choosing how much work must be stolen in each transaction.

W
c) Tuning the criterion based on which workers decide to stop contacting other workers and exit the calculation.

ng
This involves designing an algorithm that propagates the information about work completion across the

e
MPI world.
nh
Yu
d) Optimizing the selection of the worker to steal work from. In a heterogeneous architecture, it may be more
efficient to contact workers from another platform with a greater probability, because load imbalance is
r
fo

more likely to occur because of the performance differences between the different platforms.
d
re

e) Instrumenting learning or another type of dynamic feedback to adjust the parameters of the work stealing
pa

algorithm for subsequent iterations.


re
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


248 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.8 Using the Intel R MKL


Intel R Math Kernel Library (Intel MKL), first introduced to the public in 2003, is a collection of general
purpose mathematical functions. Core functionality includes Basic Linear Algebra Subprograms (BLAS),
Linear Algebra Package (LAPACK), Scalable Linear Algebra Package (ScaLAPACK), sparse solvers, Fast
Fourier transform, and vector math. Implementations of Intel MKL functions are optimized for Intel Xeon
processors, and a number of functions are also optimized for Intel Xeon Phi coprocessors. The scope of
functions optimized for the MIC architectures is expected to grow with every new release of the library.
Figure 4.32 illustrates the structure and applicability of Intel MKL.

Intel R MKL

Linear Algebra Fast Fourier Vector Random Summary Data Fitting


Vector Math
Transform Number Generators Statistics

g
BLAS Multidimentional Trigonometric Congruential Kurtosis Splines

an
LAPACK (up to 7D) Hyperbolic Recursive Variation Interpolation

W
Sparse solvers FFTW interfaces Exponential Wichmann-Hill coefficitent Cell search

ng
ScaLAPACK Cluster FFT Lagorithmic Mersenne Twister Quantiles, order
Power/Root Sobol
e statistics
nh
Rounding Neiderreiter Min/max
Yu

Non-deterministic Variance-
r

covariance
fo

...
ed
p ar
re

Source
yP

Intel MKL
el
iv

Multicore CPU Multicore CPU Intel Xeon Phi Multicore Clusters with Multicore
us

coprocessor cluster and Many-core


cl
Ex

Multicore Many-core Clusters

Figure 4.32: Intel MKL structure. Image credit: Intel Corporation.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.8. USING THE INTEL R MKL 249

Earlier in our discussion we have given examples of workloads that use the Intel MKL (see Sections 4.2.4,
4.4.5 and 4.7.1). In this section, we outline the MKL usage models and provide general usage and optimization
advice. Complete documentation on the Intel MKL can be found in the Reference Manual [54].
We discuss the Intel MKL version 11.0 for Linux* OS. It supports computation on Intel Xeon Phi
coprocessors in three modes of operations, discussed in detail in this section:
1. Automatic Offload (AO)

• No code change is required in order to offload calculations to an Intel Xeon Phi coprocessor;
• Automatically uses both the host and the Intel Xeon Phi coprocessor;
• The library takes care of data transfer and execution management.
2. Compiler Assisted Offload (CAO)

• Programmer maintains explicit control of data transfer and remote execution, using compiler
offload pragmas and directives;

g
an
• Can be used together with Automatic Offload.

W
3. Native Execution

ng
• Uses an Intel Xeon Phi coprocessor as an independent compute node.

e
nh
• Data is initialized and processed on the coprocessor or communicated via MPI.
Yu
The operation modes discussed above enable heterogeneous computing, which takes an advantage of
r
fo

both the multi-core host system and many-core Intel Xeon Phi coprocessors. Choice of operation modes can
d

be used to execute previously developed legacy code employing the Intel MKL without modification, or with
re

fine control over the compute devices, if such approach is required by the problem.
pa
re
yP
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


250 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.8.1 Functions Offered by MKL


The Intel MKL includes the following groups of routines:

• Basic Linear Algebra Subprograms (BLAS):


– vector operations
– matrix-vector operations
– matrix-matrix operations

• Sparse BLAS Level 1, 2, and 3 (basic operations on sparse vectors and matrices)
• LAPACK routines for solving systems of linear equations
• LAPACK routines for solving least squares problems, eigenvalue and singular value problems, and
Sylvester’s equations

g
• Auxiliary and utility LAPACK routines

an
W
• ScaLAPACK computational, driver and auxiliary routines (only in Intel MKL for Linux* and Windows*
operating systems)
e ng
• PBLAS routines for distributed vector, matrix-vector, and matrix-matrix operation
nh
Yu

• Direct and Iterative Sparse Solver routines


r
fo

• Vector Mathematical Library (VML) functions for computing core mathematical functions on vector
arguments (with Fortran and C interfaces)
ed
ar

• Vector Statistical Library (VSL) functions for generating vectors of pseudorandom numbers with
p

different types of statistical distributions and for performing convolution and correlation computations
re
yP

• General Fast Fourier Transform (FFT) Functions, providing fast computation of Discrete Fourier
el

Transform via the FFT algorithms and having Fortran and C interfaces
iv
us

• Cluster FFT functions (only in Intel MKL for Linux* and Windows* operating systems)
cl

• Tools for solving partial differential equations: trigonometric transform routines and Poisson solver
Ex

• Optimization solver routines for solving nonlinear least squares problems through trust region algorithms
and computing the Jacobi matrix by central differences
• Basic Linear Algebra Communication Subprograms (BLACS) that are used to support a linear algebra
oriented message passing interface

• Data fitting functions for spline-based approximation of functions, derivatives and integrals of functions,
and search
• GNU Multiple Precision (GMP) arithmetic functions

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.8. USING THE INTEL R MKL 251

4.8.2 Linking Applications Intel MKL. Link Line Advisor


Generally, in order to compile applications using the Intel MKL with the Intel C++ Compiler, the
command line argument -mkl must be specified, and MKL header files must be included in the source code
in order to declare the functions and data types used in the application.
When the application using Intel MKL is run in cluster environments, cross-compiled, or compiled with a
non-Intel compiler, it may be difficult to determine the set of compiler arguments. In order to assist users with
this problem, the Intel MKL Link Line Advisor1 can be used [71]. The Advisor is an interactive Web page,
which requests information about your system and on how you intend to use the Intel MKL (link dynamically
or statically; use threaded or serial mode; use of OpenMP, MPI, and other libraries). Using this information,
the tool automatically generates the appropriate set of compiler and linker arguments.
Figure 4.33 illustrates the interface of the Intel MKL Link Line Advisor.

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Figure 4.33: Web interface of the Intel MKL Link Line Advisor.

1 http://software.intel.com/sites/products/mkl/MKL_Link_Line_Advisor.html

Prepared for Yunheng Wang c Colfax International, 2013


252 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.8.3 Automatic offload


For an application that already uses Intel MKL for calculations on the host system, the easiest way to
launch calculations on an Intel Xeon Phi coprocessor is using the Automatic Offload (AO) mode. In order to
do that, AO must be enabled either by setting an environment variable, or by calling the respective support
function, as shown in Listing 4.87. In this and all other examples int this section, the function call overrides
the environment variable setting.

C/C++ function call Set an environment variable


user@host% export \
1 mkl_mic_enable();
MKL_MIC_ENABLE=1

Listing 4.87: Two ways to enable the Intel MKL Automatic Offload (AO).

g
an
In order for AO to work, the application must be compiled using the Intel C++ Compiler with support

W
for the Intel Xeon Phi architecture. Nothing else needs to be done to use the coprocessor. The library
e ng
will automatically detect available coprocessors, decide when it is beneficial to offload calculations to the
coprocessor, transfer the data over the PCIe bus, and initiate offloaded computation on the coprocessor.
nh
In order to disable AO after it was previously enabled, use the corresponding support function call or
Yu

environment variable, as shown in Listing 4.88.


r
fo

C/C++ function call Set an environment variable


ed
ar

user@host% export \
1 mkl_mic_disable();
p

MKL_MIC_ENABLE=0
re
yP
el

Listing 4.88: Two ways to disable the Intel MKL Automatic Offload (AO).
iv
us
cl

For some functions, users can control the amount of work that must be performed on the host and on the
Ex

coprocessor in the AO mode. This can be done by setting an environment variable, or calling the respective
function, as shown in Listing 4.89.

C/C++ function call Set an environment variable


1 mkl_mic_set_workdivision( user@host% export \
2 MKL_TARGET_MIC, 0, 0.5) MKL_MIC0_WORKDIVISION=50

Listing 4.89: Offload 50% of computation to the first Intel Xeon Phi coprocessor. Note: The support function calls take
precedence over environment variables.

The third argument of the function mkl_mic_set_workdivision() is the fraction of the work to
be performed on the coprocessor (from 0 to 1), and the value of the environment variable
MKL_MIC<card_number>_WORKDIVISION is the percentage (from 0 and 100). Work is measured in
floating-point operations.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.8. USING THE INTEL R MKL 253

Example
Listing 4.90 demonstrates the usage of automatic offload in an application using the Intel MKL.

1 double op_count = (2.0*SIZE*SIZE*SIZE + 1.0*SIZE*SIZE); //


2 double Flops = op_count/time_avg;
3
4 printf("\t Enabling Automatic Offload\n");
5 /* Alternatively, one could set environment variable MKL_MIC_ENABLE=1 */
6 if (mkl_mic_enable() != 0) // function call returns 0 if AO enables successfully
7 {
8 printf("Could not enable Automatic Offload - no MIC devices?. Existing \n");
9 return -1;
10 } else {
11
12 /************************* AO executing *****************************/
13 printf("\t ========= executing in Automatic Offload mode ========= \n");
14 sgemm(&transa, &transb, &SIZE, &SIZE, &SIZE, &alpha,

g
15 A, &newLda, B, &newLda, &beta, C, &SIZE);

an
16 double time_start_AO = dsecnd();

W
17 for( k = 0; k < LOOP_COUNT; k++)
18 {

ng
19 sgemm(&transa, &transb, &SIZE, &SIZE, &SIZE, &alpha,

e
20 A, &newLda, B, &newLda, &beta, C, &SIZE);
21 }
double time_end_AO = dsecnd(); nh
Yu
22
23 double time_avg_AO = ( time_end_AO - time_start_AO )/LOOP_COUNT;
r

24 op_count = (2.0*SIZE*SIZE*SIZE + 1.0*SIZE*SIZE); //


fo

25 double Flops_AO = op_count/time_avg_AO;


printf("\t size == %d, GFlops == %.3f \n", SIZE, Flops_AO/1000000000 );
d

26
re

27 }
pa
re

Listing 4.90: Fragment of Automatic Offload code with the sgemm function call from Intel MKL with corresponding
yP

performance calculations.
el
iv
us
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


254 CHAPTER 4. OPTIMIZING APPLICATIONS FOR INTEL R XEON R PRODUCT FAMILY

4.8.4 Compiler-Assisted Offload


It is possible to offload Intel MKL functions to the coprocessor using #pragma offload. This
approach, known as Compiler Assisted Offload (CAO), requires that the user takes care of data transfer to the
coprocessor. The benefit of CAO is a more fine-grained control over data traffic and compute device usage
than with AO. For instance, when memory retention or data persistence on the coprocessor is possible (see
Section 4.6), CAO may produce better results than AO.
Listing 4.91 demonstrates calling the SGEMM routine using CAO. Listing 4.92 demonstrates using data
persistence on the coprocessor with CAO.

1 #pragma offload target(mic) \


2 in(transa, transb, N, alpha, beta) \
3 in(A:length(matrix_elements)) \
4 in(B:length(matrix_elements)) \
5 out(C:length(matrix_elements) alloc_if(0))
6 {
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);

g
7

an
8 }

W
Listing 4.91: C/C++ example of Intel MKL Compiler Assisted Offload usage.
e ng
nh
Yu

1 __attribute__((target(mic))) static float *A, *B, *C, *C1;


r

2 // Allocate matrices
fo

3
ed

4 // Transfer matrices A, B, and C to coprocessor and do not deallocate matrices A and B


5 #pragma offload target(mic) \
ar

6 in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) \


p

7 in(A:length(NCOLA * LDA) free_if(0)) \


re

8 in(B:length(NCOLB * LDB) free_if(0)) \


yP

9 out(C:length(N * LDC) free_if(0))


10 {
el

11 sgemm(&transa, &transb, &M, &N, &K, &alpha,


iv

12 A, &LDA, B, &LDB, &beta, C, &LDC);


us

13 }
cl

14
// Re-use the data of matrix A on the coprocessor (data persistence)
Ex

15
16 // and re-use the memory allocated for B and C (memory retention)
17 #pragma offload target(mic) \
18 in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) \
19 nocopy(A: length(NCOLA * LDA) alloc_if(0) free_if(0)) \
20 in(B: length(NCOLB * LDB) alloc_if(0) free_if(0)) \
21 out(C: length(N * LDC1) into(C1) alloc_if(0) free_if(0))
22 {
23 sgemm(&transa1, &transb1, &M, &N, &K, &alpha1,
24 A, &LDA, B, &LDB, &beta1, C, &LDC1);
25 }
26
27 // Deallocate A, B and C on the coprocessor
28 #pragma offload target(mic) \
29 nocopy(A:length(NCOLA * LDA) free_if(1)) \
30 nocopy(B:length(NCOLB * LDB) free_if(1)) \
31 nocopy(C:length(NCOLC * LDC) free_if(1))
32 { }

Listing 4.92: C/C++ example of Intel MKL Data Persistence at Compiler Assisted Offload.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


4.8. USING THE INTEL R MKL 255

4.8.5 Native Execution


As discussed in Section 2.1, applications for native execution on Intel Xeon Phi coprocessors can be built
with the compiler option -mmic. In order to use Intel MKL in a native application, an additional argument
-mkl is required. Native applications with Intel MKL functions operate just like native applications with
user-defined functions. In MPI applications where MPI processes run on the host as well as on coprocessors,
the code for the coprocessor part is compiled as a native application.

4.8.6 General Performance Considerations for Applications Using Intel MKL


As previously described in Sections 4.4.5 and 4.6.3, thread affinity and TLB page size must be controlled
in order to achieve best performance on the coprocessor. For compute-bound tasks (e.g., dense linear algebraic
operations), affinity of type compact or balanced may significantly increase performance; for bandwidth-
bound calculations, affinity of type scatter and using half of all available logical cores is usually better. It
is also advisable to avoid using the uOS cores of Intel Xeon Phi coprocessors, which handle data transfer and
housekeeping tasks. Using large TLB pages (> 2MB) helps applications with large data sets accessed in a

g
regular pattern.

an
For instance, on a 60-core coprocessor, the procedure shown in Listing 4.93 may produce better results

W
for SGEMM calculation than the default execution in the AO mode.

e ng
nh
user@host% export MIC_OMP_NUM_THREADS=236
user@host% export MIC_ENV_PREFIX=MIC
Yu
user@host% export MIC_KMP_AFFINITY=compact,granularity=fine
user@host% export MIC_PLACE_THREADS=59C,4t
r
fo

user@host% export MIC_USE_2MB_BUFFERS=16K


user@host% icpc -o my-sgemm -mkl my-sgemm.cc
d

user@host% ./my-sgemm
re
pa
re

Listing 4.93: Optimizing execution parameters for an AO application with SGEMM calculation on a coprocessor.
yP

For native applications, the prefix MIC_ENV_PREFIX does not affect the environment variable sharing.
el

In order to set environment variables for native applications, use one of the following methods:
iv
us

a) for native applications launched from a shell on the coprocessor (i.e., when you SSH into the µOS), use the
cl

shell command export. Example:


Ex

user@host% scp my-native-application mic0:~/


my-native-application 100% 101KB 100.4KB/s 00:00
user@host% ssh mic0
user@mic% export OMP_NUM_THREADS=120
user@mic% ./my-native-application

b) for applications launched with the tool micnativeloadex, use the argument -e to pass an environment
variable to the coprocessor. Example:
user@host% micnativeloadex ./my-native-application -e "OMP_NUM_THREADS=120"

c) for MPI applications launched with micrun, use the argument -env to pass environment variables
user@host% micrun \
-host mic0 -env OMP_NUM_THREADS 120 -np 1 ./my-native-application : \
-host mic1 -env OMP_NUM_THREADS 120 -np 1 ./my-native-application

Prepared for Yunheng Wang c Colfax International, 2013


Ex
cl
us
iv
el
yP
re
par
ed
fo
r Yu
nh
eng
W
an
g
257

Chapter 5

Summary and Resources

Thank you for learning about Intel Xeon Phi coprocessor programming with “Parallel Programming and

g
TM

an
Optimization with Intel R Xeon Phi Coprocessors” by Colfax International! We hope that, whatever scope

W
of information you were looking for, you were able to find answers or pointers in this book. In this last brief
chapter, we will summarize the key findings of our experience with Intel Xeon Phi coprocessor programming,

ng
and provide references for future learning.

e
nh
Yu
TM
5.1 Programming Intel R Xeon Phi Coprocessors is Not Trivial, but
r
fo

Offers Double Rewards


d
re

Computing accelerators such as GPGPUs and Intel Xeon Phi coprocessors will be extremely important
pa

in the future on all levels of high performance computing, from workstation to exascale. The launch of
re

the Intel Xeon Phi product family changed the landscape of computing accelerators by offering developers
yP

something new that GPGPUs cannot match. However, it is important to realize that this novelty is not the ease
el

of programming. It is not trivial to achieve good performance with Intel Xeon Phi coprocessors, especially
iv

when one compares it to the performance of modern Intel Xeon processors with the Sandy Bridge architecture.
us

The new truth that HPC programmers must learn is: if a parallel code does not perform fast on Intel Xeon Phi
cl

coprocessors, it probably is not doing very well on Intel Xeon processors, either. The flip side of this truth
Ex

is that when developers invest time and effort into optimizing for the many-core architecture, they also reap
performance benefits on multi-core processors. In this sense, optimization for the Intel MIC platform yields
“double rewards” by also tapping more performance from the host system. That said, we concur with Intel’s
James Reinders, who expresses the “double advantage” in this way [2]:

The single most important lesson from working with Intel Xeon Phi coprocessors is this: the best way to prepare for
Intel Xeon Phi coprocessors is to fully exploit the performance that an application can get on Intel Xeon processors
first. Trying to use an Intel Xeon Phi coprocessor, without having maximized the use of parallelism on Intel Xeon
processor, will almost certainly be a disappointment.
...
The experiences of users of Intel Xeon Phi coprocessors . . . point out one challenge: the temptation to stop tuning before
the best performance is reached. . . . There ain’t no such thing as a free lunch! The hidden bonus is the “transforming-
and-tuning” double advantage of programming investments for Intel Xeon Phi coprocessors that generally applies
directly to any general-purpose processor as well. This greatly enhances the preservation of any investment to tune
working code by applying to other processors and offering more forward scaling to future systems.

Prepared for Yunheng Wang c Colfax International, 2013


258 CHAPTER 5. SUMMARY AND RESOURCES

In the programming and optimization examples presented throughout this book, we strived to convey
two important messages:

1) Optimization methods that benefit applications for Intel Xeon Phi coprocessors usually also improve
performance on Intel Xeon processors, and vice-versa. Consequently, an attractive feature of Intel Xeon
Phi coprocessors as accelerators is that the developer may write and optimize the computational kernel
code only once to run on the host system as well as on the target coprocessor.
2) High performance can be achieved by relying on automatic vectorization in the Intel C++ Compiler and
traditional parallel frameworks such as OpenMP and MPI. This means that

a) “Ninja programming” (i.e., low-level optimization that may involve assembly coding or the usage of
intrinsics) is not necessary for high performance with Intel Xeon Phi coprocessors [72]. In fact, tradi-
tional programming methods can lead to good performance, if the programmer follows the guidelines
for developing vectorizable, scalable code with data locality and infrequent synchronization.
b) A single source code can be used on today’s Intel Xeon processors, Intel Xeon Phi coprocessors, and

g
future technologies based on x86-like architecture. In this sense, applications designed for the MIC

an
architecture using common programming methods are “future-proof”.

W
stimulating and enjoyable as it has been for us.
e ng
We do hope that your experience with the adoption of Intel Xeon Phi coprocessors is as intellectually
nh
Yu

5.2 Practical Training


r
fo
ed

Colfax International is ready to offer you the opportunity to try using Intel Xeon Phi coprocessors
ar

and Intel software development tools. You can get access to computing systems equipped with Intel Xeon
processors and Intel Xeon Phi coprocessors, and Intel software development tools by participating in the
p
re

Colfax Developer Training, for which this book was written. The training is available in two formats:
yP

1) Self-study course with remote access to computing systems hosted by Colfax International, and
el
iv

2) Instructor-led classes, which can be taken at Colfax’s location, or brought to your company’s offices.
us
cl

For information on booking the training, please visit http://www.colfax-intl.com/xeonphi/training.html


Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


5.3. ADDITIONAL RESOURCES 259

5.3 Additional Resources


Books
We can recommend the following books for additional information on parallel programming and the
Intel MIC architecture.

1) Another perspective on programming for the MIC architecture, more examples of high performance
codes, and best practices advice from Intel’s senior engineers Jim Jeffers and James Reinders can be
found in “Intel Xeon Phi Coprocessor High Performance Programming” [35]. The book has a Web site at
http://www.lotsofcores.com/ [36]

2) For a solid foundation of traditional parallel programming methods with OpenMP and MPI, refer to
“Parallel Programming in C with MPI and OpenMP” by Michael J. Quinn [39].
3) A new look at parallel algorithms and novel parallel frameworks are presented in “Structured Parallel
Programming: Patterns for Efficient Computation” by Michael McCool, Arch D. Robinson and James

g
an
Reinders [37]. The Web site of the book is http://parallelbook.com/ [38].

W
4) In order to gain a better understanding of computer architecture in general, and specifically the architecture

ng
of Intel Xeon Phi coprocessors, refer to “Compute Architecture: Quantitative Approach" by John L.

e
Hennessy and David A. Patterson [1] and “Intel Xeon Phi Coprocessor System Software Developer’s
Guide”, a publication by Intel [73]. nh
r Yu
Reference Guides
fo
d

The following list is a collection of URLs for software development tool and programming framework
re

reference guides.
pa
re

1. Intel C++ Compiler User and Reference Guide [20]:


yP

http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-lin/index.htm
el

2. Intel VTune Amplifier XE User’s Guide [74]:


iv

http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/lin/ug_docs/index.htm
us
cl

3. Intel Trace Analyzer and Collector Reference Guide [75], [76]:


Ex

http://software.intel.com/sites/products/documentation/hpc/ics/itac/81/ITA_Reference_Guide/index.htm
http://software.intel.com/sites/products/documentation/hpc/ics/itac/81/ITC_Reference_Guide/index.htm

4. Intel MPI Library Reference Manual [6]:


http://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference_Manual/index.htm

5. MPI routines on the ANL Web site [40]:


http://www.mcs.anl.gov/research/projects/mpi/www/

6. OpenMP specifications [31]:


http://openmp.org/wp/openmp-specifications/

Prepared for Yunheng Wang c Colfax International, 2013


260 CHAPTER 5. SUMMARY AND RESOURCES

Online Resources
1) The Intel Developer Zone has a portal for Intel Xeon Phi coprocessor developers with white papers, links
to products, forums and case studies, and other essential information [77]:
http://software.intel.com/mic-developer

2) This online article submitted by Intel’s Technical Consulting Engineer Wendy Doerner contains a wealth of
information on optimization for Intel Xeon Phi coprocessors in the form of blog posts, white papers and
presentation slides [78]:
http://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture

3) The YouTube channel Intel Software TV has published a set of video tutorials on Intel Xeon Phi coprocessor
programming [79]:
http://www.youtube.com/playlist?list=PLg-UKERBljNwuVuid_rhZ1yVUrTjC3gzx

Community Support

g
an
1) The forum “Intel Many Integrated Core Architecture” in the Intel Developer Zone is a great place to ask

W
questions and exchange ideas [80]:
http://software.intel.com/en-us/forums/intel-many-integrated-core
ng
This forum gets contributions from developers working with Intel Xeon Phi coprocessors, and it is also
e
nh
monitored by Intel’s engineers involved in the development of the MIC architecture.
Yu

2) Another forum in the Intel Developer Zone, “Threading on Intel Parallel Architectures” [81]
r

http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures
fo

is a good place to communicate with peers about parallel programming, not necessarily in the context of
ed

the MIC architecture.


ar

3) Find connections and stay updated on the latest news related to the MIC technology by joining the LinkedIn
p
re

group “Parallel Computing with Intel Xeon Phi Coprocessors” [82]


yP

http://www.linkedin.com/groups/Parallel-Computing-Intel-Xeon-Phi-4722265/about
el
iv

Contact Us
us

If you have questions, ideas, suggestions, corrections, or need information about purchasing or test-
cl
Ex

driving computing systems with Intel Xeon Phi coprocessors, please refer to:

a) the Colfax International Web site http://colfax-intl.com/


b) the information page for Intel Xeon Phi coprocessors http://colfax-intl.com/xeonphi/
c) or contact us at the following email address: [email protected].

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


261

Appendix A

Practical Exercises

g
A.1 Exercises for Chapter 1:

an
W
A.1.1 Power Management and Resource Monitoring

ng
The following practical exercises and questions correspond to the material covered in the Chapter 1,

e
Section 1.4 – Section 1.5 (pages 18 – 31).
nh
Yu
r

Goal
fo

The following practical exercises will show some of the tools we studied in Chapter 1 to get information
d
re

about the Intel Xeon Phi coprocessors and to monitor what resources are being used.
pa
re

Instructions
yP

1. Most of the administrative tools and utilities can be found in the /opt/intel/mic/bin directory.
el

Check if this location was added to your $PATH environment variable already.
iv
us

user@host% echo $PATH


cl
Ex

Some of these utilities require superuser privileges. If you have not already modified the PATH
environment variable, you should do so now, to facilitate the path lookup for these tools.

user@host% export PATH=$PATH:/opt/intel/mic/bin

The above command will modify the path environment variable only for the current terminal. To apply
this changes to all users, we need to create a script, which will do it automatically. Use su or sudo to
access the system folders.

root@host% echo ’pathmunge /opt/intel/mic/bin’ > /etc/profile.d/micpath.sh


root@host% chmod +x /etc/profile.d/micpath.sh

pathmunge is a function from /etc/profile1 , which will add the path to the environment at the
startup. Thus, for changes to take effect, we need to log out or reload the current profile.
1 applicable to RHEL*/CentOS/Fedora Linux distributions

Prepared for Yunheng Wang c Colfax International, 2013


262 APPENDIX A. PRACTICAL EXERCISES

user@host% . /etc/profile

Question 1.1.a. What does the Intel MIC architecture stand for?

2. Let’s check if the MPSS service is running. Root privileges should be used.

root@host% service mpss status


mpss is running

Question 1.1.b. What command should be used to start the MPSS service, if it is not running?

3. While testing Intel Xeon Phi coprocessors, it is important to check the temperature regularly.
Question 1.1.c. What utility should be used to find out the current MIC core rail temperature?

g
4. After the MPSS installation and starting all corresponding services, we can manually connect to the

an
Intel Xeon Phi coprocessors, or we can run some tests automagically.

W
ng
Question 1.1.d. What command should be used to run diagnostics on the coprocessors?
e
nh
5. A memory swapping mechanism has not been implemented for the Intel Xeon Phi coprocessors yet.
Yu

Therefore, you should avoid situations which would cause an overflow, otherwise your application will
crash with a runtime error.
r
fo

Question 1.1.e. How can I tell how much and what type of memory is installed on the Intel Xeon Phi
ed

coprocessor(s)?
p ar
re

6. Intel will provide new versions of the software stack, as they will be developed and optimized for
yP

performance.
el

Question 1.1.f. What utility should be used to display the MIC Flash version?
iv
us

7. Currently up to eight Intel Xeon Phi coprocessors can be installed in one computational node.
cl
Ex

Question 1.1.g. What should be done to reboot the second MIC card in a system (if several Intel Xeon
Phi coprocessors are installed)?

8. For highly parallel applications it is useful to know the load of individual cores for diagnostic, testing,
and debugging purposes.
Question 1.1.h. Is there a way I can display the utilization per core on my Intel Xeon Phi coprocessors.

9. Every new version of MPSS usually requires re-flashing the Intel Xeon Phi coprocessor’s.
Question 1.1.i. If I want to re-flash my Intel Xeon Phi coprocessors with a new flash image, what utility
would I use?

10. Intel Xeon Phi coprocessors can be reconfigured for a specific network configuration, power management
policy, etc.
Question 1.1.j. Once MPSS has been installed, what is the utility that is used to create the coprocessor
configuration files? And where are these configuration files created?

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.1. EXERCISES FOR CHAPTER 1: 263

Answers

Answer 1.1.a. MIC = Many Integrated Cores.

Answer 1.1.b. sudo service mpss start

Answer 1.1.c. micsmc in GUI mode, or CLI version: micsmc -t


If only terminal access is available, we still can continuously monitor the temperatures on the Intel Xeon
Phi coprocessor(s) by using the following command:
user@host% watch -n 1 "micsmc -t"

-n 1 parameter tells watch to run the micsmc -t command every second, thus providing a conve-
nient temperature monitoring tool.

g
an
Answer 1.1.d. miccheck

W
To check if the Intel Xeon Phi coprocessor is running properly, we can run miccheck, which will test
the standard unit operations.

e ng
user@host% miccheck
nh
rYu
Answer 1.1.e. micinfo provides memory information
fo
d
re

Answer 1.1.f. sudo micinfo or


pa

root@host% micctrl -r
re

root@host% micctrl -w
root@host% micflash -GetVersion
yP
el
iv

Answer 1.1.g. sudo micctrl -reboot mic1


us
cl

Answer 1.1.h. micsmc -cores or ssh mic0; top and press “1"
Ex

Answer 1.1.i. micflash -Update <flashImage> -device <deviceID>

Answer 1.1.j. micctrl -initdefaults and micctrl -resetconfig to remove and recreate a
default configuration from the current MIC configuration files, which are stored at
/etc/sysconfig/mic/mic<N>.cfg

TM
A.1.2 Networking on Intel R Xeon Phi Coprocessors
The following practical exercises and questions correspond to the material covered in Section 1.5.2 –
Section 1.5.4 (pages 31 – 37).

Goal
The following practical exercises will show communication patterns with the Intel Xeon Phi coprocessors
with SSH, and will provide detailed instruction on using an NFS-shared mount of an Intel MPI folder on the

Prepared for Yunheng Wang c Colfax International, 2013


264 APPENDIX A. PRACTICAL EXERCISES

coprocessors.

Instructions
1. Generate SSH RSA and DSA keys and copy them to the Intel Xeon Phi coprocessor by reinitializing the
configuration files. Before we can do anything with the configuration files, however, we need to stop the
MPSS service, and restart it after we are done.
user@host% ssh-keygen
... follow the instructions ...
user@host% ssh-keygen -t dsa
... follow the instructions ...
user@host% sudo service mpss stop
Shutting down MPSS Stack: [ OK ]
user@host% sudo micctrl --resetconfig
user@host% sudo service mpss start
Starting MPSS Stack: [ OK ]
mic0: online (mode: linux image: /lib/firmware/mic/uos.img)

g
mic1: online (mode: linux image: /lib/firmware/mic/uos.img)

an
W
These actions are required for each newly created user. Once the SSH keys have been created, they will

execution, copying files to the coprocessor with scp, etc.


e ng
be copied to the Intel Xeon Phi coprocessor and will be used for remote access via ssh for native mode
nh
Yu

2. Login to the Intel Xeon Phi coprocessor with the ssh command and run the following commands to
find the specifications of the device(s).
r
fo

user@host% cat /etc/hosts


ed

...
ar

172.31.1.1 host-mic0 mic0


172.31.1.254 hostmic0
p
re

172.31.2.1 host-mic1 mic1


172.31.2.254 hostmic1
yP

user@host% ssh mic0


el

user@mic0% cat /proc/cpuinfo | grep "processor" | wc -l


...it will show total number of logical cores on the device
iv

user@mic0% uname -a
us

Linux host-mic0 2.6.34.11-g65c0cd9 #2 SMP Wed Nov 21 12:43:06 PST 2012


cl

k1om k1om k1om GNU/Linux


Ex

user@mic0% less /proc/meminfo


...press "q" to quit.
user@mic0% top
...press "1" for "load per processor" view; press "q" to quit.
user@mic0% route
...shows kernel IP routing table
user@mic0% ifconfig
...
user@mic0% reboot
...Well, let‘s not use it... the proper way is to use "sudo micctrl -R"

3. Next we will use NFS mount to access /opt/item/impi, the Intel MPI folder on the host. It will be
needed later on for the Intel MPI labs.
(a) If iptables is enabled, allow traffic on ports 111 and 2049. Otherwise disable it, if it will not
compromise the security of the host.
user@host% sudo service iptables stop

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.1. EXERCISES FOR CHAPTER 1: 265

(b) Check the status of services and install/start them, if any of them are stopped or missing.
root@host% sudo yum install nfs-utils
root@host% service rpcbind start
root@host% service nfslock start
root@host% service nfs start

(c) On the host system, modify the /etc/exports file. We assume you have two Intel Xeon Phi
coprocessors, otherwise, use only mic0 settings. Add the following line to the file:
# add this to the /etc/exports file
/opt/intel/impi mic0(rw,no_root_squash) mic1(rw,no_root_squash)

This can be done with your favorite text editor, for instance with vi. Or the following way:
root@host% \
% echo ’/opt/intel/impi mic0(rw,no_root_squash) mic1(rw,no_root_squash)’ \

g
% >> /etc/exports

an
W
Warning: Be sure to use two “greater than" signs to insert the line. If only one “greater than" sign

ng
is used, you will overwrite the contents of the file!

e
(d) Add ALL: mic0,mic1 line to the /etc/hosts.allow file:
nh
Yu
root@host% echo ’ALL: mic0,mic1’ >> /etc/hosts.allow
r
fo

(e) To apply the changes:


d
re

root@host% exportfs -a
pa
re

(f) Everything is ready on the host for the export. Now we need to configure the coprocessor side to
yP

allow the mounting of the exported NFS share.


el

It can be done with the micctrl utility:


iv
us

root@host% micctrl --addnfs=/opt/intel/impi --dir=/opt/intel/impi


cl

root@host% service mpss restart


Ex

This instruction equivalent to the following list of commands:


(g) On the coprocessor add the following line to the etc/fstab file:
host:/opt/intel/impi /opt/intel/impi nfs rsize=8192,wsize=8192,nolock,intr 0 0

(h) As a root create the mounting folder on the coprocessor, and mount the NFS share.
root@host% ssh mic0
root@mic0% mkdir -p /opt/intel/impi
root@mic0% mount -a
root@mic0% ls /opt/intel/impi

4. You should see the Intel MPI folders mounted from the host, if we succeeded. But it will disappear the
next time MPSS restarts. Thus we need change MPSS files on the host, to apply those mounting settings
automagically.

Prepared for Yunheng Wang c Colfax International, 2013


266 APPENDIX A. PRACTICAL EXERCISES

root@host% cd /opt/intel/mic/filesystem
root@host% \
% echo ’host:/opt/intel/impi /opt/intel/impi nfs rsize=8192,wsize=8192,nolock,intr 0 0’ \
% >> mic0/etc/fstab
root@host% \
% echo ’host:/opt/intel/impi /opt/intel/impi nfs rsize=8192,wsize=8192,nolock,intr 0 0’ \
% >> mic1/etc/fstab
root@host% mkdir -p mic0/opt/intel/impi
root@host% mkdir -p mic1/opt/intel/impi
root@host% echo ’dir opt 755 0 0’ >> mic0.filelist
root@host% echo ’dir opt/intel/ 755 0 0’ >> mic0.filelist
root@host% echo ’dir opt/intel/impi 755 0 0’ >> mic0.filelist
root@host% echo ’dir opt 755 0 0’ >> mic1.filelist
root@host% echo ’dir opt/intel/ 755 0 0’ >> mic1.filelist
root@host% echo ’dir opt/intel/impi 755 0 0’ >> mic1.filelist

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.2. EXERCISES FOR CHAPTER 2: PROGRAMMING MODELS 267

A.2 Exercises for Chapter 2: Programming Models


TM
A.2.1 Compiling and Running Native Intel R Xeon Phi Applications
The following practical exercises correspond to the material covered in the Chapter 2: Section 2.1.1
through Section 2.1.4 (pages 37 – 41)

Goal
This practical exercise will show how to: link and compile simple source code for native Intel Xeon Phi
coprocessor execution, how to use micnativeloadex tool for automatic native execution and resolving
library dependencies, how to monitor an activity on the Intel Xeon Phi coprocessor.

Preparation
Before linking and compiling any source code, we need to be sure that the compiler is installed in the

g
system and the environment is set up properly.

an
W
1. In the terminal, execute the following command to check if the Intel C Compiler is installed:

e ng
user@host% which icc
/opt/intel/composer_xe_2013.1.117/bin/intel64/icc
nh
r Yu
2. As was previously described in Section 1.4.3 it is essential to properly set up the environment variables
fo

for the Intel C Compiler and Intel C++ Compiler with the ia32 or intel64 option, which should
d

indicate the host system architecture.


re
pa

user@host% /opt/intel/composerxe/bin/compilervars.sh intel64


re
yP

This script will export environment variables of the compilers, Intel Tread Building Blocks (TBB),
Intel MKL, and others. For convenience sake, this command line can be added to the .profile or
el

.bash_profile files to be executed automatically. Check your system configuration to see if it is


iv

already present.
us
cl
Ex

Instructions
1. Link and compile the source code hello.c (code Lab B.2.1.1), which can be found in the labs folder:
labs/2/2.1-native/hello.c
user@host% cd ~/labs/2/2.1-native/
user@host% icc -o hello hello.c
user@host% ./hello
Hello world! I have 32 logical cores.

2. Next compile it with the -mmic flag to make it natively executable for the Intel Xeon Phi coprocessor.

user@host% icc -mmic -o hello.MIC hello.c


user@host% ./hello.MIC
-bash: ./hello.MIC: cannot execute binary file

Note that the resultant binary file can only be executed on the Intel Xeon Phi coprocessor. It can not be
executed on the host system, as shown in the listing above.

Prepared for Yunheng Wang c Colfax International, 2013


268 APPENDIX A. PRACTICAL EXERCISES

3. The Intel Xeon Phi coprocessor is an IP-addressable PCIe device with an independent µOS linux, with
an SSH server demon installed. So, we can use scp to copy the hello.MIC file to the home folder on
the card. Aliases and IP addresses for the devices can be found at the /etc/hosts file on the host.
Connect to the Intel Xeon Phi coprocessor through SSH, and execute the binary file locally (native
execution model):

user@host% scp hello.MIC mic0:~/


user@host% ssh mic0
user@mic0% ./hello.MIC
Hello world! I have 228 logical cores.

4. We will use the micnativeloadex tool next, which can be used to upload a native application and
related dependent libraries to the Intel Xeon Phi coprocessor upon execution.

user@host% micnativeloadex hello.MIC


Hello world! I have 228 logical cores.

g
an
W
5. It also can be used to find detailed information about the binary target and library dependencies:

user@host% micnativeloadex hello.MIC -l


e ng
nh
Yu

Dependency information for hello.MIC


r

Full path was resolved as


fo

/home/user/labs/2/2.1-native/hello.MIC
ed
ar

Binary was built for Intel(R) Xeon Phi(TM) Coprocessor


(codename: Knights Corner) architecture
p
re

SINK_LD_LIBRARY_PATH = /opt/intel/composer_xe_2013.0.079/compiler/lib/mic/:
yP

/opt/intel/mic/filesystem/:/opt/intel/impi/4.1.0/mic/lib/lib:
/opt/intel/impi/4.1.0/mic/bin/
el
iv

Dependencies Found:
us

(none found)
cl
Ex

Dependencies Not Found Locally (but may exist already on the coprocessor):
libm.so.6
libgcc_s.so.1
libc.so.6
libdl.so.2

Note: If the binary file cannot be executed due to missing dependencies, micnativeloadex will
inform you about it. Those missing libraries can be found with locate and added to the libraries path
environment variable (SINK_LD_LIBRARY_PATH). Then micnativeloadex can upload them
automatically.

6. Resolving library dependencies for native execution with micnativeloadex.


Compile hello.c source code with the -mkl and -mmic flags to add links for the Intel MKL libraries
to the Intel Xeon Phi coprocessor native executable. Try to run the resultant binary file natively with
the micnativeloadex tool. You will find the error message: The remote process indicated that the
following libraries could not be loaded: libmkl_intel_lp64.so libmkl_intel_thread.so libmkl_core.so
libiomp5.so Error creating remote process, at least one library dependency is missing. Please check

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.2. EXERCISES FOR CHAPTER 2: PROGRAMMING MODELS 269

the list of dependencies below to see which one is missing and update the SINK_LD_LIBRARY_PATH
environment variable to include the missing library.
Question 2.1.a. How would you resolve the missing dependencies by adding the path to those libraries
to the environment variable (SINK_LD_LIBRARY_PATH)?

(a) Run micnativeloadex with the source code compiled for the Intel Xeon Phi coprocessor and
Intel MKL (it should have been compiled with the -mmic and -mkl flags).
You should see the following error:
user@host% micnativeloadex hello.MIC
The remote process indicated that the following libraries could not be loaded:
libmkl_intel_lp64.so libmkl_intel_thread.so libmkl_core.so libiomp5.so
Error creating remote process, at least one library dependency is missing.
Please check the list of dependencies below to see which
one is missing and update the SINK_LD_LIBRARY_PATH
environment variable to include the missing library.
...

g
an
(b) When we compiled our source code with -mkl flag, we explicitly told the compiler to link our

W
binary with Intel MKL libraries:

ng
• libmkl_intel_lp64.so

e
• libmkl_intel_thread.so
nh
Yu
• libmkl_core.so
• libiomp5.so
r
fo

(c) To find the location of corresponding libraries we can use the following command:
d
re

user@host% locate libmkl_core.so|grep mic


pa

/opt/intel/composer_xe_2013.1.117/mkl/lib/mic/libmkl_core.so
user@host% locate libiomp5.so|grep mic
re

/opt/intel/composer_xe_2013.1.117/compiler/lib/mic/libiomp5.so
yP
el

(d) So those locations should be added to the SINK_LD_LIBRARY_PATH path environment variable
iv

separated with a colon (“:").


us
cl

7. Next we will monitor an activity on Intel Xeon Phi coprocessor with the micsmc tool.
Ex

Consider the following source code, which uses pthreads for the parallelism B.2.1.2.
Note: Threads load the CPU with a series of infinite loops, thus the user will have to terminate the
process manually.
fflush(0) on line 9 insures, that all printf() (line 8) function calls on Intel Xeon Phi coprocessor
are printed out by flushing the I/O buffers.
This code will spawn as many threads as there logical cores in the system, which is found using
sysconf(_SC_NPROCESSORS_ONLN). The code was written with C99 standard in mind to keep
local variables within local scopes, like with the for loop at the line 18, thus -std=c99 flag should
be used during the compilation:
user@host% icc -pthread -std=c99 -o donothinger donothinger.c
user@host% ./donothinger
Spawning 2 threads that do nothing, press ^C to terminate.
Hello World from thread #0!
Hello World from thread #1!
...

Prepared for Yunheng Wang c Colfax International, 2013


270 APPENDIX A. PRACTICAL EXERCISES

Question 2.1.b. How do you compile the donothinger.c source code for native Intel Xeon Phi
coprocessor execution and run it with the micnativeloadex tool?

8. Use micsmc to monitor the activity of the Intel Xeon Phi coprocessors.

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el

Figure A.1: Screenshot of micsmc tool.


iv
us
cl

Initially, only combined statistics are available with average temperature and sum of all cores and
Ex

memory volume. To see individual statistics per card, press the “Cards” button, highlighted in green on
the Figure A.1. The areas highlighted in red indicate additional buttons, which will change the view of
the individual card statistics.

9. Run the donothinger.c code compiled for native execution on the Intel Xeon Phi coprocessor with
micnativeloadex and monitor the activity through the micsmc statistics output.
Question 2.1.c. What should you use to execute the code on the second Intel Xeon Phi coprocessor
with the micnativeloadex tool?

10. Advanced: Study the offloading code B.2.1.3. Compile it and execute on the host with the offload on
different Intel Xeon Phi coprocessors.

Note: Alternatively, to avoid libraries copying or modifying SINK_LD_LIBRARY_PATH environment


variable, we can mount the /opt/intel/ folder as NFS-share, allowing coprocessors access all the libraries
located on the host. Detailed instruction for the Intel MPI NFS-shared folder was previously covered in
Lab A.2.1, Instruction set 3.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.2. EXERCISES FOR CHAPTER 2: PROGRAMMING MODELS 271

Answers

Answer 2.1.a. Write all in one line and substitute your current version of the compiler:
export SINK_LD_LIBRARY_PATH=
$SINK_LD_LIBRARY_PATH:/opt/intel/composerxe/mkl/lib/mic/:
/opt/intel/composer_xe_2013.1.117/compiler/lib/mic/

Answer 2.1.b.
user@host% icc -mmic -pthread -std=c99 -o donothinger donothinger.c
user@host% micnativeloadex donothinger

Answer 2.1.c. If you have several cards available, use micnativeloadex with the -d 1 flag for native
execution on the 2nd coprocessor.

g
an
W
A.2.2 Explicit Offload: Sharing Arrays and Structures

ng
The following practical exercises correspond to the material covered in Section 2.2.1 – Section 2.2.9

e
(pages 45 – 55)
nh
Yu
Goal
r
fo

The explicit offload execution model differs from the native execution model. The main function is
d
re

executed on the host, and part of the code or specified function calls are offloaded to the Intel Xeon Phi
pa

coprocessor. It is a blocked operation. Therefore, the host processor will wait until the offloaded code finishes
re

execution before it continues.


yP
el

Instructions
iv

1. Go to the corresponding lab folder and study the source code for offload.cpp (see B.2.2.2) , which
us

is just counting the number of non-zero elements randomly generated by the host processor.
cl
Ex

user@host% cd ~/labs/2/2.2-explicit-offload/step-00
user@host% make
icpc -c -o "offload.o" "offload.cpp"
icpc -o runme offload.o
user@host% ./runme
There are 893 non-zero elements in the array.

You can use the make command (see B.2.2.1) to compile the object file .o (use the -c flag with the
compiler) and link it to the ./runme executable. You can modify the source code to experiment with
the program. To clean the folder (e.g. delete the executable and the object file produced by the Intel C++
Compiler) just type:

user@host% make clean

2. We want the CountNonZero function to be offloaded to the Intel Xeon Phi coprocessor, as the final
result. But let us start from something simple.

Prepared for Yunheng Wang c Colfax International, 2013


272 APPENDIX A. PRACTICAL EXERCISES

Question 2.2.a. How would you add the printf function call with the message, “Hello World from
MIC!" and offload it to the Intel Xeon Phi coprocessor?

To print out the string from the offloaded printf() function call, proxy console I/O will be used.
Question 2.2.b. What can we do to guarantee that the text will be printed?

Corresponding changes were made to the files in the step-01 subfolder (see B.2.2.3).

3. To get some information about the offloading process, use the OFFLOAD_REPORT environment variable.
Assign “1" for the basic information, and “2" for more detailed report.

user@host% export OFFLOAD_REPORT=2


user@host% make
icpc -c -o "offload.o" "offload.cpp"
icpc -o runme offload.o
user@host% ls

g
Makefile offload.cpp offloadMIC.o offload.o runme

an
user@host% ./runme

W
There are 893 non-zero elements in the array.
Hello from MIC![Offload] [MIC 0] [File] offload.cpp

ng
[Offload] [MIC 0] [Line] 24
[Offload] [MIC 0] [CPU Time] 0.522934 (seconds)
e
nh
[Offload] [MIC 0] [CPU->MIC Data] 0 (bytes)
[Offload] [MIC 0] [MIC Time] 0.000174 (seconds)
Yu

[Offload] [MIC 0] [MIC->CPU Data] 0 (bytes)


r
fo

Note: There are two object files (<name>.o and <name>MIC.o) created. The object file with the
ed

MIC.o ending contains objects for the Intel Xeon Phi coprocessor architecture.
p ar

4. Use the __MIC__ precompiler macros within the #ifdef conditional directive to check if the binary
re

executed on the Intel Xeon Phi coprocessor or fell back and executed on the host processor. Print out
yP

the corresponding line: “Offload is successful!" or “Offload has failed miserably!".


el

Question 2.2.c. How would you implement the above task?


iv
us
cl

Compare your solution with the source code from the subfolder step-02 (see B.2.2.4).
Ex

To check the program behavior with the fall-back scenario, we can explicitly ask the compiler to ignore
all the offload pragmas with the -no-offload compiler flag :

user@host% icpc -o runme -no-offload offload.cpp


offload.cpp(24): warning #161: unrecognized #pragma
#pragma offload target(mic)
^

user@host% ./runme
Offload has failed miserably!
There are 893 non-zero elements in the array.

5. Finally, modify the original code: define the size variable and the data array as local variables within
the main function, and put this segment of code within the #pragma offload directive section,
together with the CountNonZero function call.
Question 2.2.d. What should be specified at the declaration of the offloaded CountNonZero function?

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.2. EXERCISES FOR CHAPTER 2: PROGRAMMING MODELS 273

6. Modify the code to initialize all variables on the host and offload only the CountNonZero function
call. Use the #pragma offload_attribute push/pop directive to select all the variables and
the function for offloading.
Compare your solution with the source code from the subfolder step-04 (see B.2.2.6).

Answers

Answer 2.2.a. Use the following:


#pragma offload target(mic)
printf("Hello World from MIC!");

Answer 2.2.b. fflush(0) will flush the Intel Xeon Phi coprocessor’s output buffer.

g
an
Answer 2.2.c. Use the following construction:

W
#ifdef __MIC__

ng
printf("Offload is successful!");
#else

e
nh
printf("Offload has failed miserably!");
#endif
r Yu
fo

Answer 2.2.d. __attribute__((target(mic))) should be used (step-03 or B.2.2.5).


d
re
pa

A.2.3 Explicit Offload: Data Traffic and Asynchronous Offload


re

The following practical exercises correspond to the material covered in Section 2.2.9 – Section 2.2.10
yP

(pages 53 – 59)
el
iv

Goal
us
cl

Additional performance can be gained and memory can be saved through reuse of transfered data and
Ex

asynchronous function calls. The following practical exercises will cover those topics.

Instructions
1. In the previous lab we considered a simple case of the synchronous offload model with the variables
from the local scope (on the stack) transfered from the host memory to the coprocessor and back, after
the calculation is done.
Next we will write a source code, where the globally defined double sum variable will contain the
result of an array summation dynamically allocated on the heap, initialized on the host and passed to the
summation function call on the coprocessor (see B.2.3.1 and B.2.3.2).
Pass those variables with the in/out/inout/nocopy clauses, calculate the sum, and print it out.
If you have any difficulties, you can a compare your result with the step-01 subfolder’s source file
offload.cpp (see B.2.3.3).
2. Using the previously written source code, modify the offload pragma in such a way, so that the sum
variable will be not freed after the pragma’s end. Print out its value. Modify its value on the host by

Prepared for Yunheng Wang c Colfax International, 2013


274 APPENDIX A. PRACTICAL EXERCISES

incrementing it by one, and print out the value again. Use offload_transfer pragma to restore
the sum variable value from the coprocessor, and free the allocated memory. (step-02 or B.2.3.4)
Note: Do not forget to specify the number of the Intel Xeon Phi coprocessor card in the target
(mic:N) clause. If you do not, the variable sum might be copied from a different coprocessor, if you
have more than one.

3. Use the signal(p) and wait(p) clauses to implement asynchronous offload execution on the target
and synchronization at the offload_transfer pragma. (See step-03 subfolder or B.2.3.5)

A.2.4 Explicit Offload: Putting It All Together


The following practical exercise is based on unique example and uses topics from Section 2.1 – Sec-
tion 2.2, Chapter 2 (pages 37 – 59).

Goal

g
an
Matrix-vector multiplication example is studied in the context of data persistence. Matrix content is

W
transfered beforehand, and used for multiplication with multiple vectors asynchronously.

Instructions
e ng
nh
Yu

1. Develop and run a code that performs serial matrix-vector multiplication on the host. Matrix-vector
multiplication is defined as A~b = ~c, where A is an [m × n] matrix of double precision numbers, ~b
r
fo

is a vector of length n and ~c is the resultant vector of length m. Do not worry about performance at
ed

this point, just design a serial code. The suggested C code of matrix-vector multiplication is shown in
ar

Listing A.1 (matrix A is stored as a one-dimensional array). Allocate all quantities on the stack.
p
re
yP

1 A[:]=1.0/(double)n;
2 b[:]=1.0;
el

3 c[:]=0;
iv

4 for (int i=0; i<m; i++)


us

5 for (int j=0; j<n; j++)


6 c[i] += A[i*n+j] * b[j];
cl
Ex

Listing A.1: Suggested matrix-vector multiplication code.

Or use source code from the step-00 subfolder of the corresponding lab folder (see B.2.4.1 and
B.2.4.2).

2. Modify this code so that matrix A and vector b are initialized on the host, but the calculation is offloaded
to the Intel Xeon Phi coprocessor. Vector c should be returned back to the host and verified against the
expected result. (step-01 or B.2.4.3).

3. Test the maximum value of m*n for which matrix A can be allocated on the stack.

4. Change data allocation: keep vector b on the stack, but matrix A on the heap, and test the maximum
problem size. (step-02 or B.2.4.4)

5. Improve the code so that it performs matrix-vector multiplication for multiple vectors b, but the same
matrix A. Take care to avoid unnecessary transfer of the data of matrix A.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.2. EXERCISES FOR CHAPTER 2: PROGRAMMING MODELS 275

Use the OFFLOAD_REPORT=2 environment variable to see the amount of data being transferred during
the offload:
user@host% export OFFLOAD_REPORT=2

The #pragma offload_transfer and #pragma offload nocopy directive constructions


can be used for the initial copy of the dynamically allocated array A and used without additional data
transfer. (step-03 or B.2.4.5)
Or we can only use the #pragma offload in directive with this small trick: pass the full array A
at the initialization and only one element at the iterations. (see step-04 or B.2.4.6).

6. Modify the code so that the data of matrix A is transferred to the coprocessor beforehand (previous step),
and matrix-vector multiplication executed in offload mode asynchronously, while the same calculations
are produced on the host, to be compared later with the coprocessor’s results.
Take a look at our implementation in the step-05 subfolder (see B.2.4.7).

g
an
W
A.2.5 Virtual-Shared Memory Model: Sharing Complex Objects

ng
The following practical exercises correspond to the material covered in Section 2.3 – Section 2.3.5 (pages

e
59 – 65).
nh
Yu
Goal
r
fo

MYO model allow to share complex objects (not only bit-wise copyable) between the host system and
d
re

Intel Xeon Phi coprocessors. You will be asked to do so for dynamically allocated data, structures, classes,
pa

and objects created with the new operator.


re
yP

Instructions
el

1. Using a simple serial program, which initializes two arrays with a predefined size, adds each of the
iv

corresponding elements, and saves the resulting summation in a third array res of the same size, with
us

the post-processing result check.


cl
Ex

See the serial source code B.2.5.2 and B.2.5.1, at:


labs/2/2.5-sharing-complex-objects/step-00/cilk-shared-offload.cpp
Using the _Cilk_shared and _Cilk_offload keywords, initialize variables on the host system
and pass them with the offloaded add() function for use on the Intel Xeon Phi coprocessor. The offload
function call should use the virtual-shared memory model. For the arrays to be synchronized between
the host and the target coprocessor, they need to be declared with the _Cilk_shared keyword.

2. Compare your version of the modified code with the implementation in the folder step-01 or the
source code at B.2.5.3.
If you have several Intel Xeon Phi coprocessors available, instead of using _Cilk_offload, you can
use _Cilk_offload_to to specify which coprocessors should be used for the offload.

3. Dynamically allocated data can be synchronized before offloading and after in a similar manner. Take
a look at the example in the following location:
labs/2/2.5-sharing-complex-objects/step-03/dynamic-alloc.cpp, (see B.2.5.4)
where the pointer to the dynamically allocated data:

Prepared for Yunheng Wang c Colfax International, 2013


276 APPENDIX A. PRACTICAL EXERCISES

1 int* data = (int*)malloc(N*sizeof(float));

should be allocated on both the host and the coprocessor at the dynamically synchronized memory area.
Therefore, _Offload_shared_malloc should be used instead of regular malloc. But since the
pointer will be shared as well, it will be declared with the _Cilk_shared keyword.

4. For extra credit, you can try to figure out how to recode the summation using parallel processing in the
ComputeSum() function call, which will be offloaded to the Intel Xeon Phi coprocessor for execution.
You can use the OpenMP reduction mechanism.
Compare your code with one possible solution at step-03/dynamic-alloc.cpp or the source
code at B.2.5.5.

5. Structures can be virtually shared between the host and the coprocessor as well. Take a look in the next
example at step-04/structures.cpp (B.2.5.6)

g
an
1 typedef struct {

W
2 int i;

ng
3 char c[10];
4 } person; e
nh
Yu

Write the code, where the structure above should be virtually shared. And offloaded function call
SetPerson() should change the variables of this structure, which will be printed out later on the
r
fo

host.
ed

Compare your results with step-05/structures.cpp source code (B.2.5.7).


p ar

6. Class Person serial implementation is presented in the source code step-06/classes.cpp (see
re

B.2.5.8).
yP

Use _Cilk_shared keyword for sharing the object of this class in the virtual memory. Make offload
el
iv

call of the class method on the Intel Xeon Phi coprocessor. Function’s arguments should be in the
us

virtual-shared memory as well. Therefore, method declaration will have _Cilk_shared keyword in
cl

front of the parameters.


Ex

Compare your result with the one at the step-07 folder, source code B.2.5.9.

7. To use the standard new operator to create an object in the virtual-shared memory, we need to use a
special implementation of this operator from the new library (#include <new>).
We also need to allocate the corresponding amount of memory in the virtual-shared memory region first,
and pass the pointer as the new() function parameter.
Based on the serial example from step-08/new.cpp (or source code B.2.5.10), try to modify it
for virtual-shared memory model, dynamic object creation with the new operator, and method call
offloaded to the Intel Xeon Phi coprocessor.
To check your result you can use the source code from the step-09 folder, or B.2.5.11.

A.2.6 Using Multiple Coprocessors


The following practical exercises correspond to the material covered in Section 2.4 (pages 66 – 73)

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.2. EXERCISES FOR CHAPTER 2: PROGRAMMING MODELS 277

Goal
Many parallel algorithms may utilize several computing devices and show almost linear or even super-
linear performance gain, if communication overhead is not significant. This practical exercise will focus on
using multiple Intel Xeon Phi coprocessors within a single computational node.

Instructions
1. We will use several common techniques of offloading a function to multiple Intel Xeon Phi coprocessors.
with Intel Cilk Plus
Compile and execute some simple C/C++ code shown below, which will print out the number of Intel
Xeon Phi coprocessors available in the system (see B.2.6.1 and B.2.6.2):
1 #include <stdio.h>
2 int _Cilk_shared numDevices;
3 int main(int argc, char *argv[]) {

g
4 numDevices = _Offload_number_of_devices();

an
5 printf("Number of Target devices installed: %d\n\n" ,numDevices);}

W
This code can be found at the following location:

ng
labs/2/2.6-multiple-coprocessors/step-00/multiple.cpp

e
nh
We used the _Cilk_shared keyword here to make the _Offload_number_of_devices()
Yu
function available at the linking stage. Try to delete the keyword _Cilk_shared from the source
code and recompile it again. You will see an error message stating,“undefined reference to ’_Of-
r
fo

fload_number_of_devices’". Intel Cilk Plus is a language extension, and it will be utilized by compiler
d

if only Intel Cilk Plus keywords are used in the source code.
re
pa

This very simple program prints out the number of available Intel Xeon Phi coprocessors, which we will
be using for the second step, to print out the current device number.
re
yP

2. If you have two or more coprocessors installed in your system (the program run from the previous step
el

returned two or more devices), then write an offloaded function call within the for loop. This function
iv

should print out the device number of the coprocessor currently running the offloaded function (use
us

_Offload_get_device_number()).
cl

To make an offload function call, use the Intel Cilk Plus compiler extension with the corresponding
Ex

keywords _Cilk_shared and _Cilk_offload.


3. Compare your source code with the one in the step-01 folder or B.2.6.3. It should be noted, that if
you implement your source code exactly the same way as we did, this approach does not guarantee that
an offloaded function call will use the same Intel Xeon Phi coprocessor each time. To resolve this issue,
the _Cilk_offload_to(int) keyword can be used (instead of the _Cilk_offload keyword),
where the integer – is the zero-based number of the target device.
4. with #pragma offload target(mic:i):
Take a look at the serial implementation of the code B.2.6.4 in the listing:
labs/2/2.6-multiple-coprocessors/step-02/multiple.cpp
Use the offload pragma directive to make a function call on the target device (index i is the zero-based
numbering of the Intel Xeon Phi coprocessors..
The offloaded function call of whatIsMyNumber(), using the #pragma offload directive is
a blocking operation, meaning the main program on the host will wait for the offloaded function to
complete execution and return the result, before it will continue the rest of the for loop.

Prepared for Yunheng Wang c Colfax International, 2013


278 APPENDIX A. PRACTICAL EXERCISES

We can save some data transfer operations and pass only one corresponding element of the response
array to each individual card. This can be done by specifying the slice of the array (Intel Cilk Plus
array notation) – the first element to be passed and the length of the slice, which is only one element
in this particular case. And since the array will be shared between the host system and Intel Xeon Phi
coprocessors, the target attribute should be specified for the response pointer.
Note: if you will get the following error code:

offload error: unexpected number of variable descriptors


offload error: process on the device 0 unexpectedly exited with code 1

It might be due to zero-code elimination (one of the optimization techniques of the Intel C++ Compiler)
within the #ifdef __MIC__ statement. The source code within this statement will not be visible
to the host at compilation, and thus, the compiler will assume that the response array was not
manipulated. Therefore, this variable will be eliminated completely.
To fix this issue we can add #else statement into the #ifdef condition, and assign zero value instead,

g
if the code is executed on the host system. In this case variable and manipulations with it will be visible

an
to the compiler (the host part) and we will not get this error.

W
Compare your result with the code B.2.6.5 at:
labs/2/2.6-multiple-coprocessors/step-03/multiple.cpp e ng
nh
Yu

A.2.7 Asynchronous Execution on One and Multiple Coprocessors


r

The following practical exercises correspond to the material covered in Section 2.4 (pages 66 – 73)
fo
ed

Goal
p ar

Asynchronous execution allow to run computations in parallel on multiple Intel Xeon Phi coprocessors.
re
yP

Instructions
el
iv

1. with Intel Cilk Plus:


us

Make the source code (B.2.7.1 and B.2.7.2) offload the whatIsMyNumber() function call to the
cl

corresponding Intel Xeon Phi coprocessor asynchronously:


Ex

labs/2/2.7-asynchronous-offload/step-00/async.cpp

2. If the _Cilk_spawn keyword is used, offloaded function calls will be submitted without blocking
asynchronously (see the source code B.2.7.3 in the step-01 folder).
We are not using synchronization (_Cilk_sync) here, since it is not needed. It will happen implicitly
at the end of the main() function call.
Another note is that, instead of _Cilk_* keywords, we can use the header file <cilk/cilk.h>,
which defines macros that provide names with simpler conventions (cilk_spawn, cilk_sync
and cilk_for), described in Intel C++ Compiler reference manual. It is your choice to decide which
one to use.

3. with OpenMP and #pragma offload:


Review the source code B.2.7.4 from step-02 subfolder.
Use the omp parallel for pragma to specify a parallel loop. There will probably be more OpenMP
threads available on the host processor than number of Intel Xeon Phi coprocessors installed in the

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.2. EXERCISES FOR CHAPTER 2: PROGRAMMING MODELS 279

system. Therefore, each thread will be run in parallel and will offload the code, which we need to specify
with the offload pragma and in/out/inout/nocopy data manipulation clauses, as well as the
target(mic:i) clause, to submit the offloading to different target devices.
Your result can be compared with B.2.7.5 from the step-03 subfolder.

4. with #pragma offload signal and #pragma offload_wait synchronization:


Let’s try to write an asynchronous offload to multiple coprocessors with the offload and
offload_wait pragmas. You can use your previous program, or use B.2.7.6 at the following location:
step-04/async.cpp.
Within the offload pragma, use the signal clause to run the offload command asynchronously, as well as
specifying the target and data manipulation clauses.
To synchronize the offloaded threads, a second loop is needed which will use the offload_wait
pragma and wait for signal clause.

g
an
Compare your result with B.2.7.7 from step-05/async.cpp.

W
5. Intel Cilk Plus _Cilk_for and _Cilk_offload

e ng
Previously we used an OpenMP parallel for loop to offload the function calls to the Intel Xeon Phi
nh
coprocessors simultaneously. The same approach can be used with the Intel Cilk Plus Intel C++ Compiler
Yu
extension.
r
fo

Use the _Cilk_shared and _Cilk_offloaded keywords where needed. Memory allocation for
d

shared arrays should be done with the _Offload_shared_malloc() function call. Instead of a
re

regular for loop or OpenMP parallel for, _Cilk_for can be used, which will start all the iterations
pa

simultaneously.
re

Use your previous code, or B.2.7.8 at the step-06/async.cpp. Afterwards, compare your results
yP

with B.2.7.9 from step-07/async.cpp.


el
iv

6. Intel Cilk Plus _Cilk_spawn and _Cilk_sync


us
cl

An asynchronous offload with Intel Cilk Plus will be covered next. You can start with the code B.2.7.10
Ex

in step-08/async.cpp and add the _Cilk_offload keyword for the Respond() function
call. In addition, add the _Cilk_spawn keyword to make this offload asynchronous, and add the
_Cilk_sync keyword for synchronization between the threads.
Don’t forget to compare your result with B.2.7.11 in step-09/async.cpp.

A.2.8 Using MPI for Multiple Coprocessors


The following practical exercise shows Intel MPI execution on the host and the Intel Xeon Phi coproces-
sors, as described in Section 2.4.3 (pages 73 – 77).

Goal
Message Passing Interface (MPI) allows to organize heterogeneous parallelism between several Intel
Xeon Phi coprocessors and the host within a single computational node, as well as between several computers
and multiple coprocessors.

Prepared for Yunheng Wang c Colfax International, 2013


280 APPENDIX A. PRACTICAL EXERCISES

Preparation
1. To use the Intel MPI on the Intel Xeon Phi coprocessors, the corresponding libraries and binary files
need to be copied, or otherwise, made available to the target devices. Therefore, you can copy the files
onto the target devices, or you can also NFS mount the host Intel MPI folder to allow the coprocessors
access them.
To NFS mount Intel MPI folder, follow the instruction from the Chapter 1.5.4 on page 35, or Lab A.1.2,
Instruction set 3. Modify the corresponding files to use NFS share on the appropriate number of Intel
Xeon Phi coprocessors.
2. Later in this practical exercise we will assign MPI jobs to multiple Intel Xeon Phi coprocessors. To
enable the communication between those devices, we also need to enable peer to peer communication
between them. Follow the instructions in Section 2.4.3 on page 76 to do so.

Instructions

g
1. Study the simple MPI “Hello World!" makefile B.2.8.1 and the source code B.2.8.2 at the following

an
location:

W
labs/2/2.8-MPI/step-00/HelloMPI.c
e ng
To compile it, we will need to use mpiicc or mpiicpc macros installed with the Intel MPI. We will
also use -mmic Intel C++ Compiler flag to compile Intel Xeon Phi binary HelloMPI.MIC for native
nh
execution.
Yu

2. First, we will execute this “Hello World!" application on the host system:
r
fo

user@host% mpirun -n 4 ./HelloMPI


ed

Hello World from rank 2 running on host!


ar

Hello World from rank 3 running on host!


p

Hello World from rank 1 running on host!


re

Hello World from rank 0 running on host!


yP

MPI World size = 4 processes


el

3. Copy HelloMPI.MIC file to the Intel Xeon Phi coprocessors to the home folder. Since coprocessors
iv
us

are IP-addressable devices and have SSH servers running on them, you can use scp to copy the files.
cl

Question 2.8.a. What environment variable will enable the Intel Xeon Phi coprocessors support in Intel
Ex

MPIapplications?

4. In a similar manner, as we executed the code on the host, we can run it on an Intel Xeon Phi coprocessor
by specifying the -host flag.
Question 2.8.b. What command should we use on the host to execute the Intel MPI code on one of the
Intel Xeon Phi coprocessors.

5. Intel MPI processes can be assigned to multiple Intel Xeon Phi coprocessors, and even the host system
(heterogeneous model).
Question 2.8.c. What is the format of explicit multiple hosts assignment for an Intel MPI program?

6. For the large clusters with many hosts and coprocessors it might be not convenient to specify all the
hostnames separated with “:" symbol. Instead we can put all the hostnames in a text file and use it
instead.
Question 2.8.d. What are the parameters we can use to specify hostnames and mapping for Intel MPI
program?

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.2. EXERCISES FOR CHAPTER 2: PROGRAMMING MODELS 281

Answers

Answer 2.8.a.
user@host% export I_MPI_MIC=1

Answer 2.8.b.
user@host% mpirun -host mic0 -n 2 ~/HelloMPI.MIC
Hello World from rank 1 running on mic0!
Hello World from rank 0 running on mic0!
MPI World size = 2 processes

Answer 2.8.c.

g
user@host% mpirun -host hostmic0 -n 2 ./HelloMPI : -host mic0 -n 2 \

an
% ~/HelloMPI.MIC : -host mic1 -n 2 ~/HelloMPI.MIC

W
Hello World from rank 0 running on host!
MPI World size = 6 processes

ng
Hello World from rank 1 running on host!

e
Hello World from rank 4 running on mic1!
Hello World from rank 2 running on mic0!
nh
Yu
Hello World from rank 3 running on mic0!
Hello World from rank 5 running on mic1!
r
fo

Note: Spaces around the colon symbol “:" are very important. Without them, the colon would be
d

considered part of the executable name.


re
pa

Answer 2.8.d.
re
yP

-f {filename} | -hostfile {filename} file containing the host names


-hosts {host list} comma delimited host list
el

-machine {filename} | -machinefile {filename} file mapping procs to machines


iv
us

For instance, using file B.2.8.3 with the hostnames:


cl
Ex

user@host% mpirun -f hosts -n 10 ~/HelloMPI.MIC

Prepared for Yunheng Wang c Colfax International, 2013


282 APPENDIX A. PRACTICAL EXERCISES

A.3 Exercises for Chapter 3: Expressing Parallelism


A.3.1 Automatic Vectorization: Compiler Pragmas and Vectorization Report
The following practical exercises correspond to the material covered in Section 3.1 (pages 77 – 94).

Goal
In the following practical exercises, we will use the Intel C++ Compiler automatic vectorization feature.

Instructions
1. The automatic vectorization feature of the Intel C++ Compiler allows it to recognize operations, which
can be applied to multiple data elements simultaneously, and thus speed up the computations by
exploiting vector registers of Intel Xeon processor or Intel Xeon Phi coprocessors.
Question 3.1.a. What compiler flag allows you to turn on the explicit output of the Intel C++ Compiler

g
an
automatic vectorization log?

W
ng
2. Starting from the serial code, which sum up two arrays together B.3.1.1, located at:
labs/3/3.1-vectorization/step-00/vectorization.cpp, see if the code will be au-
e
nh
tomatically vectorized by the Intel C++ Compiler.
Yu

Question 3.1.b. How can we find out that the Intel C++ Compiler automatically vectorized the specific
r

loop successfully?
fo
ed

Additional instructions and pragmas will let the compiler auto-vectorize the code more effectively. In
ar

the code shown above, use the align attribute for the arrays’ alignment.
p
re

In the main summation loop, use explicit and inexplicit Intel Cilk Plus array notation. Check if the loop
yP

is still vectorized automatically.


el

Compare your result with the source code B.3.1.2 in the step-01 subfolder.
iv
us

3. Next use dynamic memory allocation of the arrays A[:] and B[:], and explicit Intel Cilk Plus array
cl

notation for the summation loop.


Ex

Question 3.1.c. Why do you think implicit (inexplicit) Intel Cilk Plus array notation will raise compila-
tion errors for dynamically allocated arrays?

4. If the align attribute is not specified, the array allocated randomly in the memory and can have a
random offset. Using the source code from the previous steps, add the calculation of the offsets for the
arrays A[:] and B[:] relative to some alignment constant, for instance, const int al=64;
Question 3.1.d. Can you express mathematically and implement in C/C++ the offset calculation for
some pointer address, relative to some constant al – size of the memory block?

For the dynamically allocated arrays A[:] and B[:] with standard malloc() function calls, write a
program to calculate the offset for some alignment constant al, and print out those values.
Compare your implementation with B.3.1.4 located in the step-03 subfolder.

5. In the previous step, we calculated the offset relative to some alignment constant al. If the align
attribute is not implemented in the compiler, we can align an array or a variable manually, by allocating

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.3. EXERCISES FOR CHAPTER 3: EXPRESSING PARALLELISM 283

slightly more memory (sizeof(array)+al-1), and shifting the address by the offset to get the
alignment.
Implement this algorithm of the dynamical memory allocation and shifting by the offset to get the
alignment. And remember to free the initial pointer, rather than the aligned one. Compare your result
with B.3.1.5 from the subfolder step-04.
6. To simplify the alignment of dynamically allocated memory, we can use the Intel C++ Compiler’s
intrinsic functions to allocate and free aligned blocks of memory:

1 void* _mm_malloc(int size, int alignment);


2 void _mm_free(void *p)

Note: Memory that is allocated using _mm_malloc must be freed using _mm_free. Calling free
on memory allocated with _mm_malloc or calling _mm_free on memory allocated with malloc
will cause unpredictable behavior.
Use these intrinsic function calls to allocate memory blocks for two arrays, as in the previous steps.

g
Compare your results with B.3.1.6 from step-05.

an
As an additional exercise, combine the offset calculation with the intrinsic memory allocation to show

W
that allocated memory is indeed aligned.

e ng
7. Intrinsics, as described in the reference manual:
nh
Yu
Intrinsics are functions coded in assembly language that allow you to use C++ function calls
and variables in place of assembly instructions.
r
fo

Intrinsics are expanded inline, eliminating function call overhead. Providing the same benefit
d

as using inline assembly, intrinsics improve code readability, assist instruction scheduling,
re

and help reduce debugging.


pa

Intrinsics provide access to instructions that cannot be generated using the standard constructs
re

of the C and C++ languages


yP

Intrinsic function calls provide a fine-tuned and direct instruction set to work with vector registers and
el
iv

the like. However, it hardwires the code to a specific architecture and its feature set. In general, it is a
us

bad idea. More preferable approach is to allow the compiler to take care of those details.
cl

For educational purposes, try to implement a code with summation of two arrays by using intrinsic
Ex

functions for vector summation. Compile the code for native execution on the Intel Xeon Phi coprocessor.
Compare it with B.3.1.7 source code from the step-06 subfolder.
8. Scalar function calls can be vectorized automatically by compiler with the code inlining at the compila-
tion stage, if the function body is at the same source code file with the function call loop.
Write int my_simple_add(int x1, int x2){ return x1+x2;} scalar summation of
two integers and call it within the iterating for loop over the elements of arrays A and B. Compile it
and make sure, that the for loop was vectorized (see B.3.1.8 from step-07).
9. Next step is to move my_simple_add function to the separate file (worker.cpp) and leave the rest
of the code at main.cpp. This code will not be vectorized, since at the compilation time Intel C++
Compiler creates object files separately and will not inline the function calls (see B.3.1.9 and B.3.1.10
from step-08).
10. Elemental functions are a general language construct to express a data parallel algorithm. If (as in
previous step) function body is located in a separate file or function within an external library, those still
can be automatically vectorized by applying __attribute__((vector)) to them.

Prepared for Yunheng Wang c Colfax International, 2013


284 APPENDIX A. PRACTICAL EXERCISES

#pragma simd before the for loop will insure the Intel C++ Compiler that loop can be safely
vectorized.
Apply those changes to the previous example and compile it to make sure that compiler indeed auto-
vectorized worker.cpp and main.cpp corresponding regions of code. Compare your source code
with B.3.1.11 and B.3.1.12 for step-09 subfolder.

11. In many cases program developer knows more about data organization and access patterns then compiler.
Therefore, additional instructions and pragmas passed to the compiler will help it to optimize the code
much better. But it is developer’s responsibility to provide correct information.
In the next example we will show what might happened if ignore vector dependency pragma #pragma
ivdep is used where it should not be used, e.g. where actual vector dependency is exists.
Take a look at the source code files B.3.1.13 and B.3.1.14. #pragma ivdep in worker.cpp
tells compiler, that we will guarantee, that links to integer passed as arguments of the function
my_simple_add and used within the for loop are independent. But in the main.cpp we call
this function for n − 1 elements for links pointing to B and B + 1 – next element of array.

g
an
The result is unpredictable and most likely wrong:

W
ng
user@host% ./runme
0 0 0 e
nh
1 1 0
2 2 0
Yu

3 3 0
4 4 0
r
fo

5 5 4
6 6 5
ed

7 7 6
p ar
re

while the whole array B should contain only zeros: first element is zero; next element is assigned the
yP

value of the previous one.


el

12. Keyword restrict can be used in a similar manner as #pragma ivdep, but for individual pointers.
iv

For those pointers developer guarantee mutual independence. And therefore, for loops with those
us

pointer variables will be successfully vectorized.


cl
Ex

During the compilation flag -restrict should be used to enable the keyword.
Using the previous example modify the code to use restrict keyword. Compare your results with
B.3.1.16 and B.3.1.17 from step-0b subfolder and Makefile B.3.1.15.

13. _Cilk_for keyword and cilk_for function from cilk/cilk.h indicates that a for loop’s
iterations can be executed independently in parallel, and moreover, it will be considered as candidate for
auto-vectorization. Therefore, double parallelism can be applied to the for loop indicated by Intel Cilk
Plus keyword: data and thread parallelism.
Implement the source code with Intel Cilk Plus for loop over 256 elements adding array elements B[:]
to array elements A[:]. Compare your results with B.3.1.18 from step-0c subfolder.

Answers

Answer 3.1.a. -vec-report3 and -vec-report6 (for more verbose output) will show the automatic
vectorization log of the compiler.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.3. EXERCISES FOR CHAPTER 3: EXPRESSING PARALLELISM 285

Answer 3.1.b. With the -vec-report3 flag, the Intel C++ Compiler will include the following line
among the rest of its output:
vectorization.cpp(14): (col. 3) remark: LOOP WAS VECTORIZED.

which is an obvious indication, that the loop was vectorized.

Answer 3.1.c. The compiler does not know the size of the array in dynamically allocated memory at a
compilation time. Therefore, you will get the following error:
array section length must be specified for incomplete array types.

Answer 3.1.d.
1 const int offset = (al - ( (size_t) A % al)) % al;

g
an
W
A.3.2 Parallelism with OpenMP: Shared and Private Variables, Reduction

e ng
The following practical exercises correspond to the material covered in Section 3.2 (pages 94 – 122)
nh
Yu
Goal
r
fo

OpenMP (Open Multi-Processing) is one of the most used parallelism model Application Programming
d

Interface (API), that supports multi-platform shared memory multiprocessing.


re
pa
re

Instructions
yP

1. Write C++ source code for simple OpenMP program, which prints out the total number of OpenMP
el

threads and for each fork-join branch prints out “Hello world from thread %d" with printf() function
iv

call.
us
cl

Question 3.2.a. What OpenMP function will return the total number of available threads?
Ex

Question 3.2.b. What environment variable can control the total number of OpenMP threads?

Question 3.2.c. Multiple OpenMP threads can be used to run a code region in parallel. What pragma
can we use to do that?

Question 3.2.d. What OpenMP function will return the current thread number?

Compare your result with B.3.2.1 from labs/3/3.2-OpenMP/step-00/openmp.cpp

2. Using OpenMP write the program, which will run OpenMP parallel for loop with the total
number of iteration equal to maximum number of OpenMP threads available on the system, and print
out number of iteration and current thread number.

Question 3.2.e. What pragma should be used for OpenMP for loop?

Compare your result with B.3.2.2 source code from step-01 subfolder.

Prepared for Yunheng Wang c Colfax International, 2013


286 APPENDIX A. PRACTICAL EXERCISES

3. Variables visibility in the OpenMP program depends on location, where those variables were defined.
Write the OpenMP program, where constant variable nt will be initialized with the maximum value
of OpenMP threads and will be available for all of those threads. Private integer private_number
should be independent for each parallel region. And using OpenMP for loop print out the current
thread number and value of the private variable, to make sure, that it is private for each thread, when we
increment it by one.
Compare your result with B.3.2.3 from step-02 subfolder.
Question 3.2.f. OpenMP parallel region pragma will create available number of threads. What pragma
will distribute iterations of for loop between those threads, without creating nested parallelism?

4. Maximum number of OpenMP threads can be controlled by environment variable OMP_NUM_THREADS.


Change this variable and run already compiled program from the previous step:

user@host$ export OMP_NUM_THREADS=2

g
user@host$ ./runme

an
OpenMP with 2 threads

W
Hello World from thread 0 (private_number = 1)
Hello World from thread 1 (private_number = 1)
e ng
nh
5. Number of threads in any OpenMP parallel region can be also controlled with the corresponding clause
Yu

parameter. OpenMP for loop’s scheduling mechanism can be specified through clauses as well.
r

Modify the source code from the previous step so specify guided scheduling mechanism for the for
fo

loop. And also specify number of threads needed for OpenMP parallel region and for loop.
ed

Compare your result with B.3.2.4 from step-03 subfolder.


p ar
re

6. Recursive algorithms can be parallelized as well by using OpenMP task pragma. Take a look at the
yP

implementation of recursive parallel function call from B.3.2.5 (step-04).


el

Question 3.2.g. Why did we use #pragma omp parallel and #pragma omp single for the
iv

initial recursive function call?


us
cl

7. Control over the variables scope can be done with OpenMP parallel clauses
Ex

private/shared/firstprivate.
Create the program with tree variables and control their behavior with the clauses mentioned above.
Check what values will be assigned to them within the parallel region and how they will react to the
modifications of their values. Source code B.3.2.6 from step-05 shows racing condition for shared
variable varShared and use of private and firstprivate variables.

8. Probably the most common mistake in implementing parallel algorithms is creating racing conditions,
when shared variables accessed for reading and writing by different threads at the same time. Write the
code with OpenMP parallel for loop over the first N = 1000 numbers added together in shared
PN −1
variable sum. Correct value should be i=0 i = N ∗(N 2
−1)
. Note: the upper boundary is N − 1, since
the for loop has “i < N ;" exit condition. Print out the resulted value of sum and expected value of the
summation. Compare your code with B.3.2.7 from step-06.

9. There are several ways to fix racing conditions in the parallel codes. One of them is applying #pragma
omp critical to the region of code, where racing conditions occur. Modify your code to fix the
summation problem.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.3. EXERCISES FOR CHAPTER 3: EXPRESSING PARALLELISM 287

It should be noted, that only one thread will execute region of the code marked with critical pragma.
Therefore, the parallel code we created will technically become serial, since only one thread executing it
at a time.
Compare your result with B.3.2.8 from step-07 subfolder.

10. Some scalar operations can be marked with atomic pragma, which will ensure that a specific memory
location is updated atomically, which prevents the possibility of multiple, simultaneous reading and
writing of threads.
For the previous code add the #pragma omp atomic before the summation inside the OpenMP
for loop. Execute the compiled program and compare the result with the expected value.
B.3.2.9 shows how the atomic pragma can be implemented (step-08).

11. Another common approach is to have private variables collecting temporary values of summation, and
then adding them together to get the final answer.

g
Implement this idea by using two OpenMP task pragmas and shared variables sum1, sum2, accessed

an
only by corresponding tasks. Use taskwait pragma to synchronize the tasks.

W
You can compare your implementation with B.3.2.10 from step-09 subfolder.

e ng
nh
12. For the highly parallel systems it is better to write parallel region in a way, that OpenMP will split the
work between the available threads automatically. Use the similar approach as in previous step – collect
Yu
the temporary summation result in the private variables. After the OpenMP for loop (but still within the
r
fo

OpenMP parallel region collect the values from those private variables into the shared variable sum,
and to avoid racing conditions use #pragma omp critical (B.3.2.11 from step-0a subfolder).
d
re
pa

13. Reduction is a clause of OpenMP for loop, which indicates what operation will be used on what
reduction variable. OpenMP will automatically take care of avoiding racing conditions and receiving
re
yP

correct result.
Implement the summation over the array by specifying reduction clause for sum. Compare your
el
iv

result with B.3.2.12 from step-0b subfolder.


us
cl

Answers
Ex

Answer 3.2.a. int omp_get_max_threads(); from omp.h will provide the number of maximum
OpenMP threads available.

Answer 3.2.b. OMP_NUM_THREADS defines how many OpenMP threads will be created, if this number in
not specified with corresponding OpenMP function calls.

Answer 3.2.c. #pragma omp parallel will run the following after it code in the maximum number
of OpenMP threads available for the system.

Answer 3.2.d. int omp_get_thread_num(); from omp.h will provide the current number of
thread.

Answer 3.2.e.

Prepared for Yunheng Wang c Colfax International, 2013


288 APPENDIX A. PRACTICAL EXERCISES

1 #pragma omp parallel for


2 for(int i=0; i<omp_get_max_threads(); i++){...}

Answer 3.2.f.
1 #pragma omp parallel
2 #pragma omp for
3 for (int i=0; i<N, i++) {...}

Note: If #pragma omp parallel for is used instead – this will created nested parallelism,
which is not desired in our case.

Answer 3.2.g. Without #pragma omp single the maximum number of OpenMP threads will be
created, and all of them will execute Recurse(0); initial recursive function call, which it not the desired
behavior.

g
an
TM

W
A.3.3 Complex Algorithms with Intel R Cilk Plus: Recursive Divide-and-Conquer

ng
The following practical exercises correspond to the material covered in Section 3.2 (pages 94 – 122)
e
nh
Yu

Goal
r

The listed instructions below will help you to get familiar with Intel Cilk Plus, an extension to the C and
fo

C++ programming languages, designed for multi-threaded parallel computing.


ed
p ar

Instructions
re
yP

1. In the following practical exercise you will be required to write a code, which will use Intel Cilk Plus
parallelism model. Print out the total number or Intel Cilk Plus workers available on the system. Use
el

_Cilk_for iterating through number of available workers and print out current worker number. Since
iv

the workload is very light all iteration should be done by only one worker (this for loop gets serialized).
us

Therefore, we need to add extra workload to the for loop to see Intel Cilk Plus parallelism. Within
cl
Ex

the for loop write additional while loop adding or multiplying some numbers to the private variable.
At the end print out the current worker number doing those calculation, and the final result to avoid
zero-code elimination.

Question 3.3.a. What environment variable controls the number of Intel Cilk Plus workers?

Compare your result with B.3.3.1 located at labs/3/3.2-Cilk-Plus/step-00/cilk.cpp.

2. In the previous step we used _Cilk_for loop to complete the total number of iteration equal to total
number of Intel Cilk Plus workers. And if the workload was significant for each worker, than each
of them should have been involved in the calculations only once. Intel Cilk Plus operating on hungry
workers workload distribution model. But number of iterations distributed between the workers can be
controlled by grainsize clause of #pragma cilk grainsize N pragma.
Modify the previous code to grant 4 iteration steps for each worker. Make sure, that only quarter of total
number of workers were involved in the calculations this time.
You can compare your result with B.3.3.2 from step-01.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.3. EXERCISES FOR CHAPTER 3: EXPRESSING PARALLELISM 289

3. Using _Cilk_spawn for asynchronous parallelism we can run recursive tasks in parallel.
Write the program, which will recursively call some function Recurse( const int task); and
print out the number of current worker doing some calculations within the function.
Compare your results with B.3.3.3 from step-02.
4. Quite often we need synchronisation between the parallel tasks. Intel Cilk Plus has _Cilk_sync
keyword for this.
Write a program, where 1000 dynamically allocated consecutive integer elements of array summed
up by two parallel (_Cilk_spawn) function calls Sum() over two parts of array; synchronized and
printed out with the printf() statement.
Compare your result with B.3.3.4 from step-03.
5. More elegant way to organize parallelism is to avoid hardwiring number of parallel task and let Intel
Cilk Plus take care of this automatically.
To prevent racing conditions we will need to use reducers in Intel Cilk Plus, defined as

g
an
cilk::reducer_opadd<int> sum from <cilk/reducer_opadd.h>.

W
Access to the reducer sum will be done through sum.set_value(N) and sum.get_value()

ng
calls.

e
Write a program, which will use reducer sum and will store summation result of adding the consecutive
nh
20 integers iterated over with _Cilk_for loop, and printing out the final result. Compare your code
Yu
with B.3.3.5 from step-04 subfolder.
r
fo

6. Maximum number of Intel Cilk Plus workers can be controlled not only by environment variable, but
d

also via function call within a program.


re

Question 3.3.b. What function can we use to change the maximum number of Intel Cilk Plus workers?
pa
re

Consider the following source code B.3.3.6 at step-05 subfolder. Class Scratch has public attribute
yP

array data with many elements, which make it quite expensive to construct the objects of this class.
el

With the current implementation object scratch of class Scratch will be constructed by every Intel
iv

Cilk Plus worker on any iteration of _Cilk_for loop:


us
cl

user@host% ./runme
Ex

Constructor called by worker 0


Constructor called by worker 1
i=0, worker=0, sum=0
Constructor called by worker 0
i=5, worker=1, sum=500000
Constructor called by worker 1
i=1, worker=0, sum=100000
Constructor called by worker 0
i=6, worker=1, sum=600000
Constructor called by worker 1
i=2, worker=0, sum=200000
Constructor called by worker 0
i=7, worker=1, sum=700000
Constructor called by worker 1
i=3, worker=0, sum=300000
Constructor called by worker 0
i=8, worker=1, sum=800000
Constructor called by worker 1
i=4, worker=0, sum=400000
i=9, worker=1, sum=900000

Prepared for Yunheng Wang c Colfax International, 2013


290 APPENDIX A. PRACTICAL EXERCISES

By using cilk::holder<Scratch> scratch we can decrease the overhead of constructing new


objects. Only one object will be created per worker and will be preserved during the assigned iterations:

user@host$ ./runme
Constructor called by worker 0
Constructor called by worker 1
i=5, worker=1, sum=5000000
i=0, worker=0, sum=0
i=6, worker=1, sum=6000000
i=1, worker=0, sum=1000000
i=7, worker=1, sum=7000000
i=2, worker=0, sum=2000000
i=8, worker=1, sum=8000000
i=3, worker=0, sum=3000000
i=9, worker=1, sum=9000000
i=4, worker=0, sum=4000000

Modify the source code to use Intel Cilk Plus holders. Compare your result with B.3.3.7 from step-06

g
subfolder.

an
W
Answers
e ng
nh

Answer 3.3.a. CILK_NWORKERS controls the number of Intel Cilk Plus workers.
r Yu
fo

Answer 3.3.b. __cilkrts_set_param("nworkers", "2"); will set the maximum number of


ed

Intel Cilk Plus workers to 2.


p ar

A.3.4 Data Traffic with MPI


re
yP

The next practical exercise corresponds to the material covered in Section 3.3 (pages 122 – 138)
el
iv

Goal
us
cl

This exercise will show you practical asspects of heterogeneous execution of parallel application in
Ex

distributed memory with MPI and provide the basis for Intel Xeon Phi coprocessors clusterring.

Instructions
1. Write a simple Intel MPI “Hello World!" program: find the rank, world size, and name of the host
running the code. Print out this information with only rank 0 printing out the total number of MPI
processes (world size).
The source code B.3.4.2 and corresponding Makefile B.3.4.1 can be found at the
labs/3/3.4-MPI/step-00/ folder.
Initialize MPI support for Intel Xeon Phi coprocessors.

user@host% export I_MPI_MIC=1

Question 3.4.a. What command would you use to run compiled code manually on the host and two
Intel Xeon Phi coprocessors, with two MPI processes per host?

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.3. EXERCISES FOR CHAPTER 3: EXPRESSING PARALLELISM 291

2. Communication between MPI processes can be organized with MPI_Send and MPI_Revc function
calls.
Write a program based on the source code from the previous step, where all ranks, except the master
process, send its rank and the node name. This information should be collected by the master process
and printed out.
To control proper communication between MPI processes we can specify from which rank we expect
the message. But we can also specify the tag number, which can be use for ordering control, etc..
A message can be received by a receive operation only if it is addressed to the receiving process, and
if its source, tag, and communicator (comm) values match the source, tag, and comm values specified
by the receive operation. The receive operation may specify a wildcard value for source and/or tag,
indicating that any source and/or tag are acceptable. The wildcard value for source is source =
MPI_ANY_SOURCE. The wildcard value for tag is tag = MPI_ANY_TAG. There is no wildcard
value for comm. The scope of these wildcards is limited to the processes in the group of the specified
communicator.

g
Note the asymmetry between send and receive operations: A receive operation may accept messages

an
from an arbitrary sender; on the other hand, a send operation must specify a unique receiver. This

W
matches a “push" communication mechanism, where data transfer is effected by the sender (rather than

ng
a “pull" mechanism, where data transfer is effected by the receiver)

e
nh
If you specified the source as MPI_ANY_SOURCE, and control the message sequencing by the tag, than
change your code to specify the rank number of sender/receiver; and vice versa if you used rank to
Yu
control the order.
r
fo

Compare your result with the source code B.3.4.3 from step-01 subfolder.
d
re

3. Write a program with user-provided buffering communication, use MPI_Bsend and regular MPI
pa

recieve to pass two arrays of floats and doubles.


re

Use even ranks as senders, and odd ranks as receivers. For sender/receiver pairs use unique tag number,
yP

for instance:
el

ranks – tag
iv

0, 1 – 0
us

2, 3 – 1
cl

4, 5 – 2
Ex

Compare your result with B.3.4.4 from step-02 subfolder.


4. Write a program with non-blocking send of array by MPI_Isend, doing some additional workload,
and syncronizing with MPI_Wait.
Compare your result with B.3.4.5 from step-03 subfolder.
5. Using MPI_Scatter share parts of two-dimentional array sendbuf[SIZE][SIZE] between
SIZE MPI processes.
Compare your result with B.3.4.6 from step-04 subfolder.
6. You will be asked to write a source code with MPI_Allreduce next. For some optimization teqniques,
especially within the heterogeneous computation model, it is essential to know number of MPI processes
running on Intel Xeon Phi coprocessors and on the host system.
Write the code, which will print out the number of MPI processes, running on the host and on the
coprocessors.

Prepared for Yunheng Wang c Colfax International, 2013


292 APPENDIX A. PRACTICAL EXERCISES

Question 3.4.b. What should we use to check if MPI process is currently running on Intel Xeon Phi
coprocessor?

Compare your result with B.3.4.7 from step-05 subfolder.


http://www.linux-mag.com/id/7210/

Answers

Answer 3.4.a.
user@host% export I_MPI_MIC=1
user@host% mpirun -host hostmic0 -n 2 ./runme-mpi : \
> -host mic0 -n 2 ~/runme-mpi.MIC : \
> -host mic1 -n 2 ~/runme-mpi.MIC

g
an
Answer 3.4.b.

W
1 #ifdef __MIC__

ng
2 mic++;
3 #else e
nh
4 host++;
5 #endif
rYu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 293

A.4 Exercises for Chapter 4: Optimizing Applications


TM
A.4.1 Using Intel R VTune Amplifier XE
The following practical exercises correspond to the material covered in Chapter 4 (pages 139 – 257)

Goal
Intel VTune Amplifier XE is a commercial application for software performance analysis for 32-bit
and 64-bit x86 based machines with advanced hardware-based sampling of Intel-manufactured CPUs and
coprocessors.

Instructions
In this lab, we will walk through the workflow for application performance analysis in the Intel VTune

g
Amplifier XE tool. VTune is an application performance profiling tool that relies on hardware event sampling.

an
Some optimization examples in Chapter 4 demonstrate analysis in Intel VTune Amplifier XE, relying on the

W
procedures described in this lab.

e ng
1. First, let us compile the applications that will be used for profiling in VTune. Navigate to directory
nh
labs/4/4.1-vtune, enter each subdirectory in it, and run make. As you could guess from the
Yu
names of the source files, we will have one application that runs on the host system, one that performs
r

offload to an Intel Xeon Phi coprocessor, and one that runs natively on a coprocessor.
fo
d
re

2. Before we start VTune, some preparation may be needed as shown in Figure A.2.
pa
re
yP
el
iv
us
cl
Ex

Figure A.2: Preparing to run Intel VTune Amplifier XE.

Prepared for Yunheng Wang c Colfax International, 2013


294 APPENDIX A. PRACTICAL EXERCISES

In order to use VTune, environment variables must be set by sourcing the script file located at the
following path: /opt/intel/vtune_amplifier_xe/amplxe-vars.sh. Additionally, the
user of VTune must belong to user group vtune, the sampling driver must be loaded, and the NMI
watchdog must be disabled.
When the preparation work is done, launch VTune with the command amplxe-gui.

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Figure A.3: Intel VTune Amplifier XE flash screen.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 295

3. After you launch the VTune graphical user interface with command amplxe-gui, you will see a
window inviting you to create a new Project or open an existing one. Projects are containers for analysis
settings and results. Create and configure a new project named “Host-Workload” as shown in Figure A.4.
This project will contain the host-only application in step-00-xeon.

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Figure A.4: Creating and configuring a new project in Intel VTune Amplifier XE.

Prepared for Yunheng Wang c Colfax International, 2013


296 APPENDIX A. PRACTICAL EXERCISES

4. Now we are ready to profile the application. Click the orange triangle in the toolbar and choose “Sandy
Bridge” / “General Exploration” in the sidebar menu as the Analysis Type as shown (see the top panel
of Figure A.5). As the name suggests, this is a general-purpose analysis. Click the large button “Start”
at the right-hand side of the VTune window. VTune will launch your application. You can monitor the
progress of the application by switching to the terminal window (see the bottom panel of Figure A.5). In
order to switch to the terminal window, you can press Alt+Tab or mouse-click the Terminal window at
the bottom of the Gnome desktop.

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Figure A.5: Launching and monitoring the General Exploration analysis for a host application.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 297

5. Once VTune processes the analysis results, we can view them. Let us navigate the VTune interface to
get accustomed to the information that it displays.

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Figure A.6: Viewing the General Exploration analysis results for a host application.

Initially, you will see the “Summary” tab (top panel of Figure A.6). It contains cumulative metrics such
as the elapsed time, CPI rate and platform information. You can mouse over the question marks on this
page, and VTune will display help information on the respective metric in a pop-up window.
You can also see a breakdown of the sampled events by switching to the “Bottom-Up” or “Top-Down”
tab (shown in the bottom panel of Figure A.6). There, you see functions and modules and the number of
events measured in these modules. Event counts that appear sub-optimal are automatically highlighted
in pink. The “Bottom-Up” and “Top-Down” tabs are helpful in identifying which part or parts of a code
are responsible for certain negative metrics.

Prepared for Yunheng Wang c Colfax International, 2013


298 APPENDIX A. PRACTICAL EXERCISES

6. Information collected by VTune can be presented in different viewpoints. In order to switch to a different
viewpoint, click the word “change” in the header of the window (top panel of Figure A.7). The viewpoint
“Hotspots” is particularly helpful for optimizing applications. In this viewpoint, the primary metric
shown in the “Bottom-Up” and “Top-Down” tabs is the CPU time. This allows to find the bottlenecks
(hotspots) of the application (see the bottom panel of Figure A.7).

g
an
W
e ng
nh
r Yu
fo

Figure A.7: Switching to the “Hotspots” view.


ed
p ar

7. A very powerful feature of Intel VTune Amplifier XE is the ability to narrow down the hotspots to
re

individual lines of code or even individual assembly instructions. In order to get to that view, double-
yP

click any function in the “Bottom-Up” view. The result is shown in Figure A.8. In order to see the
assembly listing corresponding to the C/C++ code, click the “Assembly” button above the code listing.
el
iv

Note that in order to enable source code viewing, the application must be compiled with the compiler
us

argument -g. It is advisable to also use -O3 to avoid slowing down the calculation.
cl
Ex

Figure A.8: Viewing hotspots in the code.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 299

8. Now that we have learned how to analyze a host application, let us profile an application that uses an
Intel Xeon Phi coprocessor in the offload mode. In order to do that, create a new project by clicking the
button with the “+” symbol in the toolbar (top panel of Figure A.9). Then configure a new project called
“Offload-Workload” with the executable step-01-offload/offload-workload.

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Figure A.9: Configuring a project for an offload application.

In fact, there is no difference between configuring a project for a host-only application and one with
offload. However, now is a good time to learn how to control the sampling interval.
Enter value “5” into the box “Automatically resume collection after”. With this setting, VTune will begin
sampling 5 seconds after the launch of your application. This allows you to exclude the initialization of
the application from the analysis. For offload applications, this is especially important, because while
the application and dependent libraries are being transferred to the coprocessor at the beginning of the
run, nothing worth profiling usually happens.
Enter value “12” into the box “Automatically stop collection after”. This setting makes VTune terminate
sampling 12 seconds after the launch of the application. This allows you to exclude finalization stages
from the analysis. You can also manually stop profiling any time using the buttons in the right-hand side
of the VTune window.

Prepared for Yunheng Wang c Colfax International, 2013


300 APPENDIX A. PRACTICAL EXERCISES

9. Now we will run analysis on the coprocessor. Click the orange triangle in the toolbar and choose analysis
type “Knights Corner Platfrom Analysis” / “General Exploration” (you can also choose “Lightweight
Hotspots” if you wish) as shown in the top panel of Figure A.10. This will run the analysis on the
coprocessor. If you wish to profile the host part of an offload application, then choose “Sandy Bridge...”
/ “General Exploration” (we will not do it in this case).
Click the button “Start Paused” to launch the application and start profiling. The button “Start” is greyed
out because in the previous step we chose to start sampling 5 seconds after the launch of the application.

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Figure A.10: Profiling an application with offload to an Intel Xeon Phi coprocessor.

When the application terminates, or if you terminate sampling manually, you will see cumulative
sampling information (bottom panel of Figure A.10). The metrics here are different from the metrics
that you saw in the Sandy Bridge architecture analysis. However, you can still get information about the
metrics by placing the mouse cursor over the question mark symbols.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 301

10. Finally, let us analyze an application compiled for native execution on an Intel Xeon Phi coprocessor.
The configuration of a VTune project in this case is slightly different from the configuration for a
host application. You must set micnativeloadex as the application to run. The name of the
executable, native-workload, must be placed in the line “Application Parameters”. You must also
specify the working directory so that the micnativeloadex tool can find the executable. If the
applications uses any external libraries, such as the Intel OpenMP library used in this application, you
must also set the value of the environment variable SINK_LD_LIBRARY_PATH. This variable points
to the directories where micnativeloadex searches for libraries that must be transferred to the
coprocessor. See Section 2.1.3 for more information about using micnativeloadex to run native
coprocessor applications. Figure A.11 shows the project configuration window for a native application.

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Figure A.11: Configuring a VTune project for a native application for Intel Xeon Phi coprocessors.

11. Create a new project called “Native-Workload” with the application native-workload from the
directory step-02-native, as shown in the previous step. Run the analysis of type “Knights Corner
Plaftorm Analysis” / “Lightweight Hotspots” for this application. Ensure that the run was successful by
switching to the terminal window and monitoring the output of the application.

Prepared for Yunheng Wang c Colfax International, 2013


302 APPENDIX A. PRACTICAL EXERCISES

12. At this point, you should be able to analyze applications that run on the host, use the offload model,
or run on the coprocessor. You can find hotspots and determine which application modules incur
negative performance metrics. We have not discussed how to use these metrics in order to improve the
application performance, because the rest of Chapter 4 is dedicated to this subject. However, when you
see references to profiling of an application using VTune in the main text, you will be able to reproduce
those steps.
13. Before concluding, we would like to show you some additional useful techniques in Intel VTune
Amplifier XE. When you view the “Bottom-Up” or the “Top-Down” tab, you can zoom in on a time
interval to study the events in it. In order to zoom in, click on the timeline and drag the mouse to the left
or to the right. Then choose “Zoom In on Selection” from the context menu that appears. This is shown
in Figure A.12

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Figure A.12: Zooming in on a time interval.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 303

14. It is possible to create a custom analysis with events that you want to study specifically. In order to do
that, use one of the buttons at the top of the sidebar menu. You will be given the opportunity to select
the events that you wish to collect for your custom analysis. Once this is done, your custom analysis
type will appear at the bottom of the sidebar menu. See Figure A.13.

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Figure A.13: Creating a custom analysis type.

Prepared for Yunheng Wang c Colfax International, 2013


304 APPENDIX A. PRACTICAL EXERCISES

15. You can start analysis from the command line. In order to see what command line VTune uses to launch
the profiling that you configured, click the button “Command Line...” at the bottom right corner of the
VTune window. This is shown in Figure A.14. When you have collected profiling information for an
application, you can then use amplxe-gui to load and view the results.

g
an
W
e ng
nh
r Yu
fo
ed
ar

Figure A.14: Obtaining the shell command to start the configured analysis.
p
re
yP

16. Optional: use the techniques discussed in Section 3.2.7 to optimize one of the workloads used in this
el

lab. Use VTune to perform the General Exploration analysis. Compare the results. You can just look
iv

at both results, or use the “Compare Results” function available via a button in the tool bar (below the
us

menu bar) that looks like two halves of a circle (see Figure A.15).
cl
Ex

Figure A.15: Comparing results in VTune.

17. Optional: you can also study the tutorial included in Intel VTune Amplifier XE. This tutorial can be
found by pointing the web browser to the following local URL on a machine with installed Intel VTune
Amplifier XE:
file:///opt/intel/vtune_amplifier_xe/documentation/en/tutorials/
The official documentation for VTune can be found in [74].

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 305

Question 4.1.a. What is the difference between the configuration of a VTune project for a host-only application
and the configuration of a project for a native application for Intel Xeon Phi coprocessors?
Question 4.1.b. What happens if you analyze an application with offload to the coprocessor using the “Sandy
Bridge” / “General Exploration” analysis type?
Question 4.1.c. If you want to identify hotspots on the level of individual lines of source code, what compiler
argument must you use when you compile the application?

Answers

Answer 4.1.a. For host-only workloads, one must specify the executable file of the workload as the
application to analyze. For native workloads for coprocessors, one must specify micnativeloadex as the
application and specify the executable as the application parameter.

g
Answer 4.1.b. You will obtain the performance metrics of the host portion of the application.

an
W
Answer 4.1.c. Use -g to include symbols into the executable, and -O3 to avoid slowing down of the

ng
application during the analysis.

e
nh
Yu
TM
A.4.2 Using Intel R Trace Analyzer and Collector
r
fo

The following practical exercises correspond to the material covered in Chapter 4 (pages 139 – 257)
d
re

Goal
pa
re

Intel Trace Analyzer and Collector is a powerful tool for understanding MPI application behavior, quickly
yP

finding bottlenecks, and achieving high performance for parallel cluster applications.
The following instructions will cover the basics of the Intel Trace Analyzer and Collector interface and
el

functionality. Let us refer to the previous problem of calculating the number π, as presented in Chapter 4.7.3.
iv

You can experiment on the source code and see how it effects the results with the Intel Trace Analyzer and
us

Collector. Use the Makefile B.4.2.1 and the source code B.4.2.2, which are located in the corresponding
cl
Ex

lab folder labs/4/4.2-itac/step-00/.

Preparation
The Intel Trace Analyzer and Collector should be installed on the host computer. Its current version (as
of this writing) is 8.1.0.024. If your system has a different version, use it instead of the one presented in the
instructions.
The Intel Trace Analyzer and Collector requires that libVT.so be located on the Intel Xeon Phi
coprocessor for collecting trace data and setup of other environment variables:
user@host% sudo scp /opt/intel/itac/8.1.0.024/mic/slib/libVT.so mic0:/lib64
user@host% . /opt/intel/itac/8.1.0.024/intel64/bin/itacvars.sh impi4

Parameter impi4 indicates what version of the Intel MPI will be used with Intel Trace Analyzer and
Collector. We can save some space by using NFS sharing, described in the lab A.1.2. The whole /opt/intel
folder can be connected, to include required libraries from all Intel products installed on the system.
To use Intel Trace Analyzer and Collector libraries and traces from Intel Xeon Phi coprocessors we need
to run the following script:

Prepared for Yunheng Wang c Colfax International, 2013


306 APPENDIX A. PRACTICAL EXERCISES

user@host% export $VT_ROOT/mic/bin/itacvars.sh impi4

It will configure environment variables and corresponding paths for libraries.

Troubleshooting
In the following section you will see standard error messages from MPI runs, the reasons causing them,
and the way to fix those problems.

intel64/bin/pmi_proxy: line 1: syntax error: unexpected ")" Intel MPI is not


configured to be used with MIC architecture. Set up I_MPI_MIC=1 environment variable.

MPI run halts, while running on several Intel Xeon Phi coprocessors. Caused due to missing connection
between Intel Xeon Phi coprocessors. Enable IP packets forwarding on the host. Modify
etc/sysctl.conf – change the following net.ipv4.ip_forward = 1.

g
MPI run halts on any two devices. Communication between devices are blocked. Turn off iptables and

an
see if it helps: sudo service iptables stop.

W
Instructions e ng
nh
1. Using files from labs/4/4.2-itac/step-00 folder B.4.2.1 and B.4.2.2 compile and execute
Yu

binary files to produce trace file:


r
fo

user@host% cd ~/labs/4/4.2-itac/step-00
user@host% make
ed

user@host% make run


p ar

The last command is equivalent to the following list of commands:


re
yP

user@host% export I_MPI_MIC=1 # turn MIC architecture support for Intel MPI
el

user@host% export VT_LOGFILE_FORMAT=stfsingle # create single file for MPI trace


iv

user@host% export I_MPI_PIN_DOMAIN=omp # pin MPI processes to correspond to OpenMP


us

user@host% mpirun -trace -n 2 -host hostmic0 ./runme-mpi : \


% -host mic0 -n 1 ~/runme-mpi.MIC : \
cl

% -host mic1 -n 1 ~/runme-mpi.MIC


Ex

Most probably you will see the following error messages:

user@host% make run


...
ERROR: ld.so: object ’libVT.so’ from LD_PRELOAD cannot be preloaded: ignored.
~/runme-mpi.MIC: error while loading shared libraries: libmkl_intel_lp64.so:
cannot open shared object file: No such file or directory

First message is produced by Intel Trace Analyzer and Collector on Intel Xeon Phi coprocessors. And
can be fixed by setting up the proper environment for MIC architecture:

user@host% export /opt/intel/itac/8.1.0.024/mic/bin/itacvars.sh

Second message is about Intel MKL library location on Intel Xeon Phi coprocessors. If you mounted
NFS share, and set up Intel C++ Compiler environment, than the following small trick will resolve the
issue:

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 307

user@host$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MIC_LD_LIBRARY_PATH

This will combine library paths for the host architecture with MIC architecture files.
2. The procedure above should result in creation of runme-mpi.single.stf log trace file. This file
can be visualized with Intel Trace Analyzer and Collector:
user@host% traceanalyzer runme-mpi.single.stf

This command should be executed in the terminal of remote desktop client. Otherwise, X11 forwarding
should be enabled to display GUI of Intel Trace Analyzer and Collector.
You will see main interface of the Intel Trace Analyzer and Collector, similar to one shown on Fig-
ure A.16.

g
an
W
e ng
nh
r Yu
fo
d
re

Figure A.16: Initial view of Intel Trace Analyzer and Collector application.
pa
re

This window provides general information about traced MPI run with the summary of times spend on
yP

MPI communication – Group MPI, and other calculations – Group Application.


el

MPI communication timeline provides more information about the application run. To open timeline
iv

visualization for all MPI ranks click the “Charts" menu of internal window, and choose “Event Timeline",
us

as shown on Figure A.17.


cl
Ex

Figure A.17: Choosing “Event Timeline" chart from Intel Trace Analyzer and Collector.

Default color codes are the following: red and blue zones correspond to MPI and Application groups;
black and blue lines show point-to-point and collective operations. This can be modified by right-clicking

Prepared for Yunheng Wang c Colfax International, 2013


308 APPENDIX A. PRACTICAL EXERCISES

on the chart and choosing “Event Timeline Settings...". Since collective operations use the same color
as application blocks, it is recommended to change color by clicking on “Collective Operations Color"
button and also choosing “Use thick Lines for Collective Operations" checkbox.

g
an
W
e ng
nh
Figure A.18: Event Timeline chart with highlighted broadcast messages (green).
r Yu

3. Statistics about point-to-point MPI communication between ranks can be activated by choosing “Message
fo

Profile" from Charts menu. Color-codding correspond to the latency of the MPI communication.
ed
ar

Default view of the Event Timeline chart shows individual timelines per rank. It is might be helpful to
p

use hostname grouping instead. Choose “Process Aggregation" from “Advanced" menu allows to group
re

processes by the nodes. Select the “All_Nodes" from the list and click “Apply" button.
yP
el
iv
us
cl
Ex

Figure A.19: Groups of MPI processes tracks by the host names, and communication statistics between them.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 309

In the “Advanced" menu we can select “Function Aggregation", and change “Major Function Groups"
to “All Functions", which will change the caption of the blocks, and provide more information on what
MPI function were used within the MPI Groups.
Zooming to the specific area can be done with mouse selection of the horizontal area, with the menu
selection, or with the key short-cuts.
Filtering, tagging and explicit frame limits specification can be changed through button at the bottom of
the Intel Trace Analyzer and Collector.

4. Advanced exercise: Intel Trace Analyzer and Collector can visualize function calls within the user
application. To do this application should be compiled with the -tcollect flag and corresponding
path to the Intel Trace Analyzer and Collector libraries. Makefile

g
an
W
e ng
nh
r Yu
fo
d
re
pa
re
yP
el
iv
us
cl
Ex

Figure A.20: Timeline chart of MPI processes with traces of function calls.

Change Function Aggregation view to get the desired functions name scheme.

Prepared for Yunheng Wang c Colfax International, 2013


310 APPENDIX A. PRACTICAL EXERCISES

A.4.3 Serial Optimization: Precision Control, Eliminating Redundant Operations


The next practical exercise corresponds to the material covered in Section 4.2 (pages 141 – 153)

Goal
In the following practical exercise, you will be asked to optimize the source code of calculating the error
function (aka Gauss error function):
Z x
2 2
erf(x) = √ e−t dt (A.1)
π 0
with the rational approximation[83]:
2
erf(x) ≈ 1 − (a1 t + a2 t2 + · · · + a5 t5 )e−x
1
t=
1 + px

g
an
(A.2)

W
p = 0.3275911, a1 = 0.254829592,
a2 = −0.284496736, a3 = 1.421413741,
e
a4 = −1.453152027, a5 = 1.061405429, ng
nh

which accurately ( ≤ 1.5 × 10−7 ) represents the non-negative part of the error function. And since the error
Yu

function is an odd function, the following property will be used:


r
fo

f (−x) = −f (x) (A.3)


ed
p ar
re

2.0
yP
el

1.5
iv
us

1.0
cl
Ex

0.5

0.0

0.5

1.0

1.5

2.0 4 2 0 2 4

Figure A.21: The erf(x) function A.1 is in red (solid) and the rational approximation A.2 is in blue (dashed).

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 311

Instructions
1. Consider the following unoptimized source code B.4.3.3 from
labs/4/4.3-serial-optimization/step-00/erf.cpp of the error function erf(x) and
corresponding main.cpp (B.4.3.2) and Makefile (B.4.3.1).
Compile and run the code.
Note: Write down the calculation times at each step of the optimization.
The myerff() function is implemented in a separate file (erf.cpp), and you will be asked to modify
it for optimization purposes. To use this scalar function within a vector environment, the #pragma
simd and __attribute__((vector)) constructions were used. During the compilation, the
-vec-report3 flag is used to display the vectorization report. Since we plan to run this code
on the Intel Xeon Phi coprocessors, we will use intrinsic memory allocation and free for fIn and
fOut input/output arrays. Use printf() to output: the minimum/maximum argument values and
corresponding function values; number of seconds spend on calculation of 228 points on the grid, and
relative error by comparison with the library implementation of erf().

g
an
Question 4.3.a. What optimization techniques can be applied to the source code to speed up the

W
execution?

ng
2. Common subexpression elimination. Using the original unoptimized code B.4.3.3 from the step-00

e
nh
subfolder, modify the method of powers calculation for the variable t. Try to implement two differ-
ent approaches. For the first method, try using the pow() function from the math.h library (see
Yu
Listing B.4.3.4). For the second method, try using multiplication with the previous power values (see
r

Listing B.4.3.4). Compare the performance of those two implementations.


fo
d

3. Explicit specification of literal constant types. Since we are using float type variables as the
re

input and output parameters of the function myerff(), all the literal constants should be specified
pa

as floats as well. This will avoid unnecessary implicit type casting. Floating-point constants have
re

double type by default, and thus we need to use the “f" specifier to make them float type, e.g.
yP

1.0f.
el

Question 4.3.b. What specifiers should be used to explicitly specify the constant “1" as long and
iv

unsigned long?
us
cl

Compare your result with B.4.3.6 from the step-03 subfolder.


Ex

4. Precision control and optimized functions. In our code, most of the computational resources are
spent on calculating the exp(-x*x) multiplier of the resultant value. The double exp(double);
function call typecasts our (float) -x*x into double and we lose precision again when the result
is converted back to the float type. This can be avoided by using float expf(float); function
call instead.
Another approach is to use binary mathematical functions, which due to system architecture get better
performance. The following mathematical property can be used:

ea = 2a log2 e (A.4)
In the Equation A.4, log2 e = 1.442695040 is a constant, which can be specified before the result
calculation line. In this case it will be inlined by compiler to the expression. exp2f() function call
can be used to calculate powers of 2.
Implement the method described above and compare your results with the source code B.4.3.7 from
step-04 subfolder.

Prepared for Yunheng Wang c Colfax International, 2013


312 APPENDIX A. PRACTICAL EXERCISES

5. Branches elimination. In general, a branching code is not good for efficient auto-vectorization.
And it significantly slows down the code execution when used within a function, which called multiple
times. Therefore, branching should be avoided as much as possible.
In our previous implementations we used explicit comparison check twice. The argument suppose to be
non-negative number. This branching correspond to oddness check of the function (see A.3).
Using the bit-wise operations we can speedup the execution. On the downsize, we make code architecture
dependent, which will work on 64-bit systems with little-endian storage OS (Windows, Linux, Intel-
based Mac OS, etc.). But other systems may have big-endian storage (e.g. PowerPC), and thus will
calculate the wrong result. Use this technique with caution.
For Intel Xeon and Intel Xeon Phi coprocessors architectures float numbers have the upper bit
corresponding to the sign, 8 bits representing exponent part, and 23 bits are contain fractional part of the
number. We will use bit-wise AND operation to change only the upper bit. This will give the absolute
value of the argument.

g
To get the sign of the argument bit-wise AND with 0x80000000 mask constant should be used. Afterward

an
it should be applied to the result value with bit-wise OR operation.

W
ng
Implement your code and compare it with B.4.3.8 from step-05 subfolder.
e
nh
6. Explicitly specify what vector instructions should be used. It may speedup the resulted code. Since
Yu

we plan to run the compiled binary code on Intel Xeon processor, which supports AVX instruction set,
-xAVX compiler’s flag should be used. Changing -fp-model flag’s value effects the final performance
r
fo

as well.
ed

Subfolder step-06 has modified Makefile with corresponding flags.


p ar
re

Advanced Exercise
yP
el

Modify main.cpp source code file to implement parallelism and vectorization using OpenMP. Compare
iv

your result with B.4.3.9 form step-0p subfolder.


us

Serial optimizations performance (lab 4.4)


cl
Ex

8 8.05 s Host system


7.12 s Intel Xeon Phi coprocessor
7
6.39 s
6 5.80 s
Time, s (lower is better)

5
4
3 2.76 s 2.80 s
2.50 s
2 2.05 s 1.80 s 1.88 s 1.88 s
1.20 s 1.14 s 1.05 s
1
0 0.07 s 0.03 s
) pes
ized pow
( ion
resnsation ctio
ns hes X fl
ag
tori
zed
ptim x p nt s ty f u n br anc -xA
V
vec
Uno ube limi sta ized ing d
on s e Con tim inat l an
m O p l i m alle
Com E par

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 313

Answers

Answer 4.3.a. Optimization techniques:

• Common subexpression elimination

• Precision control

• Explicit specification of literal constant types

• Optimized functions use

• Eliminating branches

• Compiler switches

g
• Using parallelism and vectorization

an
W
ng
Answer 4.3.b.

e
nh
1 long lvar = 1.0L;
2 unsigned long luvar = 1.0UL;
rYu
fo

A.4.4 Vector Optimization: Unit-Stride Access, Data Alignment


d
re

The next practical exercise corresponds to the material covered in Section 4.3.1 (pages 153 – 157)
pa
re
yP

Goal
el

Optimize the code for automatic vectorization. Apply the technique to the problem of calculating electric
iv

potential on a grid.
us
cl

Instructions
Ex

1. Compile and execute source code B.4.4.2 and corresponding Makefile B.4.4.1 from
labs/4/4.4-vectorization-data-structure/step-00
This code calculates the values of the potential on the grid, formed by charged points. Those points
described as struct Charge structures, and contain corresponding coordinates and charge values.
Coordinates and charge value have float types. Therefore, each structure takes 4x4=16 bytes of
memory.
Modify this source code to apply unit-stride data access to speedup the program by utilizing vector
instructions more efficiently.
Compare your result with the source code B.4.4.3 from step-01 subfolder.

2. Additional performance can be achieved by using special compiler flags to control the precision of
floating-point operations.
For instance, using -fimf-domain-exclusion flag we can exclude some special computationally
expensive cases of floating-point exceptions:

Prepared for Yunheng Wang c Colfax International, 2013


314 APPENDIX A. PRACTICAL EXERCISES

value class binary integer


extremes 1
NaNs 2
infinities 4
denormals 8
zeros 16
none 0
all 31
common 15

Integer value of the flag is calculated as bitwise OR on corresponding bit flags of excluded classes.
Exclude means that the code generated by the compiler does not have to handle that category of values,
thus providing additional speedup.
-fimf-accuracy-bits defines the relative error for result of math library functions.

g
an
In this step, try using these compiler arguments and monitor their effect on performance. Compare your

W
result with B.4.4.5 and B.4.4.4 from step-02.
e ng
nh
A.4.5 Vector Optimization: Assisting the Compiler
Yu

The next practical exercise corresponds to the material covered in Section 4.3.2 (pages 157 – 161)
r
fo
ed

Goal
p ar

You will be asked to optimize sparse matrix-vector multiplication.


re

Consider the problem of finding the result of multiplication of a sparse matrix M by vector A :
yP

M x A = B. To save memory space and calculation time original matrix M can be stored as a packed array
el

of contiguous chunks of non-zero elements and additional arrays with the information about those non-zero
iv

chunks: e.g. starting position, length, offset.


us

Example of a small 16 × 16 sparse random matrix, vector and multiplication result is presented below:
cl
Ex

Results of packing a sparse 16 x 16 matrix:


Contains 7 non-zero blocks, a total of 32 non-zero elements.
Average number of non-zero blocks per row: 0
Average length of non-zero blocks: 4
Matrix fill factor: 12.50%
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.27 0.00 T
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.54 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.38 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.72 0.14 0.61 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.76 0.67
0.00 0.97 0.90 0.85 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 1.51
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.89 0.35 0.06 0.02 0.00 0.53 0.80
0.00 0.00 0.00 0.00 0.00 0.00 0.80 0.91 0.20 0.34 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.86
0.00 0.00 0.00 0.53 0.77 0.40 0.89 0.16 0.40 0.92 0.07 0.95 0.53 0.00 0.00 0.00 x 0.44 = 3.47
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.93 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.52 0.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.93 0.58
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.72 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.48 0.63 0.36 0.51 0.00 0.00 0.00 0.28 1.44
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.74 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.35 0.00

For matrix-vector multiplication, the processing of rows of matrix M can be parallelized.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 315

Preparation
Note: For the application studied in this lab, hyper-threading is counter-productive. Set the number of
OpenMP threads as follows (assuming 16 physical cores on the host and 60 physical cores on the coprocessor):

user@host% export OMP_NUM_THREADS=16


user@host% export MIC_ENV_PREFIX=MIC
user@host% export MIC_OMP_NUM_THREADS=120

Instructions
1. Look at the following list of files located at
labs/4/4.5-vectorization-compiler-hints/step-00 folder:

Makefile (see B.4.5.1) will compile our source code for the host and Intel Xeon Phi coprocessors,
with activated flags for OpenMP and auto-vectorization report.

g
an
main.cc file (see B.4.5.2) demonstrates simple example with a small 16 × 16 sparse matrix multipli-

W
cation by vector, and then performs benchmark testing for bigger 20000x20000 matrices with the

ng
row of 100 non-zero elements in average; also initialization and testing functions implemented

e
here.
nh
worker.h header file (see B.4.5.3) contains declaration of PackedSparseMatrix class and
Yu
detailed description of variables used in this class.
r
fo

worker.cc source file (see B.4.5.4) has the class implementation. Constructor of the class creates
packed version of a sparse matrix provided to it. MultiplyByVector method implements
d
re

multiplication of the packed representation of a sparse matrix by given vector.


pa
re

Compile and execute the code for the host system and Intel Xeon Phi coprocessors.
yP

2. #pragma loop_count can be used to help Intel C++ Compiler optimize the executable for expected
el

number of loop iterations by choosing the optimal execution path. It only leads to an increase in
iv

performance when the actual loop count in the program is in agreement with the prediction value,
us

specified in pragma at the compile-time.


cl
Ex

Question 4.5.a. Where do you think #pragma loop_count should be used and with what parame-
ter value?

Compare your result with B.4.5.5 from step-01 subfolder.


3. Intel Xeon Phi coprocessors may benefit, if we make matrix M and vector B aligned in the memory, since
this will help utilize MIC vector instructions more efficiently.
In the source code above modify the declaration and usage of matrix M and vector B. Provide additional
clue to the compiler by specifying: #pragma vector aligned
Compare your result with B.4.5.6, B.4.5.7 and B.4.5.8 from step-02 subfolder.

Prepared for Yunheng Wang c Colfax International, 2013


316 APPENDIX A. PRACTICAL EXERCISES

Answers

Answer 4.5.a.
1 #pragma loop_count avg(100)

This pragma can be added before summation loop within MultiplyByVector method implementation of
PackedSparseMatrix class, since we know a priori that this loop will have 100 iterations in average.

A.4.6 Vector Optimization: Branches in Auto-Vectorized Loops


The next practical exercise corresponds to the material covered in Section 4.3.3 (pages 161 – 166).

Goal
In the following practical exercise we will examine conditional branching within the innermost auto-

g
vectorized loop. Intrinsic instructions of MIC architecture (Intel Xeon Phi coprocessors architecture) include

an
masked versions of the most vector instructions, which will apply the specified operation only if a mask for

W
corresponding number is set to non-zero value. With #pragma simd pragma we can force the compiler

dependency).
e ng
to use those masked instructions, if there are no other issues with auto-vectorization (for instance, vector
nh
Yu

Preparation
r
fo

Note: For the application studied in this lab, hyper-threading is counter-productive. Set the number of
ed

OpenMP threads as follows (assuming 16 physical cores on the host and 60 physical cores on the coprocessor):
p ar

user@host% export OMP_NUM_THREADS=16


re

user@host% export MIC_ENV_PREFIX=MIC


yP

user@host% export MIC_OMP_NUM_THREADS=120


el
iv
us

Instructions
cl
Ex

1. Source code B.4.6.3 from the following location:


labs/4/4.6-vectorization-branches/step-00/worker.cc
contains implementations of two functions we will study during this short practical exercise.
Function NonMaskedOperations() contains several nested loops. The most-inner loop will be
automatically vectorized by Intel C++ Compiler, while other loops will not.
Function MaskedOperations() implemented in a similar manner, but has additional conditional
check within the internal loop. But even with this branching most-inner loop will be still vectorized by
the compiler, since masked intrinsic vector instructions can be used.
Both of those functions using #pragma omp parallel for on the outer loop to parallelize those
calculations with OpenMP.
Source code main.cc B.4.6.2 and Makefile B.4.6.1 take care of data preparation, performance
measurement, and printing out the results. Aligned memory allocation is used for data and flag
arrays. Pointer to the functions is passed as argument to the Benchmark() function, which also prints
out the statistics of execution. Four different mask patterns are used for the testing.
Depending on the mask we apply compiler can decide to use different implementation of the functions.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 317

Compile and execute this program. Compare your result with the one provided in Section 4.3.3.
2. Modify the source code above to explicitly vectorize the internal loop with #pragma simd.
Check the performance change of those modifications.
Compare your source code with B.4.6.4 from step-01.

A.4.7 Shared-Memory Optimization: Reducing the Synchronization Cost


The next practical exercise corresponds to the material covered in Section 4.4.1 and Section 4.4.2
(pages 166 – 175)

Goal
Make parallel implementation and optimize histogram creation algorithm for caching without false
sharing and cache line stilling.

g
an
Instructions

W
1. Files for this practical exercise (Makefile B.4.7.1, main.cc B.4.7.2, and worker.cc B.4.7.3)

ng
located at: labs/4/4.7-optimize-shared-mutexes/step-00 folder. main.cc initialize

e
random array of ages, which will be used for histogram creation (Histogram() function call from
nh
worker.cc). Calculated histogram occupancy compared with the result of serial implementation for
Yu
correctness, with performance statistics printed out.
r
fo

Function Histogram() from worker.cc source code file is serial unoptimized version of histogram
calculation function. Your task is to optimize this code. From our previous practical exercises we
d
re

know, that devision is slower than multiplication operation. Thus, you need to modify the code to use
pa

multiplication wherever possible; pre-compute the reciprocal. Also use strip-mining technique to split
re

the loop into two nested loops. This will allow the inner loop to be vectorized. Use #pragma vector
yP

aligned to notify the compiler, that aligned array will be used.


el

Try to implement additional loop, which will take care of the tail iterations, if total number of elements
iv

in the array is not a multiple of vecLen variable (length of vectorized loop).


us

Compare your result with the source code B.4.7.4 from step-01 subfolder. Make sure that inner loop
cl

is vectorized by setting -vec-report3 compiler flag.


Ex

2. In the previous step we applied data parallelism – vectorization of the code. Next step is to use thread
parallelism, which will be implemented with OpenMP.
Apply OpenMP pragma to the external for loop to be run in parallel. To avoid racing condition use
#pragma omp atomic mutex to protect hist[] array modification. Although, this operation is
highly inefficient and presented here only for educational purposes, this approach still can be used for
light-loaded operations.
Parallel version with OpemMP atomic mutex pragma can be found at step-02 subfolder, source code
B.4.7.5.
3. Optimize previous parallel code by using private variables to hold a copy of histogram in each thread.
To do so use #pragma omp parallel and #pragma omp for separately. Use aligned array
for storing temporary histogram values. You should collect the total number of element from each
corresponding cell in all private histogram arrays. Use #pragma omp atomic to avoid racing
conditions.
Compare your result with B.4.7.6 for step-03 subfolder.

Prepared for Yunheng Wang c Colfax International, 2013


318 APPENDIX A. PRACTICAL EXERCISES

4. False sharing and cache line stilling. Using previous example create shared two-dimensional array,
which will keep values of histogram entries (the first dimension) for each individual thread (the second
dimension).
If your code did not automatically vectorized, use private variable to collect histogram indexes and use
second loop to collect counting of those values into the two-dimensional array.
Compile and run your code. Compare your source code with B.4.7.7 from step-04 subfolder.
Since there are only 5 histogram groups with counters of integer type, we will notice performance issues
due to false sharing.

5. To prevent false cache line sharing we can increase the distance between the accessed elements by
increasing number of elements in the array.
Rewrite the code to calculate the new size of array with provided paddingBytes variable. New
implementation of array should has this new larger size.
Compare your source code with B.4.7.8 from step-05.

g
an
W
A.4.8 Shared-Memory Optimization: Load Balancing
ng
The next practical exercise corresponds to the material covered in Section 4.4.3 (pages 175 – 179)
e
nh
Yu

Goal
r
fo

In the following practical exercise you will be asked to write matrix-vector multiplication solver M x = b,
using Jacobi method.
ed

To show load imbalance and methods of preventing it, we will not use accuracy (threshold) number, but
ar

rather vector with length nBVectors of the accuracy values with the large distribution of values, and those
p
re

values will be assigned to individual OpenMP threads in parallel. To store solution vectors x and vectors b we
yP

will increase their length n by the factor of nBVectors.


We will use “good" type of matrix for Jacobi method. Example of such 10 × 10 matrix M is below:
el
iv
us
cl

90.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
10.0 268.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0
Ex

20.0 21.0 446.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0
30.0 31.0 32.0 624.0 34.0 35.0 36.0 37.0 38.0 39.0
40.0 41.0 42.0 43.0 802.0 45.0 46.0 47.0 48.0 49.0
50.0 51.0 52.0 53.0 54.0 980.0 56.0 57.0 58.0 59.0
60.0 61.0 62.0 63.0 64.0 65.0 1158.0 67.0 68.0 69.0
70.0 71.0 72.0 73.0 74.0 75.0 76.0 1336.0 78.0 79.0
80.0 81.0 82.0 83.0 84.0 85.0 86.0 87.0 1514.0 89.0
90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 1692.0

Vector b is initialized uniformly with random numbers by using Intel MKL streams.

Instructions
1. Compile and execute source code files from labs/4/4.8-optimize-scheduling/step-00:
Makefile B.4.8.1, main.cc B.4.8.2, and worker.cc B.4.8.3.
In the main.cc file for loop will be iterated nTrials times, calling IterativeSolver()
function from worker.cc. Average number of iterations is returned by this function and printed out
as part of execution statistics, as well as time of execution.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 319

#pragma omp parallel for has reduction clause and schedule clause. Changing the
parameter of schedule clause we can optimize workload distribution between the parallel threads,
and thus avoid load imbalance.

2. Modify the source code above to use Intel Cilk Plus as a parallelism engine.
Compare your result with the source code B.4.8.4 from step-01 subfolder.

3. To compare performance difference between scheduling parameters, write the code, witch will call
IterativeSolver() function within different OpenMP
#pragma omp parallel for schedule(...) pragma environments. Second integer param-
eter passed to the schedule clause indicates grain size.
We suggest you to try the following scheduling modes:

g
• Intel Cilk Plus

an
W
• without schedule clause

ng
• schedule(static, 1)

e
• schedule(static, 4) nh
Yu
• schedule(static, 32)
r
fo

• schedule(static, 256)
d
re

• schedule(dynamic, 1)
pa
re

• schedule(dynamic, 4)
yP

• schedule(dynamic, 32)
el
iv

• schedule(dynamic, 256)
us
cl

• schedule(guided, 1)
Ex

• schedule(guided, 4)
• schedule(guided, 32)
• schedule(guided, 256)

Compare your results with the source code B.4.8.5 from step-02 subfolder.

4. Use Intel VTune Amplifier XE to visualize concurrency between threads for different scheduling modes.

(a) Use “Concurrency" from “Algorithm Analysis" for program running on the host Intel Xeon
processor
(b) For Intel Xeon Phi coprocessors you will have to use “Lightweight Hotspots" from “Knights
Corner Platform Analysis", and filter-out IterativeSolver function calls. See illustration
below:

Prepared for Yunheng Wang c Colfax International, 2013


320 APPENDIX A. PRACTICAL EXERCISES

g
an
W
e ng
nh
Can you explain the waiting areas at different scheduling modes and grain sizes?
r Yu
fo

A.4.9 Shared-Memory Optimization: Loop Collapse and Strip-Mining for Parallel


ed

Scalability
ar

The next practical exercise corresponds to the material covered in Section 4.4.4 and Section 4.4.5
p
re

(pages 179 – 196)


yP
el

Goal
iv
us

Here, we demonstrate several methods for optimizing some code with insufficient parallelism in which
cl

the parallel iteration space can be expanded.


Ex

Instructions
1. Source files B.4.9.1, B.4.9.2, and B.4.9.3 can be found at the following location:

user@host% cd ~/labs/4/4.9-insufficient-parallelism
user@host% cd step-00

2. Establish the baseline.


user@host% emacs worker.cc # study the performance-critical section of the code
user@host% emacs worker.cc # study the parameters passed to the code
user@host% make
user@host%./runme
% record the result on host
user@host% micnativeloadex runmeMIC
% record the result on coprocessor

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 321

What was the bandwidth (in GB/s) on the host? What was it on the coprocessor? Can you explain the
poor performance of the coprocessor in this case?
3. Diagnose performance problems.
Use Intel VTune Amplifier XE to run the analysis of type “Concurrency” on the host version of the
application. What is the analysis telling you? Refer to Section 4.4.4 for additional information.
4. Optimization attempt: inner loop optimization.
You should now be in directory step-00. Modify the file worker.cc so that instead of the outer
loop with few iterations, parallelization is applied to the inner loop with multiple iterations. Do you
expect to get an improvement?
For solution, go to the next step (B.4.9.4 at step-01).

user@host% cd ../step-01
user@host% emacs worker.cc
user@host% make

g
user@host% ./runme

an
% record new results on host

W
user@host% micnativeloadex runmeMIC
% record the results on coprocessor

e ng
nh
Use Intel VTune Amplifier XE to run the analysis of type “Concurrency” on the host version of the
Yu
application. What is the analysis telling you? Refer to Section 4.4.4 for additional information.
r

Explain the difference between the effect of this optimization on the performance of the host and of the
fo

coprocessor.
d
re

5. Optimization attempt: loop collapse.


pa

You should now be in directory step-01. Now we will attempt to increase the iteration space by using
re

the clause collapse(2) for #pragma omp for.


yP

Modify the file worker.cc:


el
iv

a) apply parallelization to the outer loop (i);


us

b) use clause collapse(2) in the #pragma omp for statement;


cl
Ex

c) decide how to perform the reduction: now it is not possible to use the variable sum.

6. For solution, go to the next step (B.4.9.5 at step-02).

user@host% cd ../step-02
user@host% emacs worker.cc
user@host% make
user@host% ./runme
% record new results on host
user@host% micnativeloadex runmeMIC
% record the results on coprocessor

Can you explain the results? Hint: try to compile worker.cc with the argument -vec-report3.
7. Optimization attempt: loop collapse + strip-mine
The reason for the failure of the optimization in the previous step is that the compiler does not know
how to automatically vectorize the reduction when loop collapsed technique is used. Let us assist the
compiler by strip-mining the j-loop. You should now be in directory step-02. Use your previous

Prepared for Yunheng Wang c Colfax International, 2013


322 APPENDIX A. PRACTICAL EXERCISES

solution or the file worker.cc in this directory. Strip-mine the j-loop, so that the inner loop along the
strip can be automatically vectorized.
Look up the solution in step-03 and benchmark it:

user@host% cd ../step-03
user@host% emacs worker.cc
user@host% make
user@host% ./runme
% record new results on host
user@host% micnativeloadex runmeMIC
% record the results on coprocessor

A.4.10 Shared-Memory Optimization: Core Affinity Control


The next practical exercise corresponds to the material covered in Section 4.4.5 (pages 189 – 196)

g
Goal

an
W
Affinity control allows to get additional performance gain due to optimizing the resources distribution.

Instructions
e ng
nh

1. To show the core affinity control for this first step, we will use the source codes from the last practical
Yu

exercise – summation of column elements of matrix (aka the row-wise matrix reduction). Makefile
r
fo

B.4.10.1, main.cc B.4.10.2, and worker.cc B.4.10.3 located at the


labs/4/4.a-affinity/step-00 folder.
ed
ar

Memory bandwidth-intensive calculations like this one are best run when hyper-threading is not used,
p

and also KMP_AFFINITY=scatter. This is because the processor or the coprocessor can employ
re

all available memory controllers, and at the same time, there is no thread contention on the memory
yP

controllers.
el

Compile and benchmark the affinity-optimized code in the step-00 subfolder:


iv
us

user@host% make
cl

user@host% export KMP_AFFINITY=scatter


Ex

user@host% export OMP_NUM_THREADS=16


user@host% ./runme
...
user@host% micnativeloadex runmeMIC -e "KMP_AFFINITY=scatter" \
% -e "OMP_NUM_THREADS=120"
...

2. Compute-bound calculations, for instance DGEMM function from Intel MKL – matrix-matrix multi-
plication and summation of type αA ∗ B + βC, is highly arithmetically intensive problem. Example
implementation can be found at B.4.10.4 Makefile and B.4.10.5 affinity.cpp source code file.
Compile and execute those files from step-01 subfolder.
Use micnativeloadex to run this program on Intel Xeon Phi coprocessors. Use flag
-e "KMP_AFFINITY=compact" to specify thread affinity mode. Compare the performance of those
two cases.

3. For some problems running on the Intel Xeon processors KMP_AFFINITY can provide additional
performance as well. Consider the problem of one-dimensional Discrete Fast Fourier Transform

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 323

(DFFT) of a large 4GB array. Makefile B.4.10.6 and corresponding affinity.cpp source code
file B.4.10.7 located at the step-02 subfolder. Compile and execute this program. Notice the
performance.
In the same folder two additional shell scripts are provided:
(run1_noaffinity.sh and run2_affinity.sh).
Each of them will execute runme compiled program with modified environment variables, changing
number of threads used by Intel MKL, and affinity mode.
Run those scripts and compare the performance.

4. Modify Makefile to specify -par-affinity compiler’s flag. This will define the affinity mode at
the moment of compilation, which will be used at a run-time.

A.4.11 Cache Optimization: Loop Interchange and Tiling


The next practical exercise corresponds to the material covered in Section 4.5.3 and Section 4.5.4,

g
Chapter 4 (pages 200 – 213)

an
W
Goal

e ng
Study cache optimization techniques, and compare performance gain from loop interchange and tiling.
nh
Yu
Instructions
r
fo

1. The following practical exercise is based on program calculating the transient emissivity of cosmic dust
d

grains (by T. Porter and A. Vladimirov, Stanford University).


re

Study the source code B.4.11.2, B.4.11.3, and corresponding Makefile B.4.11.1 from
pa

labs/4/4.b-tiling/step-00 folder. Physical meaning of the variables and structure of the


re

program explained in Section 4.5.4.


yP

There are tree nested loops in the worker.cc iterated with i, j, and k. Within those loops we
el

accessing two arrays plankFunc[] and distribution[]:


iv
us

1 for (int i = 0; i < wlBins; i++)


cl

2 for (int j = 0; j < gsMax; j++)


Ex

3 for (int k = 0; k < tempBins; k++)


4 result += planckFunc[i*tempBins + k]*distribution[j*tempBins + k];

Question 4.11.a. Interchanging i-loop and j-loop will increase the performance. Can you explain
why?

Modify the source code to interchange nested loops iterated over i and j. k-loop is provide unit-stride
access for vectorization, and thus changing its order will only decrease the performance. Compare your
result with B.4.11.4 from step-01 subfolder.

2. In the next step you will be asked to tile i-loop. Define additional constant iTile and split i-loop into
two nested loops with internal one making iTile iterations.
Try different values of iTile and find the optimal one. iTile-loop will increase the performance,
since several vector registers will keep plankFunk[] array, thus reducing time needed to copy those
chunks from L1 cache layer.
Compare your source code with B.4.11.5 from step-02 folder.

Prepared for Yunheng Wang c Colfax International, 2013


324 APPENDIX A. PRACTICAL EXERCISES

3. Tiling can be used for both i- and j-loops, providing data locality for both planckFunc[] and
distribution[].
Note: Two nested loops within third one prevent it from auto-vectorization by compiler. Thus, you will
need to explicitly unroll one of them.
Use __MIC__ macros and find optimal parameters for iTile and jTile for Intel Xeon processor
and Intel Xeon Phi coprocessor.
Compare your results with B.4.11.6 from step-03 folder.

4. Combine all the steps above together in one file, for instance, as shown at B.4.11.7. Use Intel VTune
Amplifier XE to compare cache replacements on different layers. You should get something similar to
the following plot:

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

This plot indicates, that on every optimization step the number of data cache L1 and L2 cache replace-
ments decreased, since we optimized the data locality by using tiling.

Answers

Answer 4.11.a. distribution[] array is a private array for each thread, while plankFunc[]
is shared between the threads in parallel environment. Therefore, it is better to keep data locality for
distribution[] array, or otherwise portions of it will be copied several times more often (proportional to
number of threads accessing the same cache) from L2 to L1 cache layers, than plankFunc[].

A.4.12 Memory Access: Cache-Oblivious Algorithms


The next practical exercise corresponds to the material covered in the Section 4.5.5 (pages 213 – 216)

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 325

Goal
Study cache optimization technique of cache-oblivious algorithms.

Instructions
1. Folder labs/4/4.c-cache-oblivious-recursion/step-00 contains several files:
Makefile B.4.12.1, main.cc B.4.12.2, and worker.cc B.4.12.3 – source codes of parallel
28000 × 28000 matrix transposition.
File main.cc implements matrix initialization, verification of correct transposition, and timed calls
of Transpose() function from worker.cc source code file. This function uses Intel Cilk Plus
parallel for to iterate over external loop. Function is not optimized, it will exchange elements of the
matrix from below the matrix’s diagonal location with elements above it. Intel C++ Compiler suspects
vector dependence, and thus do not vectorize the inner loop. Using #pragma ivdep we can force
compiler to auto-vectorize it, but this actually will not increase the performance.

g
an
We can increase the performance by applying tiling algorithm, which will improve data locality by
re-using data already in the cache. Try to implement this optimization technique. Compare your result

W
with the source code B.4.12.4 from step-01 folder.

e ng
2. Program from the previous step can be additionally optimized by providing #pragma loop count
nh
avg(TILE) and #pragma simd pragmas, as shown in B.4.12.5 source code file.
r Yu
3. Cache-oblivious algorithm shows even better performance for this problem. Try to implement recursive
fo

function for matrix transposition, using different recursive threshold constant RT for Intel Xeon processor
d

and Intel Xeon Phi coprocessor. Compare your result with B.4.12.6 from step-03 folder.
re
pa

4. Vector operations will benefit significantly, if data split points will be multiple of the SIMD vector
re

length. Using modulo operation implement recursive splitting at those points. Compare your result with
yP

B.4.12.7 from step-04 folder.


el
iv

A.4.13 Memory Access: Loop Fusion


us
cl

The next practical exercise corresponds to the material covered in Section 4.5.6 (pages 216 – 220)
Ex

Goal
Study cache optimization technique based on loop fusion.

Instructions
1. In the following practical exercise we will calculate mean value and standard deviation of randomly
distributed (generated with Intel MKL) 10000 array of 50000 elements each.
Initially, see labs/4/4.d-cache-loop-fusion/step-00 B.4.13.1, B.4.13.2, and B.4.13.3,
initialization function and calculation of mean and standard deviation values are called separately.
Although, those functions can be combined together, providing additional speed-up due to loop fusion,
and avoiding additional overhead by using only one OpenMP parallel region.
Combine those functions and compare your result with the source code B.4.13.4 from step-01
subfolder.

Prepared for Yunheng Wang c Colfax International, 2013


326 APPENDIX A. PRACTICAL EXERCISES

2. For this particular problem we don’t need to keep all random generated data on the heap, but rather we
can generate it within the parallel OpenMP region for each individual thread on the stack. Compare
your result with B.4.13.5 from step-02 subfolder.

A.4.14 Offload Traffic Optimization


The next practical exercise corresponds to the material covered in Section 4.6 (pages 221 – 225)

Goal
Offloading function calls can be optimized through precise control of data manipulation.

Instructions
1. Using source code files from labs/4/4.e-offload/step-00: main.cc B.4.14.2,
worker.cc B.4.14.3, and Makefile B.4.14.1, compare the performance of different offload imple-

g
mentations. Default offload procedure contains the following steps: allocating memory on coprocessor,

an
transferring data, performing offload calculations, and deallocating memory on coprocessor. For the

W
offload with the memory retention, please write the body of the function, so that the memory container

iterations and deallocated during the last iterations.


e ng
for the data is allocated during the first iteration, but this allocated memory is retained in the subsequent
nh
Compare your results with B.4.14.4 source code from step-01.
r Yu

2. Implement the offload function with data persistence next. In the body of the function the data is
fo

transferred to the coprocessor during the first iterations, allocated memory is retained afterwards, and
ed

data is not transferred in subsequent iterations.


ar

Compare your source code with B.4.14.5 from step-02 subfolder.


p
re
yP

A.4.15 MPI: Load Balancing


el

The next practical exercise corresponds to the material covered in Section 4.7 (pages 225 – 248)
iv
us

Goal
cl
Ex

Heterogeneous parallel computing require proper load balancing, which will be shown next.

Instructions
1. Reproduce the code for calculating number π with simple Monte Carlo simulations, where points with
random coordinates, uniformly distributed in the unit square, also are covering one forth of a unit
circle. Detailed description of the problem can be found in corresponding section of the main text (see
Section 4.7.1).
Total number of iterations iter=232 should be fixed, and distributed between available MPI processes.
Use fixed blockSize= 212 constant as number of iterations of most inner vectorized loop. Quick
random numbers generator can be used from Intel MKL library. And since the problem is two-
dimensional, you will need 2*blockSize random numbers. Try different access patterns for choosing
x and y coordinates. See which one is the most efficient and explain why.
Computations should be evenly distributed between all MPI processes. Final number of the points on
the surface of the unit circle should be collected from all processes with MPI_Reduce() function call
and final answer printed out by a single MPI process (rank 0).

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


A.4. EXERCISES FOR CHAPTER 4: OPTIMIZING APPLICATIONS 327

Write, compile and execute your code. To compare your implementation you can use B.4.15.1 and
B.4.15.2 from labs/4/4.f-MPI-load-balance/step-00. You can use the following com-
mands for automatic run of MPI jobs:

user@host% make
user@host% make run
user@host% make runhost
user@host% make runmic
user@host% make runboth

This commands will compile the program, copy corresponding version of the binary executable on the
Intel Xeon Phi coprocessors and will execute MPI run. If you have a different number of cores on the
host CPU and coprocessors – modify Makefile to the correct values.

2. Static load balance. For heterogeneous MPI calculations on host and Intel Xeon Phi coprocessors we
will need to distribute the workload proportionally to the performance of the nodes, to guarantee the
load balance and optimal use of the available resources. ALPHA environment variable, specified by

g
an
user, corresponds to the relation between workload split between the host system and the coprocessors.

W
Calculate the number of ranks running on the host system and on the Intel Xeon Phi coprocessors with
__MIC__ macros, and divide the workload accordingly.

e ng
Compare your result with Makefile B.4.15.3 and B.4.15.4 source code from step-01 subfolder.
nh
Change environment variable ALPHA, and see how it effects the performance. Try to plot this dependence
Yu
and calculated theoretical value for the optimal proportion between the workload on the host and the
r

Intel Xeon Phi coprocessors.


fo
d

3. Boss–worker model. Dynamic workload distribution. Implement Monte Carlo calculation of number π
re

using this model. Dedicate special rank (rank 0) as a Boss for assigning work distribution between
pa

the rest of the ranks – workers. All workers should request new portion of work, when they finish
re

the previous portion. Boss process should respond with the number of Monte Carlo runs worker will
yP

execute. The same amount of work should be distributed per request, specified by environment variable
el

GRAIN_SIZE. Use MPI_Reduce() to collect the results.


iv

Don’t forget to use auto-vectorized loops for better performance.


us
cl

Compare your results with the source code B.4.15.5 from step-02 subfolder.
Ex

4. Hybrid MPI and OpenMP. Modify the source code from the previous step to use combination of MPI
and OpenMP – hybrid model. Worker’s MPI process will receive portion of the workload and spread it
between OpenMP threads. Optimize the code for the performance. Try different scheduling mechanisms
for #pragma omp for.
Compare your result with the files B.4.15.6 and B.4.15.7 from step-03 subfolder.
Using different combination of MPI/OpenMP processes/threads for the host system and Intel Xeon Phi
coprocessors find the optimal hybrid mode parameters.
Additional exercise: find amount of MPI communication and load imbalance of your program. How
different combinations of MPI/OpenMP processes/threads will effect those characteristics?

5. Guided workload distribution. Avoiding MPI communication. Using the code above, modify the
workload distribution algorithm assigned by the boss process. Instead of assigning the same amount
of computations per request, the boss should spread a portion of the workload between the workers,
calculated dynamically in the chunkSize variable. Thereafter, the workload amount should be
decreased by the half, and so on. To avoid massive MPI communication at the end for the small

Prepared for Yunheng Wang c Colfax International, 2013


328 APPENDIX A. PRACTICAL EXERCISES

chunks of workload, use some threshold value for the smallest chunk, defined in environment variable
GRAIN_SIZE.
Compare your results with source code B.4.15.8 from step-04 subfolder.

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


329

Appendix B

Source Code for Practical Exercises

B.1 Source Code for Chapter 1:

g
an
B.2 Source Code for Chapter 2: Programming Models

W
g
B.2.1 Compiling and Running Native Applications on Intel Xeon Phi Coprocessors
en
h
un
Back to Lab A.2.1.
rY
fo

B.2.1.1 labs/2/2.1-native/hello.c
ed
ar

/* Copyright (c) 2013, Colfax International. All Right Reserved.


ep

1
2 hello.c, located at 2/2.1-native
Pr

3 is a part of the practical supplement to the handbook


y

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


el

(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


siv

5
6 Redistribution or commercial usage without a written permission
u

from Colfax International is prohibited.


cl

7
Ex

8 Contact information can be found at http://colfax-intl.com/ */


9
10 #include <stdio.h>
11 #include <unistd.h>
12 int main(){
13 printf("Hello world! I have %ld logical cores.\n",
14 sysconf(_SC_NPROCESSORS_ONLN ));
15 }

Back to Lab A.2.1.

B.2.1.2 labs/2/2.1-native/donothinger.c

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file donothinger.c, located at 2/2.1-native
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */

Prepared for Yunheng Wang c Colfax International, 2013


330 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

9
10 #include <stdio.h>
11 #include <unistd.h>
12 #include <pthread.h>
13
14 void *Spin(void *threadid){
15 long tid;
16 tid = (long)threadid;
17 printf("Hello World from thread #%ld!\n", tid);
18 fflush(0);
19 while(1);
20 pthread_exit(NULL);
21 }
22
23 int main (int argc, char *argv[]){
24 int numThreads=sysconf(_SC_NPROCESSORS_ONLN);
25 pthread_t threads[numThreads];
26 printf("Spawning %d threads that do nothing, press ^C to terminate.\n", numThreads);
27 if (numThreads > 0){
28 for (int i = 1; i < numThreads; i++){
29 int rc = pthread_create(&threads[i], NULL, Spin, (void *)i);
if (rc){

g
30

an
31 printf("ERROR; return code from pthread_create() is %d\n", rc);

W
32 return -1;
}
ng
33
34 }
he

35 }
un

36 Spin(NULL);
rY

37 pthread_exit(NULL);
fo

38 }
d
re
pa

Back to Lab A.2.1.


re
yP

B.2.1.3 labs/2/2.1-native/donothinger-offload.c
el
siv

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


u
cl

2 donothinger-offload.c, located at 2/2.1-native


Ex

3 is a part of the practical supplement to the handbook


4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <unistd.h>
12 #include <pthread.h>
13
14 __attribute__((target(mic))) void *Spin(void *threadid){
15 long tid;
16 tid = (long)threadid;
17 printf("Hello World from thread #%ld!\n", tid);
18 fflush(0);
19 while(1);
20 pthread_exit(NULL);
21 }
22
23 int main (int argc, char *argv[]){
24 #pragma offload target(mic)
25 {

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 331

26 int numThreads=sysconf(_SC_NPROCESSORS_ONLN);
27 pthread_t threads[numThreads];
28 printf("Spawning %d threads that do nothing, press ^C to terminate.\n", numThreads);
29 if (numThreads > 0){
30 for (int i = 1; i < numThreads; i++){
31 int rc = pthread_create(&threads[i], NULL, Spin, (void *)i);
32 if (rc){
33 printf("ERROR; return code from pthread_create() is %d\n", rc);
34 //return -1;
35 }
36 }
37 }
38 Spin(NULL);
39 pthread_exit(NULL);
40 }
41 }

B.2.2 Explicit Offload: Sharing Arrays and Structures


Back to Lab A.2.2.

g
an
W
B.2.2.1 labs/2/2.2-explicit-offload/step-00/Makefile

g
CXX = icpc en
h
un
CXXFLAGS =
rY
fo

OBJECTS = offload.o
ed

.SUFFIXES: .o .cpp
ar
ep

.cpp.o:
Pr

$(CXX) -c $(CXXFLAGS) -o "$@" "$<"


y
el

all: runme
siv
u

runme: $(OBJECTS)
cl
Ex

$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

clean:
rm -f *.o runme

Back to Lab A.2.2.

B.2.2.2 labs/2/2.2-explicit-offload/step-00/offload.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file offload.cpp, located at 2/2.2-explicit-offload/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12

Prepared for Yunheng Wang c Colfax International, 2013


332 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

13 const int size = 1000;


14 int data[size];
15
16 int CountNonZero(const int N, const int* arr){
17 int nz=0;
18 for ( int i = 0 ; i < N ; i++ ){
19 if ( arr[i] != 0 ) nz++;
20 }
21 return nz;
22 }
23
24 int main( int argc, const char* argv[] ){
25
26 // initialize array of integers
27 for ( int i = 0; i < size ; i++) {
28 data[i] = rand() % 10;
29 }
30
31 int numberOfNonZeroElements = CountNonZero(size, data);
32 printf("There are %d non-zero elements in the array.\n", numberOfNonZeroElements);
33
exit(0);

g
34

an
35 }

W
ng
Back to Lab A.2.2.
he
un
rY

B.2.2.3 labs/2/2.2-explicit-offload/step-01/offload.cpp
fo
d

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


re

2 file offload.cpp, located at 2/2.2-explicit-offload/step-01


pa

3 is a part of the practical supplement to the handbook


re

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


yP

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


Redistribution or commercial usage without a written permission
el

6
siv

7 from Colfax International is prohibited.


8 Contact information can be found at http://colfax-intl.com/ */
u
cl

9
Ex

10 #include <stdio.h>
11 #include <stdlib.h>
12
13 const int size = 1000;
14 int data[size];
15
16 int CountNonZero(const int N, const int* arr){
17 int nz=0;
18 for ( int i = 0 ; i < N ; i++ ){
19 if ( arr[i] != 0 ) nz++;
20 }
21 return nz;
22 }
23
24 int main( int argc, const char* argv[] ){
25
26 int numberOfNonZeroElements;
27
28 // initialize array of integers
29 for ( int i = 0; i < size ; i++) {
30 data[i] = rand() % 10;
31 }
32

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 333

33 #pragma offload target(mic)


34 {
35 printf("Hello from MIC!\n");
36 fflush(0);
37 }
38
39 numberOfNonZeroElements = CountNonZero(size, data);
40 printf("There are %d non-zero elements in the array.\n", numberOfNonZeroElements);
41 }

Back to Lab A.2.2.

B.2.2.4 labs/2/2.2-explicit-offload/step-02/offload.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file offload.cpp, located at 2/2.2-explicit-offload/step-02

g
3 is a part of the practical supplement to the handbook

an
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

W
5
6 Redistribution or commercial usage without a written permission

ng
7 from Colfax International is prohibited.
Contact information can be found at http://colfax-intl.com/ */

e
8
9
10 #include <stdio.h> nh
Yu
11 #include <stdlib.h>
r

12
fo

13 const int size = 1000;


14 int data[size];
d
re

15
int CountNonZero(const int N, const int* arr){
pa

16
17 int nz=0;
re

18 for ( int i = 0 ; i < N ; i++ ){


yP

19 if ( arr[i] != 0 ) nz++;
20 }
el

21 return nz;
iv

22 }
us

23
24 int main( int argc, const char* argv[] ){
cl

25
Ex

26 int numberOfNonZeroElements;
27
28 // initialize array of integers
29 for ( int i = 0; i < size ; i++) {
30 data[i] = rand() % 10;
31 }
32
33 #pragma offload target(mic)
34 {
35 #ifdef __MIC__
36 printf("Offload is successful!\n");
37 fflush(0);
38 #else
39 printf("Offload has failed miserably!\n");
40 #endif
41 }
42
43 numberOfNonZeroElements = CountNonZero(size, data);
44 printf("There are %d non-zero elements in the array.\n", numberOfNonZeroElements);
45 }

Prepared for Yunheng Wang c Colfax International, 2013


334 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

Back to Lab A.2.2.

B.2.2.5 labs/2/2.2-explicit-offload/step-03/offload.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file offload.cpp, located at 2/2.2-explicit-offload/step-03
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 __attribute__((target(mic))) int CountNonZero(const int N, const int* arr){
14 int nz=0;
15 for ( int i = 0 ; i < N ; i++ ){
16 if ( arr[i] != 0 ) nz++;

g
17 }

an
18 return nz;

W
19 }
ng
20
he

21 int main( int argc, const char* argv[] ){


un

22
23 int numberOfNonZeroElements;
rY

24
fo

25 #pragma offload target(mic)


d

26 {
re

27 const int size = 1000;


pa

28 int data[size];
re

29
yP

30 // initialize array of integers


for ( int i = 0; i < size ; i++) {
el

31
siv

32 data[i] = rand() % 10;


33 }
u
cl

34
Ex

35 numberOfNonZeroElements = CountNonZero(size, data);


36 }
37
38 printf("There are %d non-zero elements in the array.\n", numberOfNonZeroElements);
39 }

Back to Lab A.2.2.

B.2.2.6 labs/2/2.2-explicit-offload/step-04/offload.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file offload.cpp, located at 2/2.2-explicit-offload/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 335

12
13 #pragma offload_attribute(push, target(mic))
14 const int size = 1000;
15 int data[size];
16
17 int CountNonZero(const int N, const int* arr){
18 int nz=0;
19 for ( int i = 0 ; i < N ; i++ ){
20 if ( arr[i] != 0 ) nz++;
21 }
22 return nz;
23 }
24 #pragma offload_attribute(pop)
25
26 int main( int argc, const char* argv[] ){
27
28 // initialize array of integers
29 for ( int i = 0; i < size ; i++) {
30 data[i] = rand() % 10;
31 }
32
int numberOfNonZeroElements;

g
33

an
34 #pragma offload target(mic)

W
35 numberOfNonZeroElements = CountNonZero(size, data);

g
36

en
37 printf("There are %d non-zero elements in the array.\n", numberOfNonZeroElements);
h
38 }
un
rY
fo

B.2.3 Explicit Offload: Data Traffic and Asynchronous Offload


ed
ar

Back to Lab A.2.3.


ep
Pr
y

B.2.3.1 labs/2/2.3-explicit-offload-persistence/step-00/Makefile
el
siv
u

CXX = icpc
cl

CXXFLAGS =
Ex

OBJECTS = offload.o

.SUFFIXES: .o .cpp

.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

all: runme

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

clean:
rm -f *.o runme

Back to Lab A.2.3.

B.2.3.2 labs/2/2.3-explicit-offload-persistence/step-00/offload.cpp

Prepared for Yunheng Wang c Colfax International, 2013


336 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file offload.cpp, located at 2/2.3-explicit-offload-persistence/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 double sum = 0;
14
15 int main(){
16
17 const long N=10000;
18 double *p = (double*) malloc(N*sizeof(double));
19 p[0:N] = 1.0; // Cilk Plus array notation

g
20

an
21 for ( long i = 0 ; i < N ; i++ ) {

W
22 sum += p[i];
23 }

ng
24
25 printf("\nsum = %f\n", sum);
e
nh
26 }
Yu

Back to Lab A.2.3.


r
fo
ed

B.2.3.3 labs/2/2.3-explicit-offload-persistence/step-01/offload.cpp
p ar
re

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


yP

2 file offload.cpp, located at 2/2.3-explicit-offload-persistence/step-01


3 is a part of the practical supplement to the handbook
el

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


iv

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


us

6 Redistribution or commercial usage without a written permission


7 from Colfax International is prohibited.
cl

8 Contact information can be found at http://colfax-intl.com/ */


Ex

9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 __attribute__((target(mic))) double sum = 0;
14
15 int main(){
16
17 const long N=10000;
18 double *p = (double*) malloc(N*sizeof(double));
19 p[0:N] = 1.0; // Cilk Plus array notation
20
21 #pragma offload target (mic) in(p : length(N)) inout(sum)
22 {
23 for ( long i = 0 ; i < N ; i++ ) {
24 sum += p[i];
25 }
26 }
27 printf("\nsum = %f\n", sum);
28 }

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 337

Back to Lab A.2.3.

B.2.3.4 labs/2/2.3-explicit-offload-persistence/step-02/offload.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file offload.cpp, located at 2/2.3-explicit-offload-persistence/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 __attribute__((target(mic))) double sum = 0;
14
15 int main(){
16

g
17 double sumHost = 0;

an
18 const long N=10000;

W
19 double *p = (double*) malloc(N*sizeof(double));

g
20 p[0:N] = 1.0; // Cilk Plus array notation
21
en
h
#pragma offload target (mic:0) in(p : length(N)) inout(sum:free_if(0))
un
22
23 {
rY

24 for ( long i = 0 ; i < N ; i++ ) {


fo

25 sum += p[i];
ed

26 }
ar

27 }
ep

28
printf("After the offload: sum = %f \n", sum);
Pr

29
30 sum += 1.0;
y

printf("Data change on the host: sum = %f \n", sum);


el

31
siv

32
33 #pragma offload_transfer target (mic:0) out(sum:alloc_if(0) free_if(1))
u
cl

34
Ex

35 printf("Copy data back from coprocessor: sum = %f \n", sum);


36 }

Back to Lab A.2.3.

B.2.3.5 labs/2/2.3-explicit-offload-persistence/step-03/offload.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file offload.cpp, located at 2/2.3-explicit-offload-persistence/step-03
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 __attribute__((target(mic))) double sum = 0;
14

Prepared for Yunheng Wang c Colfax International, 2013


338 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

15 int main(){
16
17 double sumHost = 0;
18 const long N=10000;
19 double *p = (double*) malloc(N*sizeof(double));
20 p[0:N] = 1.0; // Cilk Plus array notation
21
22 #pragma offload target (mic:0) in(p : length(N)) signal(p)
23 {
24 for ( long i = 0 ; i < N ; i++ ) {
25 sum += p[i];
26 }
27 }
28
29 printf("After the offload: sum = %f \n", sum);
30 sum += 1.0;
31 printf("Data change on the host: sum = %f \n", sum);
32
33 #pragma offload_transfer target (mic:0) out(sum) wait(p)
34
35 printf("Copy data back from coprocessor: sum = %f \n", sum);
}

g
36

an
W
ng
B.2.4 Explicit Offload: Putting it All Together
he
un

Back to Lab A.2.4.


rY
fo

B.2.4.1 labs/2/2.4-explicit-offload-matrix/step-00/Makefile
d
re
pa

CXX = icpc
re

CXXFLAGS = -vec-report -g
yP

OBJECTS = matrix.o
el
siv

.SUFFIXES: .o .cpp
u
cl
Ex

.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

all: runme

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

clean:
rm -f *.o runme

Back to Lab A.2.4.

B.2.4.2 labs/2/2.4-explicit-offload-matrix/step-00/matrix.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file matrix.cpp, located at 2/2.4-explicit-offload-matrix/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 339

7 from Colfax International is prohibited.


8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 // use ‘ulimit -s unlimited‘ to increase the stack size for the process
14 // Otherwise, code will be stopped with the "segmentation fault" error
15
16 int main(){
17
18 const int m=10, n=100000;
19 double A[n*m], b[n], c[m];
20
21 // Cilk Plus array notation
22 A[:]=1.0/(double)n;
23 b[:]=1.0;
24 c[:]=0;
25
26 for ( int i = 0 ; i < m ; i++)
27 for ( int j = 0 ; j < n ; j++)
c[i] += A[i*n+j] * b[j];

g
28

an
29

W
30 for ( int i = 0 ; i < m ; i++)
printf("%f\t", c[i]);

g
31

en
32 printf("\n"); h
33 }
un
rY

Back to Lab A.2.4.


fo
ed
ar

B.2.4.3 labs/2/2.4-explicit-offload-matrix/step-01/matrix.cpp
ep
Pr

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


y

file matrix.cpp, located at 2/2.4-explicit-offload-matrix/step-01


el

2
siv

3 is a part of the practical supplement to the handbook


4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
u
cl

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


Ex

6 Redistribution or commercial usage without a written permission


7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 // use ‘ulimit -s unlimited‘ to increase the stack size for the process
14 // Otherwise, code will be stopped with the "segmentation fault" error
15
16 int main(){
17
18 const int m=10, n=1000;
19 double A[n*m], b[n], c[m], cc[m];
20
21 // Cilk Plus array notation
22 A[:]=1.0/(double)n;
23 b[:]=1.0;
24 c[:]=0;
25 cc[:]=0;
26
27 #pragma offload target(mic)
28 for ( int i = 0 ; i < m ; i++)

Prepared for Yunheng Wang c Colfax International, 2013


340 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

29 for ( int j = 0 ; j < n ; j++)


30 c[i] += A[i*n+j] * b[j];
31
32 double norm = 0.0;
33 for ( int i = 0 ; i < m ; i++)
34 norm += (c[i] - 1.0)*(c[i] - 1.0);
35 if (norm > 1e-10)
36 printf("Norm is equal to %f\n", norm);
37 else
38 printf("Yep, we’re good!\n");
39 }

Back to Lab A.2.4.

B.2.4.4 labs/2/2.4-explicit-offload-matrix/step-02/matrix.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file matrix.cpp, located at 2/2.4-explicit-offload-matrix/step-02

g
an
3 is a part of the practical supplement to the handbook

W
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
ng
6 Redistribution or commercial usage without a written permission
he

7 from Colfax International is prohibited.


un

8 Contact information can be found at http://colfax-intl.com/ */


rY

9
#include <stdio.h>
fo

10
11 #include <stdlib.h>
d
re

12
pa

13 // use ‘ulimit -s unlimited‘ to increase the stack size for the process
// Otherwise, code will be stopped with the "segmentation fault" error
re

14
yP

15
16 int main(){
el

17
siv

18 const int m=1000, n=100000;


u

19 double b[n], c[m];


cl

double * A = (double*) malloc(sizeof(double)*n*m);


Ex

20
21
22 // Cilk Plus array notation
23 A[0:n*m]=1.0/(double)n;
24 b[:]=1.0;
25 c[:]=0;
26
27 #pragma offload target(mic) in (A:length(n*m))
28 for ( int i = 0 ; i < m ; i++)
29 for ( int j = 0 ; j < n ; j++)
30 c[i] += A[i*n+j] * b[j];
31
32 double norm = 0.0;
33 for ( int i = 0 ; i < m ; i++)
34 norm += (c[i] - 1.0)*(c[i] - 1.0);
35 if (norm > 1e-10)
36 printf("Norm is equal to %f\n", norm);
37 else
38 printf("Yep, we’re good!\n");
39 }

Back to Lab A.2.4.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 341

B.2.4.5 labs/2/2.4-explicit-offload-matrix/step-03/matrix.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file matrix.cpp, located at 2/2.4-explicit-offload-matrix/step-03
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 // use ‘ulimit -s unlimited‘ to increase the stack size for the process
14 // Otherwise, code will be stopped with the "segmentation fault" error
15
16 int main(){
17
18 const int m=1000, n=100000;
19 const int maxIter = 5;

g
double b[n], c[m];

an
20
21 double * A = (double*) malloc(sizeof(double)*n*m);

W
22

g
// Cilk Plus array notation
en
23
24 A[0:n*m]=1.0/(double)n; h
un
25 b[:]=1.0;
rY

26 c[:]=0;
27
fo

28 #pragma offload_transfer target(mic:0) in (A:length(n*m) free_if(0))


ed

29
ar

30 for ( int iter = 0; iter < maxIter ; iter++) {


ep

31
Pr

32 printf("Iteration %d of %d...\n", iter+1, maxIter);


33
y
el

34 b[:] = (double) iter;


siv

35
u

36 #pragma offload target(mic:0) nocopy (A:length(n*m) free_if(iter==maxIter-1))


cl

37 {
Ex

38 for ( int i = 0 ; i < m ; i++)


39 for ( int j = 0 ; j < n ; j++)
40 c[i] += A[i*n+j] * b[j];
41 }
42
43 }
44
45 }

Back to Lab A.2.4.

B.2.4.6 labs/2/2.4-explicit-offload-matrix/step-04/matrix.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file matrix.cpp, located at 2/2.4-explicit-offload-matrix/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.

Prepared for Yunheng Wang c Colfax International, 2013


342 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

8 Contact information can be found at http://colfax-intl.com/ */


9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 // use ‘ulimit -s unlimited‘ to increase the stack size for the process
14 // Otherwise, code will be stopped with the "segmentation fault" error
15
16 int main(){
17
18 const int m=1000, n=100000;
19 const int maxIter = 5;
20 double b[n], c[m];
21 double * A = (double*) malloc(sizeof(double)*n*m);
22
23 // Cilk Plus array notation
24 A[0:n*m]=1.0/(double)n;
25 b[:]=1.0;
26 c[:]=0;
27
28 for ( int iter = 0; iter < maxIter ; iter++) {

g
29

an
30 printf("Iteration %d of %d...\n", iter+1, maxIter);

W
31
b[:] = (double) iter;
ng
32
33 int size = iter == 0 ? n*m : 1;
he

34 #pragma offload target(mic:0) in (A:length(size) free_if(iter==maxIter-1))


un

35 for ( int i = 0 ; i < m ; i++)


rY

36 for ( int j = 0 ; j < n ; j++)


fo

37 c[i] += A[i*n+j] * b[j];


}
d

38
re

39
pa

40 }
re
yP

Back to Lab A.2.4.


el
usiv

B.2.4.7 labs/2/2.4-explicit-offload-matrix/step-05/matrix.cpp
cl
Ex

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file matrix.cpp, located at 2/2.4-explicit-offload-matrix/step-05
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 // use ‘ulimit -s unlimited‘ to increase the stack size for the process
14 // Otherwise, code will be stopped with the "segmentation fault" error
15
16 int main(){
17
18 const int iterMax = 3;
19 const int m=1000, n=100000;
20 double b[n], c_target[m], c_host[m];
21 double * A = (double*) malloc(sizeof(double)*n*m);
22

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 343

23 // Cilk Plus array notation


24 A[0:n*m]=1.0/(double)n;
25
26 #pragma offload_transfer target(mic:1) in (A:length(n*m) free_if(0))
27
28 for ( int iter = 0; iter < iterMax ; iter++) {
29 b[:] = (double) iter;
30 c_target[:]=0; // results calculated on the coprocessor
31 c_host[:]=0; // results calculated on the host
32
33 #pragma offload target(mic:1) nocopy(A : free_if(iter==iterMax-1)) \
34 signal(A)
35 {
36 // running the calculation on the coprocessor asynchronously
37 for ( int i = 0 ; i < m ; i++)
38 for ( int j = 0 ; j < n ; j++)
39 c_target [i] += A[i*n+j] * b[j];
40 }
41
42 // the following code is running on the host asynchronously
43 for ( int i = 0 ; i < m ; i++)
for ( int j = 0 ; j < n ; j++)

g
44

an
45 c_host[i] += A[i*n+j] * b[j];

W
46
// sync before proceeding

g
47

en
48 #pragma offload_transfer target(mic:1) wait(A)h
49
un
50 double norm = 0.0;
rY

51 for ( int i = 0 ; i < m ; i++)


fo

52 norm += (c_target[i] - c_host[i])*(c_target[i] - c_host[i]);


printf("Difference is %e: ", norm);
ed

53
54 if (norm > 1e-10)
ar

55 printf("ERROR!\n");
ep

56 else
Pr

57 printf("Yep, we’re good!\n");


y

58 }
el

59 }
siv
u
cl
Ex

B.2.5 Virtual-Shared Memory Model: Sharing Complex Objects


Back to Lab A.2.5.

B.2.5.1 labs/2/2.5-sharing-complex-objects/step-00/Makefile

CXX = icpc
CXXFLAGS =

OBJECTS = cilk-shared-offload.o

.SUFFIXES: .o .cpp

.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

all: runme

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

Prepared for Yunheng Wang c Colfax International, 2013


344 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

clean:
rm -f *.o runme

Back to Lab A.2.5.

B.2.5.2 labs/2/2.5-sharing-complex-objects/step-00/cilk-shared-offload.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file cilk-shared-offload.cpp, located at 2/2.5-sharing-complex-objects/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #define N 1000
int ar1[N];

g
12

an
13 int ar2[N];

W
14 int res[N]; ng
15
void initialize() {
he

16
17 for (int i = 0; i < N; i++) {
un

18 ar1[i] = i;
rY

19 ar2[i] = 1;
fo

20 }
}
d

21
re

22
pa

23 // This function should be offloaded to the coprocessor


re

24 void add() {
yP

25 for (int i = 0; i < N; i++)


26 res[i] = ar1[i] + ar2[i];
el

//printf("Offload to coprocessor failed!\n");


siv

27
28 }
u
cl

29
Ex

30 void verify() {
31 bool errors = false;
32 for (int i = 0; i < N; i++)
33 errors |= (res[i] != (ar1[i] + ar2[i]));
34 printf("%s\n", (errors ? "ERROR" : "CORRECT"));
35 }
36
37 int main(int argc, char *argv[]) {
38 initialize();
39 add(); // Make function call on coprocessor:
40 // ar1, ar2 should be copied in, res copied out
41 verify();
42 }

Back to Lab A.2.5.

B.2.5.3 labs/2/2.5-sharing-complex-objects/step-01/cilk-shared-offload.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file cilk-shared-offload.cpp, located at 2/2.5-sharing-complex-objects/step-01

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 345

3 is a part of the practical supplement to the handbook


4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #define N 1000
12 _Cilk_shared int ar1[N];
13 _Cilk_shared int ar2[N];
14 _Cilk_shared int res[N];
15
16 void initialize() {
17 for (int i = 0; i < N; i++) {
18 ar1[i] = i;
19 ar2[i] = 1;
20 }
21 }
22
23 _Cilk_shared void add() {
#ifdef __MIC__

g
24

an
25 for (int i = 0; i < N; i++)

W
26 res[i] = ar1[i] + ar2[i];
#else

g
27

en
28 printf("Offload to coprocessor failed!\n"); h
29 #endif
un
30 }
rY

31
fo

32 void verify() {
bool errors = false;
ed

33
34 for (int i = 0; i < N; i++)
ar

35 errors |= (res[i] != (ar1[i] + ar2[i]));


ep

36 printf("%s\n", (errors ? "ERROR" : "CORRECT"));


Pr

37 }
y

38
el

39 int main(int argc, char *argv[]) {


siv

40 initialize();
u

_Cilk_offload add(); // Function call on coprocessor:


cl

41
// // ar1, ar2 are copied in, res copied out
Ex

42
43 verify();
44 }

Back to Lab A.2.5.

B.2.5.4 labs/2/2.5-sharing-complex-objects/step-02/dynamic-alloc.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file dynamic-alloc.cpp, located at 2/2.5-sharing-complex-objects/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 #define N 10000

Prepared for Yunheng Wang c Colfax International, 2013


346 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

14 int* data; // Shared pointer to shared data


15 int sum;
16
17 // use Cilk Plus offloading method
18 void ComputeSum() {
19
20 printf("Address of data[0] on coprocessor: %p\n", &data[0]);
21 sum = 0;
22 // following for loop can be parallelized
23 for (int i = 0; i < N; ++i)
24 sum += data[i];
25
26 //printf("Offload to coprocessor failed!\n");
27
28 }
29
30 int main() {
31 data = (int*)malloc(N*sizeof(float));
32 for (int i = 0; i < N; i++)
33 data[i] = i%2;
34 printf("Address of data[0] on host: %p\n", &data[0]);
ComputeSum(); // offload the function call to the coprocessor

g
35

an
36 printf("%s\n", (sum==N/2 ? "CORRECT" : "ERROR"));

W
37 free(data);
}
ng
38
he
un

Back to Lab A.2.5.


rY
fo

B.2.5.5 labs/2/2.5-sharing-complex-objects/step-03/dynamic-alloc.cpp
d
re
pa

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


re

2 file dynamic-alloc.cpp, located at 2/2.5-sharing-complex-objects/step-03


yP

3 is a part of the practical supplement to the handbook


"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
el

4
siv

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


6 Redistribution or commercial usage without a written permission
u
cl

7 from Colfax International is prohibited.


Ex

8 Contact information can be found at http://colfax-intl.com/ */


9
10 #include <stdio.h>
11 #define N 10000
12 int* _Cilk_shared data; // Shared pointer to shared data
13 int _Cilk_shared sum;
14
15 _Cilk_shared void ComputeSum() {
16 #ifdef __MIC__
17 printf("Address of data[0] on coprocessor: %p\n", &data[0]);
18 sum = 0;
19 #pragma omp parallel for reduction(+: sum)
20 for (int i = 0; i < N; ++i)
21 sum += data[i];
22 #else
23 printf("Offload to coprocessor failed!\n");
24 #endif
25 }
26
27 int main() {
28 data = (_Cilk_shared int*)_Offload_shared_malloc(N*sizeof(float));
29 for (int i = 0; i < N; i++)
30 data[i] = i%2;

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 347

31 printf("Address of data[0] on host: %p\n", &data[0]);


32 _Cilk_offload ComputeSum();
33 printf("%s\n", (sum==N/2 ? "CORRECT" : "ERROR"));
34 _Offload_shared_free(data);
35 }

Back to Lab A.2.5.

B.2.5.6 labs/2/2.5-sharing-complex-objects/step-04/structures.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file structures.cpp, located at 2/2.5-sharing-complex-objects/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9

g
10 #include <stdio.h>

an
11 #include <string.h>

W
12

g
13 // share the structure between the host and the coprocessor
14
15
typedef struct {
int i; en
h
un
16 char c[10];
rY

17 } person;
fo

18
ed

19 // offload the function


void SetPerson( person & p, const char* name, const int i) {
ar

20
ep

21
22 p.i = i;
Pr

23 strcpy(p.c, name);
y

24 printf("On coprocessor: %d %s\n", p.i, p.c);


el
siv

25
26 //printf("Offload to coprocessor failed.\n");
u
cl

27
Ex

28 }
29
30 person someone;
31 char who[10];
32
33 int main(){
34 strcpy(who, "John");
35 SetPerson(someone, who, 1);
36 printf("On host: %d %s\n", someone.i, someone.c);
37 }

Back to Lab A.2.5.

B.2.5.7 labs/2/2.5-sharing-complex-objects/step-05/structures.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file structures.cpp, located at 2/2.5-sharing-complex-objects/step-05
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission

Prepared for Yunheng Wang c Colfax International, 2013


348 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

7 from Colfax International is prohibited.


8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <string.h>
12
13 typedef struct {
14 int i;
15 char c[10];
16 } person;
17
18 _Cilk_shared void SetPerson(_Cilk_shared person & p,
19 _Cilk_shared const char* name, const int i) {
20 #ifdef __MIC__
21 p.i = i;
22 strcpy(p.c, name);
23 printf("On coprocessor: %d %s\n", p.i, p.c);
24 #else
25 printf("Offload to coprocessor failed.\n");
26 #endif
27 }

g
28

an
29 person _Cilk_shared someone;

W
30 char _Cilk_shared who[10]; ng
31
32 int main(){
he

33 strcpy(who, "John");
un

34 _Cilk_offload SetPerson(someone, who, 1);


rY

35 printf("On host: %d %s\n", someone.i, someone.c);


fo

36 }
d
re
pa

Back to Lab A.2.5.


re
yP

B.2.5.8 labs/2/2.5-sharing-complex-objects/step-06/classes.cpp
el
siv

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


u
cl

2 file classes.cpp, located at 2/2.5-sharing-complex-objects/step-06


Ex

3 is a part of the practical supplement to the handbook


4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <string.h>
12
13 // make the following class shared
14 class Person {
15 public:
16 int i;
17 char c[10];
18
19 Person() {
20 i=0; c[0]=’\0’;
21 }
22
23 void Set(const char* name, const int i0) {
24
25 i = i0;

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 349

26 strcpy(c, name);
27 printf("On coprocessor: %d %s\n", i, c);
28
29 //printf("Offload to coprocessor failed.\n");
30
31 }
32 };
33
34 Person someone;
35 char who[10];
36
37 int main(){
38 strcpy(who, "Mary");
39 someone.Set(who, 2); // make offload function call
40 printf("On host: %d %s\n", someone.i, someone.c);
41 }

Back to Lab A.2.5.

B.2.5.9 labs/2/2.5-sharing-complex-objects/step-07/classes.cpp

g
an
W
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.

g
2 file classes.cpp, located at 2/2.5-sharing-complex-objects/step-07
3 is a part of the practical supplement to the handbook
en
h
"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
un
4
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
rY

5
6 Redistribution or commercial usage without a written permission
fo

7 from Colfax International is prohibited.


ed

8 Contact information can be found at http://colfax-intl.com/ */


ar

9
ep

10 #include <stdio.h>
#include <string.h>
Pr

11
12
y

class _Cilk_shared Person {


el

13
siv

14 public:
15 int i;
u
cl

16 char c[10];
Ex

17
18 Person() {
19 i=0; c[0]=’\0’;
20 }
21
22 void Set(_Cilk_shared const char* name, const int i0) {
23 #ifdef __MIC__
24 i = i0;
25 strcpy(c, name);
26 printf("On coprocessor: %d %s\n", i, c);
27 #else
28 printf("Offload to coprocessor failed.\n");
29 #endif
30 }
31 };
32
33 Person _Cilk_shared someone;
34 char _Cilk_shared who[10];
35
36 int main(){
37 strcpy(who, "Mary");
38 _Cilk_offload someone.Set(who, 2);
39 printf("On host: %d %s\n", someone.i, someone.c);

Prepared for Yunheng Wang c Colfax International, 2013


350 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

40 }

Back to Lab A.2.5.

B.2.5.10 labs/2/2.5-sharing-complex-objects/step-08/new.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file new.cpp, located at 2/2.5-sharing-complex-objects/step-08
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12 // for virtual-shared "new" operator we should use:
13 //#include <new>

g
14

an
15 class MyClass {

W
16 int i;
ng
17
public:
he

18
19 MyClass(){ i = 0; };
un

20
rY

21 void set(const int l) { i = l; }


fo

22
d

23 void print(){
re

24 #ifdef __MIC__
pa

25 printf("On coprocessor: ");


re

26 #else
yP

27 printf("On host: ");


28 #endif
el

printf("%d\n", i);
siv

29
30 }
u
cl

31 };
Ex

32
33 MyClass* sharedData;
34
35 int main()
36 {
37 const int size = sizeof(MyClass);
38 // allocate the memory and pass the pointer to it to the new operator
39 MyClass* address = (MyClass*) malloc(size);
40 sharedData=new MyClass;
41
42 sharedData->set(1000); // Shared data initialized on host
43 //sharedData->print(); // Shared data used on coprocessor
44 sharedData->print(); // Shared data used on host
45 }

Back to Lab A.2.5.

B.2.5.11 labs/2/2.5-sharing-complex-objects/step-09/new.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file new.cpp, located at 2/2.5-sharing-complex-objects/step-09

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 351

3 is a part of the practical supplement to the handbook


4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12 #include <new>
13
14 class _Cilk_shared MyClass {
15 int i;
16
17 public:
18 MyClass(){ i = 0; };
19
20 void set(const int l) { i = l; }
21
22 void print(){
23 #ifdef __MIC__
printf("On coprocessor: ");

g
24

an
25 #else

W
26 printf("On host: ");
#endif

g
27

en
28 printf("%d\n", i); h
29 }
un
30 };
rY

31
fo

32 MyClass* _Cilk_shared sharedData;


ed

33
34 int main()
ar

35 {
ep

36 const int size = sizeof(MyClass);


Pr

37 _Cilk_shared MyClass* address = (_Cilk_shared MyClass*) _Offload_shared_malloc(size);


y

38 sharedData=new( address ) MyClass;


el

39
siv

40 sharedData->set(1000); // Shared data initialized on host


u

_Cilk_offload sharedData->print(); // Shared data used on coprocessor


cl

41
sharedData->print(); // Shared data used on host
Ex

42
43 }

B.2.6 Using Multiple Coprocessors


Back to Lab A.2.6.

B.2.6.1 labs/2/2.6-multiple-coprocessors/step-03/Makefile

CXX = icpc
CXXFLAGS = -openmp

OBJECTS = multiple.o

.SUFFIXES: .o .cpp

.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

all: runme

Prepared for Yunheng Wang c Colfax International, 2013


352 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

clean:
rm -f *.o runme

Back to Lab A.2.6.

B.2.6.2 labs/2/2.6-multiple-coprocessors/step-00/multiple.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file multiple.cpp, located at 2/2.6-multiple-coprocessors/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */

g
9

an
10 #include <stdio.h>

W
11
ng
12 // Write offloaded function, which will print out the device number, using
// _Offload_get_device_number() function call.
he

13
14
un

15 int _Cilk_shared numDevices;


rY

16
fo

17 int main(int argc, char *argv[]) {


numDevices = _Offload_number_of_devices();
d

18
re

19 printf("Number of Target devices installed: %d\n\n" ,numDevices);


pa

20 }
re
yP

Back to Lab A.2.6.


el
siv
u

B.2.6.3 labs/2/2.6-multiple-coprocessors/step-01/multiple.cpp
cl
Ex

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file multiple.cpp, located at 2/2.6-multiple-coprocessors/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11
12 void _Cilk_shared whatIsMyNumber(int numDevices){
13 int currentDevice = _Offload_get_device_number();
14 printf("Hello from %d coprocessor out of %d.\n", currentDevice, numDevices);
15 fflush(0);
16 }
17
18 int _Cilk_shared numDevices;
19
20 int main(int argc, char *argv[]) {
21 numDevices = _Offload_number_of_devices();
22 printf("Number of Target devices installed: %d\n\n" ,numDevices);

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 353

23
24 for(int i=0; i<numDevices; i++){
25 _Cilk_offload_to(i) whatIsMyNumber(numDevices);
26 }
27 }

Back to Lab A.2.6.

B.2.6.4 labs/2/2.6-multiple-coprocessors/step-02/multiple.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file multiple.cpp, located at 2/2.6-multiple-coprocessors/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9

g
10 #include <stdlib.h>

an
11 #include <stdio.h>

W
12

g
13 int* response;
14
15
int _Cilk_shared n_d;
en
h
un
16 int main(){
rY

17 n_d = _Offload_number_of_devices();
fo

18 if (n_d < 1) {
ed

19 printf("No devices available!");


return 2;
ar

20
}
ep

21
22
Pr

23 response = (int*) malloc(n_d*sizeof(int));


y

24 response[0:n_d] = 0;
el
siv

25
26 for (int i = 0; i < n_d; i++) {
u
cl

27 // Use pragma to specify the targets and data manipulation clauses


Ex

28 {
29 #ifdef __MIC__
30 response[i] = 1;
31 #endif
32 }
33 }
34
35 for (int i = 0; i < n_d; i++)
36 if (response[i] == 1) {
37 printf("OK: device %d responded\n", i);
38 } else {
39 printf("Error: device %d did not respond\n", i);
40 }
41 }

Back to Lab A.2.6.

B.2.6.5 labs/2/2.6-multiple-coprocessors/step-03/multiple.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file multiple.cpp, located at 2/2.6-multiple-coprocessors/step-03

Prepared for Yunheng Wang c Colfax International, 2013


354 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

3 is a part of the practical supplement to the handbook


4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdlib.h>
11 #include <stdio.h>
12
13 __attribute__((target(mic))) int* response;
14
15
16 int main(){
17 int n_d = _Offload_number_of_devices();
18 if (n_d < 1) {
19 printf("No devices available!");
20 return 2;
21 }
22
23 response = (int*) malloc(n_d*sizeof(int));
response[0:n_d] = 0;

g
24

an
25

W
26 for (int i = 0; i < n_d; i++) {
#pragma offload target(mic:i) inout(response[i:1])
ng
27
28 {
he

29 #ifdef __MIC__
un

30 response[i] = 1;
rY

31 #else
fo

32 response[i] = 0;
#endif
d

33
re

34 }
pa

35 }
re

36
yP

37 for (int i = 0; i < n_d; i++)


38 if (response[i] == 1) {
el

39 printf("OK: device %d responded\n", i);


siv

40 } else {
u

printf("Error: device %d did not respond\n", i);


cl

41
}
Ex

42
43 }

B.2.7 Asynchronous Execution on One and Multiple Coprocessors


Back to Lab A.2.7.

B.2.7.1 labs/2/2.7-asynchronous-offload/step-03/Makefile

CXX = icpc
CXXFLAGS = -openmp

OBJECTS = async.o

.SUFFIXES: .o .cpp

.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

all: runme

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 355

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

clean:
rm -f *.o runme

Back to Lab A.2.7.

B.2.7.2 labs/2/2.7-asynchronous-offload/step-00/async.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file async.cpp, located at 2/2.7-asynchronous-offload/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */

g
9

an
10 #include <stdio.h>

W
11

g
12 void _Cilk_shared whatIsMyNumber(int numDevices){

en
13 int currentDevice = _Offload_get_device_number(); h
14 printf("Hello from %d coprocessor out of %d.\n", currentDevice, numDevices);
un
15 fflush(0);
rY

16 }
fo

17
int _Cilk_shared numDevices;
ed

18
ar

19
int main(int argc, char *argv[]) {
ep

20
21 numDevices = _Offload_number_of_devices();
Pr

22 printf("Number of Target devices installed: %d\n\n" ,numDevices);


y

23
el

for(int i=0; i<numDevices; i++){


siv

24
25 _Cilk_offload_to(i) whatIsMyNumber(numDevices);
u

}
cl

26
Ex

27 }

Back to Lab A.2.7.

B.2.7.3 labs/2/2.7-asynchronous-offload/step-01/async.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file async.cpp, located at 2/2.7-asynchronous-offload/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11
12 void _Cilk_shared whatIsMyNumber(int numDevices){
13 int currentDevice = _Offload_get_device_number();
14 printf("Hello from %d coprocessor out of %d.\n", currentDevice, numDevices);
15 fflush(0);

Prepared for Yunheng Wang c Colfax International, 2013


356 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

16 }
17
18 int _Cilk_shared numDevices;
19
20 int main(int argc, char *argv[]) {
21 numDevices = _Offload_number_of_devices();
22 printf("Number of Target devices installed: %d\n\n" ,numDevices);
23
24 for(int i=0; i<numDevices; i++){
25 _Cilk_spawn _Cilk_offload_to(i) whatIsMyNumber(numDevices);
26 }
27 }

Back to Lab A.2.7.

B.2.7.4 labs/2/2.7-asynchronous-offload/step-02/async.cpp

g
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.

an
2 file async.cpp, located at 2/2.7-asynchronous-offload/step-02
is a part of the practical supplement to the handbook

W
3
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"

ng
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
e
nh
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
Yu

9
#include <stdlib.h>
r

10
fo

11 #include <stdio.h>
12
ed

13 int* response;
ar

14 int _Cilk_shared n_d;


p

15
re

16 int main(){
yP

17 n_d = _Offload_number_of_devices();
18 if (n_d < 1) {
el

19 printf("No devices available!");


iv

20 return 2;
us

21 }
22
cl

23 response = (int*) malloc(n_d*sizeof(int));


Ex

24 response[0:n_d] = 0;
25
26 // Make the following loop run in parallel with OpenMP
27 for (int i = 0; i < n_d; i++) {
28 // The body of this loop is executed by n_d host threads concurrently
29 //
30 // Use pragma to specify the targets and data manipulation clauses
31 {
32 // Each offloaded segment blocks the execution of the thread that launched it
33 response[i] = 1;
34 }
35 }
36
37 for (int i = 0; i < n_d; i++)
38 if (response[i] == 1) {
39 printf("OK: device %d responded\n", i);
40 } else {
41 printf("Error: device %d did not respond\n", i);
42 }
43 }

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 357

Back to Lab A.2.7.

B.2.7.5 labs/2/2.7-asynchronous-offload/step-03/async.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file async.cpp, located at 2/2.7-asynchronous-offload/step-03
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdlib.h>
11 #include <stdio.h>
12
13 __attribute__((target(mic))) int* response;
14
15 int main(){
16 int n_d = _Offload_number_of_devices();

g
17 if (n_d < 1) {

an
18 printf("No devices available!");

W
19 return 2;

g
20 }
21
response = (int*) malloc(n_d*sizeof(int)); en
h
un
22
23 response[0:n_d] = 0;
rY

24
fo

25 #pragma omp parallel for


ed

26 for (int i = 0; i < n_d; i++) {


ar

27 // The body of this loop is executed by n_d host threads concurrently


ep

28 #pragma offload target(mic:i) inout(response[i:1])


{
Pr

29
30 // Each offloaded segment blocks the execution of the thread that launched it
y

response[i] = 1;
el

31
siv

32 }
33 }
u
cl

34
Ex

35 for (int i = 0; i < n_d; i++)


36 if (response[i] == 1) {
37 printf("OK: device %d responded\n", i);
38 } else {
39 printf("Error: device %d did not respond\n", i);
40 }
41 }

Back to Lab A.2.7.

B.2.7.6 labs/2/2.7-asynchronous-offload/step-04/async.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file async.cpp, located at 2/2.7-asynchronous-offload/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9

Prepared for Yunheng Wang c Colfax International, 2013


358 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

10 #include <stdlib.h>
11 #include <stdio.h>
12
13 int* response;
14 int _Cilk_shared n_d;
15
16 int main(){
17
18 n_d = _Offload_number_of_devices();
19
20 if (n_d < 1) {
21 printf("No devices available!");
22 return 2;
23 }
24
25 response = (int*) malloc(n_d*sizeof(int));
26 response[0:n_d] = 0;
27
28 for (int i = 0; i < n_d; i++) {
29 //use offload pragma with target, data manipulation clause and signal
30 {
// The offloaded job does not block the execution on the host

g
31

an
32 response[i] = 1;

W
33 }
}
ng
34
35
he

36 for (int i = 0; i < n_d; i++) {


un

37 // This loop waits for all asynchronous offloads to finish


rY

38 // Add synchronization offload_wait pragma


fo

39 }
d

40
re

41 for (int i = 0; i < n_d; i++)


pa

42 if (response[i] == 1) {
re

43 printf("OK: device %d responded\n", i);


yP

44 } else {
45 printf("Error: device %d did not respond\n", i);
el

46 }
siv

47 }
u
cl
Ex

Back to Lab A.2.7.

B.2.7.7 labs/2/2.7-asynchronous-offload/step-05/async.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file async.cpp, located at 2/2.7-asynchronous-offload/step-05
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdlib.h>
11 #include <stdio.h>
12
13 __attribute__((target(mic))) int* response;
14
15 int main(){
16 int n_d = _Offload_number_of_devices();
17 if (n_d < 1) {

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 359

18 printf("No devices available!");


19 return 2;
20 }
21 response = (int*) malloc(n_d*sizeof(int));
22 response[0:n_d] = 0;
23 for (int i = 0; i < n_d; i++) {
24 #pragma offload target(mic:i) inout(response[i:1]) signal(&response[i])
25 {
26 // The offloaded job does not block the execution on the host
27 response[i] = 1;
28 }
29 }
30
31 for (int i = 0; i < n_d; i++) {
32 // This loop waits for all asynchronous offloads to finish
33 #pragma offload_wait target(mic:i) wait(&response[i])
34 }
35
36 for (int i = 0; i < n_d; i++)
37 if (response[i] == 1) {
38 printf("OK: device %d responded\n", i);
} else {

g
39

an
40 printf("Error: device %d did not respond\n", i);

W
41 }
}

g
42

en
h
un
Back to Lab A.2.7.
rY
fo

B.2.7.8 labs/2/2.7-asynchronous-offload/step-06/async.cpp
ed
ar
ep

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


file async.cpp, located at 2/2.7-asynchronous-offload/step-06
Pr

2
3 is a part of the practical supplement to the handbook
y

"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


el

4
siv

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


6 Redistribution or commercial usage without a written permission
u
cl

7 from Colfax International is prohibited.


Ex

8 Contact information can be found at http://colfax-intl.com/ */


9
10 #include <stdlib.h>
11 #include <stdio.h>
12
13 int *response;
14
15 void _Cilk_shared Respond(int _Cilk_shared & a) {
16 a = 1;
17 }
18
19 int main(){
20
21 int n_d = _Offload_number_of_devices();
22
23 if (n_d < 1) {
24 printf("No devices available!");
25 return 2;
26 }
27
28 response = (int *) malloc(n_d*sizeof(int));
29 response[0:n_d] = 0;
30 // Use cilk_for to run this loop in parallel

Prepared for Yunheng Wang c Colfax International, 2013


360 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

31 for (int i = 0; i < n_d; i++) {


32 // All iterations start simultaneously in n_d host threads
33 // Offload the function call to the individual targets
34 Respond(response[i]);
35 }
36
37 for (int i = 0; i < n_d; i++)
38 if (response[i] == 1) {
39 printf("OK: device %d responded\n", i);
40 } else {
41 printf("Error: device %d did not respond\n", i);
42 }
43 }

Back to Lab A.2.7.

B.2.7.9 labs/2/2.7-asynchronous-offload/step-07/async.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.

g
2 file async.cpp, located at 2/2.7-asynchronous-offload/step-07

an
3 is a part of the practical supplement to the handbook

W
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
ng
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
he

6 Redistribution or commercial usage without a written permission


from Colfax International is prohibited.
un

7
Contact information can be found at http://colfax-intl.com/ */
rY

8
9
fo

10 #include <stdlib.h>
d

11 #include <stdio.h>
re

12
pa

13 int _Cilk_shared *response;


re

14
yP

15 void _Cilk_shared Respond(int _Cilk_shared & a) {


a = 1;
el

16
siv

17 }
18
u
cl

19 int main(){
Ex

20
21 int n_d = _Offload_number_of_devices();
22
23 if (n_d < 1) {
24 printf("No devices available!");
25 return 2;
26 }
27
28 response = (int _Cilk_shared *) _Offload_shared_malloc(n_d*sizeof(int));
29 response[0:n_d] = 0;
30
31 _Cilk_for (int i = 0; i < n_d; i++) {
32 // All iterations start simultaneously in n_d host threads
33 _Cilk_offload_to(i)
34 Respond(response[i]);
35 }
36
37 for (int i = 0; i < n_d; i++)
38 if (response[i] == 1) {
39 printf("OK: device %d responded\n", i);
40 } else {
41 printf("Error: device %d did not respond\n", i);
42 }

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 361

43 }

Back to Lab A.2.7.

B.2.7.10 labs/2/2.7-asynchronous-offload/step-08/async.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file async.cpp, located at 2/2.7-asynchronous-offload/step-08
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdlib.h>
11 #include <stdio.h>
12
13 int _Cilk_shared *response;

g
14

an
15 void _Cilk_shared Respond(int _Cilk_shared & a) {

W
16 a = 1;

g
17 }
18
19 int main(){ en
h
un
20
rY

21 int n_d = _Offload_number_of_devices();


fo

22
ed

23 if (n_d < 1) {
printf("No devices available!");
ar

24
return 2;
ep

25
26 }
Pr

27
y

28 response = (int _Cilk_shared *) _Offload_shared_malloc(n_d*sizeof(int));


el

response[0:n_d] = 0;
siv

29
30
u
cl

31 for (int i = 0; i < n_d; i++) {


Ex

32 // use Cilk asynchronous offload to the specified target


33 Respond(response[i]);
34 }
35
36 // use _Cilk_sync for synchronization
37
38 for (int i = 0; i < n_d; i++)
39 if (response[i] == 1) {
40 printf("OK: device %d responded\n", i);
41 } else {
42 printf("Error: device %d did not respond\n", i);
43 }
44 }

Back to Lab A.2.7.

B.2.7.11 labs/2/2.7-asynchronous-offload/step-09/async.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file async.cpp, located at 2/2.7-asynchronous-offload/step-09
3 is a part of the practical supplement to the handbook

Prepared for Yunheng Wang c Colfax International, 2013


362 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdlib.h>
11 #include <stdio.h>
12
13 int _Cilk_shared *response;
14
15 void _Cilk_shared Respond(int _Cilk_shared & a) {
16 a = 1;
17 }
18
19 int main(){
20
21 int n_d = _Offload_number_of_devices();
22
23 if (n_d < 1) {
24 printf("No devices available!");
return 2;

g
25

an
26 }

W
27
response = (int _Cilk_shared *) _Offload_shared_malloc(n_d*sizeof(int));
ng
28
29 response[0:n_d] = 0;
he

30
un

31 for (int i = 0; i < n_d; i++) {


rY

32 _Cilk_spawn _Cilk_offload_to(i)
fo

33 Respond(response[i]);
}
d

34
re

35
pa

36 _Cilk_sync;
re

37
yP

38 for (int i = 0; i < n_d; i++)


39 if (response[i] == 1) {
el

40 printf("OK: device %d responded\n", i);


siv

41 } else {
u

printf("Error: device %d did not respond\n", i);


cl

42
}
Ex

43
44 }

B.2.8 Using MPI for Multiple Coprocessors


Back to Lab A.2.8.

B.2.8.1 labs/2/2.8-MPI/step-00/Makefile

CXX = mpiicpc
CXXFLAGS =

OBJECTS = HelloMPI.o
MICOBJECTS = HelloMPI.oMIC

.SUFFIXES: .o .cc .oMIC

.c.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.2. SOURCE CODE FOR CHAPTER 2: PROGRAMMING MODELS 363

.c.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

all: HelloMPI HelloMPI.MIC

HelloMPI: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o HelloMPI $(OBJECTS)

HelloMPI.MIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o HelloMPI.MIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) HelloMPI HelloMPI.MIC

run: HelloMPI HelloMPI.MIC


scp HelloMPI.MIC mic0:~/
scp HelloMPI.MIC mic1:~/
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 2 ./HelloMPI : \
-host mic0 -np 2 ~/HelloMPI.MIC : \
-host mic1 -np 2 ~/HelloMPI.MIC

g
an
W
g
Back to Lab A.2.8.
en
h
un
rY

B.2.8.2 labs/2/2.8-MPI/step-00/HelloMPI.c
fo
ed

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


ar

2 file HelloMPI.c, located at 2/2.8-MPI/step-00


ep

3 is a part of the practical supplement to the handbook


Pr

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


y

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


el

6 Redistribution or commercial usage without a written permission


siv

7 from Colfax International is prohibited.


u

8 Contact information can be found at http://colfax-intl.com/ */


cl
Ex

9
10 #include "mpi.h"
11 #include <stdio.h>
12 #include <string.h>
13
14 int main (int argc, char *argv[]) {
15 int i, rank, size, namelen;
16 char name[MPI_MAX_PROCESSOR_NAME];
17
18 MPI_Init (&argc, &argv);
19
20 MPI_Comm_size (MPI_COMM_WORLD, &size);
21 MPI_Comm_rank (MPI_COMM_WORLD, &rank);
22 MPI_Get_processor_name (name, &namelen);
23
24 printf ("Hello World from rank %d running on %s!\n", rank, name);
25 if (rank == 0) printf("MPI World size = %d processes\n", size);
26
27 MPI_Finalize ();
28 }

Back to Lab A.2.8.

Prepared for Yunheng Wang c Colfax International, 2013


364 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

B.2.8.3 labs/2/2.8-MPI/step-00/hosts

1 mic0
2 mic1

B.3 Source Code for Chapter 3: Expressing Parallelism


B.3.1 Automatic Vectorization: Compiler Pragmas and Vectorization Report
Back to Lab A.3.1.

B.3.1.1 labs/3/3.1-vectorization/step-00/vectorization.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file vectorization.cpp, located at 3/3.1-vectorization/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"

g
an
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

W
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
ng
8 Contact information can be found at http://colfax-intl.com/ */
he

9
un

10 #include <stdio.h>
rY

11
int main(){
fo

12
13 const int n=8;
d
re

14 int i;
pa

15 int A[n];
int B[n];
re

16
yP

17
18 // Initialization
el

19 for (i=0; i<n; i++)


siv

20 A[i]=B[i]=i;
u

21
cl

// This loop will be auto-vectorized


Ex

22
23 for (i=0; i<n; i++)
24 A[i]+=B[i];
25
26 // Output
27 for (i=0; i<n; i++)
28 printf("%2d %2d %2d\n", i, A[i], B[i]);
29 }

Back to Lab A.3.1.

B.3.1.2 labs/3/3.1-vectorization/step-01/vectorization.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file vectorization.cpp, located at 3/3.1-vectorization/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 365

9
10 #include <stdio.h>
11
12 int main(){
13 const int n=8;
14 int i;
15 __attribute__((align(64))) int A[n];
16 __attribute__((align(64))) int B[n];
17
18 // Initialization
19 for (i=0; i<n; i++)
20 A[i]=B[i]=i;
21
22 // This loop will be auto-vectorized
23 A[:]+=B[:];
24
25 // Output
26 for (i=0; i<n; i++)
27 printf("%2d %2d %2d\n", i, A[i], B[i]);
28 }

g
an
Back to Lab A.3.1.

W
g
B.3.1.3
en
labs/3/3.1-vectorization/step-02/vectorization.cpp
h
un
/* Copyright (c) 2013, Colfax International. All Right Reserved.
rY

1
2 file vectorization.cpp, located at 3/3.1-vectorization/step-02
fo

3 is a part of the practical supplement to the handbook


ed

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


ar

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


ep

6 Redistribution or commercial usage without a written permission


Pr

7 from Colfax International is prohibited.


8 Contact information can be found at http://colfax-intl.com/ */
y
el

9
siv

10 #include <stdio.h>
#include <stdlib.h>
u

11
cl

12
Ex

13 int main(){
14 const int n=8;
15 int i;
16 int* A = (int*) malloc(n*sizeof(int));
17 int* B = (int*) malloc(n*sizeof(int));
18
19 // Initialization
20 for (i=0; i<n; i++)
21 A[i]=B[i]=i;
22
23 // This loop will be auto-vectorized
24 A[0:n]+=B[0:n];
25
26 // Output
27 for (i=0; i<n; i++)
28 printf("%2d %2d %2d\n", i, A[i], B[i]);
29
30 free(A);
31 free(B);
32 }

Back to Lab A.3.1.

Prepared for Yunheng Wang c Colfax International, 2013


366 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

B.3.1.4 labs/3/3.1-vectorization/step-03/vectorization.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file vectorization.cpp, located at 3/3.1-vectorization/step-03
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 int main(){
14 const int n=8;
15 const int al=64; //alignment
16 int i;
17 int* A = (int*) malloc(n*sizeof(int));
18 int* B = (int*) malloc(n*sizeof(int));
19

g
// Alignment check

an
20
21 printf("Offset for A is: %lu bytes\n",
22 (al - ( (size_t) A % al ) )%al );
W
ng
23 printf("Offset for B is: %lu bytes\n",
he

24 (al - ( (size_t) B % al ) )%al );


un

25
rY

26 // Initialization
27 for (i=0; i<n; i++)
fo

28 A[i]=B[i]=i;
d
re

29
// This loop will be auto-vectorized
pa

30
31 A[0:n]+=B[0:n];
re

32
yP

33 // Output
el

34 //for (i=0; i<n; i++)


siv

35 // printf("%2d %2d %2d\n", i, A[i], B[i]);


u

36
cl

37 free(A);
Ex

38 free(B);
39 }

Back to Lab A.3.1.

B.3.1.5 labs/3/3.1-vectorization/step-04/vectorization.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file vectorization.cpp, located at 3/3.1-vectorization/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 int main(){

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 367

14 const int n=8;


15 const int al=64; //alignment
16 int i;
17 char* Aspace = (char*) malloc(n*sizeof(int)+al-1);
18 char* Bspace = (char*) malloc(n*sizeof(int)+al-1);
19 size_t Aoffset = (al-((size_t)(Aspace))%al)%al;
20 size_t Boffset = (al-((size_t)(Bspace))%al)%al;
21 int* A = (int*) ((char*)(Aspace) + Aoffset);
22 int* B = (int*) ((char*)(Bspace) + Boffset);
23
24 // Initialization
25 for (i=0; i<n; i++)
26 A[i]=B[i]=i;
27
28 // This loop will be auto-vectorized
29 A[0:n]+=B[0:n];
30
31 // Output
32 for (i=0; i<n; i++)
33 printf("%2d %2d %2d\n", i, A[i], B[i]);
34
free(Aspace);

g
35

an
36 free(Bspace);

W
37 }

g
Back to Lab A.3.1. hen
un
rY

B.3.1.6 labs/3/3.1-vectorization/step-05/vectorization.cpp
fo
ed
ar

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


ep

2 file vectorization.cpp, located at 3/3.1-vectorization/step-05


is a part of the practical supplement to the handbook
Pr

3
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
y

(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


el

5
siv

6 Redistribution or commercial usage without a written permission


7 from Colfax International is prohibited.
u
cl

8 Contact information can be found at http://colfax-intl.com/ */


Ex

9
10 #include <stdio.h>
11
12 int main(){
13 const int n=8;
14 const int al=64; //alignment
15 int i;
16 int* A = (int*) _mm_malloc(n*sizeof(int), al);
17 int* B = (int*) _mm_malloc(n*sizeof(int), al);
18
19 // Initialization
20 for (i=0; i<n; i++)
21 A[i]=B[i]=i;
22
23 // This loop will be auto-vectorized
24 A[0:n]+=B[0:n];
25
26 // Output
27 for (i=0; i<n; i++)
28 printf("%2d %2d %2d\n", i, A[i], B[i]);
29
30 _mm_free(A);
31 _mm_free(B);

Prepared for Yunheng Wang c Colfax International, 2013


368 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

32 }

Back to Lab A.3.1.

B.3.1.7 labs/3/3.1-vectorization/step-06/vectorization.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file vectorization.cpp, located at 3/3.1-vectorization/step-06
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <immintrin.h>
12
13 int main(){

g
14 const int n=16;

an
15 const int al=64; //alignment

W
16 int i;
ng
17 int* A = (int*) _mm_malloc(n*sizeof(int), al);
int* B = (int*) _mm_malloc(n*sizeof(int), al);
he

18
19
un

20 // Initialization
rY

21 for (i=0; i<n; i++)


fo

22 A[i]=B[i]=i;
d

23
re

24 // This loop will be auto-vectorized


pa

25 for (i=0; i<n; i+=16){


re

26 __m512i Avec = _mm512_load_epi32(A+i);


yP

27 __m512i Bvec = _mm512_load_epi32(B+i);


28 Avec = _mm512_add_epi32(Avec, Bvec);
el

_mm512_store_epi32(A+i, Avec);
siv

29
30 }
u
cl

31
Ex

32 // Output
33 for (i=0; i<n; i++)
34 printf("%2d %2d %2d\n", i, A[i], B[i]);
35
36 _mm_free(A);
37 _mm_free(B);
38 }

Back to Lab A.3.1.

B.3.1.8 labs/3/3.1-vectorization/step-07/vectorization.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file vectorization.cpp, located at 3/3.1-vectorization/step-07
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 369

10 #include <stdio.h>
11
12 int my_simple_add(int x1, int x2){
13 return x1+x2;
14 }
15
16 int main(){
17 const int n=8;
18 const int al=64; //alignment
19 int i;
20 int* A = (int*) _mm_malloc(n*sizeof(int), al);
21 int* B = (int*) _mm_malloc(n*sizeof(int), al);
22
23 // Initialization
24 for (i=0; i<n; i++)
25 A[i]=B[i]=i;
26
27 // This loop will be auto-vectorized
28 for (i=0; i<n; i++)
29 A[i] = my_simple_add(A[i], B[i]);
30
// Output

g
31

an
32 for (i=0; i<n; i++)

W
33 printf("%2d %2d %2d\n", i, A[i], B[i]);

g
34

en
35 _mm_free(A); h
36 _mm_free(B);
un
37 }
rY
fo

Back to Lab A.3.1.


ed
ar
ep

B.3.1.9 labs/3/3.1-vectorization/step-08/main.cpp
Pr
y

/* Copyright (c) 2013, Colfax International. All Right Reserved.


el

1
siv

2 file main.cpp, located at 3/3.1-vectorization/step-08


3 is a part of the practical supplement to the handbook
u
cl

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


Ex

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11
12 int my_simple_add(int, int);
13
14 int main(){
15 const int n=8;
16 const int al=64; //alignment
17 int i;
18 int* A = (int*) _mm_malloc(n*sizeof(int), al);
19 int* B = (int*) _mm_malloc(n*sizeof(int), al);
20
21 // Initialization
22 for (i=0; i<n; i++)
23 A[i]=B[i]=i;
24
25 // This loop will be auto-vectorized
26
27 for (i=0; i<n; i++)

Prepared for Yunheng Wang c Colfax International, 2013


370 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

28 A[i] = my_simple_add(A[i], B[i]);


29
30 // Output
31 for (i=0; i<n; i++)
32 printf("%2d %2d %2d\n", i, A[i], B[i]);
33
34 _mm_free(A);
35 _mm_free(B);
36 }

Back to Lab A.3.1.

B.3.1.10 labs/3/3.1-vectorization/step-08/worker.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cpp, located at 3/3.1-vectorization/step-08
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

g
6 Redistribution or commercial usage without a written permission

an
7 from Colfax International is prohibited.

W
8 Contact information can be found at http://colfax-intl.com/ */
ng
9
int my_simple_add(int x1, int x2){
he

10
11 return x1+x2;
un

12 }
rY
fo

Back to Lab A.3.1.


d
re
pa

B.3.1.11 labs/3/3.1-vectorization/step-09/main.cpp
re
yP

/* Copyright (c) 2013, Colfax International. All Right Reserved.


el

1
siv

2 file main.cpp, located at 3/3.1-vectorization/step-09


3 is a part of the practical supplement to the handbook
u
cl

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


Ex

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11
12 __attribute__((vector)) int my_simple_add(int, int);
13
14 int main(){
15 const int n=8;
16 const int al=64; //alignment
17 int i;
18 int* A = (int*) _mm_malloc(n*sizeof(int), al);
19 int* B = (int*) _mm_malloc(n*sizeof(int), al);
20
21 // Initialization
22 for (i=0; i<n; i++)
23 A[i]=B[i]=i;
24
25 // This loop will be auto-vectorized
26 #pragma simd
27 for (i=0; i<n; i++)

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 371

28 A[i] = my_simple_add(A[i], B[i]);


29
30 // Output
31 for (i=0; i<n; i++)
32 printf("%2d %2d %2d\n", i, A[i], B[i]);
33
34 _mm_free(A);
35 _mm_free(B);
36 }

Back to Lab A.3.1.

B.3.1.12 labs/3/3.1-vectorization/step-09/worker.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cpp, located at 3/3.1-vectorization/step-09
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

g
6 Redistribution or commercial usage without a written permission

an
7 from Colfax International is prohibited.

W
8 Contact information can be found at http://colfax-intl.com/ */

g
9
10
11 return x1+x2; en
__attribute__((vector)) int my_simple_add(int x1, int x2){
h
un
12 }
rY
fo

Back to Lab A.3.1.


ed
ar
ep

B.3.1.13 labs/3/3.1-vectorization/step-0a/main.cpp
Pr
y

/* Copyright (c) 2013, Colfax International. All Right Reserved.


el

1
siv

2 file main.cpp, located at 3/3.1-vectorization/step-0a


3 is a part of the practical supplement to the handbook
u
cl

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


Ex

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11
12 void my_simple_add(int, int*, int*);
13
14 int main(){
15 const int n=8;
16 int i;
17 int A[n];
18 int B[n];
19
20 // Initialization
21 for (i=0; i<n; i++)
22 A[i]=B[i]=i;
23
24 // This loop will be auto-vectorized
25 my_simple_add(n-1, B+1, B);
26
27 // Output

Prepared for Yunheng Wang c Colfax International, 2013


372 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

28 for (i=0; i<n; i++)


29 printf("%2d %2d %2d\n", i, A[i], B[i]);
30 }

Back to Lab A.3.1.

B.3.1.14 labs/3/3.1-vectorization/step-0a/worker.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cpp, located at 3/3.1-vectorization/step-0a
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 void my_simple_add(int n, int* a, int* b){
11 #pragma ivdep

g
12 for (int i=0; i<n; i++)

an
13 a[i] = b[i];

W
14 } ng
he

Back to Lab A.3.1.


un
rY

B.3.1.15 labs/3/3.1-vectorization/step-0b/Makefile
fo
d
re

CXX = icpc
pa

CXXFLAGS = -vec-report6 -restrict


re
yP

OBJECTS = worker.o main.o


el
siv

.SUFFIXES: .o .cpp
u
cl

.cpp.o:
Ex

$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

all: runme

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

clean:
rm -f *.o runme

Back to Lab A.3.1.

B.3.1.16 labs/3/3.1-vectorization/step-0b/main.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cpp, located at 3/3.1-vectorization/step-0b
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 373

8 Contact information can be found at http://colfax-intl.com/ */


9
10 #include <stdio.h>
11
12 void my_simple_add(int, int*, int*);
13
14 int main(){
15 const int n=8;
16 int i;
17 int A[n];
18 int B[n];
19
20 // Initialization
21 for (i=0; i<n; i++)
22 A[i]=B[i]=i;
23
24 // This loop will be auto-vectorized
25 my_simple_add(n, A, B);
26
27 // Output
28 for (i=0; i<n; i++)
printf("%2d %2d %2d\n", i, A[i], B[i]);

g
29

an
30 }

W
g
Back to Lab A.3.1.
en
h
un

B.3.1.17
rY

labs/3/3.1-vectorization/step-0b/worker.cpp
fo
ed

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


file worker.cpp, located at 3/3.1-vectorization/step-0b
ar

2
is a part of the practical supplement to the handbook
ep

3
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
Pr

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


y

6 Redistribution or commercial usage without a written permission


el

from Colfax International is prohibited.


siv

7
8 Contact information can be found at http://colfax-intl.com/ */
u
cl

9
Ex

10 void my_simple_add(int n, int* restrict a, int* restrict b){


11 for (int i=0; i<n; i++)
12 a[i] = b[i];
13 }

Back to Lab A.3.1.

B.3.1.18 labs/3/3.1-vectorization/step-0c/vectorization.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file vectorization.cpp, located at 3/3.1-vectorization/step-0c
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11
12 __attribute__((vector)) int my_simple_add(int x1, int x2){

Prepared for Yunheng Wang c Colfax International, 2013


374 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

13 return x1+x2;
14 }
15
16 int main(){
17 const int n=256;
18 int i;
19 int A[n];
20 int B[n];
21
22 // Initialization
23 for (i=0; i<n; i++)
24 A[i]=B[i]=i;
25
26 // This loop will be auto-vectorized
27 _Cilk_for(int j=0; j<n; j++)
28 A[j] = my_simple_add(A[j], B[j]);
29
30 // Output
31 for (i=0; i<n; i++)
32 printf("%2d %2d %2d\n", i, A[i], B[i]);
33 }

g
an
B.3.2 W
Parallelism with OpenMP: Shared and Private Variables, Reduction
ng
he

Back to Lab A.3.2.


un
rY

B.3.2.1 labs/3/3.2-OpenMP/step-00/openmp.cpp
fo
d
re

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


pa

2 file openmp.cpp, located at 3/3.2-OpenMP/step-00


re

3 is a part of the practical supplement to the handbook


yP

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
el

6 Redistribution or commercial usage without a written permission


siv

7 from Colfax International is prohibited.


u

Contact information can be found at http://colfax-intl.com/ */


cl

8
Ex

9
10 #include <stdio.h>
11 #include <omp.h>
12
13 int main(){
14 const int nt=omp_get_max_threads();
15 printf("OpenMP with %d threads\n", nt);
16
17 #pragma omp parallel
18 printf("Hello World from thread %d\n", omp_get_thread_num());
19 }

Back to Lab A.3.2.

B.3.2.2 labs/3/3.2-OpenMP/step-01/openmp.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file openmp.cpp, located at 3/3.2-OpenMP/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 375

6 Redistribution or commercial usage without a written permission


7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <omp.h>
12
13 int main(){
14 const int nt=omp_get_max_threads();
15 printf("OpenMP with %d threads\n", nt);
16
17 #pragma omp parallel for
18 for (int i=0; i<nt; i++)
19 printf("Hello World from thread %d on %d iteration\n", omp_get_thread_num(), i);
20 }

Back to Lab A.3.2.

B.3.2.3 labs/3/3.2-OpenMP/step-02/openmp.cpp

g
an
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.

W
2 file openmp.cpp, located at 3/3.2-OpenMP/step-02

g
3 is a part of the practical supplement to the handbook
4
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8en
"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
h
un
5
Redistribution or commercial usage without a written permission
rY

6
7 from Colfax International is prohibited.
fo

8 Contact information can be found at http://colfax-intl.com/ */


ed

9
ar

10 #include <stdio.h>
ep

11 #include <omp.h>
Pr

12
13 int main(){
y

const int nt=omp_get_max_threads(); // this variable is declared before the parallel


el

14
siv

15 // region, it will be available for every thread


16 printf("OpenMP with %d threads\n", nt);
u
cl

17
Ex

18 #pragma omp parallel


19 {
20 // Code placed here will be executed by all threads.
21 // Stack variables declared here will be private to each thread.
22 int private_number=0;
23
24 #pragma omp for
25 for (int i=0; i<nt; i++){
26 // iteration will be distributed across available threads
27 private_number += 1;
28 printf("Hello World from thread %d (private_number = %d)\n",
29 omp_get_thread_num(), private_number);
30 }
31 // code placed here will be executed my all threads
32 }
33 }

Back to Lab A.3.2.

B.3.2.4 labs/3/3.2-OpenMP/step-03/openmp.cpp

Prepared for Yunheng Wang c Colfax International, 2013


376 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file openmp.cpp, located at 3/3.2-OpenMP/step-03
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <omp.h>
12
13 int main(){
14 const int nt=omp_get_max_threads(); // this variable is declared before the parallel
15 // region it will be available for every thread
16 printf("OpenMP with %d threads\n", nt);
17
18 #pragma omp parallel num_threads(4)
19 {
20 // Code placed here will be executed by all threads.
21 // Stack variables declared here will be private to each thread.
int private_number=0;

g
22

an
23

W
24 #pragma omp for schedule(guided, 4)
for (int i=0; i<nt; i++){
ng
25
26 // iteration will be distributed across available threads
he

27 private_number += 1;
un

28 printf("Hello World from thread %d (private_number = %d)\n",


rY

29 omp_get_thread_num(), private_number);
fo

30 }
// code placed here will be executed my all threads
d

31
re

32 }
pa

33 }
re
yP

Back to Lab A.3.2.


el
usiv

B.3.2.5 labs/3/3.2-OpenMP/step-04/openmp.cpp
cl
Ex

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file openmp.cpp, located at 3/3.2-OpenMP/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <omp.h>
12
13 void Recurse(const int task){
14 if (task<10){
15 printf("Creating task %d...\n", task+1);
16 #pragma omp task
17 {
18 Recurse(task+1);
19 }
20 long foo=0; for (long i = 0; i < (1<<20); i++) foo+=i;
21 printf("Result of task %d in thread %d is %ld\n", task, omp_get_thread_num(), foo);
22 }

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 377

23 }
24
25 int main(){
26 #pragma omp parallel
27 {
28 #pragma omp single
29 Recurse(0);
30 }
31 }

Back to Lab A.3.2.

B.3.2.6 labs/3/3.2-OpenMP/step-05/openmp.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file openmp.cpp, located at 3/3.2-OpenMP/step-05
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

g
6 Redistribution or commercial usage without a written permission

an
7 from Colfax International is prohibited.

W
8 Contact information can be found at http://colfax-intl.com/ */

g
9
10
11
#include <stdio.h>
#include <omp.h> en
h
un
12
rY

13 int main(){
fo

14 int varShared = 5;
ed

15 int varPrivate = 4;
int varFirstPrivate = 2;
ar

16
ep

17
18 #pragma omp parallel private(varPrivate) firstprivate(varFirstPrivate)
Pr

19 {
y

20 varPrivate = 0; // Private variables should be initialized within


el

// the parallel region


siv

21
22 varShared += 1; // Race condition, undefined behavior!
u
cl

23 varPrivate += 1;
Ex

24 varFirstPrivate += 1;
25 printf("For thread %d,\t varShared=%d\t varPrivate=%d\t varFirstPrivate=%d\n",
26 omp_get_thread_num(), varShared, varPrivate, varFirstPrivate);
27 }
28 printf("After parallel region, varShared=%d\t varPrivate=%d\t varFirstPrivate=%d\n",
29 varShared, varPrivate, varFirstPrivate);
30 }

Back to Lab A.3.2.

B.3.2.7 labs/3/3.2-OpenMP/step-06/openmp.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file openmp.cpp, located at 3/3.2-OpenMP/step-06
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9

Prepared for Yunheng Wang c Colfax International, 2013


378 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

10 #include <stdio.h>
11 #include <omp.h>
12
13 int main(){
14 const int n = 1000;
15 int sum = 0;
16 #pragma omp parallel for
17 for (int i=0; i<n; i++){
18 // Race condition
19 sum = sum + i;
20 }
21 printf("sum = %d (must be %d)\n", sum, ((n-1)*n)/2);
22 }

Back to Lab A.3.2.

B.3.2.8 labs/3/3.2-OpenMP/step-07/openmp.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.

g
2 file openmp.cpp, located at 3/3.2-OpenMP/step-07

an
3 is a part of the practical supplement to the handbook

W
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
ng
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
Redistribution or commercial usage without a written permission
he

6
7 from Colfax International is prohibited.
un

8 Contact information can be found at http://colfax-intl.com/ */


rY

9
fo

10 #include <stdio.h>
d

11 #include <omp.h>
re

12
pa

13 int main(){
re

14 const int n = 1000;


yP

15 int sum = 0;
16 #pragma omp parallel for
el

for (int i=0; i<n; i++){


siv

17
18 #pragma omp critical
u
cl

19 sum = sum + i; // only one thread at a time can execute this section
Ex

20 }
21 printf("sum = %d (must be %d)\n", sum, ((n-1)*n)/2);
22 }

Back to Lab A.3.2.

B.3.2.9 labs/3/3.2-OpenMP/step-08/openmp.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file openmp.cpp, located at 3/3.2-OpenMP/step-08
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <omp.h>
12
13 int main(){

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 379

14 const int n = 1000;


15 int sum = 0;
16 #pragma omp parallel for
17 for (int i=0; i<n; i++){
18 #pragma omp atomic
19 sum += i; // lightweight synchronization
20 }
21 printf("sum = %d (must be %d)\n", sum, ((n-1)*n)/2);
22 }

Back to Lab A.3.2.

B.3.2.10 labs/3/3.2-OpenMP/step-09/openmp.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file openmp.cpp, located at 3/3.2-OpenMP/step-09
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

g
6 Redistribution or commercial usage without a written permission

an
7 from Colfax International is prohibited.

W
8 Contact information can be found at http://colfax-intl.com/ */

g
9
10
11
#include <stdio.h>
#include <omp.h> en
h
un
12
rY

13 int main(){
fo

14 const int N=1000;


ed

15 int* A = (int*) malloc(N*sizeof(int));


for (int i=0; i<N; i++) A[i]=i;
ar

16
#pragma omp parallel
ep

17
18 {
Pr

19 #pragma omp single


y

20 {
el

// Compute the sum on two threads


siv

21
22 int sum1=0, sum2=0;
u
cl

23 #pragma omp task shared(A, N, sum1)


Ex

24 for (int i=0; i<N/2; i++) sum1 += A[i];


25 #pragma omp task shared(A, N, sum2)
26 for (int i=N/2; i<N; i++) sum2 += A[i];
27 // Wait for tasks to complete
28 #pragma omp taskwait
29 printf("Result=%d (must be %d)\n", sum1+sum2, ((N-1)*N)/2);
30 }
31 }
32 free(A);
33 }

Back to Lab A.3.2.

B.3.2.11 labs/3/3.2-OpenMP/step-0a/openmp.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file openmp.cpp, located at 3/3.2-OpenMP/step-0a
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission

Prepared for Yunheng Wang c Colfax International, 2013


380 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

7 from Colfax International is prohibited.


8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <omp.h>
12
13 int main(){
14 const int N=1000;
15 int sum = 0;
16 #pragma omp parallel
17 {
18 int sum_th = 0;
19 #pragma omp for
20 for (int i=0; i<N; i++) sum_th += i;
21 #pragma omp critical
22 sum += sum_th;
23 }
24 printf("sum = %d (must be %d)\n", sum, ((N-1)*N)/2);
25 }

Back to Lab A.3.2.

g
an
B.3.2.12 labs/3/3.2-OpenMP/step-0b/openmp.cpp
W
ng
he

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


un

2 file openmp.cpp, located at 3/3.2-OpenMP/step-0b


rY

3 is a part of the practical supplement to the handbook


fo

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


d

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


re

6 Redistribution or commercial usage without a written permission


pa

7 from Colfax International is prohibited.


re

8 Contact information can be found at http://colfax-intl.com/ */


yP

9
10 #include <stdio.h>
el

#include <omp.h>
siv

11
12
u
cl

13 int main(){
Ex

14 const int N=1000;


15 int sum = 0;
16 #pragma omp parallel for reduction(+:sum)
17 for (int i=0; i<N; i++) sum += i;
18 printf("sum = %d (must be %d)\n", sum, ((N-1)*N)/2);
19 }

Back to Lab A.3.2.

B.3.3 Complex Algorithms with Cilk Plus: Recursive Divide-and-Conquer


Back to Lab A.3.3.

B.3.3.1 labs/3/3.3-Cilk-Plus/step-00/cilk.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file cilk.cpp, located at 3/3.3-Cilk-Plus/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 381

6 Redistribution or commercial usage without a written permission


7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <cilk/cilk.h>
12 #include <cilk/cilk_api_linux.h>
13
14 int main(){
15 const int nw=__cilkrts_get_nworkers();
16 printf("Cilk Plus with %d workers\n", nw);
17
18 _Cilk_for (int i=0; i<nw; i++) // Light workload: gets serialized
19 printf("Hello World from worker %d\n", __cilkrts_get_worker_number());
20
21 _Cilk_for (int i=0; i<nw; i++){
22 double f=1.0;
23 while (f<1.0e40) f*=2.0; // Extra workload: gets parallelized
24 printf("Hello again from worker %d (%f)\n", __cilkrts_get_worker_number(), f);
25 }
26 }

g
an
Back to Lab A.3.3.

W
g
B.3.3.2 labs/3/3.3-Cilk-Plus/step-01/cilk.cpp en
h
un
rY

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


fo

2 file cilk.cpp, located at 3/3.3-Cilk-Plus/step-01


ed

3 is a part of the practical supplement to the handbook


ar

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


ep

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


Redistribution or commercial usage without a written permission
Pr

6
7 from Colfax International is prohibited.
y

Contact information can be found at http://colfax-intl.com/ */


el

8
siv

9
10 #include <stdio.h>
u
cl

11 #include <cilk/cilk.h>
Ex

12 #include <cilk/cilk_api_linux.h>
13
14 int main(){
15 const int nw=__cilkrts_get_nworkers();
16 printf("Cilk Plus with %d workers\n", nw);
17
18 _Cilk_for (int i=0; i<nw; i++) // Light workload: gets serialized
19 printf("Hello World from worker %d\n", __cilkrts_get_worker_number());
20
21 #pragma cilk grainsize = 4
22 _Cilk_for (int i=0; i<nw; i++){
23 double f=1.0;
24 while (f<1.0e40) f*=2.0; // Extra workload: gets parallelized
25 printf("Hello again from worker %d (%f)\n", __cilkrts_get_worker_number(), f);
26 }
27 }

Back to Lab A.3.3.

B.3.3.3 labs/3/3.3-Cilk-Plus/step-02/cilk.cpp

Prepared for Yunheng Wang c Colfax International, 2013


382 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file cilk.cpp, located at 3/3.3-Cilk-Plus/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <cilk/cilk_api_linux.h>
12
13 void Recurse(const int task){
14 if (task<10){
15 printf("Creating task %d...\n", task+1);
16 _Cilk_spawn Recurse(task+1);
17 long foo=0; for (long i=0; i<(1L<<30L); i++) foo+=i;
18 printf("Result of task %d in worker %d is %ld\n", task,
19 __cilkrts_get_worker_number(), foo);
20 }
21 }

g
22

an
23 int main(){

W
24 Recurse(0);
}
ng
25
he
un

Back to Lab A.3.3.


rY
fo

B.3.3.4 labs/3/3.3-Cilk-Plus/step-03/cilk.cpp
d
re
pa

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


re

2 file cilk.cpp, located at 3/3.3-Cilk-Plus/step-03


yP

3 is a part of the practical supplement to the handbook


"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
el

4
siv

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


6 Redistribution or commercial usage without a written permission
u
cl

7 from Colfax International is prohibited.


Ex

8 Contact information can be found at http://colfax-intl.com/ */


9
10 #include <stdio.h>
11 #include <stdlib.h>
12
13 void Sum(const int* A, const int start, const int end, int & result){
14 for (int j=start; j < end; j++) result += A[j];
15 }
16
17 int main(){
18 const int N=1000;
19 int* A = (int*) malloc(N*sizeof(int));
20 for (int i=0; i<N; i++) A[i]=i;
21
22 // Compute the sum with two tasks
23 int sum1=0, sum2=0;
24
25 _Cilk_spawn Sum(A, 0, N/2, sum1);
26 _Cilk_spawn Sum(A, N/2, N, sum2);
27
28 // Wait for forked off sums to complete
29 _Cilk_sync;
30

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 383

31 printf("Result=%d (must be %d)\n", sum1+sum2, ((N-1)*N)/2);


32
33 free(A);
34 }

Back to Lab A.3.3.

B.3.3.5 labs/3/3.3-Cilk-Plus/step-04/cilk.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file cilk.cpp, located at 3/3.3-Cilk-Plus/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>

g
11 #include <cilk/reducer_opadd.h>

an
12 #include <cilk/cilk_api_linux.h>

W
13

g
14 int main(){
15
16
const int N=20;
cilk::reducer_opadd<int> sum; en
h
un
17 sum.set_value(0);
rY

18 _Cilk_for (int i=0; i<N; i++){


fo

19 printf("%d\t%d:%d\n", __cilkrts_get_worker_number(), i, sum.get_value());


ed

20 sum = sum + i;
}
ar

21
printf("Result=%d (must be %d)\n", sum.get_value(), ((N-1)*N)/2);
ep

22
23 }
Pr
y
el

Back to Lab A.3.3.


siv
u
cl

B.3.3.6 labs/3/3.3-Cilk-Plus/step-05/cilk.cpp
Ex

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file cilk.cpp, located at 3/3.3-Cilk-Plus/step-05
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <cilk/cilk_api.h>
12 #include <cilk/cilk_api_linux.h>
13
14 const int N=100000;
15
16 class Scratch {
17 public:
18 int data[N];
19 Scratch(){ printf("Constructor called by worker %d\n",
20 __cilkrts_get_worker_number());}
21 };

Prepared for Yunheng Wang c Colfax International, 2013


384 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

22
23 int main(){
24 if (0 == __cilkrts_set_param("nworkers","2")){
25 _Cilk_for( int i=0; i<10; i++){
26 Scratch scratch;
27 scratch.data[0:N] = i;
28 int sum = 0;
29 for (int j=0; j<N; j++) sum += scratch.data[j];
30 printf("i=%d, worker=%d, sum=%d\n", i, __cilkrts_get_worker_number(), sum);
31 }
32 } else {
33 printf("Failed to set workers count!\n");
34 return 1;
35 }
36 }

Back to Lab A.3.3.

g
an
W
ng
B.3.3.7 labs/3/3.3-Cilk-Plus/step-06/cilk.cpp
e
nh
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.
Yu

2 file cilk.cpp, located at 3/3.3-Cilk-Plus/step-06


is a part of the practical supplement to the handbook
r

3
fo

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
ed

6 Redistribution or commercial usage without a written permission


ar

7 from Colfax International is prohibited.


p

8 Contact information can be found at http://colfax-intl.com/ */


re

9
yP

10 #include <stdio.h>
11 #include <cilk/holder.h>
el

12 #include <cilk/cilk_api.h>
iv

13 #include <cilk/cilk_api_linux.h>
us

14
15 const int N=1000000;
cl

16
Ex

17 class Scratch {
18 public:
19 int data[N];
20 Scratch(){ printf("Constructor called by worker %d\n", __cilkrts_get_worker_number());}
21 };
22
23 int main(){
24 if (0 == __cilkrts_set_param("nworkers","2")){
25 cilk::holder<Scratch> scratch;
26 _Cilk_for( int i=0; i<10; i++){
27 scratch().data[0:N] = i; // Operator () is used for data access
28 int sum = 0;
29 for (int j=0; j<N; j++) sum += scratch().data[j];
30 printf("i=%d, worker=%d, sum=%d\n", i, __cilkrts_get_worker_number(), sum);
31 }
32 } else {
33 printf("Failed to set workers count!\n");
34 return 1;
35 }
36 }

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 385

B.3.4 Data Traffic with MPI


Back to Lab A.3.4.

B.3.4.1 labs/3/3.4-MPI/step-00/Makefile

CXX = mpiicpc
CXXFLAGS =

OBJECTS = mpi.o
MICOBJECTS = mpi.oMIC

.SUFFIXES: .o .cc .oMIC

.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cpp.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

g
all: runme-mpi runme-mpi.MIC

an
W
runme-mpi: $(OBJECTS)

g
$(CXX) $(CXXFLAGS) -o runme-mpi $(OBJECTS)

runme-mpi.MIC: $(MICOBJECTS) en
h
un
$(CXX) $(CXXFLAGS) -mmic -o runme-mpi.MIC $(MICOBJECTS)
rY
fo

clean:
ed

rm -f $(OBJECTS) $(MICOBJECTS) runme-mpi runme-mpi.MIC


ar

run: runme-mpi runme-mpi.MIC


ep

scp runme-mpi.MIC mic0:~/


Pr

scp runme-mpi.MIC mic1:~/


y

LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
el
siv

I_MPI_MIC=1 \
mpirun \
u
cl

-host localhost -n 2 ./runme-mpi : \


Ex

-host mic0 -n 2 ~/runme-mpi.MIC :\


-host mic1 -n 2 ~/runme-mpi.MIC

Back to Lab A.3.4.

B.3.4.2 labs/3/3.4-MPI/step-00/mpi.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file mpi.cpp, located at 3/3.4-MPI/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <mpi.h>
11
12 int main(int argc, char** argv) {
13
14 // Set up MPI environment

Prepared for Yunheng Wang c Colfax International, 2013


386 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

15 int ret = MPI_Init(&argc,&argv);


16 if (ret != MPI_SUCCESS) {
17 printf("error: could not initialize MPI\n");
18 MPI_Abort(MPI_COMM_WORLD, ret);
19 }
20
21 int worldSize, myRank, myNameLength;
22 char myName[MPI_MAX_PROCESSOR_NAME];
23 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
24 MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
25 MPI_Get_processor_name(myName, &myNameLength);
26
27 // Perform work
28 // Exchange messages with MPI_Send, MPI_Recv, etc.
29 // ...
30 printf ("Hello World from rank %d running on %s!\n", myRank, myName);
31 if (myRank == 0) printf("MPI World size = %d processes\n", worldSize);
32
33 // Terminate MPI environment
34 MPI_Finalize();
35 }

g
an
Back to Lab A.3.4.
W
ng
he

B.3.4.3 labs/3/3.4-MPI/step-01/mpi.cpp
un
rY

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


fo

2 file mpi.cpp, located at 3/3.4-MPI/step-01


d

3 is a part of the practical supplement to the handbook


re

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


pa

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


re

6 Redistribution or commercial usage without a written permission


yP

7 from Colfax International is prohibited.


Contact information can be found at http://colfax-intl.com/ */
el

8
siv

9
10 #include <mpi.h>
u
cl

11 #include <stdio.h>
Ex

12
13 int main(int argc, char** argv) {
14
15 // Set up MPI environment
16 int ret = MPI_Init(&argc,&argv);
17 if (ret != MPI_SUCCESS) {
18 printf("error: could not initialize MPI\n");
19 MPI_Abort(MPI_COMM_WORLD, ret);
20 }
21
22 int worldSize, rank, irank, namelen;
23 char name[MPI_MAX_PROCESSOR_NAME];
24 MPI_Status stat;
25 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
26 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
27 MPI_Get_processor_name(name, &namelen);
28
29 if (rank == 0) {
30 printf("I am the master process, rank %d of %d running on %s\n",
31 rank, worldSize, name);
32 for (int i = 1; i < worldSize; i++){
33 // Blocking receive operations in the master process
34 MPI_Recv (&irank, 1, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &stat);

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 387

35 MPI_Recv (&namelen, 1, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &stat);


36 MPI_Recv (name, namelen + 1, MPI_CHAR, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &stat);
37 printf("Received hello from rank %d running on %s\n", irank, name);
38 }
39 } else {
40 // Blocking send poerations in all other processes
41 MPI_Send (&rank, 1, MPI_INT, 0, rank, MPI_COMM_WORLD);
42 MPI_Send (&namelen, 1, MPI_INT, 0, rank, MPI_COMM_WORLD);
43 MPI_Send (name, namelen + 1, MPI_CHAR, 0, rank, MPI_COMM_WORLD);
44 }
45 // Terminate MPI environment
46 MPI_Finalize();
47 }

Back to Lab A.3.4.

B.3.4.4 labs/3/3.4-MPI/step-02/mpi.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.

g
2 file mpi.cpp, located at 3/3.4-MPI/step-02

an
3 is a part of the practical supplement to the handbook

W
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"

g
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6
from Colfax International is prohibited. en
Redistribution or commercial usage without a written permission
h
un
7
Contact information can be found at http://colfax-intl.com/ */
rY

8
9
fo

10 #include <mpi.h>
ed

11 #include <stdio.h>
ar

12 #include <stdlib.h>
ep

13
int main(int argc, char** argv) {
Pr

14
15
y

// Set up MPI environment


el

16
siv

17 int ret = MPI_Init(&argc,&argv);


18 if (ret != MPI_SUCCESS) {
u
cl

19 printf("error: could not initialize MPI\n");


Ex

20 MPI_Abort(MPI_COMM_WORLD, ret);
21 }
22
23 const int M = 100000, N = 200000;
24 float data1[M]; data1[:]=1.0f;
25 double data2[N]; data2[:]=2.0;
26 int size1, size2;
27 int worldSize, rank, namelen;
28 char name[MPI_MAX_PROCESSOR_NAME];
29 MPI_Status stat;
30 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
31 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
32
33 if ((worldSize > 1) && ((worldSize % 2) == 0)){
34 if (rank % 2 == 0) {
35 // Sender side: allocate user-space buffer for asynchronous
36 // communication
37 MPI_Pack_size(M, MPI_FLOAT, MPI_COMM_WORLD, &size1);
38 MPI_Pack_size(N, MPI_DOUBLE, MPI_COMM_WORLD, &size2);
39 int bufsize = size1 + size2 + 2*MPI_BSEND_OVERHEAD;
40 printf("size1 = %d, size2 = %d, MPI_BSEND_OVERHEAD = %d, allocating %d bytes\n",
41 size1, size2, MPI_BSEND_OVERHEAD, bufsize);
42 void* buffer = malloc(bufsize);

Prepared for Yunheng Wang c Colfax International, 2013


388 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

43 MPI_Buffer_attach(buffer, bufsize);
44 MPI_Bsend(data1, M, MPI_FLOAT, rank+1, rank>>1, MPI_COMM_WORLD);
45 MPI_Bsend(data2, N, MPI_DOUBLE, rank+1, rank>>1, MPI_COMM_WORLD);
46 MPI_Buffer_detach(&buffer, &bufsize);
47 free(buffer);
48 } else {
49 // Receiver size does not have to do anything special
50 MPI_Recv(data1, M, MPI_FLOAT, rank-1, rank>>1, MPI_COMM_WORLD, &stat);
51 MPI_Recv(data2, N, MPI_DOUBLE, rank-1, rank>>1, MPI_COMM_WORLD, &stat);
52 }
53 } else
54 if (rank == 0) printf("Use even number of ranks.\n");
55
56 // Terminate MPI environment
57 MPI_Finalize();
58 }

Back to Lab A.3.4.

B.3.4.5 labs/3/3.4-MPI/step-03/mpi.cpp

g
an
W
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.
ng
2 file mpi.cpp, located at 3/3.4-MPI/step-03
he

3 is a part of the practical supplement to the handbook


"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
un

4
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
rY

5
6 Redistribution or commercial usage without a written permission
fo

7 from Colfax International is prohibited.


d

8 Contact information can be found at http://colfax-intl.com/ */


re

9
pa

10 #include <mpi.h>
re

11 #include <stdio.h>
yP

12 #include <stdlib.h>
el

13
siv

14 int main(int argc, char** argv) {


15
u
cl

16 // Set up MPI environment


Ex

17 int ret = MPI_Init(&argc,&argv);


18 if (ret != MPI_SUCCESS) {
19 printf("error: could not initialize MPI\n");
20 MPI_Abort(MPI_COMM_WORLD, ret);
21 }
22
23 const int N = 100000;
24 float data1[N], data2[N]; data1[:]=0.0f;
25 int rank, worldSize, tag=1;
26
27 MPI_Request request;
28 MPI_Status stat;
29
30 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
31 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
32
33 if (worldSize > 1){
34 if (rank == 0){
35 // Sender side: starting non-blocking send of data1
36 MPI_Isend(data1, N, MPI_FLOAT, 1, tag, MPI_COMM_WORLD, &request);
37 // Sender can perform some other work while data1 is in transit
38 for ( int i=0; i<N; i++) data2[i] = 1.0f;
39 // MPI_Wait will block until it safe to modify data1

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.3. SOURCE CODE FOR CHAPTER 3: EXPRESSING PARALLELISM 389

40 MPI_Wait(&request, &stat);
41 } else if (rank == 1){
42 // Receiver side: blocking receive of data1
43 MPI_Recv(data1, N, MPI_FLOAT, 0, tag, MPI_COMM_WORLD, &stat);
44 // At the end of blocking MPI_Recv, it is safe to use data1
45 }
46 }
47 // Terminate MPI environment
48 MPI_Finalize();
49 }

Back to Lab A.3.4.

B.3.4.6 labs/3/3.4-MPI/step-04/mpi.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file mpi.cpp, located at 3/3.4-MPI/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"

g
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

an
6 Redistribution or commercial usage without a written permission

W
7 from Colfax International is prohibited.

g
8 Contact information can be found at http://colfax-intl.com/ */
9
#include "mpi.h" en
h
un
10
#include <stdio.h>
rY

11
12 #define SIZE 6
fo

13
ed

14 int main(int argc, char *argv[]) {


ar

15 int numtasks, rank, sendcount, recvcount, source;


ep

16 float sendbuf[SIZE][SIZE] = {
Pr

17 {1.0, 2.0, 3.0, 4.0, 5.0, 6.0},


18 {7.0, 8.0, 9.0, 10.0, 11.0, 12.0},
y
el

19 {13.0, 14.0, 15.0, 16.0, 17.0, 18.0},


siv

20 {19.0, 20.0, 21.0, 22.0, 23.0, 24.0},


{25.0, 26.0, 27.0, 28.0, 29.0, 30.0},
u

21
cl

22 {31.0, 32.0, 33.0, 34.0, 35.0, 36.0}};


Ex

23 float recvbuf[SIZE];
24
25 MPI_Init(&argc,&argv);
26 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
27 MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
28
29 if (numtasks == SIZE) {
30 source = 1;
31 sendcount = SIZE;
32 recvcount = SIZE;
33 MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount,
34 MPI_FLOAT,source,MPI_COMM_WORLD);
35 printf("rank= %d Results: %f\t%f\t%f\t%f\t%f\t%f\n",rank,recvbuf[0],
36 recvbuf[1],recvbuf[2],recvbuf[3],recvbuf[4],recvbuf[5]);
37 }
38 else
39 printf("Must use %d processes, using %d. Terminating.\n", SIZE, numtasks);
40
41 MPI_Finalize();
42 }

Back to Lab A.3.4.

Prepared for Yunheng Wang c Colfax International, 2013


390 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

B.3.4.7 labs/3/3.4-MPI/step-05/mpi.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file mpi.cpp, located at 3/3.4-MPI/step-05
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <mpi.h>
11 #include <stdio.h>
12
13 int main(int argc, char** argv) {
14
15 // Set up MPI environment
16 int ret = MPI_Init(&argc,&argv);
17 if (ret != MPI_SUCCESS) {
18 printf("error: could not initialize MPI\n");
19 MPI_Abort(MPI_COMM_WORLD, ret);

g
}

an
20
21
22 int worldSize, rank, irank;
W
ng
23 int mics, hosts, mic=0, host=0;
he

24 MPI_Status stat;
un

25
rY

26 #ifdef __MIC__
27 mic++;
fo

28 #else
d

host++;
re

29
#endif
pa

30
31
re

32 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
yP

33 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
el

34
siv

35 MPI_Allreduce(&mic, &mics, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);


u

36 MPI_Allreduce(&host, &hosts, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);


cl

37
Ex

38 if (rank == 0)
39 printf("Of %d MPI processes, we have %d running on Xeon Phis and %d on\
40 the host.\n", worldSize, mics, hosts);
41
42 // Terminate MPI environment
43 MPI_Finalize();
44 }

B.4 Source Code for Chapters 4 and 5: Optimizing Applications


TM
B.4.1 Using Intel R VTune Amplifier XE
Back to Lab A.4.1.

B.4.1.1 labs/4/4.1-vtune/step-00-xeon/Makefile

CXX = icpc
CXXFLAGS = -openmp -g -O3

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 391

all:
$(CXX) $(CXXFLAGS) -o host-workload host-workload.cpp

clean:
rm -f host-workload

Back to Lab A.4.1.

B.4.1.2 labs/4/4.1-vtune/step-00-xeon/Makefile

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file host-workload.cpp, located at 4/4.1-vtune/step-00-xeon
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9

g
an
10 #include <stdio.h>
#include <omp.h>

W
11
12

g
en
13 long MyCalculation(const int n) {
14
h
un
15 long sum = 0;
rY

16
#pragma omp parallel for
fo

17
18 for (long i = 0; i < n; i++){
ed

19
ar

20 // A terrible way to do reduction


ep

21 #pragma omp critical


Pr

22 sum = sum + i; // only one thread at a time can execute this section
y

23
el

24 }
siv

25
u

26 return sum;
cl

27 }
Ex

28
29 int main(){
30
31 const long n = 1L<<20L;
32 for (int trial = 0; trial < 5; trial++) {
33
34 const double t0 = omp_get_wtime();
35 const long sum = MyCalculation(n);
36 const double t1 = omp_get_wtime();
37
38 printf("sum = %ld (must be %ld)\n", sum, ((n-1L)*n)/2L);
39 printf("Run time: %.3f seconds\n", t1-t0);
40 fflush(0);
41
42 }
43 }

Back to Lab A.4.1.

B.4.1.3 labs/4/4.1-vtune/step-01-offload/Makefile

Prepared for Yunheng Wang c Colfax International, 2013


392 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

CXX = icpc
CXXFLAGS = -openmp -g -O3

all:
$(CXX) $(CXXFLAGS) -o offload-workload offload-workload.cpp

clean:
rm -f offload-workload

Back to Lab A.4.1.

B.4.1.4 labs/4/4.1-vtune/step-01-offload/Makefile

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file offload-workload.cpp, located at 4/4.1-vtune/step-01-offload
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission

g
7 from Colfax International is prohibited.

an
8 Contact information can be found at http://colfax-intl.com/ */

W
9
ng
10 #include <stdio.h>
he

11 #include <omp.h>
un

12
__attribute__((target(mic)))
rY

13
14 long MyCalculation(const int n) {
fo

15
d

16 long sum = 0;
re

17
pa

18 #pragma omp parallel for


re

19 for (long i = 0; i < n; i++){


yP

20
// A terrible way to do reduction
el

21
siv

22 #pragma omp critical


23 sum = sum + i; // only one thread at a time can execute this section
u
cl

24
Ex

25 }
26
27 return sum;
28 }
29
30 int main(){
31
32 printf("Please be patient, it may take ~20 seconds before you get output...\n");
33
34 const long n = 1L<<20L;
35 for (int trial = 0; trial < 5; trial++) {
36
37 const double t0 = omp_get_wtime();
38
39 long sum;
40 #pragma offload target(mic)
41 {
42 sum = MyCalculation(n);
43 }
44
45 const double t1 = omp_get_wtime();
46
47 printf("sum = %ld (must be %ld)\n", sum, ((n-1L)*n)/2L);

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 393

48 printf("Run time: %.3f seconds\n", t1-t0);


49
50 }
51 }

Back to Lab A.4.1.

B.4.1.5 labs/4/4.1-vtune/step-02-native/Makefile

CXX = icpc
CXXFLAGS = -openmp -g -O3 -mmic

all:
$(CXX) $(CXXFLAGS) -o native-workload native-workload.cpp

clean:
rm -f native-workload

Back to Lab A.4.1.

g
an
W
B.4.1.6 labs/4/4.1-vtune/step-02-native/Makefile

g
en
h
/* Copyright (c) 2013, Colfax International. All Right Reserved.
un
1
file native-workload.cpp, located at 4/4.1-vtune/step-02-native
rY

2
3 is a part of the practical supplement to the handbook
fo

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


ed

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


ar

6 Redistribution or commercial usage without a written permission


ep

7 from Colfax International is prohibited.


Contact information can be found at http://colfax-intl.com/ */
Pr

8
9
y

#include <stdio.h>
el

10
siv

11 #include <omp.h>
12
u
cl

13 long MyCalculation(const int n) {


Ex

14
15 long sum = 0;
16
17 #pragma omp parallel for
18 for (long i = 0; i < n; i++){
19
20 // A terrible way to do reduction
21 #pragma omp critical
22 sum = sum + i; // only one thread at a time can execute this section
23
24 }
25
26 return sum;
27 }
28
29 int main(){
30
31 printf("Please be patient, it may take ~20 seconds before you get output...\n");
32 fflush(0);
33
34 const long n = 1L<<20L;
35 for (int trial = 0; trial < 5; trial++) {
36

Prepared for Yunheng Wang c Colfax International, 2013


394 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

37 const double t0 = omp_get_wtime();


38 const long sum = MyCalculation(n);
39 const double t1 = omp_get_wtime();
40
41 printf("sum = %ld (must be %ld)\n", sum, ((n-1L)*n)/2L);
42 printf("Run time: %.3f seconds\n", t1-t0);
43 fflush(0);
44
45 }
46 }

Back to Lab A.4.1.

TM
B.4.2 Using Intel R Trace Analyzer and Collector
Back to Lab A.4.2.

B.4.2.1 labs/4/4.2-itac/step-00/Makefile

g
an
W
CXX = mpiicpc
ng
CXXFLAGS = -mkl -vec-report3 -openmp
he

OBJECTS = mpi-pi.o
un

MICOBJECTS = mpi-pi.oMIC
rY
fo

.SUFFIXES: .o .cc .oMIC


d
re

.cpp.o:
pa

$(CXX) -c $(CXXFLAGS) -o "$@" "$<"


re
yP

.cpp.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"
el
siv

all: runme-mpi runme-mpi.MIC


u
cl
Ex

runme-mpi: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme-mpi $(OBJECTS)

runme-mpi.MIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runme-mpi.MIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme-mpi runme-mpi.MIC

run: runme-mpi runme-mpi.MIC


LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
scp runme-mpi.MIC mic0:~/
scp runme-mpi.MIC mic1:~/
I_MPI_MIC=1 \
VT_LOGFILE_FORMAT=stfsingle \
I_MPI_PIN_DOMAIN=omp \
mpirun -trace \
-np 2 -host localhost ${PWD}/runme-mpi : \
-np 1 -host mic0 ~/runme-mpi.MIC : \
-np 1 -host mic1 ~/runme-mpi.MIC

Back to Lab A.4.2.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 395

B.4.2.2 labs/4/4.2-itac/step-00/mpi-pi.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file mpi-pi.cpp, located at 4/4.2-itac/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <mpi.h>
11 #include <omp.h>
12 #include <stdio.h>
13 #include <stdlib.h>
14 #include "mkl_vsl.h"
15 #include <assert.h>
16
17 #include <unistd.h>
18

g
19 #define BRNG VSL_BRNG_MT19937

an
20 #define METHOD 0

W
21 #define ALIGNED __attribute__((align(64)))

g
22
23
const int trials = 2; // How many times to compute pi en
h
un
24
const long totalIterations=1L<<32L; // How many random number pairs to generate
rY

25
26 // for the calculation of pi
fo

27 const long blockSize = 1<<12; // A block of this many numbers is processes with SIMD
ed

28 const long blocks = totalIterations/blockSize;


ar

29
ep

30
void PerformCalculationOfPi(const int begin, const int end, long & dUnderCurveComputed) {
Pr

31
32
y

// Uses the Monte Carlo method to compute the number of points


el

33
siv

34 // under the curve x^2 + y^2 = 1, where 0 <= x,y <= 1


35
u
cl

36 long dUnderCurve = 0;
Ex

37 #pragma omp parallel reduction (+:dUnderCurve)


38 {
39
40 const int nth = omp_get_thread_num();
41 long localCount=0;
42 // Initialize the random number generator, one in each thread
43 VSLStreamStatePtr stream;
44 vslNewStream( &stream, BRNG, begin+nth );
45 float r[blockSize*2] ALIGNED;
46
47 #pragma omp for schedule(dynamic)
48 for (long i = begin; i < end; i++){
49 // Generate random numbers for the block
50 vsRngUniform( METHOD, stream, blockSize*2, r, 0.0f, 1.0f );
51
52 // Compute the number of points under the curve
53 for (long j = 0; j < blockSize; j++) {// vectorized loop
54 const float x=r[j];
55 const float y=r[j + blockSize];
56 if (x*x + y*y < 1.0f) localCount++;
57 }
58 }
59 vslDeleteStream( &stream );

Prepared for Yunheng Wang c Colfax International, 2013


396 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

60 dUnderCurve += localCount;
61
62 }
63
64 dUnderCurveComputed += dUnderCurve;
65 }
66
67
68 int main(int argc, char *argv[]){
69
70 assert(totalIterations%blockSize == 0);
71
72 // Who am I in the MPI world
73 int rank, ranks, namelen;
74 MPI_Status stat;
75 MPI_Init(&argc, &argv);
76 MPI_Comm_size(MPI_COMM_WORLD, &ranks);
77 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
78
79 // Scheduling policy
80 const float portion = 0.5;

g
81

an
82 double totalTime = 0.0; // Timing statistics

W
83 double totalWaitTime = 0.0; // Timing statistics
long communication[4] = {0L}; // Number of MPI messages
ng
84
85
he

86 for (int t = 0; t < trials; t++) {


un

87
rY

88 long dUnderCurve=0, UnderCurveSum=0;


fo

89
const double start_time = MPI_Wtime();
d

90
re

91
pa

92 int msgInt[2]; // MPI message buffer


re

93 long msgLong[2]; // MPI message buffer


yP

94
95 if (rank == 0) {
el

96
siv

97 // "Boss"
u

int workerRank, nthreads;


cl

98
long iter = portion*blocks/(ranks-1);
Ex

99
100 long i = 0;
101 while (i < blocks){
102 // Assign work to workers
103 for(long r = 0; r < ranks - 1; r++){
104 MPI_Recv(&msgInt, 2, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,
105 MPI_COMM_WORLD, &stat);
106 workerRank = msgInt[0];
107 nthreads = msgInt[1];
108 communication[workerRank] ++;
109
110 long begin = i;
111 i += iter;
112 long end = i;
113 if (begin>blocks) begin = blocks;
114 if (end>blocks) end = blocks;
115
116 msgLong[0] = begin;
117 msgLong[1] = end;
118 MPI_Send(&msgLong, 2, MPI_LONG, workerRank, workerRank, MPI_COMM_WORLD);
119 }
120 iter *= portion;
121 if (iter < 32*nthreads) iter = 32*nthreads;

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 397

122 }
123 // Tell workers to stop
124 for(int i = 1; i < ranks; i++) {
125 MPI_Recv(&msgInt, 2, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,
126 MPI_COMM_WORLD, &stat);
127 workerRank = msgInt[0];
128 nthreads = msgInt[1];
129 msgLong[0] = -1L;
130 MPI_Send(&msgLong, 2, MPI_LONG, workerRank, workerRank, MPI_COMM_WORLD);
131 }
132
133 } else {
134
135 // "Worker"
136
137 // Pi calculation counters
138 long begin=0, end;
139 int nthreads = omp_get_max_threads();
140 printf("Rank %d uses %d thread%s.\n", rank, nthreads, (nthreads==1?"":"s"));
141 while(begin>=0){
142
// Ask boss for work

g
143

an
144 msgInt[0] = rank;

W
145 msgInt[1] = nthreads;
MPI_Send(&msgInt, 2, MPI_INT, 0, rank, MPI_COMM_WORLD);

g
146

en
147
148
h
MPI_Recv(&msgLong, 2, MPI_LONG, 0, rank, MPI_COMM_WORLD, &stat);
un
149 begin = msgLong[0];
rY

150 end = msgLong[1];


fo

151
if (begin>=0){
ed

152
153
ar

154 // Perform work from "begin" to "end" assigned by "Boss"


ep

155 PerformCalculationOfPi(begin, end, dUnderCurve);


Pr

156 }
y

157 }
el

158 }
siv

159 double cTime = MPI_Wtime();


u

double stopTime = 0;
cl

160
Ex

161
162 // Get results from all MPI processes using reduction
163 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
164 MPI_Barrier(MPI_COMM_WORLD);
165
166 // Get timing statistics from all MPI processes
167 cTime = MPI_Wtime() - cTime;
168 MPI_Reduce(&cTime, &stopTime, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
169
170 if (rank == 0){
171 // Output results
172 const double pi = (double)UnderCurveSum / (double) totalIterations * 4.0 ;
173 cTime = MPI_Wtime();
174 const double workTime = cTime-start_time;
175 const double pi_exact=3.141592653589793;
176 printf ("pi = %10.9f, rel. error = %12.9f, time = %8.6fs, load unbalance time = \
177 %8.6fs\n", pi, (pi-pi_exact)/pi_exact, workTime, stopTime);
178 switch (t){
179 case 0 : break;
180 case trials-1 : totalTime += workTime;
181 totalWaitTime += stopTime;
182 printf("Average time (s): %f\n", totalTime/(trials-1));break;
183 printf("%.4f\t%f\t%f\t", portion, totalTime/(trials-1),

Prepared for Yunheng Wang c Colfax International, 2013


398 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

184 totalWaitTime/(trials-1));
185 for(int i=1; i<ranks; i++) printf("%ld\t", communication[i]);
186 printf("\n");
187 break;
188 default: totalTime += workTime;
189 totalWaitTime += stopTime; break;
190 }
191 }
192 MPI_Barrier(MPI_COMM_WORLD);
193 }
194
195 MPI_Finalize();
196 return 0;
197 }

B.4.3 Serial Optimization: Precision Control, Eliminating Redundant Operations


Back to Lab A.4.3.

g
B.4.3.1 labs/4/4.3-serial-optimization/step-00/Makefile

an
CXX = icpc W
ng
CXXFLAGS = -vec-report3 -openmp
he
un

OBJECTS = erf.o main.o


rY

MICOBJECTS = erf.oMIC main.oMIC


fo

.SUFFIXES: .o .oMIC .cpp


d
re
pa

.cpp.o:
re

$(CXX) -c $(CXXFLAGS) -o "$@" "$<"


yP

.cpp.oMIC:
el

$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"


siv
u

all: runme
cl
Ex

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

mic: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runme $(MICOBJECTS)
micnativeloadex ./runme

clean:
rm -f *.o* runme

Back to Lab A.4.3.

B.4.3.2 labs/4/4.3-serial-optimization/step-00/main.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cpp, located at 4/4.3-serial-optimization/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 399

7 from Colfax International is prohibited.


8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <omp.h>
12 #include <math.h>
13
14 __attribute__((vector)) float myerff(const float);
15
16 int main(int argc, char* argv[]){
17 const long lTotal = 1<<28;
18 const float fMin=-3.0, fMax=3.0;
19 float *fIn = (float*) _mm_malloc(sizeof(float)*lTotal, 64);
20 float *fOut = (float*) _mm_malloc(sizeof(float)*lTotal, 64);
21
22 for ( int i = 0; i<lTotal; i++)
23 fIn[i] = fMin + (fMax-fMin)/lTotal*i;
24
25 for ( int k = 0; k<10; k++){
26 const double start = omp_get_wtime();
27 #pragma simd
for ( int i = 0; i<lTotal; i++)

g
28

an
29 fOut[i] = myerff(fIn[i]);

W
30 const double stop = omp_get_wtime();

g
31

en
32 double err = 0.0; h
33 for (int i = 0; i < lTotal; i++) {
un
34 const float dif = fOut[i] - erff(fIn[i]);
rY

35 err += dif*dif;
fo

36 }
err = sqrt(err/(double)lTotal);
ed

37
38
ar

39 printf("%f\t%f\n", fIn[0], fOut[0]);


ep

40 printf("%f\t%f\n", fIn[lTotal-1], fOut[lTotal-1]);


Pr

41 printf("%f seconds used for calculations of %ld numbers.\n", stop-start, lTotal);


y

42 printf("Rel. error = %e\n", err);


el

43 fflush(0);
siv

44 }
u

_mm_free(fIn);
cl

45
_mm_free(fOut);
Ex

46
47 }

Back to Lab A.4.3.

B.4.3.3 labs/4/4.3-serial-optimization/step-00/erf.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file erf.cpp, located at 4/4.3-serial-optimization/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <math.h>
11
12 __attribute__((vector)) float myerff(const float inx){
13
14 //const float x = fabsf(inx);

Prepared for Yunheng Wang c Colfax International, 2013


400 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

15 float x = (inx < 0 ? -inx : inx);


16
17 const float p = 0.3275911;
18 const float t1 = 1/(1+p*x);
19 const float t2 = 1/(1+p*x)/(1+p*x);
20 const float t3 = 1/(1+p*x)/(1+p*x)/(1+p*x);
21 const float t4 = 1/(1+p*x)/(1+p*x)/(1+p*x)/(1+p*x);
22 const float t5 = 1/(1+p*x)/(1+p*x)/(1+p*x)/(1+p*x)/(1+p*x);
23
24 float res = 1.0 - (0.254829592*t1 - 0.284496736*t2 + 1.421413741*t3 -
25 1.453152027*t4 + 1.061405429*t5) * exp(-x*x);
26
27 return (inx<0 ? -res : res);
28 }

Back to Lab A.4.3.

B.4.3.4 labs/4/4.3-serial-optimization/step-01/erf.cpp

g
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.

an
2 file erf.cpp, located at 4/4.3-serial-optimization/step-01

W
3 is a part of the practical supplement to the handbook
ng
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
he

5
6 Redistribution or commercial usage without a written permission
un

7 from Colfax International is prohibited.


rY

8 Contact information can be found at http://colfax-intl.com/ */


fo

9
d

10 #include <math.h>
re

11
pa

12 __attribute__((vector)) float myerff(const float inx){


re

13
yP

14 //const float x = fabsf(inx);


15 float x = (inx < 0 ? -inx : inx);
el
siv

16
17 const float p = 0.3275911;
u
cl

18 const float t1 = 1/(1+p*x);


Ex

19 const float t2 = pow(t1, 2);


20 const float t3 = pow(t1, 3);
21 const float t4 = pow(t1, 4);
22 const float t5 = pow(t1, 5);
23
24 float res = 1.0 - (0.254829592*t1 - 0.284496736*t2 + 1.421413741*t3 -
25 1.453152027*t4 + 1.061405429*t5) * exp(-x*x);
26
27 return (inx<0 ? -res : res);
28 }

Back to Lab A.4.3.

B.4.3.5 labs/4/4.3-serial-optimization/step-02/erf.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file erf.cpp, located at 4/4.3-serial-optimization/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 401

7 from Colfax International is prohibited.


8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <math.h>
11
12 __attribute__((vector)) float myerff(const float inx){
13
14 //const float x = fabsf(inx);
15 float x = (inx < 0 ? -inx : inx);
16
17 const float p = 0.3275911;
18 const float t1 = 1/(1+p*x);
19 const float t2 = t1*t1;
20 const float t3 = t2*t1;
21 const float t4 = t3*t1;
22 const float t5 = t4*t1;
23
24 float res = 1.0 - (0.254829592*t1 - 0.284496736*t2 + 1.421413741*t3 -
25 1.453152027*t4 + 1.061405429*t5) * exp(-x*x);
26
27 return (inx<0 ? -res : res);
}

g
28

an
W
g
Back to Lab A.4.3.
en
h
un
rY

B.4.3.6 labs/4/4.3-serial-optimization/step-03/erf.cpp
fo
ed

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


ar

2 file erf.cpp, located at 4/4.3-serial-optimization/step-03


ep

3 is a part of the practical supplement to the handbook


Pr

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


y

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


el

6 Redistribution or commercial usage without a written permission


siv

7 from Colfax International is prohibited.


u

8 Contact information can be found at http://colfax-intl.com/ */


cl
Ex

9
10 #include <math.h>
11
12 __attribute__((vector)) float myerff(const float inx){
13
14 //const float x = fabsf(inx);
15 float x = (inx < 0.0f ? -inx : inx);
16
17 const float p = 0.3275911f;
18 const float t1 = 1.0f/(1.0f+p*x);
19 const float t2 = t1*t1;
20 const float t3 = t2*t1;
21 const float t4 = t3*t1;
22 const float t5 = t4*t1;
23
24 float res = 1.0f - (0.254829592f*t1 - 0.284496736f*t2 + 1.421413741f*t3 -
25 1.453152027f*t4 + 1.061405429f*t5) * exp(-x*x);
26
27 return (inx<0.0f ? -res : res);
28 }

Back to Lab A.4.3.

Prepared for Yunheng Wang c Colfax International, 2013


402 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

B.4.3.7 labs/4/4.3-serial-optimization/step-04/erf.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file erf.cpp, located at 4/4.3-serial-optimization/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <math.h>
11
12 __attribute__((vector)) float myerff(const float inx){
13
14 //const float x = fabsf(inx);
15 float x = (inx < 0.0f ? -inx : inx);
16
17 const float p = 0.3275911f;
18 const float t1 = 1.0f/(1.0f+p*x);
19 const float t2 = t1*t1;

g
const float t3 = t2*t1;

an
20
21 const float t4 = t3*t1;
22 const float t5 = t4*t1;
W
ng
23
he

24 const float l2e = 1.442695040f; // log2f(expf(1.0f))


un

25 float res = 1.0f - (0.254829592f*t1 - 0.284496736f*t2 + 1.421413741f*t3 -


rY

26 1.453152027f*t4 + 1.061405429f*t5) * exp2f(-x*x*l2e);


27
fo

28 return (inx<0.0f ? -res : res);


d

}
re

29
pa
re

Back to Lab A.4.3.


yP
el
siv

B.4.3.8 labs/4/4.3-serial-optimization/step-05/erf.cpp
u
cl
Ex

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file erf.cpp, located at 4/4.3-serial-optimization/step-05
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <math.h>
11
12 __attribute__((vector)) float myerff(float x){
13
14 // architecture-specific trick to get absolute value and sign
15 // works for Intel Xeon and Intel Xeon Phi
16 unsigned int *xp = (unsigned int*) &x;
17 const unsigned int sign = (*xp) & 0x80000000;
18 *xp &= 0x7FFFFFFF;
19
20 //const float x = fabsf(inx);
21 //float x = (inx < 0.0f ? -inx : inx);
22
23 const float p = 0.3275911f;

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 403

24 const float t1 = 1.0f/(1.0f+p*x);


25 const float t2 = t1*t1;
26 const float t3 = t2*t1;
27 const float t4 = t3*t1;
28 const float t5 = t4*t1;
29
30 const float l2e = 1.442695040f; // log2f(expf(1.0f))
31 float res = 1.0f - (0.254829592f*t1 - 0.284496736f*t2 + 1.421413741f*t3 -
32 1.453152027f*t4 + 1.061405429f*t5) * exp2f(-x*x*l2e);
33 unsigned int *resp = (unsigned int*) &res;
34 *resp |= sign;
35 return res; //(inx<0.0f ? -res : res);
36 }

Back to Lab A.4.3.

B.4.3.9 labs/4/4.3-serial-optimization/step-0p/main.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.

g
2 file main.cpp, located at 4/4.3-serial-optimization/step-0p

an
3 is a part of the practical supplement to the handbook

W
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"

g
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6
from Colfax International is prohibited. en
Redistribution or commercial usage without a written permission
h
un
7
Contact information can be found at http://colfax-intl.com/ */
rY

8
9
fo

10 #include <stdio.h>
ed

11 #include <omp.h>
ar

12 #include <math.h>
ep

13
__attribute__((vector)) float myerff(const float);
Pr

14
15
y

int main(int argc, char* argv[]){


el

16
siv

17 const long lTotal = 1<<28;


18 const float fMin=-3.0, fMax=3.0;
u
cl

19 float *fIn = (float*) _mm_malloc(sizeof(float)*lTotal, 64);


Ex

20 float *fOut = (float*) _mm_malloc(sizeof(float)*lTotal, 64);


21
22 for ( int i = 0; i<lTotal; i++)
23 fIn[i] = fMin + (fMax-fMin)/lTotal*i;
24
25 for ( int k = 0; k<10; k++){
26 const double start = omp_get_wtime();
27 int stride = 512;
28 #pragma omp parallel for schedule(guided)
29 for ( int i = 0; i<lTotal; i+=stride)
30 #pragma simd
31 #pragma vector aligned
32 for ( int l = 0; l<stride; l++)
33 fOut[i+l] = myerff(fIn[i+l]);
34 const double stop = omp_get_wtime();
35
36 double err = 0.0;
37 for (int i = 0; i < lTotal; i++) {
38 const float dif = fOut[i] - erff(fIn[i]);
39 err += dif*dif;
40 }
41 err = sqrt(err/(double)lTotal);
42

Prepared for Yunheng Wang c Colfax International, 2013


404 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

43 printf("%f\t%f\n", fIn[0], fOut[0]);


44 printf("%f\t%f\n", fIn[lTotal-1], fOut[lTotal-1]);
45 printf("%f seconds used for calculations of %ld numbers.\n", stop-start, lTotal);
46 printf("Rel. error = %e\n", err);
47 fflush(0);
48 }
49 _mm_free(fIn);
50 _mm_free(fOut);
51 }

B.4.4 Vector Optimization: Unit-Stride Access, Data Alignment


Back to Lab A.4.4.

B.4.4.1 labs/4/4.4-vectorization-data-structure/step-00/Makefile

CXX = icpc
CXXFLAGS = -openmp -vec-report3

g
an
OBJECTS = main.o

W
MICOBJECTS = main.oMIC ng
.SUFFIXES: .o .cc .oMIC
he
un

.cc.o:
rY

$(CXX) -c $(CXXFLAGS) -o "$@" "$<"


fo

.cc.oMIC:
d
re

$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"


pa
re

all: runme runmeMIC


yP

runme: $(OBJECTS)
el

$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)


usiv

runmeMIC: $(MICOBJECTS)
cl

$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)


Ex

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC

Back to Lab A.4.4

B.4.4.2 labs/4/4.4-vectorization-data-structure/step-00/main.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.4-vectorization-data-structure/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <cstdlib>
12 #include <omp.h>

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 405

13 #include <math.h>
14
15 struct Charge { // Elegant, but ineffective data layout
16 float x, y, z, q; // Coordinates and value of this charge
17 };
18
19 // This version performs poorly, because data layout of class Charge
20 // does not allow efficient vectorization
21 void calculate_electric_potential(
22 const int m, // Number of charges
23 const Charge* chg, // Charge distribution (array of classes)
24 const float Rx, const float Ry, const float Rz, // Observation point
25 float & phi // Output: electric potential
26 ) {
27 phi=0.0f;
28 for (int i=0; i<m; i++) { // This loop will be auto-vectorized
29 // Non-unit stride: (&chg[i+1].x - &chg[i].x) == sizeof(Charge)
30 const float dx=chg[i].x - Rx;
31 const float dy=chg[i].y - Ry;
32 const float dz=chg[i].z - Rz;
33 phi -= chg[i].q / sqrtf(dx*dx+dy*dy+dz*dz); // Coulomb’s law
}

g
34

an
35 }

W
36
int main(int argv, char* argc[]){

g
37

en
38 const size_t n=1<<11; h
39 const size_t m=1<<11;
un
40 const int nTrials=10;
rY

41
fo

42 Charge chg[m];
float* potential = (float*) malloc(sizeof(float)*n*n);
ed

43
44
ar

45 // Initializing array of charges


ep

46 for (size_t i=0; i<n; i++) {


Pr

47 chg[i].x = (float)rand()/(float)RAND_MAX;
y

48 chg[i].y = (float)rand()/(float)RAND_MAX;
el

49 chg[i].z = (float)rand()/(float)RAND_MAX;
siv

50 chg[i].q = (float)rand()/(float)RAND_MAX;
u

}
cl

51
printf("Initialization complete.\n");
Ex

52
53
54 for (int t=0; t<nTrials; t++){
55 potential[0:n*n]=0.0f;
56 const double t0 = omp_get_wtime();
57 #pragma omp parallel for schedule(dynamic)
58 for (int j = 0; j < n*n; j++) {
59 const float Rx = (float)(j % n);
60 const float Ry = (float)(j / n);
61 const float Rz = 0.0f;
62 calculate_electric_potential(m, chg, Rx, Ry, Rz, potential[j]);
63 }
64 const double t1 = omp_get_wtime();
65
66 if ( t>= 2) { // First two iterations are slow on Xeon Phi; exclude them
67 printf("time: %.6f\n", t1-t0);
68 }
69 }
70 free(potential);
71 }

Back to Lab A.4.4

Prepared for Yunheng Wang c Colfax International, 2013


406 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

B.4.4.3 labs/4/4.4-vectorization-data-structure/step-01/main.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.4-vectorization-data-structure/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <cstdlib>
12 #include <omp.h>
13 #include <math.h>
14
15 struct Charge_Distribution {
16 // This data layout permits effective vectorization of Coulomb’s law application
17 const int m; // Number of charges
18 float * x; // Array of x-coordinates of charges

g
19 float * y; // ...y-coordinates...

an
20 float * z; // ...etc.

W
21 float * q; // These arrays are allocated in the constructor
ng
22 };
he

23
// This version vectorizes better thanks to unit-stride data access
un

24
void calculate_electric_potential(
rY

25
26 const int m, // Number of charges
fo

27 const Charge_Distribution & chg, // Charge distribution (structure of arrays)


d

28 const float Rx, const float Ry, const float Rz, // Observation point
re

29 float & phi // Output: electric potential


pa

30 ) {
re

31 phi=0.0f;
yP

32 for (int i=0; i<chg.m; i++) {


// Unit stride: (&chg.x[i+1] - &chg.x[i]) == sizeof(float)
el

33
siv

34 const float dx=chg.x[i] - Rx;


35 const float dy=chg.y[i] - Ry;
u
cl

36 const float dz=chg.z[i] - Rz;


Ex

37 phi -= chg.q[i] / sqrtf(dx*dx+dy*dy+dz*dz);


38 }
39 }
40
41 int main(int argv, char* argc[]){
42 const size_t n=1<<11;
43 const size_t m=1<<11;
44 const int nTrials=10;
45
46 Charge_Distribution chg = { .m = m };
47 chg.x = (float*)malloc(sizeof(float)*m);
48 chg.y = (float*)malloc(sizeof(float)*m);
49 chg.z = (float*)malloc(sizeof(float)*m);
50 chg.q = (float*)malloc(sizeof(float)*m);
51 float* potential = (float*) malloc(sizeof(float)*n*n);
52
53 // Initializing array of charges
54 for (size_t i=0; i<n; i++) {
55 chg.x[i] = (float)rand()/(float)RAND_MAX;
56 chg.y[i] = (float)rand()/(float)RAND_MAX;
57 chg.z[i] = (float)rand()/(float)RAND_MAX;
58 chg.q[i] = (float)rand()/(float)RAND_MAX;
59 }

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 407

60 printf("Initialization complete.\n");
61
62 for (int t=0; t<nTrials; t++){
63 potential[0:n*n]=0.0f;
64 const double t0 = omp_get_wtime();
65 #pragma omp parallel for schedule(dynamic)
66 for (int j = 0; j < n*n; j++) {
67 const float Rx = (float)(j % n);
68 const float Ry = (float)(j / n);
69 const float Rz = 0.0f;
70 calculate_electric_potential(m, chg, Rx, Ry, Rz, potential[j]);
71 }
72 const double t1 = omp_get_wtime();
73
74 if ( t>= 2) { // First two iterations are slow on Xeon Phi; exclude them
75 printf("time: %.6f\n", t1-t0);
76 }
77 }
78 free(chg.x);
79 free(chg.y);
80 free(chg.z);
free(potential);

g
81

an
82 }

W
g
Back to Lab A.4.4 h en
un
Note: In this lab, Makefile for step 01 is identical to the Makefile for step 00. However, Makefile for step 02 is
rY

different.
fo
ed
ar

B.4.4.4 labs/4/4.4-vectorization-data-structure/step-02/Makefile
ep
Pr

CXX = icpc
y
el

CXXFLAGS = -openmp -vec-report3 -fimf-domain-exclusion=8


u siv

OBJECTS = main.o
cl

MICOBJECTS = main.oMIC
Ex

.SUFFIXES: .o .cc .oMIC

.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

all: runme runmeMIC

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC

Back to Lab A.4.4

Prepared for Yunheng Wang c Colfax International, 2013


408 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

B.4.4.5 labs/4/4.4-vectorization-data-structure/step-02/main.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.4-vectorization-data-structure/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <cstdlib>
12 #include <omp.h>
13 #include <math.h>
14
15 struct Charge_Distribution {
16 const int m;
17 float * x;
18 float * y;

g
19 float * z;

an
20 float * q;

W
21 };
ng
22
he

23 void calculate_electric_potential(
const int m,
un

24
const Charge_Distribution & chg,
rY

25
26 const float Rx, const float Ry, const float Rz,
fo

27 float & phi


d

28 ) {
re

29 phi=0.0f;
pa

30 for (int i=0; i<chg.m; i++) {


re

31 const float dx=chg.x[i] - Rx;


yP

32 const float dy=chg.y[i] - Ry;


const float dz=chg.z[i] - Rz;
el

33
siv

34 phi -= chg.q[i] / sqrtf(dx*dx+dy*dy+dz*dz);


35 }
u
cl

36 }
Ex

37
38
39 int main(int argv, char* argc[]){
40 const size_t n=1<<11;
41 const size_t m=1<<11;
42 const int nTrials=10;
43
44 Charge_Distribution chg = { .m = m };
45 chg.x = (float*)malloc(sizeof(float)*m);
46 chg.y = (float*)malloc(sizeof(float)*m);
47 chg.z = (float*)malloc(sizeof(float)*m);
48 chg.q = (float*)malloc(sizeof(float)*m);
49 float* potential = (float*) malloc(sizeof(float)*n*n);
50
51 for (size_t i=0; i<n; i++) {
52 chg.x[i] = (float)rand()/(float)RAND_MAX;
53 chg.y[i] = (float)rand()/(float)RAND_MAX;
54 chg.z[i] = (float)rand()/(float)RAND_MAX;
55 chg.q[i] = (float)rand()/(float)RAND_MAX;
56 }
57 printf("Initialization complete.\n");
58
59 for (int t=0; t<nTrials; t++){

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 409

60 potential[0:n*n]=0.0f;
61 const double t0 = omp_get_wtime();
62 #pragma omp parallel for schedule(dynamic)
63 for (int j = 0; j < n*n; j++) {
64 const float Rx = (float)(j % n);
65 const float Ry = (float)(j / n);
66 const float Rz = 0.0f;
67 calculate_electric_potential(m, chg, Rx, Ry, Rz, potential[j]);
68 }
69 const double t1 = omp_get_wtime();
70
71 if ( t>= 2) {
72 printf("time: %.6f\n", t1-t0);
73 }
74 }
75 free(chg.x);
76 free(chg.y);
77 free(chg.z);
78 free(potential);
79 }

g
Back to Lab A.4.4

an
W
g
B.4.5 Vector Optimization: Assisting the Compiler
Back to Lab A.4.5. en
h
un
rY
fo

B.4.5.1 labs/4/4.5-vectorization-compiler-hints/step-00/Makefile
ed
ar

CXX = icpc
ep

CXXFLAGS = -openmp -vec-report3


Pr

OBJECTS = main.o worker.o


y
el

MICOBJECTS = main.oMIC worker.oMIC


siv
u

.SUFFIXES: .o .cc .oMIC


cl
Ex

.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

all: runme runmeMIC

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC

Back to Lab A.4.5.

B.4.5.2 labs/4/4.5-vectorization-compiler-hints/step-00/main.cc

Prepared for Yunheng Wang c Colfax International, 2013


410 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.5-vectorization-compiler-hints/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <cstdlib>
12 #include <omp.h>
13 #include <math.h>
14 #include "worker.h"
15
16 double CheckResult(const float* M, const float* A, const float* B, float* C,
17 const size_t m, const size_t n) {
18 C[0:m]=0.0f;
19 // Calculating the correct matrix-vector product
20 _Cilk_for (int i = 0; i < m; i++)
21 for (int j = 0; j < n; j++)
C[i] += M[i*n + j]*A[j];

g
22

an
23 double err = 0.0;

W
24 for (int i = 0; i < m; i++)
err += (C[i]-B[i])*(C[i]-B[i])/(double)m;
ng
25
26
he

27 if (m*n<=256L) {
un

28 // For small matrix, output elements on the screen


rY

29 for (int i = 0; i < m; i++) {


fo

30 printf("(");
for (int j = 0; j < n; j++)
d

31
re

32 printf(" %5.3f", M[i*n+j]);


pa

33 printf(") ");
re

34 if (i == m/2) { printf("x"); } else { printf(" "); }


yP

35 printf(" (%5.3f)", A[i]);


36 if (i == m/2) { printf(" ="); } else { printf(" "); }
el

37 printf(" (%5.3f) (correct=%5.3f)\n", B[i], C[i]);


siv

38 }
u

}
cl

39
Ex

40
41 return sqrt(err);
42 }
43
44 void FillSparseMatrix(const int m, const int n, const int nb, const int bl,
45 float* const M) {
46 M[0:n*m] = 0.0f;
47 for (int b = 0; b < nb; b++) {
48 // Initializing a random sparse matrix
49 int i = rand()%m;
50 int blockStart = rand()%n;
51 // This expression gives a probability distribution that peaks at bl
52 int blockLength = 1 + (int)( (8.0f * (0.125f +
53 powf((float)rand()/(float)RAND_MAX - 0.5f, 3)))*(float)bl);
54 if (blockStart + blockLength > m-1)
55 blockLength = m-1 - blockStart;
56 for (int j = blockStart; j < blockStart + blockLength; j++)
57 M[i*n + j] = (float)rand()/(float)RAND_MAX;
58 }
59 }
60
61 void TestSparseMatrixTimesVector (const int nTrials, const int skip_it, const int m,
62 const int n, const float* M, PackedSparseMatrix* pM, float* A, float* B, float* C) {

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 411

63
64 double tAvg=0.0, dt=0.0;
65 for (int t=0; t<nTrials; t++){
66 for (int i = 0; i < n; i++)
67 A[i] = (float)rand()/(float)RAND_MAX;
68 const double t0 = omp_get_wtime();
69 pM->MultiplyByVector(A, B);
70 const double t1 = omp_get_wtime();
71 if (t==0) {
72 printf("iteration %d: time=%.2f ms, error=%f\n",
73 t, (t1-t0)*1e3, CheckResult(M, A, B, C, m, n));
74 } else {
75 printf("iteration %d: time=%.2f ms\n",
76 t, (t1-t0)*1e3);
77 }
78 if (t>=skip_it) {
79 tAvg += (t1-t0);
80 dt += (t1-t0)*(t1-t0);
81 }
82 }
83 tAvg /= (double)(nTrials-skip_it);
dt /= (double)(nTrials-skip_it);

g
84

an
85 dt = sqrt(dt-tAvg*tAvg);

W
86 printf("Average: %.2f +- %.2f ms per iteration.\n", tAvg*1e3, dt*1e3);
fflush(0);

g
87

en
88
89 }
h
un
90
rY

91 int main(int argv, char* argc[]){


fo

92
// Generating output to illustrate the algorithm
ed

93
94 printf("Demonstration:\n");
ar

95 float* M = (float*) malloc(sizeof(float)*16*16);


ep

96 float* A = (float*) malloc(sizeof(float)*16);


Pr

97 float* B = (float*) malloc(sizeof(float)*16);


y

98 float* C = (float*) malloc(sizeof(float)*16);


el

99 FillSparseMatrix(16, 16, 10, 3, M);


siv

100 PackedSparseMatrix pM(16, 16, M, true);


u

TestSparseMatrixTimesVector(1, 0, 16, 16, M, &pM, A, B, C);


cl

101
free(M); free(A); free(B); free(C);
Ex

102
103
104 printf("\nPreparing for a benchmark...\n"); fflush(0);
105 const size_t n=20000;
106 const size_t m=20000;
107 const int nTrials=50;
108 M = (float*)malloc(sizeof(float)*n*m);
109 A = (float*)malloc(sizeof(float)*n);
110 B = (float*)malloc(sizeof(float)*n);
111 C = (float*)malloc(sizeof(float)*n);
112 FillSparseMatrix(m, n, (n*m/1000), 100, M);
113 PackedSparseMatrix pM2(m, n, M, true);
114
115 printf("\nBenchmark:\n"); fflush(stdout);
116 const int skip_it = 10; // First few iterations on the coprocessor are warm-up; skip them
117 TestSparseMatrixTimesVector(nTrials, skip_it, m, n, M, &pM2, A, B, C);
118
119 free(M); free(A); free(B); free(C);
120 }

Back to Lab A.4.5.

Prepared for Yunheng Wang c Colfax International, 2013


412 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

B.4.5.3 labs/4/4.5-vectorization-compiler-hints/step-00/worker.h

1 #ifndef __INCLUDE_WORKER_H__
2 #define __INCLUDE_WORKER_H__
3
4 class PackedSparseMatrix {
5 const int nRows; // Number of matrix rows
6 const int nCols; // Number of matrix columns
7 int nBlocks; // Number of contiguous non-zero blocks
8
9 float* packedData; // Non-zero elements of the matrix in packed form
10 int* blocksInRow; // The number of non-zero blocks in the respective row
11 int* blockFirstIdxInRow; // The index of the first non-zero blocks in the respective row
12 int* blockOffset; // Indices in the packedData array of the respective blocks
13 int* blockLen; // Lengths of the respective blocks
14 int* blockCol; // The column number of the first element in the respective block
15
16 public:
17
18 PackedSparseMatrix(const int m, const int n, const float* M, const bool verbose);
19 ~PackedSparseMatrix();

g
20 void MultiplyByVector(const float* inVector, float* outVector);

an
21
22 };
W
ng
23
he

24 #endif
un
rY

Back to Lab A.4.5.


fo
d

B.4.5.4 labs/4/4.5-vectorization-compiler-hints/step-00/worker.cc
re
pa
re

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


yP

2 file worker.cc, located at 4/4.5-vectorization-compiler-hints/step-00


is a part of the practical supplement to the handbook
el

3
siv

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
u
cl

6 Redistribution or commercial usage without a written permission


Ex

7 from Colfax International is prohibited.


8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include "worker.h"
11 #include <stdlib.h>
12 #include <stdio.h>
13
14 PackedSparseMatrix::PackedSparseMatrix(const int m, const int n, const float* M,
15 const bool verbose) : nRows(m), nCols(n) {
16 // Calculating the number of non-zero blocks;
17 nBlocks = 0;
18 int nData = 0;
19 for (int i = 0; i < nRows; i++) {
20 int j = 0;
21 bool inBlock = false;
22 while (j < nCols) {
23 if (M[i*nCols + j] != 0) {
24 if (!inBlock) {
25 nBlocks++;
26 inBlock=true;
27 }
28 nData++;
29 } else {

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 413

30 if (inBlock) inBlock=false;
31 }
32 j++;
33 }
34 }
35
36 // Allocating data for packed storage
37 packedData = (float*)malloc(sizeof(float)*nData);
38 blocksInRow = (int*) malloc(sizeof(float)*nRows);
39 blockFirstIdxInRow = (int*) malloc(sizeof(float)*nRows);
40 blockOffset = (int*) malloc(sizeof(float)*nBlocks);
41 blockLen = (int*) malloc(sizeof(float)*nBlocks);
42 blockCol = (int*) malloc(sizeof(float)*nBlocks);
43
44 int pos = 0;
45 int idx = -1;
46 for (int i = 0; i < nRows; i++) {
47 blocksInRow[i] = 0;
48 int j = 0;
49 bool inBlock = false;
50 bool firstBlock = true;
while (j < nCols) {

g
51

an
52 if (M[i*nCols + j] != 0) {

W
53 if (!inBlock) {
// Begin block

g
54

en
55 idx++; h
56 inBlock=true;
un
57 blocksInRow[i]++;
rY

58 if (firstBlock) {
fo

59 blockFirstIdxInRow[i] = idx;
firstBlock = false;
ed

60
61 }
ar

62 blockOffset[idx] = pos;
ep

63 blockLen[idx] = 1;
Pr

64 blockCol[idx] = j;
y

65 } else {
el

66 // Continue block
siv

67 blockLen[idx]++;
u

}
cl

68
packedData[pos++] = M[i*nCols + j];
Ex

69
70 } else {
71 // End block
72 if (inBlock)
73 inBlock=false;
74 }
75 // Continue parsing
76 j++;
77 }
78 }
79
80 if (verbose) {
81 printf("Results of packing a sparse %d x %d matrix:\nContains %d non-zero blocks,\
82 a total of %d non-zero elements.\n", nRows, nCols, nBlocks, nData);
83 printf("Average number of non-zero blocks per row: %d\n",
84 (int)((float)nBlocks/(float)nRows));
85 printf("Average length of non-zero blocks: %d\n", (int)((float)(nData)/(float)(nBlocks)));
86 printf("Matrix fill factor: %.2f%%\n", (float)nData/(float)(nRows*nCols)*100.0f);
87 }
88
89 }
90
91 PackedSparseMatrix::~PackedSparseMatrix() {

Prepared for Yunheng Wang c Colfax International, 2013


414 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

92 free(packedData);
93 free(blocksInRow);
94 free(blockFirstIdxInRow);
95 free(blockOffset);
96 free(blockLen);
97 free(blockCol);
98 }
99
100 void PackedSparseMatrix::MultiplyByVector(const float* inVector, float* outVector) {
101 #pragma omp parallel for schedule(dynamic,30)
102 for (int i = 0; i < nRows; i++) {
103 outVector[i] = 0.0f;
104 for (int nb = 0; nb < blocksInRow[i]; nb++) {
105 const int idx = blockFirstIdxInRow[i]+nb;
106 const int offs = blockOffset[idx];
107 const int j0 = blockCol[idx];
108 // Variable sum is needed for more efficient automatic vectorization of reduction.
109 float sum = 0.0f;
110 for (int c = 0; c < blockLen[idx]; c++) {
111 sum += packedData[offs+c]*inVector[j0+c];
112 }
outVector[i] += sum;

g
113

an
114 }

W
115 }
}
ng
116
he
un

Back to Lab A.4.5.


rY

Note: in this lab, between steps 00 and 01, only the file worker.cc is changed
fo
d

B.4.5.5 labs/4/4.5-vectorization-compiler-hints/step-01/worker.cc
re
pa
re

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


yP

2 file worker.cc, located at 4/4.5-vectorization-compiler-hints/step-01


is a part of the practical supplement to the handbook
el

3
siv

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
u
cl

6 Redistribution or commercial usage without a written permission


Ex

7 from Colfax International is prohibited.


8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include "worker.h"
11 #include <stdlib.h>
12 #include <stdio.h>
13
14 PackedSparseMatrix::PackedSparseMatrix(const int m, const int n, const float* M,
15 const bool verbose) : nRows(m), nCols(n) {
16 // Calculating the number of non-zero blocks;
17 nBlocks = 0;
18 int nData = 0;
19 for (int i = 0; i < nRows; i++) {
20 int j = 0;
21 bool inBlock = false;
22 while (j < nCols) {
23 if (M[i*nCols + j] != 0) {
24 if (!inBlock) {
25 nBlocks++;
26 inBlock=true;
27 }
28 nData++;
29 } else {

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 415

30 if (inBlock) inBlock=false;
31 }
32 j++;
33 }
34 }
35
36 // Allocating data for packed storage
37 packedData = (float*)malloc(sizeof(float)*nData);
38 blocksInRow = (int*) malloc(sizeof(float)*nRows);
39 blockFirstIdxInRow = (int*) malloc(sizeof(float)*nRows);
40 blockOffset = (int*) malloc(sizeof(float)*nBlocks);
41 blockLen = (int*) malloc(sizeof(float)*nBlocks);
42 blockCol = (int*) malloc(sizeof(float)*nBlocks);
43
44 int pos = 0;
45 int idx = -1;
46 for (int i = 0; i < nRows; i++) {
47 blocksInRow[i] = 0;
48 int j = 0;
49 bool inBlock = false;
50 bool firstBlock = true;
while (j < nCols) {

g
51

an
52 if (M[i*nCols + j] != 0) {

W
53 if (!inBlock) {
// Begin block

g
54

en
55 idx++; h
56 inBlock=true;
un
57 blocksInRow[i]++;
rY

58 if (firstBlock) {
fo

59 blockFirstIdxInRow[i] = idx;
firstBlock = false;
ed

60
61 }
ar

62 blockOffset[idx] = pos;
ep

63 blockLen[idx] = 1;
Pr

64 blockCol[idx] = j;
y

65 } else {
el

66 // Continue block
siv

67 blockLen[idx]++;
u

}
cl

68
packedData[pos++] = M[i*nCols + j];
Ex

69
70 } else {
71 // End block
72 if (inBlock)
73 inBlock=false;
74 }
75 // Continue parsing
76 j++;
77 }
78 }
79
80 if (verbose) {
81 printf("Results of packing a sparse %d x %d matrix:\nContains %d non-zero blocks,\
82 a total of %d non-zero elements.\n", nRows, nCols, nBlocks, nData);
83 printf("Average number of non-zero blocks per row: %d\n",
84 (int)((float)nBlocks/(float)nRows));
85 printf("Average length of non-zero blocks: %d\n", (int)((float)(nData)/(float)(nBlocks)));
86 printf("Matrix fill factor: %.2f%%\n", (float)nData/(float)(nRows*nCols)*100.0f);
87 }
88
89 }
90
91 PackedSparseMatrix::~PackedSparseMatrix() {

Prepared for Yunheng Wang c Colfax International, 2013


416 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

92 free(packedData);
93 free(blocksInRow);
94 free(blockFirstIdxInRow);
95 free(blockOffset);
96 free(blockLen);
97 free(blockCol);
98 }
99
100 void PackedSparseMatrix::MultiplyByVector(const float* inVector, float* outVector) {
101 #pragma omp parallel for schedule(dynamic,30)
102 for (int i = 0; i < nRows; i++) {
103 outVector[i] = 0.0f;
104 for (int nb = 0; nb < blocksInRow[i]; nb++) {
105 const int idx = blockFirstIdxInRow[i]+nb;
106 const int offs = blockOffset[idx];
107 const int j0 = blockCol[idx];
108 // Variable sum is needed for more efficient automatic vectorization of reduction.
109 float sum = 0.0f;
110 // Pragma loop count assists the application at runtime in choosing
111 // the optimal execution path. It only leads to an increase in performance
112 // when the actual loop count in the problem is in agreement
// with this compile-time prediction.

g
113

an
114 #pragma loop_count avg(100)

W
115 for (int c = 0; c < blockLen[idx]; c++) {
sum += packedData[offs+c]*inVector[j0+c];
ng
116
117 }
he

118 outVector[i] += sum;


un

119 }
rY

120 }
fo

121 }
d
re

Back to Lab A.4.5.


pa

Note: in this lab, between steps 01 and 02, files main.cc, worker.h and worker.cc are changed.
re
yP
el

B.4.5.6 labs/4/4.5-vectorization-compiler-hints/step-02/main.cc
u siv
cl

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


Ex

2 file main.cc, located at 4/4.5-vectorization-compiler-hints/step-02


3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <cstdlib>
12 #include <omp.h>
13 #include <math.h>
14 #include "worker.h"
15
16 double CheckResult(const float* M, const float* A, const float* B, float* C,
17 const size_t m, const size_t n) {
18 C[0:m]=0.0f;
19 // Calculating the correct matrix-vector product
20 _Cilk_for (int i = 0; i < m; i++)
21 for (int j = 0; j < n; j++)
22 C[i] += M[i*n + j]*A[j];
23 double err = 0.0;
24 for (int i = 0; i < m; i++)

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 417

25 err += (C[i]-B[i])*(C[i]-B[i])/(double)m;
26
27 if (m*n<=256L) {
28 // For small matrix, output elements on the screen
29 for (int i = 0; i < m; i++) {
30 printf("(");
31 for (int j = 0; j < n; j++)
32 printf(" %5.3f", M[i*n+j]);
33 printf(") ");
34 if (i == m/2) { printf("x"); } else { printf(" "); }
35 printf(" (%5.3f)", A[i]);
36 if (i == m/2) { printf(" ="); } else { printf(" "); }
37 printf(" (%5.3f) (correct=%5.3f)\n", B[i], C[i]);
38 }
39 }
40
41 return sqrt(err);
42 }
43
44 void FillSparseMatrix(const int m, const int n, const int nb, const int bl,
45 float* const M) {
M[0:n*m] = 0.0f;

g
46

an
47 for (int b = 0; b < nb; b++) {

W
48 // Initializing a random sparse matrix
int i = rand()%m;

g
49

en
50 int blockStart = rand()%n; h
51 // This expression gives a probability distribution that peaks at bl
un
52 int blockLength = 1 + (int)( (8.0f * (0.125f +
rY

53 powf((float)rand()/(float)RAND_MAX - 0.5f, 3)))*(float)bl);


fo

54 if (blockStart + blockLength > m-1)


blockLength = m-1 - blockStart;
ed

55
56 for (int j = blockStart; j < blockStart + blockLength; j++)
ar

57 M[i*n + j] = (float)rand()/(float)RAND_MAX;
ep

58 }
Pr

59 }
y

60
el

61 void TestSparseMatrixTimesVector (const int nTrials, const int skip_it, const int m,
siv

62 const int n, const float* M, PackedSparseMatrix* pM, float* A, float* B, float* C) {


u
cl

63
double tAvg=0.0, dt=0.0;
Ex

64
65 for (int t=0; t<nTrials; t++){
66 for (int i = 0; i < n; i++)
67 A[i] = (float)rand()/(float)RAND_MAX;
68 const double t0 = omp_get_wtime();
69 pM->MultiplyByVector(A, B);
70 const double t1 = omp_get_wtime();
71 if (t==0) {
72 printf("iteration %d: time=%.2f ms, error=%f\n",
73 t, (t1-t0)*1e3, CheckResult(M, A, B, C, m, n));
74 } else {
75 printf("iteration %d: time=%.2f ms\n",
76 t, (t1-t0)*1e3);
77 }
78 if (t>=skip_it) {
79 tAvg += (t1-t0);
80 dt += (t1-t0)*(t1-t0);
81 }
82 }
83 tAvg /= (double)(nTrials-skip_it);
84 dt /= (double)(nTrials-skip_it);
85 dt = sqrt(dt-tAvg*tAvg);
86 printf("Average: %.2f +- %.2f ms per iteration.\n", tAvg*1e3, dt*1e3);

Prepared for Yunheng Wang c Colfax International, 2013


418 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

87 fflush(0);
88
89 }
90
91 int main(int argv, char* argc[]){
92
93 // Generating output to illustrate the algorithm
94 printf("Demonstration:\n");
95 float* M = (float*) _mm_malloc(sizeof(float)*16*16, ALIGN_BYTES);
96 float* A = (float*) _mm_malloc(sizeof(float)*16, ALIGN_BYTES);
97 float* B = (float*) malloc(sizeof(float)*16);
98 float* C = (float*) malloc(sizeof(float)*16);
99 FillSparseMatrix(16, 16, 10, 3, M);
100 PackedSparseMatrix pM(16, 16, M, true);
101 TestSparseMatrixTimesVector(1, 0, 16, 16, M, &pM, A, B, C);
102 _mm_free(M); _mm_free(A); free(B); free(C);
103
104 printf("\nPreparing for a benchmark...\n"); fflush(0);
105 const size_t n=20000;
106 const size_t m=20000;
107 const int nTrials=50;
M = (float*)_mm_malloc(sizeof(float)*n*m, ALIGN_BYTES);

g
108

an
109 A = (float*)_mm_malloc(sizeof(float)*n, ALIGN_BYTES);

W
110 B = (float*)malloc(sizeof(float)*n);
C = (float*)malloc(sizeof(float)*n);
ng
111
112 FillSparseMatrix(m, n, (n*m/1000), 100, M);
he

113 PackedSparseMatrix pM2(m, n, M, true);


un

114
rY

115 printf("\nBenchmark:\n"); fflush(stdout);


fo

116 const int skip_it = 10; // First few iterations on the coprocessor are warm-up; skip them
TestSparseMatrixTimesVector(nTrials, skip_it, m, n, M, &pM2, A, B, C);
d

117
re

118
pa

119 _mm_free(M); _mm_free(A); free(B); free(C);


re

120 }
yP
el

Back to Lab A.4.5.


u siv
cl

B.4.5.7 labs/4/4.5-vectorization-compiler-hints/step-02/worker.h
Ex

1 #ifndef __INCLUDE_WORKER_H__
2 #define __INCLUDE_WORKER_H__
3
4 // The size of the cache line and also the size of the vector register on coprocessor
5 #define ALIGN_BYTES 64
6 // The number of 32-bit floats that fit in ALIGN_BYTES
7 #define ALIGN_FLOATS 16
8
9 class PackedSparseMatrix {
10 const int nRows; // Number of matrix rows
11 const int nCols; // Number of matrix columns
12 int nBlocks; // Number of contiguous non-zero blocks
13
14 float* packedData; // Non-zero elements of the matrix in packed form
15 int* blocksInRow; // The number of non-zero blocks in the respective row
16 int* blockFirstIdxInRow; // The index of the first non-zero blocks in the respective row
17 int* blockOffset; // Indices in the packedData array of the respective blocks
18 int* blockLen; // Lengths of the respective blocks
19 int* blockCol; // The column number of the first element in the respective block
20
21 public:

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 419

22
23 PackedSparseMatrix(const int m, const int n, const float* M, const bool verbose);
24 ~PackedSparseMatrix();
25 void MultiplyByVector(const float* inVector, float* outVector);
26
27 };
28
29 #endif

Back to Lab A.4.5.

B.4.5.8 labs/4/4.5-vectorization-compiler-hints/step-02/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.5-vectorization-compiler-hints/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission

g
7 from Colfax International is prohibited.

an
8 Contact information can be found at http://colfax-intl.com/ */

W
9

g
10 #include "worker.h"
11 #include
#include
<stdlib.h>
<stdio.h> en
h
un
12
#include <assert.h>
rY

13
14
fo

15 PackedSparseMatrix::PackedSparseMatrix(const int m, const int n, const float* M,


ed

16 const bool verbose) : nRows(m), nCols(n) {


ar

17 // Calculating the number of non-zero blocks


ep

18 // Assuming that the number of columns is a multiple


// of ALIGN_FLOATS to avoid code complication
Pr

19
20 assert(nCols % ALIGN_FLOATS == 0);
y

nBlocks = 0;
el

21
siv

22 int nData = 0;
23 for (int i = 0; i < nRows; i++) {
u
cl

24 int j = 0;
Ex

25 bool inBlock = false;


26 while (j < nCols) {
27 float sum = 0.0f;
28 for (int jj = j; jj < j+ALIGN_FLOATS; jj++)
29 sum += M[i*nCols + jj];
30 // If one of consecutive ALIGN_FLOATS elements is non-zero,
31 // the whole block is packed
32 if (sum != 0) {
33 if (!inBlock) {
34 nBlocks++;
35 inBlock=true;
36 }
37 nData += ALIGN_FLOATS;
38 } else {
39 if (inBlock) inBlock=false;
40 }
41 j += ALIGN_FLOATS;
42 }
43 }
44
45 // Allocating data for packed storage
46 packedData = (float*)_mm_malloc(sizeof(float)*nData, ALIGN_BYTES);
47 blocksInRow = (int*) malloc(sizeof(float)*nRows);

Prepared for Yunheng Wang c Colfax International, 2013


420 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

48 blockFirstIdxInRow = (int*) malloc(sizeof(float)*nRows);


49 blockOffset = (int*) malloc(sizeof(float)*nBlocks);
50 blockLen = (int*) malloc(sizeof(float)*nBlocks);
51 blockCol = (int*) malloc(sizeof(float)*nBlocks);
52
53 int pos = 0;
54 int idx = -1;
55 for (int i = 0; i < nRows; i++) {
56 blocksInRow[i] = 0;
57 int j = 0;
58 bool inBlock = false;
59 bool firstBlock = true;
60 while (j < nCols) {
61 float sum = 0.0f;
62 for (int jj = j; jj < j+ALIGN_FLOATS; jj++)
63 sum += M[i*nCols + jj];
64 // If one of consecutive ALIGN_FLOATS elements is non-zero,
65 // the whole block is packed
66 if (sum != 0) {
67 if (!inBlock) {
68 // Begin block
idx++;

g
69

an
70 inBlock=true;

W
71 blocksInRow[i]++;
if (firstBlock) {
ng
72
73 blockFirstIdxInRow[i] = idx;
he

74 firstBlock = false;
un

75 }
rY

76 blockOffset[idx] = pos;
fo

77 blockLen[idx] = 16;
blockCol[idx] = j;
d

78
re

79 } else {
pa

80 // Continue block
re

81 blockLen[idx] += 16;
yP

82 }
83 for (int jj = j; jj < j+ALIGN_FLOATS; jj++)
el

84 packedData[pos++] = M[i*nCols + jj];


siv

85 } else {
u

// End block
cl

86
if (inBlock)
Ex

87
88 inBlock=false;
89 }
90 // Continue parsing
91 j+=16;
92 }
93 }
94
95 if (verbose) {
96 printf("Results of packing a sparse %d x %d matrix:\nContains %d non-zero blocks,\
97 a total of %d non-zero elements.\n",
98 nRows, nCols, nBlocks, nData);
99 printf("Average number of non-zero blocks per row: %d\n",
100 (int)((float)nBlocks/(float)nRows));
101 printf("Average length of non-zero blocks: %d\n",
102 (int)((float)(nData)/(float)(nBlocks)));
103 printf("Matrix fill factor: %f%%\n",
104 (float)nData/(float)(nRows*nCols)*100.0f);
105 }
106
107 }
108
109 PackedSparseMatrix::~PackedSparseMatrix() {

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 421

110 _mm_free(packedData);
111 free(blocksInRow);
112 free(blockFirstIdxInRow);
113 free(blockOffset);
114 free(blockLen);
115 free(blockCol);
116 }
117
118 void PackedSparseMatrix::MultiplyByVector(const float* inVector, float* outVector) {
119 #pragma omp parallel for schedule(dynamic,30)
120 for (int i = 0; i < nRows; i++) {
121 outVector[i] = 0.0f;
122 for (int nb = 0; nb < blocksInRow[i]; nb++) {
123 const int idx = blockFirstIdxInRow[i]+nb;
124 const int offs = blockOffset[idx];
125 const int j0 = blockCol[idx];
126 // Variable sum is needed for more efficient automatic vectorization of
127 // reduction.
128 float sum = 0.0f;
129 // Pragma vector aligned makes a promise to the compiler that the elements of
130 // vectorized arrays accessed in the first iteration are aligned on a 64-byte
// boundary. Pragma loop count assists the application at runtime in choosing

g
131

an
132 // the optimal execution path. It only leads to an increase in performance

W
133 // when the actual loop count in the problem is in agreement
// with this compile-time prediction.

g
134

en
135 #pragma vector aligned h
136 #pragma loop count avg(128) min(16)
un
137 for (int c = 0; c < blockLen[idx]; c++) {
rY

138 sum += packedData[offs+c]*inVector[j0+c];


fo

139 }
outVector[i] += sum;
ed

140
141 }
ar

142 }
ep

143 }
Pr
y
el

Back to Lab A.4.5.


siv
u
cl

B.4.6 Vector Optimization: Branches in Auto-Vectorized Loops


Ex

Back to Lab A.4.6.

B.4.6.1 labs/4/4.6-vectorization-branches/step-00/Makefile

CXX = icpc
CXXFLAGS = -openmp -vec-report3 -g -O3

OBJECTS = main.o worker.o


MICOBJECTS = main.oMIC worker.oMIC

.SUFFIXES: .o .cc .oMIC

.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

all: runme runmeMIC

Prepared for Yunheng Wang c Colfax International, 2013


422 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC

B.4.6.2 labs/4/4.6-vectorization-branches/step-00/main.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.6-vectorization-branches/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */

g
9

an
10 #include <stdio.h>

W
11 #include <cstdlib>
ng
12 #include <omp.h>
he

13 #include <math.h>
un

14
typedef void (*FunctionPtrType)(const int, const int, const int*, float*);
rY

15
16
fo

17 void MaskedOperations(const int m, const int n, const int* flag, float* data);
d

18 void NonMaskedOperations(const int m, const int n, const int* flag, float* data);
re

19
pa

20 void Benchmark(const int m, const int n, const int* flag,


re

21 float* data, FunctionPtrType functionPtr) {


yP

22 const int nTrials = 8; const int skip_it = 1;


double tAvg=0.0, dt=0.0;
el

23
siv

24 for (int t=0; t<nTrials; t++){


25 const double t0 = omp_get_wtime();
u
cl

26 (*functionPtr)(m, n, flag, data);


Ex

27 const double t1 = omp_get_wtime();


28 if (t>=skip_it) {
29 tAvg += (t1-t0);
30 dt += (t1-t0)*(t1-t0);
31 }
32 printf("Iteration = %d, time=%.3f ms\n", t, (t1-t0)*1e3);
33 }
34 tAvg /= (double)(nTrials-skip_it);
35 dt /= (double)(nTrials-skip_it);
36 dt = sqrt(dt-tAvg*tAvg);
37 printf("Average: %.2f +- %.2f ms per iteration.\n\n", tAvg*1e3, dt*1e3);
38 fflush(0);
39 }
40
41 int main(int argv, char* argc[]){
42
43 const size_t n = 1L<<15L;
44 const size_t m = 1L<<12L;
45 float* data = (float*) _mm_malloc(sizeof(float)*m*n, 64); data[0:m*n] = 1.0f;
46 int* flag = (int*) _mm_malloc(sizeof(float)*n, 64);
47
48 for (int c = 0; c < 5; c++) {
49 FunctionPtrType fp;

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 423

50 if (c==0) {
51 flag[0:n] = 0;
52 fp = &MaskedOperations;
53 printf("Masked calculation, all branches not taken, none of the elements \
54 computed:\n");
55 } else if (c==1) {
56 flag[0:n] = 1;
57 fp = &MaskedOperations;
58 printf("Masked calculation, all branches taken, all elements computed:\n");
59 } else if (c==2) {
60 flag[0:n/2:2] = 0;
61 flag[1:n/2:2] = 1;
62 fp = &MaskedOperations;
63 printf("Masked calculation, half of the branches taken, half of the elements\
64 computed (stride 2):\n");
65 } else if (c==3) {
66 for (int k = 0; k < 16; k++)
67 flag[k:n/32:32] = 0;
68 for (int k = 16; k < 32; k++)
69 flag[k:n/32:32] = 1;
70 fp = &MaskedOperations;
printf("Masked calculation, half of the branches taken, half of the elements\

g
71

an
72 computed (stride 16):\n");

W
73 } else if (c==4) {
flag[0:n] = 0;

g
74

en
75 fp = &NonMaskedOperations; h
76 printf("Unmasked calculation, all elements computed:\n");
un
77 }
rY

78 Benchmark(m, n, flag, data, fp);


fo

79 }
ed

80
81 _mm_free(data);
ar

82 _mm_free(flag);
ep

83
Pr

84 }
y
el
siv

B.4.6.3
u

labs/4/4.6-vectorization-branches/step-00/worker.cc
cl
Ex

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.6-vectorization-branches/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <math.h>
11
12 void MaskedOperations(const int m, const int n, const int* flag, float* data) {
13 #pragma omp parallel for schedule(dynamic)
14 for (int i = 0; i < m; i++)
15 for (int j = 0; j < n; j++) {
16 if (flag[j] == 1)
17 data[i*n+j] = sqrtf(data[i*n+j]);
18 }
19 }
20
21 void NonMaskedOperations(const int m, const int n, const int* flag, float* data) {
22 #pragma omp parallel for schedule(dynamic)

Prepared for Yunheng Wang c Colfax International, 2013


424 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

23 for (int i = 0; i < m; i++)


24 for (int j = 0; j < n; j++) {
25 data[i*n+j] = sqrtf(data[i*n+j]);
26 }
27 }

B.4.6.4 labs/4/4.6-vectorization-branches/step-01/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.6-vectorization-branches/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <math.h>
11
void MaskedOperations(const int m, const int n, const int* flag, float* data) {

g
12

an
13 #pragma omp parallel for schedule(dynamic)

W
14 for (int i = 0; i < m; i++)
#pragma simd
ng
15
16 for (int j = 0; j < n; j++) {
he

17 if (flag[j] == 1)
un

18 data[i*n+j] = sqrtf(data[i*n+j]);
rY

19 }
fo

20 }
d

21
re

22 void NonMaskedOperations(const int m, const int n, const int* flag, float* data) {
pa

23 #pragma omp parallel for schedule(dynamic)


re

24 for (int i = 0; i < m; i++)


yP

25 #pragma simd
26 for (int j = 0; j < n; j++) {
el

data[i*n+j] = sqrtf(data[i*n+j]);
siv

27
28 }
u

}
cl

29
Ex

B.4.7 Shared-Memory Optimization: Reducing the Synchronization Cost


Back to Lab A.4.7.

B.4.7.1 labs/4/4.7-optimize-shared-mutexes/step-00/Makefile

CXX = icpc
CXXFLAGS = -openmp -mkl -vec-report2

OBJECTS = main.o worker.o


MICOBJECTS = main.oMIC worker.oMIC

.SUFFIXES: .o .cc .oMIC

.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 425

all: runme runmeMIC

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC

Back to Lab A.4.7.

B.4.7.2 labs/4/4.7-optimize-shared-mutexes/step-00/main.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.7-optimize-shared-mutexes/step-00
3 is a part of the practical supplement to the handbook

g
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"

an
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

W
6 Redistribution or commercial usage without a written permission

g
7 from Colfax International is prohibited.
8
en
Contact information can be found at http://colfax-intl.com/
h */
un
9
#include <stdio.h>
rY

10
11 #include <cstdlib>
fo

12 #include <omp.h>
ed

13 #include <math.h>
ar

14 #include <mkl_vsl.h>
ep

15
void HistogramReference(const float* age, int* const group, const int n,
Pr

16
17 const float group_width){
y

// Plain (scalar, sequentual) algorithm for computing the reference histogram


el

18
siv

19 for (long i = 0; i < n; i++){


20 const int j = (int) floorf( age[i] / group_width );
u
cl

21 group[j]++;
Ex

22 }
23 }
24
25 void Histogram(const float* age, int* const group, const int n, const float group_width,
26 const int m);
27
28 int main(int argv, char* argc[]){
29 const size_t n=1L<<30L;
30 const float max_age=99.999f;
31 const float group_width=20.0f;
32 const size_t m = (size_t) floorf(max_age/group_width + 0.1f);
33 const int nTrials=10;
34
35 float* age = (float*) _mm_malloc(sizeof(int)*n, 64);
36 int group[m];
37 int ref_group[m];
38
39 // Initializing array of ages
40 printf("Initialization..."); fflush(0);
41 VSLStreamStatePtr rnStream;
42 vslNewStream( &rnStream, VSL_BRNG_MT19937, 1 );
43 vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, n, age, 0.0f, max_age);
44 for (int i = 0; i < n; i++)

Prepared for Yunheng Wang c Colfax International, 2013


426 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

45 age[i] = age[i]*age[i]/max_age;
46
47 // Computing the "correct" answer
48 ref_group[:]=0;
49 HistogramReference(age, ref_group, n, group_width);
50 printf("complete.\n"); fflush(0);
51
52 for (int t=0; t<nTrials; t++){
53 group[:] = 0;
54
55 printf("Iteration %d...", t); fflush(0);
56 const double t0 = omp_get_wtime();
57 Histogram(age, group, n, group_width, m);
58 const double t1 = omp_get_wtime();
59
60 for (int i=0; i<m; i++) {
61 if (fabs((double)(ref_group[i]-group[i])) > 1e-4*fabs((double)(ref_group[i]
62 +group[i]))) {
63 printf("Result is incorrect!\n");
64 for (int i=0; i<m; i++) printf(" (%d vs %d)", group[i], ref_group[i]);
65 }
}

g
66

an
67 printf(" time: %.3f sec\n", t1-t0);

W
68
printf("Result: ");
ng
69
70 for (int i=0; i<m; i++) printf("\t%d", group[i]);
he

71 printf("\n");
un

72 fflush(0);
rY

73 }
fo

74
_mm_free(age);
d

75
re

76 }
pa
re

Back to Lab A.4.7.


yP
el
siv

B.4.7.3 labs/4/4.7-optimize-shared-mutexes/step-00/worker.cc
u
cl
Ex

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.7-optimize-shared-mutexes/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 void Histogram(const float* age, int* const hist, const int n, const float group_width,
11 const int m) {
12 for (int i = 0; i < n; i++) {
13 const int j = (int) ( age[i] / group_width );
14 hist[j]++;
15 }
16 }

Back to Lab A.4.7.

B.4.7.4 labs/4/4.7-optimize-shared-mutexes/step-01/worker.cc

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 427

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.7-optimize-shared-mutexes/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 void Histogram(const float* age, int* const hist, const int n, const float group_width,
11 const int m) {
12
13 const int vecLen = 16; // Length of vectorized loop (lower is better,
14 // but a multiple of 64/sizeof(int))
15 const float invGroupWidth = 1.0f/group_width; // Pre-compute the reciprocal
16 const int nPrime = n - n%vecLen; // nPrime is a multiple of vecLen
17
18 // Strip-mining the loop in order to vectorize the inner short loop
19 for (int ii = 0; ii < nPrime; ii+=vecLen) {
20 // Temporary storage for vecLen indices. Necessary for vectorization
21 int histIdx[vecLen] __attribute__((aligned(64)));

g
22

an
23 // Vectorize the multiplication and rounding

W
24 #pragma vector aligned
for (int i = ii; i < ii+vecLen; i++)

g
25

en
26 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
h
27
un
28 // Scattered memory access, does not get vectorized
rY

29 for (int c = 0; c < vecLen; c++)


fo

30 hist[histIdx[c]]++;
}
ed

31
32
ar

33 // Finish with the tail of the data (if n is not a multiple of vecLen)
ep

34 for (int i = nPrime; i < n; i++)


Pr

35 hist[(int) ( age[i] * invGroupWidth )]++;


y

36 }
el
siv
u

Back to Lab A.4.7.


cl
Ex

B.4.7.5 labs/4/4.7-optimize-shared-mutexes/step-02/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.7-optimize-shared-mutexes/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 void Histogram(const float* age, int* const hist, const int n, const float group_width,
11 const int m) {
12
13 const int vecLen = 16; // Length of vectorized loop (lower is better,
14 // but a multiple of 64/sizeof(int))
15 const float invGroupWidth = 1.0f/group_width; // Pre-compute the reciprocal
16 const int nPrime = n - n%vecLen; // nPrime is a multiple of vecLen
17
18 // Distribute work across threads
19 // Strip-mining the loop in order to vectorize the inner short loop

Prepared for Yunheng Wang c Colfax International, 2013


428 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

20 #pragma omp parallel for schedule(guided)


21 for (int ii = 0; ii < nPrime; ii+=vecLen) {
22 // Temporary storage for vecLen indices. Necessary for vectorization
23 int histIdx[vecLen] __attribute__((aligned(64)));
24
25 // Vectorize the multiplication and rounding
26 #pragma vector aligned
27 for (int i = ii; i < ii+vecLen; i++)
28 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
29
30 // Scattered memory access, does not get vectorized
31 for (int c = 0; c < vecLen; c++)
32 // Protect the ++ operation with the atomic mutex (inefficient!)
33 #pragma omp atomic
34 hist[histIdx[c]]++;
35 }
36
37 // Finish with the tail of the data (if n is not a multiple of vecLen)
38 for (int i = nPrime; i < n; i++)
39 hist[(int) ( age[i] * invGroupWidth )]++;
40 }

g
an
Back to Lab A.4.7.
W
ng
he

B.4.7.6 labs/4/4.7-optimize-shared-mutexes/step-03/worker.cc
un
rY

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


fo

2 file worker.cc, located at 4/4.7-optimize-shared-mutexes/step-03


d

3 is a part of the practical supplement to the handbook


re

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


pa

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


re

6 Redistribution or commercial usage without a written permission


yP

7 from Colfax International is prohibited.


Contact information can be found at http://colfax-intl.com/ */
el

8
siv

9
10 void Histogram(const float* age, int* const hist, const int n, const float group_width,
u
cl

11 const int m) {
Ex

12
13 const int vecLen = 16; // Length of vectorized loop (lower is better,
14 // but a multiple of 64/sizeof(int))
15 const float invGroupWidth = 1.0f/group_width; // Pre-compute the reciprocal
16 const int nPrime = n - n%vecLen; // nPrime is a multiple of vecLen
17
18 #pragma omp parallel
19 {
20 // Private variable to hold a copy of histogram in each thread
21 int hist_priv[m];
22 hist_priv[:] = 0;
23
24 // Temporary storage for vecLen indices. Necessary for vectorization
25 int histIdx[vecLen] __attribute__((aligned(64)));
26
27 // Distribute work across threads
28 // Strip-mining the loop in order to vectorize the inner short loop
29 #pragma omp for schedule(guided)
30 for (int ii = 0; ii < nPrime; ii+=vecLen) {
31 // Vectorize the multiplication and rounding
32 #pragma vector aligned
33 for (int i = ii; i < ii+vecLen; i++)
34 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 429

35
36 // Scattered memory access, does not get vectorized
37 for (int c = 0; c < vecLen; c++)
38 hist_priv[histIdx[c]]++;
39 }
40
41 // Finish with the tail of the data (if n is not a multiple of vecLen)
42 #pragma omp single
43 for (int i = nPrime; i < n; i++)
44 hist_priv[(int) ( age[i] * invGroupWidth )]++;
45
46 // Reduce private copies into global histogram
47 for (int c = 0; c < m; c++) {
48 // Protect the += operation with the lightweight atomic mutex
49 #pragma omp atomic
50 hist[c] += hist_priv[c];
51 }
52 }
53 }

Back to Lab A.4.7.

g
an
W
B.4.7.7 labs/4/4.7-optimize-shared-mutexes/step-04/worker.cc

g
en
h
/* Copyright (c) 2013, Colfax International. All Right Reserved.
un
1
file worker.cc, located at 4/4.7-optimize-shared-mutexes/step-04
rY

2
3 is a part of the practical supplement to the handbook
fo

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


ed

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


ar

6 Redistribution or commercial usage without a written permission


ep

7 from Colfax International is prohibited.


Contact information can be found at http://colfax-intl.com/ */
Pr

8
9
y

#include <omp.h>
el

10
siv

11
12 void Histogram(const float* age, int* const hist, const int n, const float group_width,
u
cl

13 const int m) {
Ex

14
15 const int vecLen = 16; // Length of vectorized loop (lower is better,
16 // but a multiple of 64/sizeof(int))
17 const float invGroupWidth = 1.0f/group_width; // Pre-compute the reciprocal
18 const int nPrime = n - n%vecLen; // nPrime is a multiple of vecLen
19 const int nThreads = omp_get_max_threads();
20 // Shared histogram with a private section for each thread
21 int hist_thr[nThreads][m];
22 hist_thr[:][:] = 0;
23
24 // Strip-mining the loop in order to vectorize the inner short loop
25 #pragma omp parallel
26 {
27 // Get the number of this thread
28 const int iThread = omp_get_thread_num();
29 // Temporary storage for vecLen indices. Necessary for vectorization
30 int histIdx[vecLen] __attribute__((aligned(64)));
31
32 #pragma omp for schedule(guided)
33 for (int ii = 0; ii < nPrime; ii+=vecLen) {
34
35 // Vectorize the multiplication and rounding
36 #pragma vector aligned

Prepared for Yunheng Wang c Colfax International, 2013


430 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

37 for (int i = ii; i < ii+vecLen; i++)


38 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
39
40 // Scattered memory access, does not get vectorized.
41 // There is no synchronization in this for-loop,
42 // however, false sharing occurs here and ruins the performance
43 for (int c = 0; c < vecLen; c++)
44 hist_thr[iThread][histIdx[c]]++;
45 }
46 }
47
48 // Finish with the tail of the data (if n is not a multiple of vecLen)
49 for (int i = nPrime; i < n; i++)
50 hist[(int) ( age[i] * invGroupWidth )]++;
51
52 // Reducing results from all threads to the common histogram hist
53 for (int iThread = 0; iThread < nThreads; iThread++)
54 hist[0:m] += hist_thr[iThread][0:m];
55
56 }

g
an
Back to Lab A.4.7.

W
ng
B.4.7.8 labs/4/4.7-optimize-shared-mutexes/step-05/worker.cc
he
un

/* Copyright (c) 2013, Colfax International. All Right Reserved.


rY

1
2 file worker.cc, located at 4/4.7-optimize-shared-mutexes/step-05
fo

3 is a part of the practical supplement to the handbook


d

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


re

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


pa

6 Redistribution or commercial usage without a written permission


re

7 from Colfax International is prohibited.


yP

8 Contact information can be found at http://colfax-intl.com/ */


el

9
siv

10 #include <omp.h>
11
u
cl

12 void Histogram(const float* age, int* const hist, const int n, const float group_width,
Ex

13 const int m) {
14
15 const int vecLen = 16; // Length of vectorized loop (lower is better,
16 // but a multiple of 64/sizeof(int))
17 const float invGroupWidth = 1.0f/group_width; // Pre-compute the reciprocal
18 const int nPrime = n - n%vecLen; // nPrime is a multiple of vecLen
19 const int nThreads = omp_get_max_threads();
20 // Padding for hist_thr[][] in order to avoid a situation
21 // where two (or more) rows share a cache line.
22 const int paddingBytes = 64;
23 const int paddingElements = paddingBytes / sizeof(int);
24 const int mPadded = m + (paddingElements-m%paddingElements);
25 // Shared histogram with a private section for each thread
26 int hist_thr[nThreads][mPadded] __attribute__((aligned(64)));
27 hist_thr[:][:] = 0;
28
29 // Strip-mining the loop in order to vectorize the inner short loop
30 #pragma omp parallel
31 {
32 // Get the number of this thread
33 const int iThread = omp_get_thread_num();
34 // Temporary storage for vecLen indices. Necessary for vectorization
35 int histIdx[vecLen] __attribute__((aligned(64)));

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 431

36
37 #pragma omp for schedule(guided)
38 for (int ii = 0; ii < nPrime; ii+=vecLen) {
39
40 // Vectorize the multiplication and rounding
41 #pragma vector aligned
42 for (int i = ii; i < ii+vecLen; i++)
43 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
44
45 // Scattered memory access, does not get vectorized.
46 // There is no synchronization in this for-loop,
47 // however, false sharing occurs here and ruins the performance
48 for (int c = 0; c < vecLen; c++)
49 hist_thr[iThread][histIdx[c]]++;
50 }
51 }
52
53 // Finish with the tail of the data (if n is not a multiple of vecLen)
54 for (int i = nPrime; i < n; i++)

g
55 hist[(int) ( age[i] * invGroupWidth )]++;

an
56

W
57 // Reducing results from all threads to the common histogram hist
58 for (int iThread = 0; iThread < nThreads; iThread++)

ng
59 hist[0:m] += hist_thr[iThread][0:m];

e
60
61 }
nh
Yu
Back to Lab A.4.7.
r
fo
d

B.4.8 Shared-Memory Optimization: Resolving Load Imbalance


re
pa

Back to Lab A.4.8.


re
yP

B.4.8.1 labs/4/4.8-optimize-scheduling/step-00/Makefile
el
iv

CXX = icpc
us

CXXFLAGS = -openmp -mkl


cl

OBJECTS = main.o worker.o


Ex

MICOBJECTS = main.oMIC worker.oMIC

.SUFFIXES: .o .cc .oMIC

.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

all: runme runmeMIC

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC

Prepared for Yunheng Wang c Colfax International, 2013


432 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

Back to Lab A.4.8.

B.4.8.2 labs/4/4.8-optimize-scheduling/step-00/main.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.8-optimize-scheduling/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <cstdlib>
12 #include <omp.h>
13 #include <math.h>
14 #include <mkl_vsl.h>
15
16 int IterativeSolver(const int n, const double* M, const double* b, double* x,

g
17 const double minAccuracy);

an
18

W
19 void InitializeMatrix(const int n, double* M) {
ng
20 // "Good" matrix for Jacobi method
he

21 for (int i = 0; i < n; i++) {


double sum = 0;
un

22
for (int j = 0; j < n; j++) {
rY

23
24 M[i*n+j] = (double)(i*n+j);
fo

25 sum += M[i*n+j];
d

26 }
re

27 sum -= M[i*n+i];
pa

28 M[i*n+i] = 2.0*sum;
re

29 }
yP

30 }
el

31
siv

32 int main(int argv, char* argc[]){


33 printf("Initialization..."); fflush(0);
u
cl

34 const int n = 128;


Ex

35 const int nBVectors = 1<<14; // The number of b-vectors


36 double* M = (double*) _mm_malloc(sizeof(double)*n*n, 64);
37 double* x = (double*) _mm_malloc(sizeof(double)*n*nBVectors, 64);
38 double* b = (double*) _mm_malloc(sizeof(double)*n*nBVectors, 64);
39 double* accuracy = (double*) _mm_malloc(sizeof(double)*nBVectors, 64);
40 InitializeMatrix(n, M);
41 VSLStreamStatePtr rnStream;
42 vslNewStream( &rnStream, VSL_BRNG_MT19937, 1234 );
43 vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, n*nBVectors, b, 0.0, 1.0);
44 vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, nBVectors, accuracy, 0.0, 1.0);
45 accuracy[0:nBVectors]=exp(-28.0+26.0*accuracy[0:nBVectors]);
46 printf(" initialized %d vectors and a [%d x %d] matrix\n",
47 nBVectors, n, n); fflush(0);
48
49 const int nTrials=10;
50 const int itSkip=1;
51 double tAvg = 0.0;
52 double dtAvg = 0.0;
53 for (int t=0; t<nTrials; t++){
54 double itAvg = 0.0;
55 double dItAvg = 0.0;
56 printf("Iteration %d ", t);
57 const double t0 = omp_get_wtime();

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 433

58 #pragma omp parallel for reduction(+: itAvg, dItAvg) schedule(guided)


59 for (int c = 0; c < nBVectors; c++) {
60 const int it = IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
61 itAvg += (double)it;
62 dItAvg += (double)it*it;
63 }
64 const double t1 = omp_get_wtime();
65 itAvg /= (double)(nBVectors);
66 dItAvg /= (double)(nBVectors);
67 dItAvg = sqrt(dItAvg - itAvg*itAvg);
68 printf(" time: %.3f sec, Jacobi iterations per vector: %.1f +- %.1f\n", t1-t0,
69 itAvg, dItAvg);
70 if (t >= itSkip) {
71 tAvg += (t1-t0);
72 dtAvg += (t1-t0)*(t1-t0);
73 }
74 fflush(0);
75 }
76 tAvg /= (double)(nTrials-itSkip);
77 dtAvg /= (double)(nTrials-itSkip);
78 dtAvg = sqrt(dtAvg - tAvg*tAvg);
printf("Average: %.3f +- %.3f sec\n\n", tAvg, dtAvg); fflush(0);

g
79

an
80

W
81 _mm_free(M);
_mm_free(x);

g
82

en
83 _mm_free(b); h
84 _mm_free(accuracy);
un
85 }
rY
fo

Back to Lab A.4.8.


ed
ar
ep

B.4.8.3 labs/4/4.8-optimize-scheduling/step-00/worker.cc
Pr
y

/* Copyright (c) 2013, Colfax International. All Right Reserved.


el

1
siv

2 file worker.cc, located at 4/4.8-optimize-scheduling/step-00


3 is a part of the practical supplement to the handbook
u
cl

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


Ex

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <math.h>
11 #include <cstdio>
12
13 double RelativeNormOfDifference(const int n, const double* v1, const double* v2) {
14 // Calculates ||v1 - v2|| / (||v1|| + ||v2||)
15 double norm2 = 0.0;
16 double v1sq = 0.0;
17 double v2sq = 0.0;
18 #pragma vector aligned
19 for (int i = 0; i < n; i++) {
20 norm2 += (v1[i] - v2[i])*(v1[i] - v2[i]);
21 v1sq += v1[i]*v1[i];
22 v2sq += v2[i]*v2[i];
23 }
24 return sqrt(norm2/(v1sq+v2sq));
25 }
26
27 int IterativeSolver(const int n, const double* M, const double* b, double* x,

Prepared for Yunheng Wang c Colfax International, 2013


434 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

28 const double minAccuracy) {


29 // Iteratively solves the equation Mx=b with accuracy of at least minAccuracy
30 // using the Jacobi method
31 double accuracy;
32 double bTrial[n] __attribute__((align(64)));
33 x[0:n] = 0.0; // Initial guess
34 int iterations = 0;
35 do {
36 iterations++;
37 // Jacobi method
38 for (int i = 0; i < n; i++) {
39 double c = 0.0;
40 #pragma vector aligned
41 for (int j = 0; j < n; j++)
42 c += M[i*n+j]*x[j];
43 x[i] = x[i] + (b[i] - c)/M[i*n+i];
44 }
45
46 // Verification
47 bTrial[:] = 0.0;
48 for (int i = 0; i < n; i++) {
#pragma vector aligned

g
49

an
50 for (int j = 0; j < n; j++)

W
51 bTrial[i] += M[i*n+j]*x[j];
}
ng
52
53 accuracy = RelativeNormOfDifference(n, b, bTrial);
he

54
un

55 } while (accuracy > minAccuracy);


rY

56 return iterations;
fo

57 }
d
re
pa

Back to Lab A.4.8.


re
yP

B.4.8.4 labs/4/4.8-optimize-scheduling/step-01/main.cc
el
siv

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


u
cl

2 file main.cc, located at 4/4.8-optimize-scheduling/step-01


Ex

3 is a part of the practical supplement to the handbook


4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <cstdlib>
12 #include <omp.h>
13 #include <cilk/reducer_opadd.h>
14 #include <math.h>
15 #include <mkl_vsl.h>
16
17 int IterativeSolver(const int n, const double* M, const double* b, double* x,
18 const double minAccuracy);
19
20 void InitializeMatrix(const int n, double* M) {
21 // "Good" matrix for Jacobi method
22 for (int i = 0; i < n; i++) {
23 double sum = 0;
24 for (int j = 0; j < n; j++) {
25 M[i*n+j] = (double)(i*n+j);

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 435

26 sum += M[i*n+j];
27 }
28 sum -= M[i*n+i];
29 M[i*n+i] = 2.0*sum;
30 }
31 }
32
33 int main(int argv, char* argc[]){
34 printf("Initialization..."); fflush(0);
35 const int n = 128;
36 const int nBVectors = 1<<14; // The number of b-vectors
37 double* M = (double*) _mm_malloc(sizeof(double)*n*n, 64);
38 double* x = (double*) _mm_malloc(sizeof(double)*n*nBVectors, 64);
39 double* b = (double*) _mm_malloc(sizeof(double)*n*nBVectors, 64);
40 double* accuracy = (double*) _mm_malloc(sizeof(double)*nBVectors, 64);
41 InitializeMatrix(n, M);
42 VSLStreamStatePtr rnStream;
43 vslNewStream( &rnStream, VSL_BRNG_MT19937, 1234 );
44 vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, n*nBVectors, b, 0.0, 1.0);

g
45 vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, nBVectors, accuracy, 0.0, 1.0);

an
46 accuracy[0:nBVectors]=exp(-28.0+26.0*accuracy[0:nBVectors]);

W
47 printf(" initialized %d vectors and a [%d x %d] matrix\n",
48 nBVectors, n, n); fflush(0);

ng
49

e
50 const int nTrials=10;
51 const int itSkip=1;
double tAvg = 0.0; nh
Yu
52
53 double dtAvg = 0.0;
r

54 for (int t=0; t<nTrials; t++){


fo

55 cilk::reducer_opadd<double> itAvg;
cilk::reducer_opadd<double> dItAvg;
d

56
re

57 printf("Iteration %d ", t);


pa

58 const double t0 = omp_get_wtime();


59 _Cilk_for (int c = 0; c < nBVectors; c++) {
re

60 const int it = IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);


yP

61 itAvg += (double)it;
62 dItAvg += (double)it*it;
el

63 }
iv

64 const double t1 = omp_get_wtime();


us

65 double mitAvg = itAvg.get_value() / (double)(nBVectors);


double mdItAvg = dItAvg.get_value() / (double)(nBVectors);
cl

66
mdItAvg = sqrt(mdItAvg - mitAvg*mitAvg);
Ex

67
68 printf(" time: %.3f sec, Jacobi iterations per vector: %.1f +- %.1f\n", t1-t0,
69 mitAvg, mdItAvg);
70 if (t >= itSkip) {
71 tAvg += (t1-t0);
72 dtAvg += (t1-t0)*(t1-t0);
73 }
74 fflush(0);
75 }
76 tAvg /= (double)(nTrials-itSkip);
77 dtAvg /= (double)(nTrials-itSkip);
78 dtAvg = sqrt(dtAvg - tAvg*tAvg);
79 printf("Average: %.3f +- %.3f sec\n\n", tAvg, dtAvg); fflush(0);
80
81 _mm_free(M);
82 _mm_free(x);
83 _mm_free(b);
84 _mm_free(accuracy);
85 }

Prepared for Yunheng Wang c Colfax International, 2013


436 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

Back to Lab A.4.8.

B.4.8.5 labs/4/4.8-optimize-scheduling/step-02/main.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.8-optimize-scheduling/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <cstdlib>
12 #include <omp.h>
13 #include <math.h>
14 #include <mkl_vsl.h>
15
16 int IterativeSolver(const int n, const double* M, const double* b, double* x,

g
17 const double minAccuracy);

an
18

W
19 void InitializeMatrix(const int n, double* M) {
ng
20 // "Good" matrix for Jacobi method
he

21 for (int i = 0; i < n; i++) {


double sum = 0;
un

22
for (int j = 0; j < n; j++) {
rY

23
24 M[i*n+j] = (double)(i*n+j);
fo

25 sum += M[i*n+j];
d

26 }
re

27 sum -= M[i*n+i];
pa

28 M[i*n+i] = 2.0*sum;
re

29 }
yP

30 }
el

31
siv

32 int main(int argv, char* argc[]){


33 printf("Initialization..."); fflush(0);
u
cl

34 const int n=128;


Ex

35 const int nBVectors = 1<<14; // The number of b-vectors


36 double* M = (double*) _mm_malloc(sizeof(double)*n*n, 64);
37 double* x = (double*) _mm_malloc(sizeof(double)*n*nBVectors, 64);
38 double* b = (double*) _mm_malloc(sizeof(double)*n*nBVectors, 64);
39 double* accuracy = (double*) _mm_malloc(sizeof(double)*nBVectors, 64);
40 InitializeMatrix(n, M);
41 VSLStreamStatePtr rnStream;
42 vslNewStream( &rnStream, VSL_BRNG_MT19937, 1234 );
43 vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, n*nBVectors, b, 0.0, 1.0);
44 vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, nBVectors, accuracy, 0.0, 1.0);
45 accuracy[0:nBVectors]=exp(-28.0+26.0*accuracy[0:nBVectors]);
46 printf(" initialized %d vectors and a [%d x %d] matrix\n",
47 nBVectors, n, n); fflush(0);
48
49 const int nTrials=10;
50 const int nMethods = 14;
51 const int itSkip = 1;
52 for (int iMethod = 0; iMethod < nMethods; iMethod++ ) {
53 double tAvg = 0.0;
54 double dtAvg = 0.0;
55 for (int t=0; t<nTrials; t++){
56 printf("Iteration %d ", t);
57 const double t0 = omp_get_wtime();

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 437

58 if (iMethod == 0) {
59 printf(" (_Cilk_for)..."); fflush(0);
60 //__cilkrts_set_param("nworkers","50");
61 _Cilk_for (int c = 0; c < nBVectors; c++)
62 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
63 }
64 if (iMethod == 1) {
65 printf(" (no scheduling)..."); fflush(0);
66 #pragma omp parallel for
67 for (int c = 0; c < nBVectors; c++)
68 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
69 }
70 if (iMethod == 2) {
71 printf(" (static, 1)..."); fflush(0);
72 #pragma omp parallel for schedule(static, 1)
73 for (int c = 0; c < nBVectors; c++)
74 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
75 }
76 if (iMethod == 3) {
77 printf(" (static, 4)..."); fflush(0);
78 #pragma omp parallel for schedule(static, 4)
for (int c = 0; c < nBVectors; c++)

g
79

an
80 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);

W
81 }
if (iMethod == 4) {

g
82

en
83 printf(" (static, 32)..."); fflush(0);h
84 #pragma omp parallel for schedule(static, 32)
un
85 for (int c = 0; c < nBVectors; c++)
rY

86 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);


fo

87 }
if (iMethod == 5) {
ed

88
89 printf(" (static, 256)..."); fflush(0);
ar

90 #pragma omp parallel for schedule(static, 256)


ep

91 for (int c = 0; c < nBVectors; c++)


Pr

92 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);


y

93 }
el

94 if (iMethod == 6) {
siv

95 printf(" (dynamic, 1)..."); fflush(0);


u

#pragma omp parallel for schedule(dynamic, 1)


cl

96
for (int c = 0; c < nBVectors; c++)
Ex

97
98 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
99 }
100 if (iMethod == 7) {
101 printf(" (dynamic, 4)..."); fflush(0);
102 #pragma omp parallel for schedule(dynamic, 4)
103 for (int c = 0; c < nBVectors; c++)
104 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
105 }
106 if (iMethod == 8) {
107 printf(" (dynamic, 32)..."); fflush(0);
108 #pragma omp parallel for schedule(dynamic, 32)
109 for (int c = 0; c < nBVectors; c++)
110 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
111 }
112 if (iMethod == 9) {
113 printf(" (dynamic, 256)..."); fflush(0);
114 #pragma omp parallel for schedule(dynamic, 256)
115 for (int c = 0; c < nBVectors; c++)
116 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
117 }
118 if (iMethod == 10) {
119 printf(" (guided, 1)..."); fflush(0);

Prepared for Yunheng Wang c Colfax International, 2013


438 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

120 #pragma omp parallel for schedule(guided, 1)


121 for (int c = 0; c < nBVectors; c++)
122 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
123 }
124 if (iMethod == 11) {
125 printf(" (guided, 4)..."); fflush(0);
126 #pragma omp parallel for schedule(guided, 4)
127 for (int c = 0; c < nBVectors; c++)
128 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
129 }
130 if (iMethod == 12) {
131 printf(" (guided, 32)..."); fflush(0);
132 #pragma omp parallel for schedule(guided, 32)
133 for (int c = 0; c < nBVectors; c++)
134 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
135 }
136 if (iMethod == 13) {
137 printf(" (guided, 256)..."); fflush(0);
138 #pragma omp parallel for schedule(guided, 256)
139 for (int c = 0; c < nBVectors; c++)
140 IterativeSolver(n, M, &b[c*n], &x[c*n], accuracy[c]);
}

g
141

an
142 const double t1 = omp_get_wtime();

W
143 printf(" time: %.3f sec\n", t1-t0);
if (t >= itSkip) {
ng
144
145 tAvg += (t1-t0);
he

146 dtAvg += (t1-t0)*(t1-t0);


un

147 }
rY

148 fflush(0);
fo

149 }
tAvg /= (double)(nTrials-itSkip);
d

150
re

151 dtAvg /= (double)(nTrials-itSkip);


pa

152 dtAvg = sqrt(dtAvg - tAvg*tAvg);


re

153 printf("Average: %.3f +- %.3f sec\n\n", tAvg, dtAvg); fflush(0);


yP

154 }
155 _mm_free(M);
el

156 _mm_free(x);
siv

157 _mm_free(b);
u

_mm_free(accuracy);
cl

158
}
Ex

159

Back to Lab A.4.8.

B.4.9 Shared-Memory Optimization: Loop Collapse and Strip-Mining for Improved


Parallel Scalability
Back to Lab A.4.9.

B.4.9.1 labs/4/4.9-insufficient-parallelism/step-00/Makefile

CXX = icpc
CXXFLAGS = -openmp

OBJECTS = main.o worker.o


MICOBJECTS = main.oMIC worker.oMIC

.SUFFIXES: .o .cc .oMIC

.cc.o:

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 439

$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

all: runme runmeMIC

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC

Back to Lab A.4.9.

B.4.9.2 labs/4/4.9-insufficient-parallelism/step-00/main.cc

g
an
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.

W
2 file main.cc, located at 4/4.9-insufficient-parallelism/step-00

g
3 is a part of the practical supplement to the handbook
4
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8en
"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
h
un
5
Redistribution or commercial usage without a written permission
rY

6
7 from Colfax International is prohibited.
fo

8 Contact information can be found at http://colfax-intl.com/ */


ed

9
ar

10 #include <malloc.h>
ep

11 #include <math.h>
#include <omp.h>
Pr

12
13 #include <stdio.h>
y
el

14
siv

15 void SumColumns(const int m, const int n, long* M, long* s, char* method);


16
u
cl

17 int main(){
Ex

18 const int n = 100000000, m = 4; // n is the number of columns (inner dimension),


19 // m is the number of rows (outer dimension)
20 long* matrix = (long*)_mm_malloc(sizeof(long)*m*n, 64);
21 long* sums = (long*)_mm_malloc(sizeof(long)*m, 64); // will contain sum of matrix rows
22 char method[100];
23
24 const int nTrials=10;
25 double t=0, dt=0;
26
27 printf("Problem size: %.3f GB, outer dimension: %d, threads: %d\n",
28 (double)(sizeof(long))*(double)(n)*(double)m/(double)(1<<30),
29 m, omp_get_max_threads());
30
31 // Initializing data
32 #pragma omp parallel for
33 for (int i = 0; i < m; i++)
34 for (int j = 0; j < n; j++)
35 matrix[i*n + j] = (long)i;
36
37 // Benchmarking SumColumns(...)
38 for (int l = 0; l < nTrials; l++) {
39 const double t0=omp_get_wtime();
40 SumColumns(m, n, matrix, sums, method);

Prepared for Yunheng Wang c Colfax International, 2013


440 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

41 const double t1=omp_get_wtime();


42
43 if ( l>= 2) { // First two iterations are slow on Xeon Phi; exclude them
44 t += (t1-t0)/(double)(nTrials-2);
45 dt += (t1-t0)*(t1-t0)/(double)(nTrials-2);
46 }
47
48 // Verifying that the result is correct
49 for (int i = 0; i < m; i++)
50 if (sums[i] != i*n)
51 printf("Results are incorrect!");
52
53 }
54 dt = sqrt(dt-t*t);
55 const float GBps = (double)(sizeof(long)*(size_t)n*(size_t)m)/t*1e-9;
56 printf("%s: %.3f +/- %.3f seconds (%.2f +/- %.2f GB/s)\n",
57 method, t, dt, GBps, GBps*(dt/t));
58
59 _mm_free(sums); _mm_free(matrix);
60 }

g
Back to Lab A.4.9.

an
B.4.9.3 W
ng
labs/4/4.9-insufficient-parallelism/step-00/worker.cc
he
un

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


rY

2 file worker.cc, located at 4/4.9-insufficient-parallelism/step-00


is a part of the practical supplement to the handbook
fo

3
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
d
re

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


pa

6 Redistribution or commercial usage without a written permission


from Colfax International is prohibited.
re

7
Contact information can be found at http://colfax-intl.com/ */
yP

8
9
el

10 #include <omp.h>
siv

11 #include <cstring>
u

12
cl

13 void SumColumns(const int m, const int n, long* M, long* s, char* method){


Ex

14
15 // Distribute rows across threads
16 #pragma omp parallel for
17 for (int i = 0; i < m; i++) {
18 long sum = 0;
19
20 // In each row, use vectorized reduction
21 // to compute the sum of all columns
22 #pragma simd reduction(+: sum)
23 #pragma vector aligned
24 for (int j = 0; j < n; j++)
25 sum += M[i*n+j];
26
27 s[i] = sum;
28
29 }
30
31 strcpy(method, "Unoptimized");
32
33 }

Back to Lab A.4.9.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 441

B.4.9.4 labs/4/4.9-insufficient-parallelism/step-01/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.9-insufficient-parallelism/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <omp.h>
11 #include <cstring>
12
13 void SumColumns(const int m, const int n, long* M, long* s, char* method){
14
15 for (int i = 0; i < m; i++) {
16 long sum = 0;
17
18 // Distribute rows across threads.
19 // At the same time, use vectorized reduction

g
// to compute the sum of all columns

an
20
21 #pragma omp parallel for schedule(guided) reduction(+: sum)

W
22 #pragma simd

g
#pragma vector aligned
en
23
24 for (int j = 0; j < n; j++) h
un
25 sum += M[i*n+j];
rY

26
27 s[i] = sum;
fo

28
ed

29 }
ar

30
ep

31 strcpy(method, "Inner loop parallelized");


Pr

32
33 }
y
el
siv

Back to Lab A.4.9.


u
cl
Ex

B.4.9.5 labs/4/4.9-insufficient-parallelism/step-02/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.9-insufficient-parallelism/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <omp.h>
11 #include <cstring>
12
13 void SumColumns(const int m, const int n, long* M, long* s, char* method){
14
15 s[0:m] = 0;
16
17 #pragma omp parallel
18 {
19 long s_thread[m]; // Private reduction array to avoid false sharing

Prepared for Yunheng Wang c Colfax International, 2013


442 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

20 s_thread[0:m] = 0;
21
22 // Note the absence of "parallel" in #pragma omp for, because it is already
23 // in a parallel region
24 #pragma omp for collapse(2) schedule(guided)
25 #pragma simd
26 #pragma vector aligned
27 for (int i = 0; i < m; i++) // Loop i will be collased with loop j
28 for (int j = 0; j < n; j++) // to form a single, greater iteration space
29 s_thread[i] += M[i*n+j];
30
31 // Arrays cannot be declared as reducers in pragma omp,
32 // and so the reduction must be programmed explicitly
33 for (int i = 0; i < m; i++)
34 #pragma omp atomic
35 s[i] += s_thread[i];
36 }
37
38 strcpy(method, "Collapse nested loops");
39
40 }

g
an
Back to Lab A.4.9.
W
ng
he

B.4.9.6 labs/4/4.9-insufficient-parallelism/step-03/worker.cc
un
rY

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


fo

2 file worker.cc, located at 4/4.9-insufficient-parallelism/step-03


d

3 is a part of the practical supplement to the handbook


re

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


pa

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


re

6 Redistribution or commercial usage without a written permission


yP

7 from Colfax International is prohibited.


Contact information can be found at http://colfax-intl.com/ */
el

8
siv

9
10 #include <omp.h>
u
cl

11 #include <assert.h>
Ex

12 #include <cstring>
13
14 void SumColumns(const int m, const int n, long* M, long* s, char* method){
15
16 // stripSize works best if it is
17 // (a) a multiple of the SIMD vector length, and
18 // (b) be much greater than the SIMD vector length
19 // (c) much smaller than n
20 const int stripSize = 10000;
21
22 // It is trivial to avoid this limitation by peeling off n%stripSize iterations
23 // at the end of the j-loop, and adding a second loop to process these elements.
24 assert(n % stripSize == 0);
25
26 s[0:m] = 0;
27
28 #pragma omp parallel
29 {
30 long s_thread[m]; // Private reduction array to avoid false sharing
31 s_thread[0:m] = 0;
32
33 // Note the absence of "parallel" in #pragma omp for, because already in a
34 // parallel region

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 443

35 #pragma omp for collapse(2) schedule(guided)


36 for (int i = 0; i < m; i++) // Loop i will be collased with loop jj
37 for (int jj = 0; jj < n; jj += stripSize) // to form a single, greater
38 // iteration space
39 #pragma simd reduction(+:s_thread[i])
40 #pragma vector aligned
41 for (int j = jj; j < jj + stripSize; j++) // This loop is auto-vectorized
42 s_thread[i] += M[i*n+j];
43
44 // Arrays cannot be declared as reducers in pragma omp,
45 // and so the reduction must be programmed explicitly
46 for (int i = 0; i < m; i++)
47 #pragma omp atomic
48 s[i] += s_thread[i];
49 }
50
51 strcpy(method, "Strip-mine and collapse");
52
53 }

Back to Lab A.4.9.

g
an
W
B.4.10 Shared-Memory Optimization: Core Affinity Control

g
Back to Lab A.4.10. en
h
un
rY

B.4.10.1 labs/4/4.a-affinity/step-00/Makefile
fo
ed
ar

CXX = icpc
ep

CXXFLAGS = -openmp
Pr

OBJECTS = main.o worker.o


y

MICOBJECTS = main.oMIC worker.oMIC


el
siv

.SUFFIXES: .o .cc .oMIC


u
cl
Ex

.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

all: runme runmeMIC

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC

Back to Lab A.4.10.

B.4.10.2 labs/4/4.a-affinity/step-00/main.cc

Prepared for Yunheng Wang c Colfax International, 2013


444 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.a-affinity/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <malloc.h>
11 #include <math.h>
12 #include <omp.h>
13 #include <stdio.h>
14
15 void SumColumns(const int m, const int n, long* M, long* s, char* method);
16
17 int main(){
18 const int n = 100000000, m = 4; // n is the number of columns (inner dimension),
19 // m is the number of rows (outer dimension)

g
20 long* matrix = (long*)_mm_malloc(sizeof(long)*m*n, 64);

an
21 long* sums = (long*)_mm_malloc(sizeof(long)*m, 64); // will contain sum of matrix rows

W
22 char method[100];
23

ng
24 const int nTrials=10;
25 double t=0, dt=0;
e
nh
26
printf("Problem size: %.3f GB, outer dimension: %d, threads: %d\n",
Yu

27
28 (double)(sizeof(long))*(double)(n)*(double)m/(double)(1<<30),
r

29 m, omp_get_max_threads());
fo

30
// Initializing data
ed

31
32 #pragma omp parallel for
ar

33 for (int i = 0; i < m; i++)


p

34 for (int j = 0; j < n; j++)


re

35 matrix[i*n + j] = (long)i;
yP

36
37 // Benchmarking SumColumns(...)
el

38 for (int l = 0; l < nTrials; l++) {


iv

39 const double t0=omp_get_wtime();


us

40 SumColumns(m, n, matrix, sums, method);


const double t1=omp_get_wtime();
cl

41
Ex

42
43 if ( l>= 2) { // First two iterations are slow on Xeon Phi; exclude them
44 t += (t1-t0)/(double)(nTrials-2);
45 dt += (t1-t0)*(t1-t0)/(double)(nTrials-2);
46 }
47
48 // Verifying that the result is correct
49 for (int i = 0; i < m; i++)
50 if (sums[i] != i*n)
51 printf("Results are incorrect!");
52
53 }
54 dt = sqrt(dt-t*t);
55 const float GBps = (double)(sizeof(long)*(size_t)n*(size_t)m)/t*1e-9;
56 printf("%s: %.3f +/- %.3f seconds (%.2f +/- %.2f GB/s)\n",
57 method, t, dt, GBps, GBps*(dt/t));
58
59 _mm_free(sums); _mm_free(matrix);
60 }

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 445

Back to Lab A.4.10.

B.4.10.3 labs/4/4.a-affinity/step-00/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.a-affinity/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <omp.h>
11 #include <assert.h>
12 #include <cstring>
13
14 void SumColumns(const int m, const int n, long* M, long* s, char* method){
15
// stripSize works best if it is

g
16

an
17 // (a) a multiple of the SIMD vector length, and

W
18 // (b) be much greater than the SIMD vector length
// (c) much smaller than n

g
19

en
20 const int stripSize = 10000; h
21
un
22 // It is trivial to avoid this limitation by peeling off n%stripSize iterations
rY

23 // at the end of the j-loop, and adding a second loop to process these elements.
fo

24 assert(n % stripSize == 0);


ed

25
s[0:m] = 0;
ar

26
27
ep

28 #pragma omp parallel


Pr

29 {
y

30 long s_thread[m]; // Private reduction array to avoid false sharing


el

31 s_thread[0:m] = 0;
siv

32
u

// Note the absence of "parallel" in #pragma omp for, because it is already


cl

33
Ex

34 // in a parallel region
35 #pragma omp for collapse(2) schedule(guided)
36 for (int i = 0; i < m; i++) // Loop i will be collased with loop jj
37 for (int jj = 0; jj < n; jj += stripSize) // to form a single, greater
38 // iteration space
39 #pragma simd reduction(+:s_thread[i])
40 #pragma vector aligned
41 for (int j = jj; j < jj + stripSize; j++) // This loop is auto-vectorized
42 s_thread[i] += M[i*n+j];
43
44 // Arrays cannot be declared as reducers in pragma omp,
45 // and so the reduction must be programmed explicitly
46 for (int i = 0; i < m; i++)
47 #pragma omp atomic
48 s[i] += s_thread[i];
49 }
50
51 strcpy(method, "Strip-mine and collapse");
52
53 }

Back to Lab A.4.10.

Prepared for Yunheng Wang c Colfax International, 2013


446 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

B.4.10.4 labs/4/4.a-affinity/step-01/Makefile

CXX = icpc
CXXFLAGS = -openmp -mkl

OBJECTS = affinity.o
MICOBJECTS = affinity.oMIC

.SUFFIXES: .o .cpp .oMIC

.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cpp.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

all: runme runmeMIC

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

g
runmeMIC: $(MICOBJECTS)

an
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
W
ng
clean:
he

rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC


un
rY

Back to Lab A.4.10.


fo
d
re

B.4.10.5 labs/4/4.a-affinity/step-01/affinity.cpp
pa
re
yP

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


file affinity.cpp, located at 4/4.a-affinity/step-01
el

2
siv

3 is a part of the practical supplement to the handbook


4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
u
cl

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


Ex

6 Redistribution or commercial usage without a written permission


7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <mkl.h>
11 #include <stdio.h>
12 #include <omp.h>
13
14 int main(){
15 const int N = 10000, Nld = N+64;
16 const char tr=’N’; const double v=1.0;
17 double* A = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);
18 double* B = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);
19 double* C = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);
20 _Cilk_for (int i=0; i<N*Nld; i++)
21 A[i] = B[i] = C[i] = 0.0f;
22 int nIter = 10;
23 for (int k=0; k<nIter; k++){
24 double t1 = omp_get_wtime();
25 dgemm(&tr, &tr, &N, &N, &N, &v, A, &Nld, B, &Nld, &v, C, &N);
26 double t2 = omp_get_wtime();
27 double flopsNow = (2.0*N*N*N+1.0*N*N) * 1e-9/(t2-t1);
28 printf("Iteration %d: %.1f GFLOP/s\n", k+1, flopsNow);

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 447

29 fflush(0);
30 }
31 _mm_free(A); _mm_free(B); _mm_free(C);
32 }

Back to Lab A.4.10.

B.4.10.6 labs/4/4.a-affinity/step-02/Makefile

CXX = icpc
CXXFLAGS = -openmp -mkl

OBJECTS = affinity.o
MICOBJECTS = affinity.oMIC

.SUFFIXES: .o .cpp .oMIC

.cpp.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

g
an
.cpp.oMIC:

W
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

g
en
all: runme runmeMIC h
un
runme: $(OBJECTS)
rY

$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)


fo
ed

runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)
ar
ep

clean:
Pr

rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC


y
el
siv

Back to Lab A.4.10.


u
cl
Ex

B.4.10.7 labs/4/4.a-affinity/step-02/affinity.cpp

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file affinity.cpp, located at 4/4.a-affinity/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <math.h>
11 #include <stdio.h>
12 #include <omp.h>
13 #include "mkl_dfti.h"
14
15 int main(int argc, char** argv){
16 const size_t n = 1L<<30L;
17 const char* def = "(single instance)";
18 const char* inst = (argc < 2 ? def : argv[1]);
19 const double flopsPerTransfer = 2.5*log2((double)n)*n;
20 float* x = (float*)malloc(sizeof(float)*n);

Prepared for Yunheng Wang c Colfax International, 2013


448 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

21 _Cilk_for (int i=0; i<n; i++)


22 x[i] = 1.0f;
23 DFTI_DESCRIPTOR_HANDLE fftHandle;
24 MKL_LONG size = n;
25 DftiCreateDescriptor (&fftHandle, DFTI_SINGLE, DFTI_REAL, 1, size);
26 DftiCommitDescriptor (fftHandle);
27 const int nTrials = 5;
28 for (int t=0; t<nTrials; t++){
29 const double t1 = omp_get_wtime();
30 DftiComputeForward (fftHandle, x);
31 const double t2 = omp_get_wtime();
32 const double gflops = flopsPerTransfer*1e-9/(t2-t1);
33 printf("Instance %s, iteration %d: %.3f ms (%.1f GFLOP/s)\n",
34 inst, t+1, 1e3*(t2-t1), gflops);
35 }
36 DftiFreeDescriptor (&fftHandle);
37 free(x);
38 }

Back to Lab A.4.10.

g
an
W
B.4.11 Cache Optimization: Loop Interchange and Tiling ng
Back to Lab A.4.11.
he
un
rY

B.4.11.1 labs/4/4.b-tiling/step-00/Makefile
fo
d
re

CXX = icpc
pa

CXXFLAGS = -openmp
re
yP

OBJECTS = main.o worker.o


MICOBJECTS = main.oMIC worker.oMIC
el
siv

.SUFFIXES: .o .cc .oMIC


u
cl

.cc.o:
Ex

$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

all: runme runmeMIC

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC

Back to Lab A.4.11.

B.4.11.2 labs/4/4.b-tiling/step-00/main.cc

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 449

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.b-tiling/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <math.h>
11 #include <omp.h>
12 #include <stdio.h>
13
14 void ComputeEmissivity(const int wlBins,
15 double* emissivity,
16 const double* wavelength,
17 const int gsMax,
18 const double* grainSizeD,
19 const double* absorption,
20 const int tempBins,
21 const double* planckFunc,
const double* distribution

g
22

an
23 );

W
24
int main(){

g
25

en
26 const int wlBins = 128; h
27 const int tempBins = 128;
un
28 const int gsMax = 200;
rY

29 const int nCells = 1<<15;


fo

30
double wavelength[wlBins], grainSizeD[gsMax], absorption[wlBins*gsMax],
ed

31
32 planckFunc[tempBins*wlBins] __attribute__((aligned(64))), report[wlBins];
ar

33
ep

34 const int nTrials=6;


Pr

35 double t=0, dt=0;


y

36
el

37 // Initializing data with something not realistic, but non-trivial


siv

38 for (int i = 0; i < wlBins; i++) {


u

wavelength[i] = exp(0.1*i);
cl

39
for (int j = 0; j < gsMax; j++) {
Ex

40
41 absorption[j*wlBins + i] = (double)(i*j);
42 grainSizeD[j] = (double)j;
43 }
44 for (int k = 0; k < tempBins; k++)
45 planckFunc[i*tempBins + k] = (double)i*exp(-0.1*k);
46 }
47
48 for (int tr = 0; tr < nTrials; tr++) {
49 const double t0=omp_get_wtime();
50
51 #pragma omp parallel for schedule(guided)
52 for (int cell = 0; cell < nCells; cell++) {
53 double emissivity[wlBins];
54 double distribution[tempBins*gsMax] __attribute__((aligned(64)));
55 // In the practical application, this quantity is computed for every cell,
56 // but for benchmarking, we omit this calculation and use the same distribution
57 // for all cells
58 distribution[:] = 1.0;
59 ComputeEmissivity(wlBins,
60 emissivity,
61 wavelength,
62 gsMax,

Prepared for Yunheng Wang c Colfax International, 2013


450 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

63 grainSizeD,
64 absorption,
65 tempBins,
66 planckFunc,
67 distribution
68 );
69 if ((tr == nTrials-1) && (cell == nCells-1))
70 report[:] = emissivity[:];
71 }
72 const double t1=omp_get_wtime();
73
74 if (tr >= 2) { // First two iterations are slow on Xeon Phi; exclude them
75 t += (t1-t0)/(double)(nTrials-2);
76 dt += (t1-t0)*(t1-t0)/(double)(nTrials-2);
77 }
78 printf("Trial %d: %.3f seconds\n", tr+1, t1-t0); fflush(0);
79 }
80 dt = sqrt(dt-t*t);
81 printf("Average: %.3f +- %.3f seconds.\nResult (for verification): ", t, dt);
82 for (int i = 0; i < wlBins; i++)
83 printf(" %.2e", report[i]);
printf("\n"); fflush(0);

g
84

an
85 }

W
ng
Back to Lab A.4.11.
he
un
rY

B.4.11.3 labs/4/4.b-tiling/step-00/worker.cc
fo
d

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


re

2 file worker.cc, located at 4/4.b-tiling/step-00


pa

3 is a part of the practical supplement to the handbook


re

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


yP

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


Redistribution or commercial usage without a written permission
el

6
siv

7 from Colfax International is prohibited.


8 Contact information can be found at http://colfax-intl.com/ */
u
cl

9
Ex

10 void ComputeEmissivity(const int wlBins,


11 double* emissivity,
12 const double* wavelength,
13 const int gsMax,
14 const double* grainSizeD,
15 const double* absorption,
16 const int tempBins,
17 const double* planckFunc,
18 const double* distribution
19 ) {
20 // This function computes the emissivity of transient dust grains
21 for (int i = 0; i < wlBins; i++) {
22 double sum = 0;
23 for (int j = 0; j < gsMax; j++) {
24 const double gsd = grainSizeD[j];
25 const double crossSection = absorption[j*wlBins + i];
26 const double product = gsd*crossSection;
27 double result = 0;
28 #pragma vector aligned
29 for (int k = 0; k < tempBins; k++)
30 result += planckFunc[i*tempBins + k]*distribution[j*tempBins + k];
31 sum += result*product;
32 }

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 451

33 emissivity[i] = sum*wavelength[i];
34 }
35 }

Back to Lab A.4.11.

B.4.11.4 labs/4/4.b-tiling/step-01/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.b-tiling/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 void ComputeEmissivity(const int wlBins,
11 double* emissivity,

g
12 const double* wavelength,

an
13 const int gsMax,

W
14 const double* grainSizeD,

g
15 const double* absorption,
16
17
const int tempBins,
const double* planckFunc, en
h
un
18 const double* distribution
rY

19 ) {
fo

20 // This function computes the emissivity of transient dust grains


ed

21 // In this version, the i- and j-loops are permuted


// in order to reduce cache misses upon reading grainSizeD[]
ar

22
// and improve the locality of access to absorption[]
ep

23
24 emissivity[0:wlBins] = 0.0;
Pr

25 for (int j = 0; j < gsMax; j++) {


y

26 const double gsd = grainSizeD[j];


el

for (int i = 0; i < wlBins; i++) {


siv

27
28 double result = 0;
u
cl

29 #pragma vector aligned


Ex

30 for (int k = 0; k < tempBins; k++)


31 result += planckFunc[i*tempBins + k]*distribution[j*tempBins + k];
32
33 const double crossSection = absorption[j*wlBins + i];
34 const double product = gsd*crossSection;
35 emissivity[i] += result*product*wavelength[i];
36 }
37 }
38 }

Back to Lab A.4.11.

B.4.11.5 labs/4/4.b-tiling/step-02/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.b-tiling/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.

Prepared for Yunheng Wang c Colfax International, 2013


452 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

8 Contact information can be found at http://colfax-intl.com/ */


9
10 #include <cassert>
11
12 void ComputeEmissivity(const int wlBins,
13 double* emissivity,
14 const double* wavelength,
15 const int gsMax,
16 const double* grainSizeD,
17 const double* absorption,
18 const int tempBins,
19 const double* planckFunc,
20 const double* distribution
21 ) {
22 // This function computes the emissivity of transient dust grains
23 // In this version, loop tiling is implemented in the i-loop
24 const int iTile = 8; // Found empirically
25 assert(wlBins%iTile==0);
26 emissivity[0:wlBins] = 0.0;
27 for (int j = 0; j < gsMax; j++) {
28 const double gsd = grainSizeD[j];
for (int ii = 0; ii < wlBins; ii+=iTile) { // i-loop tiling

g
29

an
30 double result[iTile]; result[:] = 0.0;

W
31 #pragma vector aligned
#pragma simd
ng
32
33 for (int k = 0; k < tempBins; k++)
he

34 #pragma novector
un

35 for (int i = ii; i < ii+iTile; i++)


rY

36 result[i-ii] += planckFunc[i*tempBins + k]*distribution[j*tempBins + k];


fo

37
for (int i = ii; i < ii+iTile; i++) {
d

38
re

39 const double crossSection = absorption[j*wlBins + i];


pa

40 const double product = gsd*crossSection;


re

41 emissivity[i] += result[i-ii]*product*wavelength[i];
yP

42 }
43 }
el

44 }
siv

45 }
u
cl
Ex

Back to Lab A.4.11.

B.4.11.6 labs/4/4.b-tiling/step-03/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.b-tiling/step-03
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <cassert>
11
12 void ComputeEmissivity(const int wlBins,
13 double* emissivity,
14 const double* wavelength,
15 const int gsMax,
16 const double* grainSizeD,
17 const double* absorption,

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 453

18 const int tempBins,


19 const double* planckFunc,
20 const double* distribution
21 ) {
22 // This function computes the emissivity of transient dust grains
23 // In this version, loop tiling is implemented in the i- and j-loops
24 #ifdef __MIC__
25 // Found empirically
26 const int iTile = 4;
27 const int jTile = 4;
28 #else
29 // Found empirically
30 const int iTile = 4;
31 const int jTile = 2;
32 #endif
33
34 assert(gsMax%jTile==0); assert(wlBins%iTile==0);
35
36 emissivity[0:wlBins] = 0.0;
37 for (int jj = 0; jj < gsMax; jj+=jTile) { // j-loop tiling
38 for (int ii = 0; ii < wlBins; ii+=iTile) { // i-loop tiling
double result[iTile*jTile]; result[:] = 0.0;

g
39

an
40 #pragma vector aligned

W
41 #pragma simd
for (int k = 0; k < tempBins; k++) {

g
42

en
43
44 //
h
In an ideal world, the following code would be the body of the k-loop:
un
45 // for (int j = jj; j < jj+jTile; j++)
rY

46 // for (int i = ii; i < ii+iTile; i++)


fo

47 // result[(j-jj)*iTile + (i-ii)] += planckFunc[i*tempBins + k]*


// distribution[j*tempBins + k];
ed

48
49 // However, #pragma simd fails to vectorize the k-loop when its body
ar

50 // contains not one, but two nested loops.


ep

51 // So, in order to preserve automatic vectorization of the k-loop, we


Pr

52 // will unroll it. The j-loop is unrolled explicitly.


y

53
el

54 for (int i = ii; i < ii+iTile; i++) {


siv

55 // Start of j-loop unrolling


u

result[(0)*iTile + (i-ii)] += planckFunc[i*tempBins + k]*


cl

56
distribution[(jj+0)*tempBins + k];
Ex

57
58 result[(1)*iTile + (i-ii)] += planckFunc[i*tempBins + k]*
59 distribution[(jj+1)*tempBins + k];
60 // on the host, the j-loop tile is 4, so the host code ends here
61 #ifdef __MIC__
62 // on the coprocessor, the j-loop tile is 4, so two more iterations
63 result[(2)*iTile + (i-ii)] += planckFunc[i*tempBins + k]*
64 distribution[(jj+2)*tempBins + k];
65 result[(3)*iTile + (i-ii)] += planckFunc[i*tempBins + k]*
66 distribution[(jj+3)*tempBins + k];
67 // end of coprocessor-only code
68 #endif
69 // End of j-loop unrolling
70 }
71 }
72
73 for (int j = jj; j < jj+jTile; j++) {
74 const double gsd = grainSizeD[j];
75 for (int i = ii; i < ii+iTile; i++) {
76 const double crossSection = absorption[j*wlBins + i];
77 const double product = gsd*crossSection;
78 emissivity[i] += result[(j-jj)*iTile + (i-ii)]*product*wavelength[i];
79 }

Prepared for Yunheng Wang c Colfax International, 2013


454 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

80 }
81 }
82 }
83 }

Back to Lab A.4.11.

B.4.11.7 labs/4/4.b-tiling/step-04/main.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.b-tiling/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <math.h>

g
11 #include <omp.h>

an
12 #include <stdio.h>

W
13
ng
14 void ComputeEmissivity00(const int, double*, const double*, const int, const double*,
he

15 const double*, const int, const double*, const double* );


un

16
void ComputeEmissivity01(const int, double*, const double*, const int, const double*,
rY

17
18 const double*, const int, const double*, const double* );
fo

19
d

20 void ComputeEmissivity02(const int, double*, const double*, const int, const double*,
re

21 const double*, const int, const double*, const double* );


pa

22
re

23 void ComputeEmissivity03(const int, double*, const double*, const int, const double*,
yP

24 const double*, const int, const double*, const double* );


el

25
siv

26 typedef void (*CompFunc)(const int, double*, const double*, const int, const double*,
27 const double*, const int, const double*, const double* );
u
cl

28
Ex

29 CompFunc Func[] = {ComputeEmissivity00, ComputeEmissivity01, ComputeEmissivity02,


30 ComputeEmissivity03};
31
32 int main(){
33 const int wlBins = 128;
34 const int tempBins = 128;
35 const int gsMax = 200;
36 const int nCells = 1<<15;
37
38 double wavelength[wlBins], grainSizeD[gsMax], absorption[wlBins*gsMax],
39 planckFunc[tempBins*wlBins] __attribute__((aligned(64))), report[wlBins];
40
41 const int nTrials=4;
42 double t=0, dt=0;
43
44 // Initializing data with something not realistic, but non-trivial
45 for (int i = 0; i < wlBins; i++) {
46 wavelength[i] = exp(0.1*i);
47 for (int j = 0; j < gsMax; j++) {
48 absorption[j*wlBins + i] = (double)(i*j);
49 grainSizeD[j] = (double)j;
50 }
51 for (int k = 0; k < tempBins; k++)

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 455

52 planckFunc[i*tempBins + k] = (double)i*exp(-0.1*k);
53 }
54
55 for (int tr = 0; tr < nTrials; tr++) {
56 const double t0=omp_get_wtime();
57
58 #pragma omp parallel for schedule(guided)
59 for (int cell = 0; cell < nCells; cell++) {
60 double emissivity[wlBins];
61 double distribution[tempBins*gsMax] __attribute__((aligned(64)));
62 // In the practical application, this quantity is computed for every cell, but
63 // for benchmarking, we omit this calculation and use the same distribution
64 // for all cells.
65 distribution[:] = 1.0;
66 Func[tr](wlBins,
67 emissivity,
68 wavelength,
69 gsMax,
70 grainSizeD,
71 absorption,
72 tempBins,
planckFunc,

g
73

an
74 distribution

W
75 );
if ((tr == nTrials-1) && (cell == nCells-1))

g
76

en
77 report[:] = emissivity[:]; h
78 }
un
79 const double t1=omp_get_wtime();
rY

80
fo

81 if (tr >= 2) { // First two iterations are slow on Xeon Phi; exclude them
t += (t1-t0)/(double)(nTrials-2);
ed

82
83 dt += (t1-t0)*(t1-t0)/(double)(nTrials-2);
ar

84 }
ep

85 printf("Step %d: %.3f seconds\n", tr, t1-t0); fflush(0);


Pr

86 }
y

87 printf("\n"); fflush(0);
el

88 }
siv
u
cl

Back to Lab A.4.11.


Ex

B.4.12 Memory Access: Cache-Oblivious Algorithms


Back to Lab A.4.12.

B.4.12.1 labs/4/4.c-cache-oblivious-recursion/step-00/Makefile

CXX = icpc
CXXFLAGS = -openmp -vec-report3

OBJECTS = main.o worker.o


MICOBJECTS = main.oMIC worker.oMIC

.SUFFIXES: .o .cc .oMIC

.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

Prepared for Yunheng Wang c Colfax International, 2013


456 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

all: runme runmeMIC

runme: $(OBJECTS)
$(CXX) $(CXXFLAGS) -o runme $(OBJECTS)

runmeMIC: $(MICOBJECTS)
$(CXX) $(CXXFLAGS) -mmic -o runmeMIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC

Back to Lab A.4.12.

B.4.12.2 labs/4/4.c-cache-oblivious-recursion/step-00/main.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.c-cache-oblivious-recursion/step-00
3 is a part of the practical supplement to the handbook

g
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"

an
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

W
6 Redistribution or commercial usage without a written permission
ng
7 from Colfax International is prohibited.
he

8 Contact information can be found at http://colfax-intl.com/ */


un

9
#include <stdio.h>
rY

10
11 #include <omp.h>
fo

12 #include <malloc.h>
d

13 #include <math.h>
re

14 #include <cilk/reducer_opadd.h>
pa

15
re

16 void InitMatrix(float* A, const int n) {


yP

17 _Cilk_for (int i = 0; i < n; i++)


for (int j = 0; j < n; j++)
el

18
siv

19 A[i*n+j] = (float)(i*n+j);
20 }
u
cl

21
Ex

22 float VerifyTransposed(float* A, const int n) {


23 cilk::reducer_opadd<float> err;
24 _Cilk_for (int i = 0; i < n; i++)
25 for (int j = 0; j < n; j++) {
26 const float diff = (A[i*n+j] - (float)(j*n+i));
27 err += diff*diff;
28 }
29 return sqrtf(err.get_value());
30 }
31
32 void Transpose(float* const A, const int n);
33
34 int main(){
35 const int n = 28000;
36 float *A=(float*)_mm_malloc(n*n*sizeof(float), 64);
37
38 const int nt = 20;
39 double t = 0.0, dt = 0.0;
40 for (int k = 0; k < nt; k++) {
41 if (k == 0) InitMatrix(A, n);
42
43 const double t0 = omp_get_wtime();
44 Transpose(A, n);

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 457

45 const double t1 = omp_get_wtime();


46
47 if (k == 0)
48 if (VerifyTransposed(A,n) > 1e-6)
49 printf("Result of transposition is incorrect!\n");
50
51 if (k >= 4) {
52 if (t == 0) printf("--- start timing statistics collection ---\n");
53 t += (t1-t0);
54 dt += (t1-t0)*(t1-t0);
55 }
56
57 printf("Iteration %d: %.3f ms\n", k+1, 1e3*(t1-t0)); fflush(0);
58 }
59 t /= (double)(nt-4);
60 dt = sqrt(dt/(double)(nt-4) - t*t);
61
62 printf("Result: %.1f GB matrix [%d x %d] transposed in %.3f +- %.3f ms\n",
63 (double)(n*n*sizeof(float))/(double)(1<<30), n, n, 1e3*t, 1e3*dt); fflush(0);
64 _mm_free(A);
65 }

g
an
Back to Lab A.4.12.

W
g
B.4.12.3 en
labs/4/4.c-cache-oblivious-recursion/step-00/worker.cc
h
un
rY

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


fo

2 file worker.cc, located at 4/4.c-cache-oblivious-recursion/step-00


ed

3 is a part of the practical supplement to the handbook


"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
ar

4
(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
ep

5
6 Redistribution or commercial usage without a written permission
Pr

7 from Colfax International is prohibited.


y

8 Contact information can be found at http://colfax-intl.com/ */


el
siv

9
10 void Transpose(float* const A, const int n){
u
cl

11 _Cilk_for (int i = 0; i < n; i++) {


Ex

12 for (int j = 0; j < i; j++) {


13 const float c = A[i*n + j];
14 A[i*n + j] = A[j*n + i];
15 A[j*n + i] = c;
16 }
17 }
18 }

Back to Lab A.4.12.

B.4.12.4 labs/4/4.c-cache-oblivious-recursion/step-01/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.c-cache-oblivious-recursion/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9

Prepared for Yunheng Wang c Colfax International, 2013


458 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

10 #include <cassert>
11
12 void Transpose(float* const A, const int n){
13 // Tiled algorithm improves data locality by re-using data already in cache
14 #ifdef __MIC__
15 const int TILE = 16;
16 #else
17 const int TILE = 32;
18 #endif
19 assert(n%TILE == 0);
20 _Cilk_for (int ii = 0; ii < n; ii += TILE) {
21 const int iMax = (n < ii+TILE ? n : ii+TILE);
22 for (int jj = 0; jj <= ii; jj += TILE) {
23 for (int i = ii; i < iMax; i++) {
24 const int jMax = (i < jj+TILE ? i : jj+TILE);
25 for (int j = jj; j<jMax; j++) {
26 const int c = A[i*n + j];
27 A[i*n + j] = A[j*n + i];
28 A[j*n + i] = c;
29 }
30 }
}

g
31

an
32 }

W
33 } ng
he

Back to Lab A.4.12.


un
rY

B.4.12.5 labs/4/4.c-cache-oblivious-recursion/step-02/worker.cc
fo
d
re

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


pa

2 file worker.cc, located at 4/4.c-cache-oblivious-recursion/step-02


re

3 is a part of the practical supplement to the handbook


yP

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


(c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
el

5
siv

6 Redistribution or commercial usage without a written permission


7 from Colfax International is prohibited.
u
cl

8 Contact information can be found at http://colfax-intl.com/ */


Ex

9
10 #include <cassert>
11
12 void Transpose(float* const A, const int n){
13 // Tiled algorithm improves data locality by re-using data already in cache
14 #ifdef __MIC__
15 const int TILE = 16;
16 #else
17 const int TILE = 32;
18 #endif
19 assert(n%TILE == 0);
20 _Cilk_for (int ii = 0; ii < n; ii += TILE) {
21 const int iMax = (n < ii+TILE ? n : ii+TILE);
22 for (int jj = 0; jj <= ii; jj += TILE) {
23 for (int i = ii; i < iMax; i++) {
24 const int jMax = (i < jj+TILE ? i : jj+TILE);
25 #pragma loop count avg(TILE)
26 #pragma simd
27 for (int j = jj; j<jMax; j++) {
28 const int c = A[i*n + j];
29 A[i*n + j] = A[j*n + i];
30 A[j*n + i] = c;
31 }

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 459

32 }
33 }
34 }
35 }

Back to Lab A.4.12.

B.4.12.6 labs/4/4.c-cache-oblivious-recursion/step-03/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.c-cache-oblivious-recursion/step-03
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 void transpose_cache_oblivious_thread(
11 const int iStart, const int iEnd,

g
an
12 const int jStart, const int jEnd,
float* A, const int n){

W
13
14 #ifdef __MIC__

g
en
15 const int RT = 64; // Recursion threshold on coprocessor
16 #else h
un
17 const int RT = 32; // Recursion threshold on host
rY

18 #endif
if ( ((iEnd - iStart) <= RT) && ((jEnd - jStart) <= RT) ) {
fo

19
20 for (int i = iStart; i < iEnd; i++) {
ed

21 int je = (jEnd < i ? jEnd : i);


ar

22 #pragma simd
ep

23 #pragma loop count avg(RT)


Pr

24 for (int j = jStart; j < je; j++) {


const float c = A[i*n + j];
y

25
el

26 A[i*n + j] = A[j*n + i];


siv

27 A[j*n + i] = c;
u

28 }
cl

29 }
Ex

30 return;
31 }
32
33 if ((jEnd - jStart) > (iEnd - iStart)) {
34 // Split into subtasks j-wise
35 const int jSplit = jStart + (jEnd - jStart)/2;
36 _Cilk_spawn transpose_cache_oblivious_thread(iStart, iEnd, jStart, jSplit, A, n);
37 transpose_cache_oblivious_thread(iStart, iEnd, jSplit, jEnd, A, n);
38 } else {
39 // Split into subtasks i-wise
40 const int iSplit = iStart + (iEnd - iStart)/2;
41 const int jMax = (jEnd < iSplit ? jEnd : iSplit);
42 _Cilk_spawn transpose_cache_oblivious_thread(iStart, iSplit, jStart, jMax, A, n);
43 transpose_cache_oblivious_thread(iSplit, iEnd, jStart, jEnd, A, n);
44 }
45 }
46
47 void Transpose(float* const A, const int n){
48 transpose_cache_oblivious_thread(0, n, 0, n, A, n);
49 }

Back to Lab A.4.12.

Prepared for Yunheng Wang c Colfax International, 2013


460 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

B.4.12.7 labs/4/4.c-cache-oblivious-recursion/step-04/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.c-cache-oblivious-recursion/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 void transpose_cache_oblivious_thread(
11 const int iStart, const int iEnd,
12 const int jStart, const int jEnd,
13 float* A, const int n){
14 #ifdef __MIC__
15 const int RT = 64; // Recursion threshold on coprocessor
16 #else

g
17 const int RT = 32; // Recursion threshold on host

an
18 #endif

W
19 if ( ((iEnd - iStart) <= RT) && ((jEnd - jStart) <= RT) ) {
20 for (int i = iStart; i < iEnd; i++) {

ng
21 int je = (jEnd < i ? jEnd : i);
22 #pragma simd
e
nh
23 #pragma loop_count avg(RT)
for (int j = jStart; j < je; j++) {
Yu

24
25 const float c = A[i*n + j];
r

26 A[i*n + j] = A[j*n + i];


fo

27 A[j*n + i] = c;
}
ed

28
29 }
ar

30 return;
p

31 }
re

32
yP

33 if ((jEnd - jStart) > (iEnd - iStart)) {


34 // Split into subtasks j-wise
el

35 int jSplit = jStart + (jEnd - jStart)/2;


iv

36 // The following line slightly improves performance by splitting on aligned


us

37 // boundaries
if (jSplit - jSplit%16 > jStart) jSplit -= jSplit%16;
cl

38
_Cilk_spawn transpose_cache_oblivious_thread(iStart, iEnd, jStart, jSplit, A, n);
Ex

39
40 transpose_cache_oblivious_thread(iStart, iEnd, jSplit, jEnd, A, n);
41 } else {
42 // Split into subtasks i-wise
43 int iSplit = iStart + (iEnd - iStart)/2;
44 // The following line slightly improves performance by splitting on aligned
45 // boundaries
46 if (iSplit - iSplit%16 > iStart) iSplit -= iSplit%16;
47 const int jMax = (jEnd < iSplit ? jEnd : iSplit);
48 _Cilk_spawn transpose_cache_oblivious_thread(iStart, iSplit, jStart, jMax, A, n);
49 transpose_cache_oblivious_thread(iSplit, iEnd, jStart, jEnd, A, n);
50 }
51 }
52
53 void Transpose(float* const A, const int n){
54 transpose_cache_oblivious_thread(0, n, 0, n, A, n);
55 }

Back to Lab A.4.12.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 461

B.4.13 Memory Access: Loop Fusion


Back to Lab A.4.13.

B.4.13.1 labs/4/4.d-cache-loop-fusion/step-00/Makefile

CXX = icpc
CXXFLAGS = -openmp
LINKFLAGS = -openmp -mkl

OBJECTS = main.o worker.o


MICOBJECTS = main.oMIC worker.oMIC

.SUFFIXES: .o .cc .oMIC

.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

g
an
all: runme runmeMIC

W
g
runme: $(OBJECTS)
$(CXX) $(LINKFLAGS) -o runme $(OBJECTS)
en
h
un
runmeMIC: $(MICOBJECTS)
rY

$(CXX) $(LINKFLAGS) -mmic -o runmeMIC $(MICOBJECTS)


fo
ed

clean:
ar

rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC


ep
Pr

Back to Lab A.4.13.


y
el
siv

B.4.13.2 labs/4/4.d-cache-loop-fusion/step-00/main.cc
u
cl
Ex

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.d-cache-loop-fusion/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <stdio.h>
11 #include <math.h>
12 #include <omp.h>
13
14 void RunStatistics(const int m, const int n, float* const mean, float* const stdev);
15
16 int main(){
17 const int m = 10000;
18 const int n = 50000;
19
20
21 // Allocating data for results
22 float resultMean[m];
23 float resultStdev[m];

Prepared for Yunheng Wang c Colfax International, 2013


462 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

24
25 const int nt = 8;
26 double t = 0.0, dt = 0.0;
27 for (int k = 0; k < nt; k++) {
28
29 const double t0 = omp_get_wtime();
30 RunStatistics(m, n, resultMean, resultStdev);
31 const double t1 = omp_get_wtime();
32
33 if (k >= 2) {
34 t += (t1-t0);
35 dt += (t1-t0)*(t1-t0);
36 }
37
38 printf("Iteration %d: %.3f ms\n", k+1, 1e3*(t1-t0)); fflush(0);
39 }
40 t /= (double)(nt-2);
41 dt = sqrt(dt/(double)(nt-2) - t*t);
42
43 printf("Some of the results:\n ...\n");
44 for (int i = 10; i < 14; i++)
printf(" i=%d: x = %.1f+-%.1f (expected = %.1f+-%.1f)\n",

g
45

an
46 i, resultMean[i], resultStdev[i], (float)i, 1.0f);

W
47 printf(" ...\n"); fflush(stdout); ng
48
49
he

50 printf("Average performance: m=%d arrays of size n=%d processed in in %.3f +-\


un

51 %.3f ms\n", m, n, 1e3*t, 1e3*dt); fflush(0);


rY

52 }
fo
d

Back to Lab A.4.13.


re
pa
re

B.4.13.3 labs/4/4.d-cache-loop-fusion/step-00/worker.cc
yP
el
siv

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.d-cache-loop-fusion/step-00
u
cl

3 is a part of the practical supplement to the handbook


Ex

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <omp.h>
11 #include <malloc.h>
12 #include <mkl_vsl.h>
13 #include <math.h>
14
15 void GenerateRandomNumbers(const int m, const int n, float* const data) {
16 // Filling arrays with normally distributed random numbers
17 #pragma omp parallel
18 {
19 VSLStreamStatePtr rng;
20 const int seed = omp_get_thread_num();
21 int status = vslNewStream(&rng, VSL_BRNG_MT19937, omp_get_thread_num());
22
23 #pragma omp for schedule(guided)
24 for (int i = 0; i < m; i++) {
25 const float mean = (float)i;
26 const float stdev = 1.0f;

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 463

27 status = vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,
28 rng, n, &data[i*n], mean, stdev);
29 }
30
31 vslDeleteStream(&rng);
32 }
33 }
34
35 void ComputeMeanAndStdev(const int m, const int n, const float* data,
36 float* const resultMean, float* const resultStdev) {
37
38 // Processing data to compute the mean and standard deviation
39 #pragma omp parallel for schedule(guided)
40 for (int i = 0; i < m; i++) {
41 float sumx=0.0f, sumx2=0.0f;
42 #pragma vector aligned
43 for (int j = 0; j < n; j++) {
44 sumx += data[i*n + j];
45 sumx2 += data[i*n + j]*data[i*n + j];
46 }
47 resultMean[i] = sumx/(float)n;
resultStdev[i] = sqrtf(sumx2/(float)n-resultMean[i]*resultMean[i]);

g
48

an
49 }

W
50
}

g
51

en
52
53 void RunStatistics(const int m, const int n,
h
un
54 float* const resultMean, float* const resultStdev) {
rY

55
fo

56 // Allocating memory for scratch space for the whole problem


// m*n elements on heap (does not fit on stack)
ed

57
58 float* data = (float*) _mm_malloc((size_t)m*(size_t)n*sizeof(float), 64);
ar

59
ep

60 GenerateRandomNumbers(m, n, data);
Pr

61 ComputeMeanAndStdev(m, n, data, resultMean, resultStdev);


y

62
el

63 // Deallocating scratch space


siv

64 _mm_free(data);
u
cl

65
}
Ex

66

Back to Lab A.4.13.

B.4.13.4 labs/4/4.d-cache-loop-fusion/step-01/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.d-cache-loop-fusion/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <omp.h>
11 #include <mkl_vsl.h>
12 #include <math.h>
13
14 void RunStatistics(const int m, const int n,
15 float* const resultMean, float* const resultStdev) {

Prepared for Yunheng Wang c Colfax International, 2013


464 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

16
17 // Allocating memory for scratch space for the whole problem
18 // m*n elements on heap (does not fit on stack)
19 float* data = (float*) _mm_malloc((size_t)m*(size_t)n*sizeof(float), 64);
20
21 #pragma omp parallel
22 {
23 // Initializing a random number generator in each thread
24 VSLStreamStatePtr rng;
25 const int seed = omp_get_thread_num();
26 int status = vslNewStream(&rng, VSL_BRNG_MT19937, omp_get_thread_num());
27
28 #pragma omp for schedule(guided)
29 for (int i = 0; i < m; i++) {
30
31 // Filling arrays with normally distributed random numbers
32 const float seedMean = (float)i;
33 const float seedStdev = 1.0f;
34 status = vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,
35 rng, n, &data[i*n], seedMean, seedStdev);
36
// Processing data to compute the mean and standard deviation

g
37

an
38 float sumx=0.0f, sumx2=0.0f;

W
39 #pragma vector aligned
for (int j = 0; j < n; j++) {
ng
40
41 sumx += data[i*n + j];
he

42 sumx2 += data[i*n + j]*data[i*n + j];


un

43 }
rY

44 resultMean[i] = sumx/(float)n;
fo

45 resultStdev[i] = sqrtf(sumx2/(float)n-resultMean[i]*resultMean[i]);
}
d

46
re

47
pa

48 vslDeleteStream(&rng);
re

49 }
yP

50
51 // Deallocating scratch space
el

52 _mm_free(data);
siv

53 }
u
cl
Ex

Back to Lab A.4.13.

B.4.13.5 labs/4/4.d-cache-loop-fusion/step-02/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.d-cache-loop-fusion/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <omp.h>
11 #include <mkl_vsl.h>
12 #include <math.h>
13
14 void RunStatistics(const int m, const int n,
15 float* const resultMean, float* const resultStdev) {
16
17 #pragma omp parallel

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 465

18 {
19 // Allocating scratch data, n elements on stack for each thread
20 float data[n] __attribute__((aligned(64)));
21
22 // Initializing a random number generator in each thread
23 VSLStreamStatePtr rng;
24 const int seed = omp_get_thread_num();
25 int status = vslNewStream(&rng, VSL_BRNG_MT19937, omp_get_thread_num());
26
27 #pragma omp for schedule(guided)
28 for (int i = 0; i < m; i++) {
29
30 // Filling arrays with normally distributed random numbers
31 const float seedMean = (float)i;
32 const float seedStdev = 1.0f;
33 status = vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,
34 rng, n, data, seedMean, seedStdev);
35
36 // Processing data to compute the mean and standard deviation
37 float sumx=0.0f, sumx2=0.0f;
38 #pragma vector aligned
for (int j = 0; j < n; j++) {

g
39

an
40 sumx += data[j];

W
41 sumx2 += data[j]*data[j];
}

g
42

en
43 resultMean[i] = sumx/(float)n; h
44 resultStdev[i] = sqrtf(sumx2/(float)n-resultMean[i]*resultMean[i]);
un
45 }
rY

46
fo

47 vslDeleteStream(&rng);
}
ed

48
49
ar

50 }
ep
Pr

Back to Lab A.4.13.


y
el
siv
u

B.4.14 Offload Traffic Optimization


cl
Ex

Back to Lab A.4.14.

B.4.14.1 labs/4/4.e-offload/step-00/Makefile

CXX = icpc
CXXFLAGS = -openmp -vec-report
LINKFLAGS = -openmp

OBJECTS = main.o worker.o


MICOBJECTS = mainMIC.o workerMIC.o

.SUFFIXES: .o .cc .oMIC

.cc.o:
$(CXX) -c $(CXXFLAGS) -o "$@" "$<"

.cc.oMIC:
$(CXX) -c -mmic $(CXXFLAGS) -o "$@" "$<"

all: runme

Prepared for Yunheng Wang c Colfax International, 2013


466 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

runme: $(OBJECTS)
$(CXX) $(LINKFLAGS) -o runme $(OBJECTS)

runmeMIC: $(MICOBJECTS)
$(CXX) $(LINKFLAGS) -mmic -o runmeMIC $(MICOBJECTS)

clean:
rm -f $(OBJECTS) $(MICOBJECTS) runme runmeMIC

Back to Lab A.4.14.

B.4.14.2 labs/4/4.e-offload/step-00/main.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file main.cc, located at 4/4.e-offload/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission

g
7 from Colfax International is prohibited.

an
8 Contact information can be found at http://colfax-intl.com/ */

W
9
ng
10 #include <omp.h>
he

11 #include <cstdlib>
#include <cmath>
un

12
#include <cstdio>
rY

13
14
fo

15 char* __attribute__((target(mic))) data;


d

16
re

17 void DefaultOffload(const size_t size, char* data);


pa

18 void OffloadWithMemoryRetention(const size_t size, char* data, const int k,


re

19 const int nTrials);


yP

20 void OffloadWithDataPersistence(const size_t size, char* data, const int k,


const int nTrials);
el

21
siv

22
23 void PerformOffloadTransfer(const size_t size, double & t, double & dt);
u
cl

24
Ex

25 int main(){
26 const size_t sizeMin = 1L<<10L;
27 const size_t sizeMax = 1L<<30L;
28 const size_t sizeFactor = 2;
29 const int skipTrials = 2;
30 const int dropTrials = 1;
31
32 size_t size = sizeMin;
33
34 printf("#%11s%19s%19s%19s%19s%19s\n", "Array", "Default offload",
35 "With memory", "With data", "Allocate+free", "Effective");
36 printf("#%11s%19s%19s%19s%19s%19s\n", "Size, kB", "time, ms",
37 "retention, ms", "persistence, ms", "overhead, ms", "bandwidth, GB/s");
38 fflush(stdout);
39 while (size <= sizeMax) {
40
41 const size_t nTrials = 8L*sqrtf(sqrtf((float)(1L<<30L)/size));
42
43 // Array to be transferred
44 data = (char*) _mm_malloc(size, 64);
45 data[0:size] = 0;
46
47 // Timing the default offload

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 467

48 double tDefault = 0.0, dtDefault = 0.0;


49 for (int k = 0; k < nTrials; k++) {
50 const double t0 = omp_get_wtime();
51 DefaultOffload(size, data);
52 const double t1 = omp_get_wtime();
53
54 if ((k >= skipTrials) && (k < nTrials-dropTrials))
55 { tDefault += (t1-t0); dtDefault += (t1-t0)*(t1-t0); }
56 }
57 tDefault /= (double)(nTrials-skipTrials-dropTrials);
58 dtDefault = sqrt(dtDefault/(double)(nTrials-skipTrials-dropTrials) -
59 tDefault*tDefault);
60
61 // Timing the offload with memory retention
62 double tMemR = 0.0, dtMemR = 0.0;
63 for (int k = 0; k < nTrials; k++) {
64 const double t0 = omp_get_wtime();
65 OffloadWithMemoryRetention(size, data, k, nTrials);
66 const double t1 = omp_get_wtime();
67
68 if ((k >= skipTrials) && (k < nTrials-1))
{ tMemR += (t1-t0); dtMemR += (t1-t0)*(t1-t0); }

g
69

an
70

W
71 // printf("t=%.2f ms\n", (t1-t0)*1e3);
}

g
72

en
73 tMemR /= (double)(nTrials-skipTrials-1); h
74 dtMemR = sqrt(dtMemR/(double)(nTrials-skipTrials-1) - tMemR*tMemR);
un
75
rY

76 // Timing the offload with data persistence


fo

77 double tDataP = 0.0, dtDataP = 0.0;


for (int k = 0; k < nTrials; k++) {
ed

78
79 const double t0 = omp_get_wtime();
ar

80 OffloadWithDataPersistence(size, data, k, nTrials);


ep

81 const double t1 = omp_get_wtime();


Pr

82
y

83 if ((k >= skipTrials) && (k < nTrials-1))


el

84 { tDataP += (t1-t0); dtDataP += (t1-t0)*(t1-t0); }


siv

85 }
u

tDataP /= (double)(nTrials-skipTrials-1);
cl

86
dtDataP = sqrt(dtDataP/(double)(nTrials-skipTrials-1) - tDataP*tDataP);
Ex

87
88
89 // Bandwidth is the transfer time with memory retention
90 const double bandwidth = (double)(size)/(double)(1L<<30L) / tMemR;
91 const double dBandwidth = bandwidth*(dtMemR/tMemR);
92
93 // The memory allocation latency is the default offload time
94 // minus the offload time with memory retention.
95 const double mallocLat = (tDefault - tMemR);
96 const double dMallocLat = sqrtf(dtDefault*dtDefault + dtMemR*dtMemR);
97
98 printf("%12ld %8.2f +/- %5.2f %8.2f +/- %5.2f %8.2f +/- %5.2f %8.2f +/- %5.2f\
99 %8.2f +/- %5.2f\n",
100 size/(1L<<10L),
101 tDefault*1e3, dtDefault*1e3,
102 tMemR*1e3, dtMemR*1e3,
103 tDataP*1e3, dtDataP*1e3,
104 mallocLat*1e3, dMallocLat*1e3,
105 bandwidth, dBandwidth);
106 fflush(stdout);
107
108 _mm_free(data);
109

Prepared for Yunheng Wang c Colfax International, 2013


468 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

110 size *= sizeFactor;


111 }
112 printf("#%11s%58s%38s\n", "",
113 "|<--------------------- measured ---------------------->|",
114 "|<----------- inferred ------------>|");
115 }

Back to Lab A.4.14.

B.4.14.3 labs/4/4.e-offload/step-00/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.e-offload/step-00
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
Contact information can be found at http://colfax-intl.com/

g
8 */

an
9

W
10 #include <cstdlib> ng
11
void DefaultOffload(const size_t size, char* data) {
he

12
13
un

14 // Default offload procedure:


rY

15 // allocate memory on coprocessor,


fo

16 // transfer data,
// perform offload calculations
d

17
re

18 // deallocate memory on coprocessor


pa

19
re

20 #pragma offload target(mic:1) \


yP

21 in(data: length(size) align(64))


22 {
el

data[0] = 0;
siv

23
24 }
u

}
cl

25
Ex

26
27 void OffloadWithMemoryRetention(const size_t size, char* data, const int k,
28 const int nTrials) {
29
30 // Write the body of this function so that
31 // the memory container for the data is allocated during the first iteration,
32 // but this allocated memory is retained in the subsequent iterations
33 // and deallocated during the last iteration.
34
35 }
36
37 void OffloadWithDataPersistence(const size_t size, char* data, const int k,
38 const int nTrials) {
39
40 // Write the body of this function so that
41 // the data is transferred to the coprocessor during the first iteration,
42 // allocated memory is retained afterwards, and
43 // data is not transferred in subsequent iterations.
44
45 }

Back to Lab A.4.14.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 469

B.4.14.4 labs/4/4.e-offload/step-01/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.e-offload/step-01
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <cstdlib>
11
12 void DefaultOffload(const size_t size, char* data) {
13
14 // Default offload procedure:
15 // allocate memory on coprocessor,
16 // transfer data,
17 // perform offload calculations
18 // deallocate memory on coprocessor
19

g
#pragma offload target(mic:1) \

an
20
21 in(data: length(size) align(64))

W
22 {

g
data[0] = 0;
en
23
24 } h
un
25 }
rY

26
27 void OffloadWithMemoryRetention(const size_t size, char* data, const int k,
fo

28 const int nTrials) {


ed

29
ar

30 // Allocate arrays on coprocessor during the first iteration;


ep

31 // retain allocated memory for subsequent iterations


Pr

32 #pragma offload target(mic:1) \


33 in(data: length(size) alloc_if(k==0) free_if(k==nTrials-1) align(64))
y
el

34 {
siv

35 data[0] = 0;
u

36 }
cl

37 }
Ex

38
39 void OffloadWithDataPersistence(const size_t size, char* data, const int k,
40 const int nTrials) {
41
42 // Write the body of this function so that
43 // the data is transferred to the coprocessor during the first iteration,
44 // allocated memory is retained afterwards, and
45 // data is not transferred in subsequent iterations.
46
47 }

Back to Lab A.4.14.

B.4.14.5 labs/4/4.e-offload/step-02/worker.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file worker.cc, located at 4/4.e-offload/step-02
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8

Prepared for Yunheng Wang c Colfax International, 2013


470 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

6 Redistribution or commercial usage without a written permission


7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <cstdlib>
11
12 void DefaultOffload(const size_t size, char* data) {
13
14 // Default offload procedure:
15 // allocate memory on coprocessor,
16 // transfer data,
17 // perform offload calculations
18 // deallocate memory on coprocessor
19
20 #pragma offload target(mic:1) \
21 in(data: length(size) align(64))
22 {
23 data[0] = 0;
24 }
25 }
26
void OffloadWithMemoryRetention(const size_t size, char* data, const int k,

g
27

an
28 const int nTrials) {

W
29
// Allocate arrays on coprocessor during the first iteration;
ng
30
31 // retain allocated memory for subsequent iterations
he

32 #pragma offload target(mic:1) \


un

33 in(data: length(size) alloc_if(k==0) free_if(k==nTrials-1) align(64))


rY

34 {
fo

35 data[0] = 0;
}
d

36
re

37 }
pa

38
re

39 void OffloadWithDataPersistence(const size_t size, char* data, const int k,


yP

40 const int nTrials) {


41
el

42 // Transfer data during the first itertion;


siv

43 // skip transfer for subsequent iterations


u

const size_t transferSize = ( k == 0 ? size : 0);


cl

44
#pragma offload target(mic:1) \
Ex

45
46 in(data: length(transferSize) alloc_if(k==0) free_if(k==nTrials-1) align(64))
47 {
48 data[0] = 0;
49 }
50 }

Back to Lab A.4.14.

B.4.15 MPI: Load Balancing


Back to Lab A.4.15

B.4.15.1 labs/4/4.f-MPI-load-balance/step-00/Makefile

all:
mpiicpc -mkl -o pi-host pi.cc
mpiicpc -mkl -o pi-mic -mmic pi.cc
scp pi-mic mic0:~/

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 471

run: runhost runmic runboth

runhost:
mpirun -host localhost -np 32 ./pi-host

runmic:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host mic0 -np 240 ~/pi-mic

runboth:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 32 ./pi-host : -host mic0 -np 240 ~/pi-mic

clean:
rm -f pi-host pi-mic

Back to Lab A.4.15

g
B.4.15.2 labs/4/4.f-MPI-load-balance/step-00/pi.cc

an
W
g
1 /* Copyright (c) 2013, Colfax International. All Right Reserved.
2 file pi.cc, located at 4/4.f-MPI-load-balance/step-00
is a part of the practical supplement to the handbook en
h
un
3
"Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
rY

4
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
fo

6 Redistribution or commercial usage without a written permission


ed

7 from Colfax International is prohibited.


ar

8 Contact information can be found at http://colfax-intl.com/ */


ep

9
#include <mpi.h>
Pr

10
11 #include <stdio.h>
y

#include <stdlib.h>
el

12
siv

13 #include <mkl_vsl.h>
14
u
cl

15 const long iter=1L<<32L, BLOCK_SIZE=4096L, nBlocks=iter/BLOCK_SIZE, nTrials = 10;


Ex

16
17 void RunMonteCarlo(const long firstBlock, const long lastBlock,
18 VSLStreamStatePtr & stream, long & dUnderCurve) {
19 // Performs the Monte Carlo calculation for blocks in the range [firstBlock; lastBlock)
20 // to count the number of random points inside of the quarter circle
21
22 long j, i;
23 float r[BLOCK_SIZE*2] __attribute__((align(64)));
24
25 for (j = firstBlock; j < lastBlock; j++) {
26
27 vsRngUniform( 0, stream, BLOCK_SIZE*2, r, 0.0f, 1.0f );
28 for (i = 0; i < BLOCK_SIZE; i++) {
29 const float x = r[i];
30 const float y = r[i+BLOCK_SIZE];
31 // Count points inside quarter circle
32 if (x*x + y*y < 1.0f) dUnderCurve++;
33 }
34 }
35
36 }
37
38 int main(int argc, char *argv[]){

Prepared for Yunheng Wang c Colfax International, 2013


472 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

39
40 int rank, nRanks, trial;
41
42 MPI_Init(&argc, &argv);
43 MPI_Comm_size(MPI_COMM_WORLD, &nRanks);
44 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
45
46 // Work sharing: equal amount of work in each process
47 const double blocksPerProc = (double)nBlocks / (double)nRanks;
48
49 for (trial = 0; trial < nTrials; trial++) { // Multiple trials
50
51 const double start_time = MPI_Wtime();
52 long dUnderCurve=0, UnderCurveSum=0;
53
54 // Create and initialize a random number generator from MKL
55 VSLStreamStatePtr stream;
56 vslNewStream( &stream, VSL_BRNG_MT19937, trial*nRanks + rank );
57
58 // Range of blocks processed by this process
59 const long myFirstBlock = (long)(blocksPerProc*rank);
const long myLastBlock = (long)(blocksPerProc*(rank+1));

g
60

an
61

W
62 RunMonteCarlo(myFirstBlock, myLastBlock, stream, dUnderCurve);
ng
63
64 vslDeleteStream( &stream );
he

65
un

66 // Compute pi
rY

67 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);


fo

68 if (rank==0) {
const double pi = (double)UnderCurveSum / (double) iter * 4.0 ;
d

69
re

70 const double end_time = MPI_Wtime();


pa

71 const double pi_exact=3.141592653589793;


re

72 if (trial == 0) printf("#%9s %8s %7s\n", "pi", "Rel.err", "Time, s");


yP

73 printf ("%10.8f %8.1e %7.3f\n",


74 pi, (pi-pi_exact)/pi_exact, end_time-start_time);
el

75 fflush(0);
siv

76 }
u
cl

77
MPI_Barrier(MPI_COMM_WORLD);
Ex

78
79 }
80 MPI_Finalize();
81 }

Back to Lab A.4.15

B.4.15.3 labs/4/4.f-MPI-load-balance/step-01/Makefile

all:
mpiicpc -mkl -o pi-static-host pi-static.cc
mpiicpc -mkl -o pi-static-mic -mmic pi-static.cc
scp pi-static-mic mic0:~/
scp pi-static-mic mic1:~/

run: runhost runmic runboth runall

runhost:
mpirun -host localhost -np 32 ./pi-static-host

runmic:

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 473

LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host mic0 -np 240 ~/pi-static-mic

runboth:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 32 ./pi-static-host : -host mic0 -np 240 ~/pi-static-mic

runall:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 32 ./pi-static-host : \
-host mic0 -np 240 ~/pi-static-mic \
-host mic1 -np 240 ~/pi-static-mic

clean:
rm -f pi-static-host pi-static-mic

Back to Lab A.4.15

g
an
B.4.15.4 labs/4/4.f-MPI-load-balance/step-01/pi-static.cc

W
g
1
en
/* Copyright (c) 2013, Colfax International. All Right Reserved.
h
file pi-static.cc, located at 4/4.f-MPI-load-balance/step-01
un
2
is a part of the practical supplement to the handbook
rY

3
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
fo

5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8


ed

6 Redistribution or commercial usage without a written permission


ar

7 from Colfax International is prohibited.


ep

8 Contact information can be found at http://colfax-intl.com/ */


Pr

9
10 #include <mpi.h>
y

#include <stdio.h>
el

11
siv

12 #include <stdlib.h>
13 #include <mkl_vsl.h>
u
cl

14
Ex

15 const long iter=1L<<32L, BLOCK_SIZE=4096L, nBlocks=iter/BLOCK_SIZE, nTrials = 10;


16
17 void RunMonteCarlo(const long firstBlock, const long lastBlock,
18 VSLStreamStatePtr & stream, long & dUnderCurve) {
19 // Performs the Monte Carlo calculation for blocks in the range [firstBlock; lastBlock)
20 // to count the number of random points inside of the quarter circle
21
22 long j, i;
23 float r[BLOCK_SIZE*2] __attribute__((align(64)));
24
25 for (j = firstBlock; j < lastBlock; j++) {
26
27 vsRngUniform( 0, stream, BLOCK_SIZE*2, r, 0.0f, 1.0f );
28 for (i = 0; i < BLOCK_SIZE; i++) {
29 const float x = r[i];
30 const float y = r[i+BLOCK_SIZE];
31 // Count points inside quarter circle
32 if (x*x + y*y < 1.0f) dUnderCurve++;
33 }
34 }
35
36 }
37

Prepared for Yunheng Wang c Colfax International, 2013


474 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

38 int main(int argc, char *argv[]){


39 const long iter=1L<<32L, BLOCK_SIZE=4096, nBlocks=iter/BLOCK_SIZE, nTrials = 10;
40 int rank, size, nProcsOnMIC, nProcsOnHost, thisProcOnMIC=0, thisProcOnHost=0;
41 MPI_Init(&argc, &argv);
42 MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank);
43
44 // Count the number of processes running on the host and on coprocessors
45 #ifdef __MIC__
46 thisProcOnMIC++; // This process is running on an Intel Xeon Phi coprocessor
47 #else
48 thisProcOnHost++; // This process is running on an Intel Xeon processor
49 #endif
50 MPI_Allreduce(&thisProcOnMIC, &nProcsOnMIC, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
51 MPI_Allreduce(&thisProcOnHost, &nProcsOnHost, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
52
53 // Work sharing calculation
54 const char* alphast = getenv("ALPHA"); // Load balance parameter
55 if (alphast==NULL) { printf("ALPHA is undefined\n"); exit(1); }
56 const double alpha = atof(alphast);
57 #ifndef __MIC__
58 // Blocks per rank on host
const double blocksPerRank =

g
59

an
60 ( nProcsOnMIC > 0 ? alpha*nBlocks/(alpha*nProcsOnHost+nProcsOnMIC) :

W
61 (double)nBlocks/nProcsOnHost );
const long blockOffset = 0;
ng
62
63 const int rankOnDevice = rank;
he

64 #else
un

65 // Blocks per rank on coprocessor


rY

66 const double blocksPerRank = nBlocks / (alpha*nProcsOnHost + nProcsOnMIC);


fo

67 const long blockOffset = nProcsOnHost*alpha*nBlocks / (alpha*nProcsOnHost + nProcsOnMIC);


const int rankOnDevice = rank - nProcsOnHost;
d

68
re

69 #endif
pa

70
re

71 for (int t = 0; t < nTrials; t++) { // Multiple trials


yP

72 // Monte Carlo method


73 const double start_time = MPI_Wtime();
el

74 long dUnderCurve=0, UnderCurveSum=0;


siv

75
u

// Create and initialize a random number generator from MKL


cl

76
VSLStreamStatePtr stream;
Ex

77
78 vslNewStream(&stream, VSL_BRNG_MT19937, rank*nTrials + t);
79
80 // Range of blocks processed by this process
81 const long myFirstBlock = blockOffset + (long)(blocksPerRank*rankOnDevice);
82 const long myLastBlock = blockOffset + (long)(blocksPerRank*(rankOnDevice+1));
83
84 RunMonteCarlo(myFirstBlock, myLastBlock, stream, dUnderCurve);
85
86 vslDeleteStream( &stream );
87
88 // Reduction to collect the results of the Monte Carlo calculation
89 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
90
91 double unbalancedTime = MPI_Wtime();
92 MPI_Barrier(MPI_COMM_WORLD);
93 unbalancedTime = MPI_Wtime() - unbalancedTime;
94
95 // Timing collection
96 double hostUnbalancedTime = 0.0, MICUnbalancedTime = 0.0;
97 #ifdef __MIC__
98 MICUnbalancedTime += unbalancedTime;
99 #else

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 475

100 hostUnbalancedTime += unbalancedTime;


101 #endif
102 double averageHostUnbalancedTime = 0.0, averageMICUnbalancedTime = 0.0;
103 MPI_Reduce(&hostUnbalancedTime, &averageHostUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
104 0, MPI_COMM_WORLD);
105 MPI_Reduce(&MICUnbalancedTime, &averageMICUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
106 0, MPI_COMM_WORLD);
107 if (nProcsOnHost > 0)
108 averageHostUnbalancedTime /= nProcsOnHost;
109 if (nProcsOnMIC > 0)
110 averageMICUnbalancedTime /= nProcsOnMIC;
111
112 // Compute pi
113 if (rank==0) {
114 const double pi = (double) UnderCurveSum / (double) iter * 4.0 ;
115 const double end_time = MPI_Wtime();
116 const double pi_exact=3.141592653589793;
117 if (t==0) printf("#%9s %8s %7s %9s %14s %14s\n",
118 "pi", "Rel.err", "Time, s", "Alpha", "Host unbal., s", "MIC unbal., s");
119 printf ("%10.8f %8.1e %7.3f %9.3f %14.3f %14.3f\n",
120 pi, (pi-pi_exact)/pi_exact, end_time-start_time, alpha,
averageHostUnbalancedTime, averageMICUnbalancedTime);

g
121

an
122 fflush(0);

W
123 }

g
124

en
125 MPI_Barrier(MPI_COMM_WORLD); h
126 }
un
127 MPI_Finalize();
rY

128 }
fo
ed

Back to Lab A.4.15


ar
ep
Pr

B.4.15.5 labs/4/4.f-MPI-load-balance/step-02/pi-dynamic.cc
y
el
siv

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file pi-dynamic.cc, located at 4/4.f-MPI-load-balance/step-02
u
cl

3 is a part of the practical supplement to the handbook


Ex

4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"


5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <mpi.h>
11 #include <stdio.h>
12 #include <stdlib.h>
13 #include <mkl_vsl.h>
14
15 const long iter=1L<<32L, BLOCK_SIZE=4096L, nBlocks=iter/BLOCK_SIZE, nTrials = 10;
16
17 void RunMonteCarlo(const long firstBlock, const long lastBlock, float* r,
18 VSLStreamStatePtr & stream, long & dUnderCurve) {
19
20 // Performs the Monte Carlo calculation for blocks in the range [firstBlock; lastBlock)
21 // to count the number of random points inside of the quarter circle
22
23 long j, i;
24
25 for (j = firstBlock; j < lastBlock; j++) {
26

Prepared for Yunheng Wang c Colfax International, 2013


476 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

27 vsRngUniform( 0, stream, BLOCK_SIZE*2, r, 0.0f, 1.0f );


28 for (i = 0; i < BLOCK_SIZE; i++) {
29 const float x = r[i];
30 const float y = r[i+BLOCK_SIZE];
31 // Count points inside quarter circle
32 if (x*x + y*y < 1.0f) dUnderCurve++;
33 }
34 }
35
36 }
37
38 int main(int argc, char *argv[]){
39
40 int rank, nRanks, worker;
41 long grainSize, msg[2]; // MPI message; msg[0] is blockStart, msg[1] is blockEnd
42 MPI_Status stat;
43 MPI_Init(&argc, &argv);
44 MPI_Comm_size(MPI_COMM_WORLD, &nRanks); MPI_Comm_rank(MPI_COMM_WORLD, &rank);
45
46 // Count the number of processes running on the host and on coprocessors
47 int nProcsOnMIC, nProcsOnHost, thisProcOnMIC=0, thisProcOnHost=0;
if (rank != 0) {

g
48

an
49 #ifdef __MIC__

W
50 thisProcOnMIC++; // This process is running on an Intel Xeon Phi coprocessor
#else
ng
51
52 thisProcOnHost++; // This process is running on an Intel Xeon processor
he

53 #endif
un

54 }
rY

55 MPI_Allreduce(&thisProcOnMIC, &nProcsOnMIC, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);


fo

56 MPI_Allreduce(&thisProcOnHost, &nProcsOnHost, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);


d

57
re

58 for (int t = 0; t < nTrials; t++) { // Multiple trials


pa

59
re

60 long dUnderCurve = 0, UnderCurveSum = 0;


yP

61 double hostSchedulingWait = 0.0, MICSchedulingWait = 0.0;


62 const double start_time = MPI_Wtime();
el

63
siv

64 if (rank == 0) {
u

// Boss assigns work


cl

65
const char* grainSizeSt = getenv("GRAIN_SIZE");
Ex

66
67 if (grainSizeSt == NULL) { printf("GRAIN_SIZE undefined\n"); exit(1); }
68 grainSize = atof(grainSizeSt);
69 long currentBlock = 0;
70 while (currentBlock < nBlocks) {
71 msg[0] = currentBlock; // First block for next worker
72 msg[1] = currentBlock + grainSize; // Last block
73 if (msg[1] > nBlocks) msg[1] = nBlocks; // Stay in range
74
75 // Wait for next worker
76 MPI_Recv(&worker, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,
77 &stat);
78
79 // Assign work to next worker
80 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
81
82 currentBlock = msg[1]; // Update position
83 }
84
85 // Terminate workers
86 msg[0] = -1; msg[1] = -2;
87 for (int i = 1; i < nRanks; i++) {
88 MPI_Recv(&worker, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 477

89 &stat);
90 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
91 }
92
93 } else {
94 // Worker performs the Monte Carlo calculation
95 VSLStreamStatePtr stream; // Create & initialize a random number generator from MKL
96 vslNewStream(&stream, VSL_BRNG_MT19937, rank*nTrials + t);
97 float r[BLOCK_SIZE*2] __attribute__((align(64)));
98
99 // Range of blocks processed by this worker
100 msg[0] = 0;
101 while (msg[0] >= 0) {
102 double waitTime = MPI_Wtime();
103 MPI_Send(&rank, 1, MPI_INT, 0, rank, MPI_COMM_WORLD);
104 MPI_Recv(&msg, 2, MPI_LONG, 0, rank, MPI_COMM_WORLD, &stat);
105 waitTime = MPI_Wtime() - waitTime;
106 #ifdef __MIC__
107 MICSchedulingWait += waitTime;
108 #else
109 hostSchedulingWait += waitTime;
#endif

g
110

an
111 const long myFirstBlock = msg[0];

W
112 const long myLastBlock = msg[1];

g
113

en
114 RunMonteCarlo(myFirstBlock, myLastBlock, r, stream, dUnderCurve);
h
115 }
un
116 vslDeleteStream( &stream );
rY

117 }
fo

118
// Reduction to collect the results of the Monte Carlo calculation
ed

119
120 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
ar

121
ep

122 double unbalancedTime = MPI_Wtime();


Pr

123 MPI_Barrier(MPI_COMM_WORLD);
y

124 unbalancedTime = (rank == 0 ? 0.0 : MPI_Wtime() - unbalancedTime);


el

125
siv

126 // Timing collection


u

double hostUnbalancedTime = 0.0, MICUnbalancedTime = 0.0;


cl

127
#ifdef __MIC__
Ex

128
129 MICUnbalancedTime += unbalancedTime;
130 #else
131 hostUnbalancedTime += unbalancedTime;
132 #endif
133 double averageHostUnbalancedTime = 0.0, averageMICUnbalancedTime = 0.0;
134 double averageHostSchedulingWait = 0.0, averageMICSchedulingWait = 0.0;
135 MPI_Reduce(&hostUnbalancedTime, &averageHostUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
136 0, MPI_COMM_WORLD);
137 MPI_Reduce(&MICUnbalancedTime, &averageMICUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
138 0, MPI_COMM_WORLD);
139 MPI_Reduce(&hostSchedulingWait, &averageHostSchedulingWait, 1, MPI_DOUBLE, MPI_SUM,
140 0, MPI_COMM_WORLD);
141 MPI_Reduce(&MICSchedulingWait, &averageMICSchedulingWait, 1, MPI_DOUBLE, MPI_SUM,
142 0, MPI_COMM_WORLD);
143 if (nProcsOnHost > 0) {
144 averageHostUnbalancedTime /= nProcsOnHost;
145 averageHostSchedulingWait /= nProcsOnHost;
146 }
147 if (nProcsOnMIC > 0) {
148 averageMICUnbalancedTime /= nProcsOnMIC;
149 averageMICSchedulingWait /= nProcsOnMIC;
150 }

Prepared for Yunheng Wang c Colfax International, 2013


478 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

151
152 // Compute pi
153 if (rank==0) {
154 const double pi = (double) UnderCurveSum / (double) iter * 4.0 ;
155 const double end_time = MPI_Wtime();
156 const double pi_exact=3.141592653589793;
157 if (t==0) printf("#%9s %8s %7s %9s %14s %14s %14s %14s\n",
158 "pi", "Rel.err", "Time, s", "GrainSize", "Host unbal., s", "MIC unbal., s",
159 "Host sched, s.", "MIC sched, s.");
160 printf ("%10.8f %8.1e %7.3f %9ld %14.3f %14.3f %14.3f %14.3f\n",
161 pi, (pi-pi_exact)/pi_exact, end_time-start_time, grainSize,
162 averageHostUnbalancedTime, averageMICUnbalancedTime, averageHostSchedulingWait,
163 averageMICSchedulingWait);
164 fflush(0);
165 }
166 }
167 MPI_Finalize();
168 }

Back to Lab A.4.15

g
an
B.4.15.6 labs/4/4.f-MPI-load-balance/step-03/Makefile
W
ng
he

all:
mpiicpc -mkl -openmp -o pi-boss-dynamic pi-dynamic-hybrid.cc
un

mpiicpc -mkl -openmp -o pi-worker-hybrid-host pi-dynamic-hybrid.cc


rY

mpiicpc -mkl -openmp -o pi-worker-hybrid-mic -mmic pi-dynamic-hybrid.cc


fo

scp pi-worker-hybrid-mic mic0:~/


d
re

run: runhost runmic runboth


pa
re

runhost: runhost1 runhost4 runhost16 runhostall


yP

runhost1:
el
siv

LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
u
cl

mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \


Ex

-host localhost -np 32 -env OMP_NUM_THREADS 1 ./pi-worker-hybrid-host

runhost4:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host localhost -np 8 -env OMP_NUM_THREADS 4 ./pi-worker-hybrid-host

runhost16:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host localhost -np 2 -env OMP_NUM_THREADS 16 ./pi-worker-hybrid-host

runhostall:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host localhost -np 1 -env OMP_NUM_THREADS 32 ./pi-worker-hybrid-host

runmic: runmic1 runmic4 runmic16 runmicall

runmic1:

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 479

LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host mic0 -np 240 -env OMP_NUM_THREADS 1 ~/pi-worker-hybrid-mic

runmic4:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host mic0 -np 60 -env OMP_NUM_THREADS 4 ~/pi-worker-hybrid-mic

runmic16:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host mic0 -np 15 -env OMP_NUM_THREADS 16 ~/pi-worker-hybrid-mic

runmicall:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host mic0 -np 1 -env OMP_NUM_THREADS 240 ~/pi-worker-hybrid-mic

g
an
W
runboth: runboth1 runboth4 runboth16 runbothall

g
en
runboth1: h
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
un
I_MPI_MIC=1 \
rY

mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \


fo

-host localhost -np 32 -env OMP_NUM_THREADS 1 ./pi-worker-hybrid-host : \


-host mic0 -np 240 -env OMP_NUM_THREADS 1 ~/pi-worker-hybrid-mic
ed
ar

runboth4:
ep

LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
Pr

I_MPI_MIC=1 \
y

mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \


el

-host localhost -np 8 -env OMP_NUM_THREADS 4 ./pi-worker-hybrid-host : \


siv

-host mic0 -np 60 -env OMP_NUM_THREADS 4 ~/pi-worker-hybrid-mic


u
cl

runboth16:
Ex

LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host localhost -np 2 -env OMP_NUM_THREADS 16 ./pi-worker-hybrid-host : \
-host mic0 -np 15 -env OMP_NUM_THREADS 16 ~/pi-worker-hybrid-mic

runbothall:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MIC_LD_LIBRARY_PATH} \
I_MPI_MIC=1 \
mpirun -host localhost -np 1 -env I_MPI_PIN 0 ./pi-boss-dynamic : \
-host localhost -np 1 -env OMP_NUM_THREADS 32 ./pi-worker-hybrid-host : \
-host mic0 -np 1 -env OMP_NUM_THREADS 240 ~/pi-worker-hybrid-mic

clean:
rm -f pi-boss-dynamic pi-worker-hybrid-host pi-worker-hybrid-mic

Back to Lab A.4.15

B.4.15.7 labs/4/4.f-MPI-load-balance/step-03/pi-dynamic-hybrid.cc

Prepared for Yunheng Wang c Colfax International, 2013


480 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file pi-dynamic-hybrid.cc, located at 4/4.f-MPI-load-balance/step-03
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <omp.h>
11 #include <mpi.h>
12 #include <stdio.h>
13 #include <stdlib.h>
14 #include <mkl_vsl.h>
15
16 const long iter=1L<<32L, BLOCK_SIZE=4096L, nBlocks=iter/BLOCK_SIZE, nTrials = 10;
17
18 void RunMonteCarlo(const long firstBlock, const long lastBlock,
19 VSLStreamStatePtr *stream, long & dUnderCurveExt) {
20
21 // Performs the Monte Carlo calculation for blocks in the range [firstBlock; lastBlock)
// to count the number of random points inside of the quarter circle

g
22

an
23

W
24 long j, i;
long dUnderCurve = 0;
ng
25
26 #pragma omp parallel
he

27 {
un

28 float r[BLOCK_SIZE*2] __attribute__((align(64)));


rY

29 const int myThread = omp_get_thread_num();


fo

30 #pragma omp for schedule(dynamic) reduction(+: dUnderCurve)


for (j = firstBlock; j < lastBlock; j++) {
d

31
re

32 vsRngUniform( 0, stream[myThread], BLOCK_SIZE*2, r, 0.0f, 1.0f );


pa

33 for (i = 0; i < BLOCK_SIZE; i++) {


re

34 const float x = r[i];


yP

35 const float y = r[i+BLOCK_SIZE];


36 // Count points inside quarter circle
el

37 if (x*x + y*y < 1.0f) dUnderCurve++;


siv

38 }
u

}
cl

39
}
Ex

40
41
42 dUnderCurveExt += dUnderCurve;
43
44 }
45
46 int main(int argc, char *argv[]){
47
48 int rank, nRanks, worker;
49 long grainSize, msg[2]; // MPI message; msg[0] is blockStart, msg[1] is blockEnd
50 MPI_Status stat;
51 MPI_Init(&argc, &argv);
52 MPI_Comm_size(MPI_COMM_WORLD, &nRanks); MPI_Comm_rank(MPI_COMM_WORLD, &rank);
53
54 // Count the number of processes running on the host and on coprocessors
55 int nProcsOnMIC, nProcsOnHost, thisProcOnMIC=0, thisProcOnHost=0;
56 if (rank != 0) {
57 #ifdef __MIC__
58 thisProcOnMIC++; // This process is running on an Intel Xeon Phi coprocessor
59 #else
60 thisProcOnHost++; // This process is running on an Intel Xeon processor
61 #endif
62 }

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 481

63 MPI_Allreduce(&thisProcOnMIC, &nProcsOnMIC, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);


64 MPI_Allreduce(&thisProcOnHost, &nProcsOnHost, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
65 for (int t = 0; t < nTrials; t++) { // Multiple trials
66
67 long dUnderCurve = 0, UnderCurveSum = 0;
68 double hostSchedulingWait = 0.0, MICSchedulingWait = 0.0;
69 const double start_time = MPI_Wtime();
70
71 if (rank == 0) {
72 // Boss assigns work
73 const char* grainSizeSt = getenv("GRAIN_SIZE");
74 if (grainSizeSt == NULL) { printf("GRAIN_SIZE undefined\n"); exit(1); }
75 grainSize = atof(grainSizeSt);
76 long currentBlock = 0;
77 while (currentBlock < nBlocks) {
78 msg[0] = currentBlock; // First block for next worker
79 msg[1] = currentBlock + grainSize; // Last block
80 if (msg[1] > nBlocks) msg[1] = nBlocks; // Stay in range
81
82 // Wait for next worker
83 MPI_Recv(&worker, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,
&stat);

g
84

an
85

W
86 // Assign work to next worker
MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);

g
87

en
88
89
h
currentBlock = msg[1]; // Update position
un
90 }
rY

91
fo

92 // Terminate workers
msg[0] = -1; msg[1] = -2;
ed

93
94 for (int i = 1; i < nRanks; i++) {
ar

95 MPI_Recv(&worker, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,


ep

96 &stat);
Pr

97 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);


y

98 }
el

99
siv

100 } else {
u

// Worker performs the Monte Carlo calculation


cl

101
Ex

102
103 // Create and initialize a random number generator from MKL
104 VSLStreamStatePtr stream[omp_get_max_threads()];
105 #pragma omp parallel
106 {
107 // Each thread gets its own random seed
108 const int randomSeed = nTrials*omp_get_thread_num()*nRanks + nTrials*rank + t;
109 vslNewStream(&stream[omp_get_thread_num()], VSL_BRNG_MT19937, randomSeed);
110 }
111
112 msg[0] = 0;
113 while (msg[0] >= 0) {
114 // Receive from boss the range of blocks to process
115 double waitTime = MPI_Wtime();
116 MPI_Send(&rank, 1, MPI_INT, 0, rank, MPI_COMM_WORLD);
117 MPI_Recv(&msg, 2, MPI_LONG, 0, rank, MPI_COMM_WORLD, &stat);
118 waitTime = MPI_Wtime() - waitTime;
119 #ifdef __MIC__
120 MICSchedulingWait += waitTime;
121 #else
122 hostSchedulingWait += waitTime;
123 #endif
124 const long myFirstBlock = msg[0];

Prepared for Yunheng Wang c Colfax International, 2013


482 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

125 const long myLastBlock = msg[1];


126
127 RunMonteCarlo(myFirstBlock, myLastBlock, stream, dUnderCurve);
128
129 }
130
131 #pragma omp parallel
132 {
133 vslDeleteStream( &stream[omp_get_thread_num()] );
134 }
135 }
136
137 // Reduction to collect the results of the Monte Carlo calculation
138 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
139
140 double unbalancedTime = MPI_Wtime();
141 MPI_Barrier(MPI_COMM_WORLD);
142 unbalancedTime = (rank == 0 ? 0.0 : MPI_Wtime() - unbalancedTime);
143
144 // Timing collection
145 double hostUnbalancedTime = 0.0, MICUnbalancedTime = 0.0;
#ifdef __MIC__

g
146

an
147 MICUnbalancedTime += unbalancedTime;

W
148 #else
hostUnbalancedTime += unbalancedTime;
ng
149
150 #endif
he

151 double averageHostUnbalancedTime = 0.0, averageMICUnbalancedTime = 0.0;


un

152 double averageHostSchedulingWait = 0.0, averageMICSchedulingWait = 0.0;


rY

153 MPI_Reduce(&hostUnbalancedTime, &averageHostUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,


fo

154 0, MPI_COMM_WORLD);
MPI_Reduce(&MICUnbalancedTime, &averageMICUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
d

155
re

156 0, MPI_COMM_WORLD);
pa

157 MPI_Reduce(&hostSchedulingWait, &averageHostSchedulingWait, 1, MPI_DOUBLE, MPI_SUM,


re

158 0, MPI_COMM_WORLD);
yP

159 MPI_Reduce(&MICSchedulingWait, &averageMICSchedulingWait, 1, MPI_DOUBLE, MPI_SUM,


160 0, MPI_COMM_WORLD);
el

161 if (nProcsOnHost > 0) {


siv

162 averageHostUnbalancedTime /= nProcsOnHost;


u

averageHostSchedulingWait /= nProcsOnHost;
cl

163
}
Ex

164
165 if (nProcsOnMIC > 0) {
166 averageMICUnbalancedTime /= nProcsOnMIC;
167 averageMICSchedulingWait /= nProcsOnMIC;
168 }
169
170 // Compute pi
171 if (rank==0) {
172 const double pi = (double) UnderCurveSum / (double) iter * 4.0 ;
173 const double end_time = MPI_Wtime();
174 const double pi_exact=3.141592653589793;
175 if (t==0) printf("#%9s %8s %7s %9s %14s %14s %14s %14s\n",
176 "pi", "Rel.err", "Time, s", "GrainSize", "Host unbal., s", "MIC unbal., s",
177 "Host sched, s.", "MIC sched, s.");
178 printf ("%10.8f %8.1e %7.3f %9ld %14.3f %14.3f %14.3f %14.3f\n",
179 pi, (pi-pi_exact)/pi_exact, end_time-start_time, grainSize,
180 averageHostUnbalancedTime, averageMICUnbalancedTime,
181 averageHostSchedulingWait, averageMICSchedulingWait);
182 fflush(0);
183 }
184 }
185 MPI_Finalize();
186 }

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 483

Back to Lab A.4.15

B.4.15.8 labs/4/4.f-MPI-load-balance/step-04/pi-guided-hybrid.cc

1 /* Copyright (c) 2013, Colfax International. All Right Reserved.


2 file pi-guided-hybrid.cc, located at 4/4.f-MPI-load-balance/step-04
3 is a part of the practical supplement to the handbook
4 "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"
5 (c) Colfax International, 2013, ISBN: 978-0-9885234-1-8
6 Redistribution or commercial usage without a written permission
7 from Colfax International is prohibited.
8 Contact information can be found at http://colfax-intl.com/ */
9
10 #include <omp.h>
11 #include <mpi.h>
12 #include <stdio.h>
13 #include <stdlib.h>
14 #include <mkl_vsl.h>

g
15

an
16 const long iter=1L<<32L, BLOCK_SIZE=4096L, nBlocks=iter/BLOCK_SIZE, nTrials = 10;

W
17

g
18 void RunMonteCarlo(const long firstBlock, const long lastBlock,
19
en
VSLStreamStatePtr *stream, long & dUnderCurveExt) {
h
un
20
// Performs the Monte Carlo calculation for blocks in the range [firstBlock; lastBlock)
rY

21
22 // to count the number of random points inside of the quarter circle
fo

23
ed

24 long j, i;
ar

25 long dUnderCurve = 0;
ep

26 #pragma omp parallel


{
Pr

27
28 float r[BLOCK_SIZE*2] __attribute__((align(64)));
y

const int myThread = omp_get_thread_num();


el

29
siv

30 #pragma omp for schedule(dynamic) reduction(+: dUnderCurve)


31 for (j = firstBlock; j < lastBlock; j++) {
u
cl

32 vsRngUniform( 0, stream[myThread], BLOCK_SIZE*2, r, 0.0f, 1.0f );


Ex

33 for (i = 0; i < BLOCK_SIZE; i++) {


34 const float x = r[i];
35 const float y = r[i+BLOCK_SIZE];
36 // Count points inside quarter circle
37 if (x*x + y*y < 1.0f) dUnderCurve++;
38 }
39 }
40 }
41
42 dUnderCurveExt += dUnderCurve;
43
44 }
45
46 int main(int argc, char *argv[]){
47
48 int rank, nRanks, worker;
49 long grainSize, msg[2]; // MPI message; msg[0] is blockStart, msg[1] is blockEnd
50 MPI_Status stat;
51 MPI_Init(&argc, &argv);
52 MPI_Comm_size(MPI_COMM_WORLD, &nRanks); MPI_Comm_rank(MPI_COMM_WORLD, &rank);
53
54 // Count the number of processes running on the host and on coprocessors
55 int nProcsOnMIC, nProcsOnHost, thisProcOnMIC=0, thisProcOnHost=0;

Prepared for Yunheng Wang c Colfax International, 2013


484 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

56 if (rank != 0) {
57 #ifdef __MIC__
58 thisProcOnMIC++; // This process is running on an Intel Xeon Phi coprocessor
59 #else
60 thisProcOnHost++; // This process is running on an Intel Xeon processor
61 #endif
62 }
63 MPI_Allreduce(&thisProcOnMIC, &nProcsOnMIC, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
64 MPI_Allreduce(&thisProcOnHost, &nProcsOnHost, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
65 for (int t = 0; t < nTrials; t++) { // Multiple trials
66
67 long dUnderCurve = 0, UnderCurveSum = 0;
68 double hostSchedulingWait = 0.0, MICSchedulingWait = 0.0;
69 const double start_time = MPI_Wtime();
70
71 if (rank == 0) {
72 // Boss assigns work
73 const char* grainSizeSt = getenv("GRAIN_SIZE");
74 if (grainSizeSt == NULL) { printf("GRAIN_SIZE undefined\n"); exit(1); }
75 grainSize = atof(grainSizeSt);
76 long currentBlock = 0;
while (currentBlock < nBlocks) {

g
77

an
78 // Chunk size is proportional to the number of unassigned blocks

W
79 // divided by the number of workers...
long chunkSize = ((nBlocks-currentBlock)/(nRanks-1))/2;
ng
80
81 // ...but never smaller than GRAIN_SIZE
he

82 if (chunkSize < grainSize) chunkSize = grainSize;


un

83 msg[0] = currentBlock; // First block for next worker


rY

84 msg[1] = currentBlock + chunkSize; // Last block


fo

85 if (msg[1] > nBlocks) msg[1] = nBlocks; // Stay in range


d

86
re

87 // Wait for next worker


pa

88 MPI_Recv(&worker, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,


re

89 &stat);
yP

90
91 // Assign work to next worker
el

92 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);


siv

93
u

currentBlock = msg[1]; // Update position


cl

94
}
Ex

95
96
97 // Terminate workers
98 msg[0] = -1; msg[1] = -2;
99 for (int i = 1; i < nRanks; i++) {
100 MPI_Recv(&worker, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,
101 &stat);
102 MPI_Send(&msg, 2, MPI_LONG, worker, worker, MPI_COMM_WORLD);
103 }
104
105 } else {
106 // Worker performs the Monte Carlo calculation
107
108 // Create and initialize a random number generator from MKL
109 VSLStreamStatePtr stream[omp_get_max_threads()];
110 #pragma omp parallel
111 {
112 // Each thread gets its own random seed
113 const int randomSeed = nTrials*omp_get_thread_num()*nRanks + nTrials*rank + t;
114 vslNewStream(&stream[omp_get_thread_num()], VSL_BRNG_MT19937, randomSeed);
115 }
116
117 msg[0] = 0;

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


B.4. SOURCE CODE FOR CHAPTERS 4 AND 5: OPTIMIZING APPLICATIONS 485

118 while (msg[0] >= 0) {


119 // Receive from boss the range of blocks to process
120 double waitTime = MPI_Wtime();
121 MPI_Send(&rank, 1, MPI_INT, 0, rank, MPI_COMM_WORLD);
122 MPI_Recv(&msg, 2, MPI_LONG, 0, rank, MPI_COMM_WORLD, &stat);
123 waitTime = MPI_Wtime() - waitTime;
124 #ifdef __MIC__
125 MICSchedulingWait += waitTime;
126 #else
127 hostSchedulingWait += waitTime;
128 #endif
129 const long myFirstBlock = msg[0];
130 const long myLastBlock = msg[1];
131
132 RunMonteCarlo(myFirstBlock, myLastBlock, stream, dUnderCurve);
133
134 }
135
136 #pragma omp parallel
137 {
138 vslDeleteStream( &stream[omp_get_thread_num()] );
}

g
139

an
140 }

W
141
// Reduction to collect the results of the Monte Carlo calculation

g
142

en
143 MPI_Reduce(&dUnderCurve, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
h
144
un
145 double unbalancedTime = MPI_Wtime();
rY

146 MPI_Barrier(MPI_COMM_WORLD);
fo

147 unbalancedTime = (rank == 0 ? 0.0 : MPI_Wtime() - unbalancedTime);


ed

148
149 // Timing collection
ar

150 double hostUnbalancedTime = 0.0, MICUnbalancedTime = 0.0;


ep

151 #ifdef __MIC__


Pr

152 MICUnbalancedTime += unbalancedTime;


y

153 #else
el

154 hostUnbalancedTime += unbalancedTime;


siv

155 #endif
u

double averageHostUnbalancedTime = 0.0, averageMICUnbalancedTime = 0.0;


cl

156
double averageHostSchedulingWait = 0.0, averageMICSchedulingWait = 0.0;
Ex

157
158 MPI_Reduce(&hostUnbalancedTime, &averageHostUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
159 0, MPI_COMM_WORLD);
160 MPI_Reduce(&MICUnbalancedTime, &averageMICUnbalancedTime, 1, MPI_DOUBLE, MPI_SUM,
161 0, MPI_COMM_WORLD);
162 MPI_Reduce(&hostSchedulingWait, &averageHostSchedulingWait, 1, MPI_DOUBLE, MPI_SUM,
163 0, MPI_COMM_WORLD);
164 MPI_Reduce(&MICSchedulingWait, &averageMICSchedulingWait, 1, MPI_DOUBLE, MPI_SUM,
165 0, MPI_COMM_WORLD);
166 if (nProcsOnHost > 0) {
167 averageHostUnbalancedTime /= nProcsOnHost;
168 averageHostSchedulingWait /= nProcsOnHost;
169 }
170 if (nProcsOnMIC > 0) {
171 averageMICUnbalancedTime /= nProcsOnMIC;
172 averageMICSchedulingWait /= nProcsOnMIC;
173 }
174
175 // Compute pi
176 if (rank==0) {
177 const double pi = (double) UnderCurveSum / (double) iter * 4.0 ;
178 const double end_time = MPI_Wtime();
179 const double pi_exact=3.141592653589793;

Prepared for Yunheng Wang c Colfax International, 2013


486 APPENDIX B. SOURCE CODE FOR PRACTICAL EXERCISES

180 if (t==0) printf("#%9s %8s %7s %9s %14s %14s %14s %14s\n",
181 "pi", "Rel.err", "Time, s", "GrainSize", "Host unbal., s", "MIC unbal., s",
182 "Host sched, s.", "MIC sched, s.");
183 printf ("%10.8f %8.1e %7.3f %9ld %14.3f %14.3f %14.3f %14.3f\n",
184 pi, (pi-pi_exact)/pi_exact, end_time-start_time, grainSize,
185 averageHostUnbalancedTime, averageMICUnbalancedTime,
186 averageHostSchedulingWait, averageMICSchedulingWait);
187 fflush(0);
188 }
189 }
190 MPI_Finalize();
191 }

Back to Lab A.4.15

g
an
W
e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


487

Bibliography

[1] J.L. Hennessy and D.A. Patterson. Computer Architecture: a Quantitative Approach. Morgan Kaufmann,
5th edition, 2011.

[2] James Reinders. An Overview of Programming for Intel Xeon processors and Intel Xeon Phi
coprocessors.

g
an
http://software.intel.com/sites/default/files/blog/337861/

W
reindersxeonandxeonphi-v20121112a.pdf.

ng
[3] Intel Many Core Platform Software Stack (MPSS).

e
http://software.intel.com/en-us/articles/intel-manycore-platform-
software-stack-mpss. nh
Yu
[4] Wikipedia. Private Network.
r
fo

http://en.wikipedia.org/wiki/Private_network.
d
re

[5] Linux NFS Howto.


pa

http://nfs.sourceforge.net/nfs-howto/.
re
yP

[6] Intel MPI 4.1 Reference Manual for Linux OS*.


http://software.intel.com/sites/products/documentation/hpc/ics/impi/
el

41/lin/Reference_Manual/index.htm.
iv
us

[7] Intel C++ Compiler XE 13.0 Reference: Placing Variables and Functions on the Coprocessor.
cl

http://software.intel.com/sites/products/documentation/doclib/stdxe/
Ex

2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-CCAC04A3-DD2F-
4DFF-BD89-7235B321F7F3.htm.

[8] Intel C++ Compiler XE 13.0 Reference: pragma offload_transfer.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-F66EEDA2-2FB9-
4952-A8FC-E997F92DDF0A.htm.

[9] Intel C++ Compiler XE 13.0 Reference: pragma offload.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-EAB414FD-40C6-
4054-B094-0BA70824E2A2.htm.

[10] Intel C++ Compiler XE 13.0 About Asynchronous Computation.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-6563F12D-5155-
4B6D-AD96-7534EC251FD1.htm.

Prepared for Yunheng Wang c Colfax International, 2013


488 BIBLIOGRAPHY

[11] Intel C++ Compiler XE 13.0 Reference: Offload Using a Pragma.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-44F5B8E2-8EFD-
4C51-ACF8-357900798834.htm.

[12] Intel C++ Compiler XE 13.0 Reference: Using Shared Memory.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-8E5078C5-85F8-
40BD-932E-49E935943EBA.htm.

[13] Aart Bik. The software vectorization handbook. Applying multimedia extensions for maximum perfor-
mance. Intel Press, 2006.
http://www.intel.com/intelpress/sum_vmmx.htm.

[14] Intel C++ Compiler XE 13.0 Reference: Intrinsics.


http://software.intel.com/sites/products/documentation/doclib/stdxe/

g
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-712779D8-D085-

an
4464-9662-B630681F16F1.htm.

W
ng
[15] Intel C++ Compiler XE 13.0 Reference: Data Alignment.
http://software.intel.com/sites/products/documentation/doclib/stdxe/
e
nh
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-801063E6-0144-
Yu

4025-8852-2BBBB38D526A.htm.
r
fo

[16] Intel C++ Compiler XE 13.0 Reference: Alignment Support.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
ed

2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-A8921E64-6201-
ar

4018-BAE8-DE58E4E6ECB3.htm.
p
re

[17] Intel C++ Compiler XE 13.0 Reference: Allocating and Freeing Aligned Memory Blocks.
yP

http://software.intel.com/sites/products/documentation/doclib/stdxe/
el

2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-D0927A8E-A220-
iv

4F50-8697-C89BBE6EFC95.htm.
us
cl

[18] Intel C++ Compiler XE 13.0 Reference: Inline Assembly.


Ex

http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-5100C4FC-BC2F-
4E36-943A-120CFFFB4285.htm.

[19] Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual.
http://software.intel.com/sites/default/files/forum/278102/
327364001en.pdf.

[20] Intel C++ Compiler XE 13.0 Reference.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-56224050-87E4-
4F5A-868D-46EF5693E7DB.htm.

[21] Intel C++ Compiler XE 13.0 Reference: Class Libraries.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-8FAC8E44-EFD8-
4A49-95E5-D051DA1C3A05.htm.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


489

[22] Intel C++ Compiler XE 13.0 Reference: Class Libraries for KNC.
http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-8FAC8E44-EFD8-
4A49-95E5-D051DA1C3A05.htm.

[23] Agner Fog. C++ Vector Class Library.


http://www.agner.org/optimize/#vectorclass.

[24] Intel C++ Compiler XE 13.0 Automatic Vectorization.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-7D541D6D-4929-
4F35-A58D-B67F9A897AA0.htm.

[25] Intel C++ Compiler XE 13.0 Cilk Plus Library.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-44B505B6-01AF-

g
an
4865-8DF4-AF851F51DDA1.htm.

W
[26] Intel C++ Compiler XE 13.0 Cilk Plus Library, Array Notation.

ng
http://software.intel.com/sites/products/documentation/doclib/stdxe/

e
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-B4E06ED4-184F-
40E6-A8B4-117947D8C7AD.htm. nh
rYu
[27] Intel C++ Compiler XE 13.0 Elemental Functions.
fo

http://software.intel.com/sites/products/documentation/doclib/stdxe/
d

2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-90A7F490-941F-
re

4C07-A88E-07BBA14AE6AF.htm.
pa
re

[28] Intel C++ Compiler XE 13.0 Reference: Pragmas.


yP

http://software.intel.com/sites/products/documentation/doclib/stdxe/
el

2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-DD32852C-A0F9-
iv

4AC6-BF67-D10D064CC87A.htm.
us
cl

[29] Intel C++ Compiler XE 13.0 Cilk Plus Library, How to Write a New Reducer.
Ex

http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-275AA577-EE90-
4829-B1EA-01B7EB64C26F.htm.

[30] Intel C++ Compiler XE 13.0 Cilk Plus Library, Reducer Library.
http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-6B5EBB46-2BAB-
465B-870F-5CD6A981FA35.htm.

[31] OpenMP Specifications.


http://openmp.org/wp/openmp-specifications/.

[32] B. Blaise. OpenMP Tutorial on the Lawrence Livermore National Laboratory Web Site.
https://computing.llnl.gov/tutorials/openMP/.

[33] Threading Building Blocks (Intel R TBB).


http://threadingbuildingblocks.org.

Prepared for Yunheng Wang c Colfax International, 2013


490 BIBLIOGRAPHY

[34] Intel Array Building Blocks (ArBB).


http://software.intel.com/en-us/articles/intel-array-building-
blocks-documentation.

[35] Jim Jeffers and James Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Morgan
Kaufmann, 1 edition edition, March 2013.

[36] Jim Jeffers and James Reinders. Web Site for the book “Intel Xeon Phi Coprocessor High Performance
Programming”.
http://www.lotsofcores.com/.

[37] Michael McCool, Arch D. Robinson, and James Reinders. Structured Parallel Programming: Patterns
for Efficient Computation. Morgan Kaufmann, 2012.
http://parallelbook.com/.

[38] Michael McCool, Arch D. Robinson, and James Reinders. Web Site for the book “Structured Parallel

g
Programming: Patterns for Efficient Computation”.

an
http://parallelbook.com/.

W
ng
[39] Michael J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill, 2004.
e
nh
[40] Web Pages for MPI Routines at the Argonne National Laboratory Seb Site.
Yu

http://www.mcs.anl.gov/research/projects/mpi/www/.
r
fo

[41] MPI Forum Web Site.


http://www.mpi-forum.org.
ed
ar

[42] Message Passing Interface Forum. MPI: A Message Passing Interface Standard Version 2.2.
p
re

http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf.
yP

[43] Message Passing Interface Forum. MPI: A Message Passing Interface Standard Version 3.0.
el

http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf.
iv
us

[44] Argonne National Laboratory. Web Portal for MPI at the Argonne National Laboratory Web Site.
cl

http://www.mcs.anl.gov/mpi/.
Ex

[45] Intel C++ Compiler XE 13.0 Reference: Optimization options.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-CDCCCACD-A61C-
40C5-A342-E452C95E1608.htm.

[46] Intel C++ Compiler XE 13.0 Reference: Optimization pragma.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-15460543-7B8B-
4484-9B53-86B69146ABB9.htm.

[47] ROOT, a Data Analysis Framework.


http://root.cern.ch/.

[48] Andrey Vladimirov. Arithmetics on Intel’s Sandy Bridge and Westmere CPUs: not all FLOPs are Created
Equal.
http://research.colfaxinternational.com/post/2012/04/30/FLOPS.aspx.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


491

[49] Intel C++ Compiler XE 13.0 Floating-Point Options.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-6515772B-2F90-
4187-BB88-852938B88A58.htm.

[50] Martyn J. Corden and David Kreitzer. Consistency of Floating-Point Results using the Intel Compiler or
Why doesn’t my application always give the same answer?
http://software.intel.com/en-us/articles/consistency-of-floating-
point-results-using-the-intel-compiler.

[51] Intel C++ Compiler XE 13.0 Data Options.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-1A20CE73-31FA-
4AA5-8D90-58C5A0209C52.htm.

[52] Wendy Doerner. Advanced Optimizations for Intel MIC Architecture, Low Precision Optimizations.

g
an
http://software.intel.com/en-us/articles/advanced-optimizations-for-

W
intel-mic-architecture-low-precision-optimizations.

ng
[53] Units in the last place - Wikipedia, the free encyclopedia.

e
http://en.wikipedia.org/wiki/Unit_in_the_last_place.
nh
Yu
[54] Intel Math Kernel Library 11.0 Reference Manual.
r

http://software.intel.com/sites/products/documentation/doclib/mkl_
fo

sa/11/mklman/hh_goto.htm#GUID-0191F247-778C-4C69-B54F-ABF951506FCD.
d

htm.
re
pa

[55] Andrey Vladimirov. Auto-Vectorization with the Intel Compilers: is Your Code Ready for Sandy Bridge
re

and Knights Corner?, 2012.


yP

http://research.colfaxinternational.com/post/2012/03/12/AVX.aspx.
el
iv

[56] Peter Richards and Stephen Weston. Technology in Banking - Facing the Challenges of Scale and
us

Complexity. Acceleration, Speed and Other Derivatives, 2010.


cl

http://www.stanford.edu/class/ee380/Abstracts/110511.html.
Ex

[57] Nicolas Butler. Concurrency Hazards: False Sharing.


http://simplygenius.net/Article/FalseSharing.

[58] Intel C++ Compiler XE 13.0 Reference: Thread Affinity Interface.


http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/composerxe/compiler/cpp-lin/index.htm#GUID-8BA55F4A-D5AE-4E27-
8C25-058B68D280A4.htm.

[59] Andrey Vladimirov. Terabyte RAM Servers: Memory Bandwidth Benchmark and How to Boost RAM
Bandwidth by 20% with a Single Command.
http://research.colfaxinternational.com/post/2012/01/04/Terabyte-
RAM-Servers-Memory-Bandwidth-Benchmark.aspx.

[60] Andrey Vladimirov. Large Fast Fourier Transforms with FFTW 3.3 on Terabyte-RAM NUMA Servers.
http://research.colfaxinternational.com/post/2012/02/02/FFTW-
NUMA.aspx.

Prepared for Yunheng Wang c Colfax International, 2013


492 BIBLIOGRAPHY

[61] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an Insightful Visual Performance
Model for Multicore Architectures. Communications of the ACM, 52(4):65–76, April 2009.
http://dx.doi.org/doi:10.1145/1498765.1498785.
[62] B. T. Draine and A. Li. Infrared Emission from Interstellar Dust. I. Stochastic Heating of Small Grains.
The Astrophysical Journal, 551:807–824, 2001.
http://adsabs.harvard.edu/abs/2001ApJ...551..807D.
[63] P. Guhathakurta and B. T. Draine. Temperature fluctuations in interstellar grains. I - Computational
method and sublimation of small grains. The Astrophysical Journal, 345:230–244, 1989.
http://adsabs.harvard.edu/abs/1989ApJ...345..230G.
[64] Harald Prokop. Cache-Oblivious Algorithms. Master’s thesis, Massachusetts Institute of Technology,
1999.
http://supertech.csail.mit.edu/papers/Prokop99.pdf.
[65] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-Oblivious

g
Algorithms. In 40th Annual Symposium on Foundations of Computer Science, 1999.

an
http://doi.ieeecomputersociety.org/10.1109/SFFCS.1999.814600.

W
ng
[66] D. Tsifakis, Alistair P. Rendell, and Peter E. Strazdins. Cache Oblivious Matrix Transposition: Simulation
e
and Experiment. In International Conference on Computational Science, pages 17–25, 2004.
nh
http://www.springeronline.com/3-540-22115-8.
Yu

[67] Rakesh Krishnaiyer. Compiler prefetching for Intel Xeon Phi coprocessors.
r
fo

http://software.intel.com/sites/default/files/article/326703/5.3-
prefetching-on-mic-4.pdf.
ed
ar

[68] Intel C++ Compiler XE 13.0 Reference: pragma prefetch/noprefetch.


p

http://software.intel.com/sites/products/documentation/doclib/stdxe/
re

2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-3A086451-4C82-
yP

4BB1-B742-FF93EBF60DA3.htm.
el
iv

[69] Intel C++ Compiler XE 13.0 -opt-prefetch argument.


us

http://software.intel.com/sites/products/documentation/doclib/stdxe/
cl

2013/composerxe/compiler/cpp-lin/hh_goto.htm#GUID-C46A86DA-6D6B-
Ex

455D-8860-AC814569C3D5.htm.
[70] Chris J. Newburn et al. Offload Runtime for the Intel Xeon Phi Coprocessor.
http://software.intel.com/en-us/articles/offload-runtime-for-the-
intelr-xeon-phitm-coprocessor.
[71] Intel Math Kernel Library Link Line Advisor.
http://software.intel.com/sites/products/mkl/MKL_Link_Line_Advisor.
html.
[72] Changkyu Kim et al. Closing the Ninja Performance Gap through Traditional Programming and
Compiler Technology.
http://www.intel.com/content/dam/www/public/us/en/documents/
technology-briefs/intel-labs-closing-ninja-gap-paper.pdf.
[73] Intel. Intel Xeon Phi Coprocessor System Software Developers Guide.
http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-
system-software-developers-guide.

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


493

[74] Intel VTune Parallel Amplifier XE 2013 Help for Linux* OS.
http://software.intel.com/sites/products/documentation/doclib/stdxe/
2013/amplifierxe/lin/ug_docs/index.htm.
[75] Intel Trace Analyzer Reference Guide.
http://software.intel.com/sites/products/documentation/hpc/ics/itac/
81/ITA_Reference_Guide/index.htm.

[76] Intel Trace Collector Reference Guide.


http://software.intel.com/sites/products/documentation/hpc/ics/itac/
81/ITC_Reference_Guide/index.htm.
[77] Intel Developer Zone: Intel Xeon Phi Coprocessor.
http://software.intel.com/mic-developer.

[78] Wendy Doerner. Programming and Compiling for Intel Many Integrated Core Architecture.
http://software.intel.com/en-us/articles/programming-and-compiling-

g
an
for-intel-many-integrated-core-architecture.

W
[79] Intel Software TV. Intel Xeon Phi Coprocessor.

ng
http://www.youtube.com/playlist?list=PLg-UKERBljNwuVuid_

e
rhZ1yVUrTjC3gzx.
nh
Yu
[80] Intel Many Integrated Core Architecture Forum.
http://software.intel.com/en-us/forums/intel-many-integrated-core.
r
fo

[81] Threading on Intel Parallel Architectures (Forum).


d
re

http://software.intel.com/en-us/forums/threading-on-intel-parallel-
pa

architectures.
re

[82] Parallel Computing with Intel Xeon Phi Coprocessors (a LinkedIn group).
yP

http://www.linkedin.com/groups/Parallel-Computing-Intel-Xeon-Phi-
el

4722265/about.
iv
us

[83] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions. New York: Dover, 1972.
cl
Ex

Prepared for Yunheng Wang c Colfax International, 2013


Ex
cl
us
iv
el
yP
re
par
ed
fo
r Yu
nh
eng
W
an
g
INDEX 495

Index

/etc/hosts, 38 _Cilk_for, 71, 95, 98


4–way threaded, 5, 13, 140, 179, 196 _Cilk_offload, 59, 60, 65, 95
512-bit SIMD vector register, 5 _Cilk_offload_to, 62, 63, 65, 70, 71, 275,
277
alignment _Cilk_shared, 59, 60, 65, 71, 72, 95
check, 90 _Cilk_spawn, 70, 72, 106
data, 11, 79–81, 84, 89, 90, 139, 153, 158, 159,

g
_Cilk_sync, 72, 95, 113

an
162, 222, 282, 283 __cilkrts_get_nworkers(), 98

W
offset __cilkrts_get_worker_number(), 98,
calculation, 282

ng
106
arithmetic compilation, 97, 98

e
complexity, 16
nh elemental functions, 86
Yu
array extension for array notation, 85
notation, 85 fork-join, 106
r
fo

__assume_aligned, 90 holders, 118


asynchronous
d

loops, 99, 102


re

offload, 72 mutexes, 110


pa

__attribute__((aligned())), 79 reducers, 114, 116


re

__attribute__((target(mic))), 46, 51, 52, spawn, 103


yP

57, 68, 69, 272 variables, 109


automatic
el

Cilk Plus vs. OpenMP, 72, 94


vectorization, 83
iv

-cilk-serialize, 98
us

AVX, 14, 78, 81, 82, 90, 91, 159, 312


_Cilk_for, 71, 95, 98
cl

bandwidth, 7, 9, 14, 15 _Cilk_offload, 59, 60, 65, 95


Ex

PCIe, 16 _Cilk_offload_to, 62, 63, 65, 70, 71, 275, 277


bandwidth-bound, 15, 16, 140, 181, 189–191, 197, _Cilk_shared, 59, 60, 65, 71, 72, 95
198, 255 _Cilk_spawn, 70, 72, 106
bridge, 31 _Cilk_sync, 72, 95, 113
__cilkrts_get_nworkers(), 98
cache __cilkrts_get_worker_number(), 98, 106
coherency, 8 cluster
hierarchy, 6 MPI, 73
L1, 7 code
L2, 7 portability, 3
prefetching, 8, 11, 15, 92, 140, 220 coherency, 8
property, 7 communication
chkconfig, 18 between coprocessors, 76
cilk::holder, 118 compiler, 3, 19
Cilk Plus, 95, 99 environment
CILK_NWORKERS, 98, 106, 119, 120, 290 setup, 267

Prepared for Yunheng Wang c Colfax International, 2013


496 INDEX

flag FLOPS, 9, 16
-cilk-serialize, 98 FMA, 92
-mkl, 269 fork-join, 103
-mmic, 3, 37, 44, 192, 230, 231, 238, 242, form factor, 9
255, 267–269, 280 fused multiply-add, 92
-no-offload, 272
-O, 78, 84, 90 GCC, 11
-openmp, 95, 97, 105 GDDR bandwidth, 7
-openmp-stubs, 97 GDDR5, 1, 5, 15
-vec-report, 83, 84, 90
-x, 90 hardware
optimization, 78 installation, 18
compute-bound, 14–16, 140, 189, 192, 196–198, 255, header file
322 stdio.h, 49, 50
computing stdlib.h, 49
header files

g
model

an
heterogeneous, 4, 122 cilk/cilk.h, 96, 98

W
hybrid, 4, 122 dvec.h, 82
emmintrin.h, 81
ng
configuration
file, 31 fvec.h, 82
e
nh
cpuinfo, 34 ia32intrin.h, 81
immintrin.h, 81
Yu

data ivec.h, 82
r
fo

alignment, 11, 79–81, 84, 89, 90, 139, 153, 158, malloc.h, 79
159, 162, 222, 282, 283 mmintrin.h, 81
ed

heap, 79 mpi.h, 43
ar

stack, 79 omp.h, 69, 95


p
re

persistence, 53, 274 pmmintrin.h, 81


yP

transfer pthread.h, 40
asynchronous, 55 smmintrin.h, 81
el

synchronous, 53 stdio.h, 37, 40, 43, 45, 46, 67–69, 71, 72, 83, 98,
iv

267
us

__declspec(align()), 79, 83
stdlib.h, 68, 71, 72, 80
cl

dense form factor, 9


Ex

string.h, 43
environment tmmintrin.h, 81
variable, 49 unistd.h, 37, 40, 267
CILK_NWORKERS, 98, 106, 119, 120, 290 xmmintrin.h, 81
OMP_NUM_THREADS, 97, 105, 170, 196, 286, heterogeneous
287 computing, 4, 122
OMP_SCHEDULE, 101 MPI, 76, 122
explicit offload, 45, 271, 273 hostname, 32, 38
exportfs -a, 35 HPC, 1
hybrid
false sharing, 171 computing, 4
fflush(0), 47, 269 hyper-threads, 5, 6, 13, 97, 140, 179, 189, 194, 196,
file 199, 207, 209, 230, 238
configuration, 31 counter-productive, 315, 316, 322
firmware
update, 30 I_MPI_MIC, 44, 74, 306
firstprivate, 108 I/O, 47

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


INDEX 497

#ifdef pattern, 15
__INTEL_COMPILER, 85 bandwidth, 7, 15
__MIC__, 50, 63, 272, 278, 324, 327 hierarchy, 7
IMCI, 6, 14, 78, 81, 82, 91, 93 MIC, 1
alignment, 84 vector
installation, 18 reatures, 91
Intel __MIC__, 50, 63, 272, 278, 324, 327
Cluster Studio XE, 12 MIC_ENV_PREFIX, 49
Parallel Studio XE, 12 MIC_ENV_PREFIX, 49
Xeon, 11 MIC_LD_LIBRARY_PATH, 49
__INTEL_COMPILER, 85 MIC_PROXY_IO, 47
IP address, 32, 33 miccheck, 21, 28, 262
IP-addressable, 5 micctrl, 18, 21, 27, 31, 263
iptables, 35 micflash, 21, 30, 262
ITAC, 3 micinfo, 21, 22, 262

g
micnativeloadex, 39, 267, 268

an
Jacobi micnativeloadex, 39

W
method, 318 micrasd, 21, 29

ng
micsmc, 21, 24, 26, 40, 262
KNC, 1, 6

e
MKL, 3, 248
KNC chip, 5
nh
mkl_mic_set_workdivision(), 252
Yu
L1 cache, 7 Automatic Offload, 252
r

L2 cache, 5, 7 Compiler Assisted Offload, 254


fo

lastprivate, 108 mkl_mic_set_workdivision() , 252


d

µLinux, 10, 31
re

LD_LIBRARY_PATH, 49
_mm_free, 79, 80, 158, 192, 218, 222, 283
pa

Linux, 10
_mm_malloc, 79, 80, 158, 192, 218, 222, 283
re

linux
-mmic, 44, 267
yP

kernel, 20
load -mmic flag, 3, 37, 192, 230, 231, 238, 242, 255, 267,
el

monitoring, 40 268, 280


iv

LRU, 7 MMX, 78, 82


us

lspci, 18 monitoring, 261


cl

mount -a, 36
Ex

many-core, 1, 3, 4 MPI, 3, 42, 66, 122, 263


masking, 16 cluster, 73
matrix communication, 76
matrix compiling, 42
multiplication, 322 heterogeneous, 76, 122
operations, 250 I_MPI_MIC, 44, 74, 306
reduction, 180, 184–188, 191 -machinefile, 74
sparse, 157–159, 314 multiple, 73
transposition, 15, 211, 214, 215, 325 coprocessors, 73, 74
vector NFS, 35, 73
multiplication, 157–159, 200, 274, 275, 314, offload, 123
315, 318 peer to peer, 76
operations, 250 mpiicc, 42
product, 17, 157–159 mpiicpc, 42
memory, 23, 24 MPS, 20
access MPSS, 10, 18, 262

Prepared for Yunheng Wang c Colfax International, 2013


498 INDEX

MTU, 33 _Offload_number_of_devices, 67
multi-Core, 1 OMP_NUM_THREADS, 97, 105, 170, 196, 286, 287
multi-core, 1, 3, 4 OpenCL, 3
multi-threaded, 14 OpenMP, 3, 39, 95, 97, 99
multiple atomic operations, 110
asynchronous compilation, 97
offload, 72 critical section, 110
coprocessors, 66 firstprivate, 108
offload, 67 fork-join, 103, 104
MYO, 45 lastprivate, 108
_Cilk_offload, 59, 60 loops, 99
_Cilk_offload, 65 private, 108
_Cilk_shared, 59, 60, 71 private variables, 114, 166
_Cilk_shared, 65 reduction, 114
schedule
native

g
dynamic, 100

an
compilation, 38 guided, 101

W
execution, 37, 39, 264, 267 static, 100
mode, 264, 267
ng
shared, 109
MPI, 44 e
tasks, 103
nh
netwokring
thread, 99
host IP, 32
Yu

thread affinity, 189


networking, 31
r

variables
fo

bridge, 31, 32
private, 107
multiple, 33
ed

shared, 107
DHCP, 33
ar

-openmp, 95, 97, 105


hostname, 33
p

OpenMP vs. Intel Cilk Plus, 72, 94


re

IP, 32
-openmp-stubs, 97
yP

MTUsize, 33
optimization
SSH, 34
el

-On, 141
subnetwork, 32
iv

pragma, 141
new, 64
us

NFS, 10, 35, 36, 263


cl

parallel
Ex

MPI, 35, 73
application, 13
offload performance, 13
coherence, 59 processor, 13
diagnostics, 48 parallel vs. serial, 13
explicit, 45, 51, 271, 273 parallelism, 13
function, 46 data, 77
model, 45 distributed memory, 122
multiple, 67 SIMD, 77
asynchronous, 72 task, 94
unsuccessful, 50 thread, 94
_Offload_number_of_devices(), 67 PCIe, 2, 31
OFFLOAD_REPORT, 48, 272, 275 bandwidth, 16
_Offload_shared_aligned_free, 61, 64 traffic, 17
_Offload_shared_aligned_malloc, 61, 64 peak
_Offload_shared_free, 61, 64 bandwidth, 9
_Offload_shared_malloc, 61, 64 FLOPS, 9

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


INDEX 499

performance, 9 always, 89
PMU, 5 nontemporal, 89
portability, 3, 81 temporal, 89
power unaligned, 89
management, 261 prefetch, 7
power consumption, 2 prefetching, 8, 11, 15, 92, 140, 220
#pragma private, 108
cilk proxy console I/O, 47
grainsize, 102 pthreads, 3, 40, 94
ivdep, 87–89
loop count, 90 race condition, 166
novector, 89 resource
offload, 51–53 monitoring, 261
in, 53 restrict, 87, 88, 90
inout, 53
schedule

g
nocopy, 53

an
dynamic, 100
out, 53

W
guided, 101
signal, 69
static, 100

ng
target(mic), 45, 50, 58
scp, 38

e
target(mic:0), 55, 56, 67
target(mic:i), 68, 69
serial
nh
application, 13
Yu
wait, 55
performance, 13
r

offload_attribute
fo

processor, 13
pop, 46, 57, 272
service mpss, 19, 20, 262, 264
d

push, 46, 57, 272


re

shared, 109
offload_transfer, 52, 53, 275
pa

SIMD, 77
alloc_if, 54
re

AVX, 14, 78, 81, 82, 90, 91, 159, 312


free_if, 54
yP

IMCI, 6, 78, 81, 82, 93


in, 53
MMX, 78, 82
el

inout, 53
SSE, 78, 81, 82
iv

nocopy, 53, 54
us

SINK_LD_LIBRARY_PATH, 39, 269


out, 53
specifications, 9
cl

signal, 55
Ex

SSE, 78, 81, 82


target(mic), 57 alignment, 84
target(mic:0), 55 SSE2, 81
offload_wait SSH, 34, 38, 268
target(mic:0), 55, 56 DSA, 264
target(mic:i), 69 RSA, 264
wait, 55, 56, 69
omp temperature, 23–25
atomic, 111, 115 testing, 28
critical, 110 thread affinity, 189
for, 68, 95, 99, 100, 107 TLB, 7, 8
parallel, 68, 95, 97, 99, 100, 104, 107
single, 104 update, 30
task, 95, 104, 112 utilities
taskwait, 112 miccheck, 28
vector micctrl, 27
aligned, 89 micflash, 30

Prepared for Yunheng Wang c Colfax International, 2013


500 INDEX

micinfo, 22
micnativeloadex, 39
micrasd, 29
micsmc, 24, 41, 262

variables
private, 107
shared, 107
vector
register, 5, 14, 78
vectorization
automatic, 83
IMCI, 91
virtual
shared
class, 62

g
an
object, 60

W
VTune, 3

e ng
nh
r Yu
fo
ed
p ar
re
yP
el
iv
us
cl
Ex

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors


Ex
cl
us
iv
el
yP
re
pa
re
d
fo
r Yu
nh
eng
W
an
g
Ex
cl
us
iv
el
yP
re
par
ed
fo
r Yu
nh
eng
W
an
g

You might also like