0% found this document useful (0 votes)
53 views

Phi Intro PDF

Uploaded by

Kien Pham
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Phi Intro PDF

Uploaded by

Kien Pham
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 266

Parallel Programming and Optimization

with Intel Xeon Phi Coprocessors


Colfax Developer Boot Camp

Vadim Karpusenko, PhD and Andrey Vladimirov, PhD


Colfax International

July 2014, Rev. 12


MIC Developer Boot Camp Rev. 12 Welcome © Colfax International, 2013–2014
About This Document

This document represents the mate-


rials of a one-day training “Parallel
Programming Boot Camp” developed
and run by Colfax International.

© Colfax International, 2013-2014

http://www.colfax-intl.com/nd/xeonphi/training.aspx

MIC Developer Boot Camp Rev. 12 About This Document © Colfax International, 2013–2014
Disclaimer

While best efforts have been used in preparing this training, Colfax International makes no
representations or warranties of any kind and assumes no liabilities of any kind with respect to
the accuracy or completeness of the contents and specifically disclaims any implied warranties
of merchantability or fitness of use for a particular purpose. The publisher shall not be held
liable or responsible to any person or entity with respect to any loss or incidental or
consequential damages caused, or alleged to have been caused, directly or indirectly, by the
information or programs contained herein. No warranty may be created or extended by sales
representatives or written sales materials.

MIC Developer Boot Camp Rev. 12 Disclaimer © Colfax International, 2013–2014


Sign In
Please sign in during any coffee break to receive an invitation to a
survey. Completing the survey earns you a free electronic copy of our
book “Parallel Programming and Optimization with Intel Xeon Phi
Coprocessors”.

MIC Developer Boot Camp Rev. 12 Disclaimer © Colfax International, 2013–2014


Supplementary Materials

MIC Developer Boot Camp Rev. 12 Supplementary Materials © Colfax International, 2013–2014
Supplementary Materials: Textbook
ISBN: 978-0-9885234-1-8 (520 pages)

Parallel Programming
and Optimization with
Intel® Xeon Phi™
Coprocessors
Handbook on the Development and
Optimization of Parallel Applications
for Intel® Xeon® Processors
and Intel® Xeon Phi™ Coprocessors

© Colfax International, 2013


http://www.colfax-intl.com/nd/xeonphi/book.aspx

MIC Developer Boot Camp Rev. 12 Supplementary Materials © Colfax International, 2013–2014
Research and Consulting

http://research.colfaxinternational.com/
http://nlreg.colfax-intl.com/
MIC Developer Boot Camp Rev. 12 Supplementary Materials © Colfax International, 2013–2014
Additional Reading
Learn more about this book:
It all comes down to
PARALLEL lotsofcores.com This book belongs on the
PROGRAMMING !
(applicable to processors bookshelf of every HPC
and Intel® Xeon Phi™ professional. Not only does it
coprocessors both) successfully and accessibly
teach us how to use and
Forward, Preface obtain high performance on
Chapters: the Intel MIC architecture, it is
1. Introduction
2. High Performance Closed
about much more than that. It
Track takes us back to the universal
Test Drive! fundamentals of high-
3. A Friendly Country Road Race performance computing
4. Driving Around Town: including how to think and
Optimizing A Real-World reason about the performance
Code Example
of algorithms mapped to
5. Lots of Data (Vectors)
6. Lots of Tasks (not Threads) modern architectures, and it
7. Offload puts into your hands powerful
8. Coprocessor Architecture tools that will be useful for
9. Coprocessor System Software years to come.
10. Linux on the Coprocessor —Robert J. Harrison
11. Math Library Institute for Advanced
MPI
Computational Science,
12.

13. Profiling and Timing Available since February 2013.


14. Summary Stony Brook University
Glossary, Index
Intel® Xeon Phi™ Coprocessor High Performance Programming,
Jim Jeffers, James Reinders, (c) 2013, publisher: Morgan Kaufmann

“© 2013, James Reinders & Jim Jeffers, book image used with permission

MIC Developer Boot Camp Rev. 12 Supplementary Materials © Colfax International, 2013–2014
List of Topics

MIC Developer Boot Camp Rev. 12 List of Topics © Colfax International, 2013–2014
List of Topics

1
Introduction
Ï Intel Xeon Phi Architecture from the Programmer’s Perspective
Ï Software Tools for Intel Xeon Phi Coprocessors
Ï Will Application X benefit from the MIC architecture?

2
Programming Models for Intel Xeon Phi Applications
Ï Native Applications for Coprocessors and MPI
Ï Offload Programming Models
Ï Using Multiple Coprocessors
Ï MPI Applications and Heterogeneous Clustering

MIC Developer Boot Camp Rev. 12 List of Topics © Colfax International, 2013–2014
List of Topics

3
Porting Applications to the MIC Architecture
Ï Future-Proofing: Reliance on Compiler and Libraries
Ï Choosing the Programming Model
Ï Cross-Compilation of User Applications
Ï Performance Expectations

4
Parallel Scalability on Intel Architectures
Ï Vectorization (Single Instruction Multiple Data, SIMD, Parallelism)
Ï Multi-threading: OpenMP, Intel Cilk Plus
Ï Task Parallelism in Distributed Memory, MPI

MIC Developer Boot Camp Rev. 12 List of Topics © Colfax International, 2013–2014
List of Topics

5
Optimization for the Intel Xeon Product Family
Ï Optimization Checklist
Ï Finding Bottlenecks with Intel VTune Amplifier
Ï MPI Diagnostics Using Intel Trace Analyzer and Collector
Ï Intel Math Kernel Library (MKL)
Ï Scalar Optimization Considerations
Ï Automatic Vectorization and Data Structures
Ï Optimization of Thread Parallelism

MIC Developer Boot Camp Rev. 12 List of Topics © Colfax International, 2013–2014
List of Topics

6
Advanced Optimization for the MIC Architecture
Ï Memory Access and Cache Utilization
Ï Data Persistence and PCIe Traffic
Ï MPI Applications on Clusters with Coprocessors

7
Conclusion
Ï Course Recap
Ï Additional Resources: Reading, Guides, Support

MIC Developer Boot Camp Rev. 12 List of Topics © Colfax International, 2013–2014
§1. Introduction to the Intel Many
Integrated Core (MIC) Architecture

MIC Developer Boot Camp Rev. 12 Introduction to the Intel Many Integrated Core (MIC) Architecture © Colfax International, 2013–2014
MIC Architecture from the Programmer’s Perspective

MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Intel Xeon Phi Coprocessors and the MIC Architecture

PCIe end-point device


High Power efficiency
∼ 1 TFLOP/s in DP
Heterogeneous clustering

For highly parallel applications which reach the scaling limits


on Intel Xeon processors

MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Xeon Family Product Performance
Many-core Coprocessors
(Xeon Phi) vs Multi-core
Processors (Xeon) —
Better performance per
system & performance
per watt for parallel
applications
Same programming
methods, same Source: “Intel Xeon Product Family:
development tools. Performance Brief”

MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Intel Xeon Processors and the MIC Architecture

Multi-core Intel Xeon processor Many-core Intel Xeon Phi coprocessor

C/C++/Fortran; OpenMP/MPI C/C++/Fortran; OpenMP/MPI


Standard Linux OS Special Linux µOS distribution
Up to 768 GB of DDR3 RAM 6–16 GB cached GDDR5 RAM
≤12 cores/socket ≈3 GHz 57 to 61 cores at ≈1 GHz
2-way hyper-threading 4-way hyper-threading
256-bit AVX vectors 512-bit IMCI vectors
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Examples of Solutions with the Intel MIC Architecture

Colfax’s CXP7450 workstation with Colfax’s CXP9000 server with eight


two Intel Xeon Phi coprocessors Intel Xeon Phi coprocessors

MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Paper: research.colfaxinternational.com/post/2013/01/07/Nbody-Xeon-Phi.aspx
Demo: http://www.youtube.com/watch?v=KxaSEcmkGTo
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Microarchitecture
Core Ring
SBOX
CORE CORE CORE CORE Interconnect (CRI)
PCIe v2.0
controller, DATA
L2 L2 L2 L2
DMA engines
ADDRESS
COHERENCE
TD TD TD TD

GDDR5 CORE L2 TD Distributed tag TD L2 CORE GDDR5


directory (DTD)

GDDR5 GDDR5
TD TD
CORE L2 L2 CORE
GDDR5 TD TD TD TD GDDR5
GBOX GBOX
GDDR5 (memory (memory GDDR5
controller) L2 L2 L2 L2 controller)

CORE CORE CORE CORE

Source: “Intel Xeon Phi Coprocessor - the Architecture“


MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Core Topology

MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Cache Structure
The caches are 8-way associative, fully coherent with
the LRU (Least Recently Used) replacement policy
Cache line size 64B
L1 size 32KB data, 32KB code
L1 latency 1 cycle
L2 size 512KB
L2 ways 8
L2 latency 11 cycles
Memory → L2 prefetching hardware and software
L2 → L1 prefetching software only
Translation Lookaside Buffer(TLB) 64 pages of size 4KB (256KB coverage)
coverage options (L1, data) 8 pages of size 2MB (16MB coverage)

MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Features of the IMCI Instruction Set
Intel IMCI is the instruction set supported by Intel Xeon Phi copr.
512-bit wide registers
Ï can pack up to eight 64-bit elements (long int, double)
Ï up to sixteen 32-bit elements (int, float)
Arithmetic Instructions
Ï Addition, subtraction and multiplication
Ï Fused Multiply-Add instruction (FMA)
Ï Division and reciprocal calculation;
Ï Error function, inverse error function;
Ï Exponential functions (natural, base 2 and base 10) and the power function.
Ï Logarithms (natural, base 2 and base 10).
Ï Square root, inverse square root, hypothenuse value and cubic root;
Ï Trigonometric functions (sin, cos, tan, sinh, cosh, tanh, asin, acos . . . );
Ï Rounding functions
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Features of the IMCI Instruction Set

Initialization, Load and Store, Gather and Scatter


Comparison
Conversion and type cast
Bitwise instructions: NOT, AND, OR, XOR, XAND
Reduction and minimum/maximum instructions
Vector mask instructions
Scalar instructions
Swizzle and permute

MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Interactions between Operating Systems

MIC Developer Boot Camp Rev. 12 Interactions between Operating Systems © Colfax International, 2013–2014
Linux Host Intel® Xeon Phi™ coprocessor
Virtual terminal session
Host-side offload Target-side “native" Target-side offload
application application application

User code User code User code


SSH
Offload libraries, Standard OS Offload libraries,
user-level driver, libraries plus any user-accessible
user-accessible APIs 3rd-party or Intel APIs and
and libraries libraries libraries

User-level code User-level code


System-level code System-level code

Intel® Xeon Phi™ Intel® Xeon Phi™


coprocessor support coprocessor communication and
libraries, tools, and drivers application-launching support

Linux OS PCIe Bus PCIe Bus Linux uOS

MIC Developer Boot Camp Rev. 12 Interactions between Operating Systems © Colfax International, 2013–2014
Linux µOS on Intel Xeon Phi coprocessors (part of MPSS)
user@host% lspci | grep -i "co-processor"
06:00.0 Co-processor: Intel Corporation Device 2250 (rev 11)
82:00.0 Co-processor: Intel Corporation Device 2250 (rev 11)
user@host% sudo service mpss status
mpss is running
user@host% cat /etc/hosts | grep mic
172.31.1.1 host-mic0 mic0
172.31.2.1 host-mic1 mic1
user@host% ssh mic0
user@mic0% cat /proc/cpuinfo | grep proc | tail -n 3
processor : 237
processor : 238
processor : 239
user@mic0% ls /
amplxe dev home lib64 oldroot proc sbin sys usr
bin etc lib linuxrc opt root sep3.10 tmp var

MIC Developer Boot Camp Rev. 12 Interactions between Operating Systems © Colfax International, 2013–2014
Software Tools for Intel Xeon Phi Coprocessors

MIC Developer Boot Camp Rev. 12 Software Tools for Intel Xeon Phi Coprocessors © Colfax International, 2013–2014
Execute MIC Applications (all free):

Drivers : Intel MIC Platform Software Stack


(Intel MPSS) — mandatory —
detect, boot and manage
coprocessors
Libraries : Redistributable libraries —
optional — run and distribute
pre-built applications
OpenCL : Intel OpenCL SDK — optional Monitoring MIC activity with
micsmc (an MPSS tool)

MIC Developer Boot Camp Rev. 12 Software Tools for Intel Xeon Phi Coprocessors © Colfax International, 2013–2014
MPSS Tools and Utilities
micinfo a system information query tool
micsmc a utility for monitoring and modifying the physical
paramaters: temperature, power modes, core utilization, etc.
micctrl a comprehensive configuration tool for the Intel Xeon Phi
coprocessor operating system
miccheck a set of diagnostic tests for the verification of the Intel Xeon
Phi coprocessor configuration
micrasd a host daemon logger of hardware errors reported by Intel
Xeon Phi coprocessors
micflash an Intel Xeon Phi flash memory agent
MIC Developer Boot Camp Rev. 12 Software Tools for Intel Xeon Phi Coprocessors © Colfax International, 2013–2014
Build Xeon Phi & Xeon CPU Applications (all licensed):

Compilers : Intel C Compiler, Intel C++ Compiler,


and Intel Fortran Compiler — mandatory
Optimization tools : Intel VTune Amplifier XE and
Intel Trace Analyzer and Collector (ITAC)
— highly recommended
Mathematics support : Intel Math Kernel Library
(MKL) — highly recommended
Development : Intel Inspector XE, Intel Advisor XE —
optional All-in-One Bundles,
common for CPU and MIC

MIC Developer Boot Camp Rev. 12 Software Tools for Intel Xeon Phi Coprocessors © Colfax International, 2013–2014
Will Application X Benefit from the MIC architecture?

MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
Three Layers of Parallelism
SIMD Instruction Pool

PU

Vector Unit
Data Pool
PU

PU

PU

MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
Three Layers of Parallelism
SIMD Instruction Pool

PU

Vector Unit
Data Pool
PU

PU

PU

MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
Three Layers of Parallelism
SIMD Instruction Pool

PU

Vector Unit
Data Pool
PU

PU

PU

MPI

Host CPUs Xeon Phi coprocessor Xeon Phi coprocessor


Compute Node 1

MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
Compute-Bound Application Performance
- Intel Xeon Phi

Scalar & Single-thread

- Intel Xeon
More Parallel

Vector & Single-thread

Scalar & Multi-threaded

Vector & Multi-threaded

1 10 100 1k 10k

More Performance

MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
One Size Does Not Fit All

An application must reach scalability limits on the CPU in order to


benefit from the MIC architecture.

Use Xeon Phi if: Use Xeon if:


Scales up to 100 threads Serial or scales to .10
Compute bound & threads
vectorized, or Unvectorized or
bandwidth-bound latency-bound

MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
Xeon + Xeon Phi Coprocessors = Xeon Family

Programming models allow a range of CPU+MIC coupling modes

Xeon - Multi-Core Centric Breadth MIC - Many-Core Centric

Multi-Core Hosted Offload Symmetric Many Core Hosted


General serial and Code with highly- Codes with Highly-parallel
parallel computing parallel phases balanced needs codes

MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
§2. Programming Models for Intel Xeon
Phi Applications

MIC Developer Boot Camp Rev. 12 Programming Models for Intel Xeon Phi Applications © Colfax International, 2013–2014
Native Execution
“Hello World” application:
1 #include <stdio.h>
2 #include <unistd.h>
3 int main(){
4 printf("Hello world! I have %ld logical cores.\n",
5 sysconf(_SC_NPROCESSORS_ONLN ));
6 }

Compile and run on host:


user@host% icc hello.c
user@host% ./a.out
Hello world! I have 32 logical cores.
user@host%

MIC Developer Boot Camp Rev. 12 Programming Models for Intel Xeon Phi Applications © Colfax International, 2013–2014
Native Execution
Compile and run the same code on the coprocessor in the native mode:
user@host% icc hello.c -mmic
user@host% scp a.out mic0:~/
a.out 100% 10KB 10.4KB/s 00:00
user@host% ssh mic0
user@mic0% pwd
/home/user
user@mic0% ls
a.out
user@mic0% ./a.out
Hello world! I have 240 logical cores.
user@mic0%

Use -mmic to produce executable for MIC architecture


Must transfer executable to coprocessor (or NFS-share) and run from shell
Native MPI applications work the same way (need Intel MPI library)
MIC Developer Boot Camp Rev. 12 Programming Models for Intel Xeon Phi Applications © Colfax International, 2013–2014
Native Applications for Coprocessors with MPI
“Hello World” in MPI:
1 #include "mpi.h"
2 #include <stdio.h>
3 #include <string.h>
4 int main (int argc, char *argv[]) {
5 int i, rank, size, namelen;
6 char name[MPI_MAX_PROCESSOR_NAME];
7 MPI_Init (&argc, &argv);
8 MPI_Comm_size (MPI_COMM_WORLD, &size);
9 MPI_Comm_rank (MPI_COMM_WORLD, &rank);
10 MPI_Get_processor_name (name, &namelen);
11 printf ("Hello World from rank %d running on %s!\n", rank, name);
12 if (rank == 0) printf("MPI World size = %d processes\n", size);
13 MPI_Finalize ();
14 }

MIC Developer Boot Camp Rev. 12 Programming Models for Intel Xeon Phi Applications © Colfax International, 2013–2014
Running MPI Applications on Host

user@host% source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh


user@host% export I_MPI_FABRICS=shm:tcp
user@host% mpiicpc -o HelloMPI.XEON HelloMPI.c
user@host% mpirun -host localhost -np 2 ./HelloMPI.XEON
Hello World from rank 1 running on host!
Hello World from rank 0 running on host!
MPI World size = 2 processes

Set up MPI environment variables


Use wrapper script mpiicpc to compile
Use automated tool mpirun to launch

MIC Developer Boot Camp Rev. 12 Programming Models for Intel Xeon Phi Applications © Colfax International, 2013–2014
Running Native MPI Applications on Coprocessors
user@host% source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh
user@host% export I_MPI_MIC=1
user@host% export I_MPI_FABRICS=shm:tcp
user@host% mpiicpc -mmic -o HelloMPI.MIC HelloMPI.c
user@host% scp HelloMPI.MIC mic0:~/
user@host% mpirun -host mic0 -np 2 ~/HelloMPI.MIC
Hello World from rank 1 running on host-mic0!
Hello World from rank 0 running on host-mic0!
MPI World size = 2 processes

Enable the MIC architecture in Intel MPI: I_MPI_MIC=1


Copy or NFS-share MPI library & executables with coprocessor
Use mpiicpc with -mmic to compile
Launch as if mic0 is a remote host
MIC Developer Boot Camp Rev. 12 Programming Models for Intel Xeon Phi Applications © Colfax International, 2013–2014
Paper: research.colfaxinternational.com/post/2013/10/17/Heterogeneous-Clustering.aspx
Demo: http://youtu.be/GffmChTcWf8
MIC Developer Boot Camp Rev. 12 Programming Models for Intel Xeon Phi Applications © Colfax International, 2013–2014
Offload Programming Models

MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Explicit Offload: Pragma-based approach

“Hello World” in the explicit offload model:


1 #include <stdio.h>
2 int main(int argc, char * argv[] ) {
3 printf("Hello World from host!\n");
4 #pragma offload target(mic)
5 {
6 printf("Hello World from coprocessor!\n"); fflush(0);
7 }
8 printf("Bye\n");
9 }
Application runs on the host, but some parts of code and date are moved (“offloaded”)
the coprocessor.

MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Compiling and Running an Offload Application
user@host% icpc hello_offload.cpp -o hello_offload
user@host% ./hello_offload
Hello World from host!
Bye
Hello World from coprocessor!

No additional arguments if compiled with an Intel compiler


Run application on host as a regular application
Code inside of #pragma offload is offloaded automatically
Console output on Intel Xeon Phi coprocessor is buffered and
mirrored to the host console
If coprocessor is not installed, code inside #pragma offload runs
on the host system
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Offloading Functions
1 __attribute__((target(mic))) void MyFunction() {
2 // ... implement function as usual
3 }
4

5 int main(int argc, char * argv[] ) {


6 #pragma offload target(mic)
7 {
8 MyFunction();
9 }
10 }

Functions used on coprocessor must be marked with the specifier


__attribute__((target(mic)))
Compiler produces a host version and a coprocesor version of such
functions (to enable fall-back to host)
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Offloading Multiple Functions

1 #pragma offload_attribute(push, target(mic))


2 void MyFunctionOne() {
3 // ... implement function as usual
4 }
5

6 void MyFunctionTwo() {
7 // ... implement function as usual
8 }
9 #pragma offload_attribute(pop)

To mark a long block of code with the offload attribute, use #pragma
offload_attribute(push/pop)

MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Offloading Data: Local Scalars and Arrays
1 void MyFunction() {
2 const int N = 1000;
3 int data[N];
4 #pragma offload target(mic)
5 {
6 for (int i = 0; i < N; i++)
7 data[i] = 0;
8 }

Scope-local scalars and known-size arrays offloaded automatically


Data is copied from host to coprocessor at the start of offload
Data is copied back from coprocessor to host at the end of offload
Bitwise-copyable data only (arrays of basic types and scalars)
C++ classes, etc. should use virtual-shared memory model
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Offloading Data: Global and Static Variables
1 int* __attribute__((target(mic))) data;
2

3 void MyFunction() {
4 static int __attribute__((target(mic))) N;
5 // ...
6 }
7

8 int main() {
9 // ...
10 }

Global and static variables must be marked with the offload attribute
#pragma offload_attribute(push/pop) may be used as well

MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Data Marshalling for Dynamically Allocated Data
1 double *p1=(double*)malloc(sizeof(double)*N);
2 double *p2=(double*)malloc(sizeof(double)*N);
3

4 #pragma offload target(mic) in(p1 : length(N)) out(p2 : length(N))


5 {
6 // ... perform operations on p1[] and p2[]
7 }

#pragma offload recognizes clauses in, out, inout and nocopy


Data size (length), alignment, redirection, and other properties
may be specified
Marshalling is required for pointer-based data

MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Memory retention and data persistence on coprocessor
1 #pragma offload target(mic) in(p : length(N) alloc_if(1) free_if(0) )
2 { /* allocate memory for array p on coprocessor, do not deallocate */ }
3

4 #pragma offload target(mic) in(p : length(0) alloc_if(0) free_if(0) )


5 { /* re-use previously allocated memory on coprocessor */ }
6

7 #pragma offload target(mic) out(p : length(N) alloc_if(0) free_if(1) )


8 { /* re-use memory and deallocate at the end of offload */ }

By default, memory on coprocessor is allocated before, deallocated after offload


Specifiers alloc_if and free_if allow to avoid allocation/deallocation
Can be combined with length(0) to avoid data transfer
Why bother: data transfer across the PCIe bus is relatively slow (6 GB/s), and
memory allocation on coprocessor is even slower (0.5 GB/s)

MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Precautions with persistent data
Use explicit zero-based coprocessor number
(e.g., mic:0 as shown below)
With multiple coprocessors, if target number is unspecified, any
coprocessor can be used, which will result in runtime errors if
persistent data cannot be found.
1 #pragma offload target(mic:0) in(p : length(N)) alloc_if(1) free_if(0) )
2 { /* allocate memory for array p on coprocessor, do not deallocate */ }

Do not change the value of the host pointer to a persistent array: the
runtime system finds the data on coprocessor using the host pointer
value, not variable name.
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Virtual-shared Memory Model
1 _Cilk_shared int arr[N]; // This is a virtual-shared array
2

3 _Cilk_shared void Compute() { // This function may be offloaded


4 // ... function uses array arr[]
5 }
6

7 int main() {
8 // arr[] can be initialized on the host
9 _Cilk_offload Compute(); // and used on coprocessor
10 // and the values are returned to the host
11 }

Alternative to Explicit Offload


Data synced from host to coprocessor before the start of offload
Data synced from coprocessor to host at the end of offload
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Virtual-shared Memory Model

1 int* _Cilk_shared data; // Pointer to a virtual-shared array


2

3 int main() {
4 // Working with pointer-based data is illustrated below:
5 data = (_Cilk_shared int*)_Offload_shared_malloc(N*sizeof(float));
6 _Offload_shared_free(data);
7 }

Addresses of virtual-shared pointers identical on host and


coprocessors
Synchronized before and after offload, with page granularity

MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Target-Specific Code

During MIC architecture compilation, preprocessor macro __MIC__ is defined.


Allows to fine-tune application performance or output where necessary

1 void __attribute__((target(mic))) MyFunction() {


2 #ifdef __MIC__
3 printf("I am running on a coprocessor.\n");
4 const int tuningParameter = 16;
5 #else
6 printf("I am running on the host.\n");
7 const int tuningParameter = 8;
8 #endif
9 // ... Proceed, using the variable tuningParameter
10 }

MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Using Multiple Coprocessors

MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
Multiple Coprocessors with Explicit Offload

Querying the number of coprocessors:


1 const int numDevices = _Offload_number_of_devices();
2 printf("Number of available coprocessors: %d\n" , numDevices);

Specifying offload target:


1 #pragma offload target(mic: 0)
2 { /* ... */ }

MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
Multiple Blocking Offloads Using Host Threads
(Explicit Offload)
1 const int nDevices = _Offload_number_of_devices();
2 #pragma omp parallel for
3 for (int i = 0; i < nDevices; i++) {
4 #pragma offload target(mic: i)
5 {
6 MyFunction(/*...*/ );
7 }
8 }

Up to 8 coprocessors, up to 32 host threads


All offloads start simultaneously and block the respective thread

MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
Blocking Explicit Offloads Using Threads: Dynamic Work
Distribution Across Coprocessors
1 const int nDevices = _Offload_number_of_devices();
2 omp_set_num_threads(nDevices);
3 #pragma omp parallel for schedule(dynamic, 1)
4 for (int i = 0; i < nWorkItems; i++) {
5 const int iDevice = omp_get_thread_num();
6 #pragma offload target(mic: iDevice)
7 {
8 MyFunction(i);
9 }
10 }

Up to 8 coprocessors, up to 32 host threads


nWorkItems are dynamically scheduled on nDevices
MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
Asynchronous Offload
By default, #pragma offload blocks until offload completes
Use clause “signal” with any pointer to avoid blocking
Use #pragma offload_wait to block where needed
1 char* offload0;
2 #pragma offload target(mic:0) signal(offload0) in(data : length(N))
3 { /* ... will not block code execution because of clause "signal" */ }
4

5 DoSomethingElse();
6

7 /* Now block until offload signalled by pointer "offload0" completes */


8 #pragma offload_wait target(mic:0) wait(offload0)

Use the target number to avoid hanging


MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
Offload diagnostics
user@host% export OFFLOAD_REPORT=2
user@host% ./offload-application
Transferring some data to and from coprocessor...
Done. Bye!
[Offload] [MIC 0] [File] offload-application.cpp
[Offload] [MIC 0] [Line] 6
[Offload] [MIC 0] [CPU Time] 0.505982 (seconds)
[Offload] [MIC 0] [CPU->MIC Data] 1024 (bytes)
[Offload] [MIC 0] [MIC Time] 0.000409 (seconds)
[Offload] [MIC 0] [MIC->CPU Data] 1024 (bytes)
user@host%

Set environment variable OFFLOAD_REPORT to 1 or 2 for automatic


collection and output of offload information.
Unset or set OFFLOAD_REPORT=0 to disable offload diagnostics
MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
Environment variable forwarding with offload
By default, all host environment variables on the host will be copied
to the coprocessor when offload starts.
In order to have different values for an environment variable on host
and coprocessor, set MIC_ENV_PREFIX
The prefix is dropped when variables are copied to coprocessor
user@host% # This enables s
user@host% export MIC_ENV_PREFIX=XEONPHI
user@host%
user@host% # This sets the value of OMP_NUM_THREADS on the host:
user@host% export OMP_NUM_THREADS=32
user@host%
user@host% # This sets the value of OMP_NUM_THREADS on the coprocessor:
user@host% export XEONPHI_OMP_NUM_THREADS=236

MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
Multiple Asynchronous Explicit Offloads From a Single
Thread
1 const int nDevices = _Offload_number_of_devices();
2 char sig[nDevices];
3 for (int i = 0; i < nDevices; i++) {
4 #pragma offload target(mic: i) signal(&sig[i])
5 {
6 MyFunction(/*...*/ );
7 }
8 }
9 for (int i = 0; i < nDevices; i++) {
10 #pragma offload_wait target(mic: i) wait(&sig[i])
11 }

Any pointer acts as a signal


Must wait for all signals
MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
MPI Applications and Heterogeneous Clustering

MIC Developer Boot Camp Rev. 12 MPI Applications and Heterogeneous Clustering © Colfax International, 2013–2014
Heterogeneous MPI Applications: Host + Coprocessors

user@host% mpirun -host mic0 -n 2 ~/Hello.MIC : -host mic1 -n 2 ~/Hello.MIC : \


% -host localhost -n 2 ~/Hello.XEON
Hello World from rank 5 running on localhost!
Hello World from rank 4 running on localhost!
Hello World from rank 2 running on mic1!
Hello World from rank 3 running on mic1!
Hello World from rank 1 running on mic0!
Hello World from rank 0 running on mic0!
MPI World size = 6 ranks

Specify Xeon executable for host processes


Specify Xeon Phi executable for coprocessor processes

MIC Developer Boot Camp Rev. 12 MPI Applications and Heterogeneous Clustering © Colfax International, 2013–2014
Heterogeneous Distributed Computing with Xeon Phi

Option 1: Hybrid MPI+OpenMP with Offload.


MPI processes are multi-threaded with OpenMP.
MPI processes run only on CPUs.
One or more OpenMP threads perform offload to coprocessor(s).
MIC Developer Boot Camp Rev. 12 MPI Applications and Heterogeneous Clustering © Colfax International, 2013–2014
Heterogeneous Distributed Computing with Xeon Phi

Option 2: Symmetric Pure MPI.


MPI processes are single-threaded.
Native MPI processes on the coprocessor.
E.g., 32 MPI processes on each CPU, 240 on each coprocessor.
MIC Developer Boot Camp Rev. 12 MPI Applications and Heterogeneous Clustering © Colfax International, 2013–2014
Heterogeneous Distributed Computing with Xeon Phi

Option 1: Symmetric Hybrid MPI+OpenMP.


MPI processes are multi-threaded with OpenMP.
Native MPI processes on the coprocessor.
E.g., one 32-thr MPI proc on each CPU, 240-thr on each coprocessor.
MIC Developer Boot Camp Rev. 12 MPI Applications and Heterogeneous Clustering © Colfax International, 2013–2014
File I/O in MPI Applications

MIC Developer Boot Camp Rev. 12 File I/O in MPI Applications © Colfax International, 2013–2014
RAM Filesystem

NIC

Files are stored in the host HDD


OS
coprocessor RAM host
HDD
Does not survive MPSS restart or
IB HCA
host reboot
PCIe BUS
Fastest method native
MPI I/O Xeon
Good for local pre-staged input process
RAM FS PHI
or runtime scratch data uOS

MIC Developer Boot Camp Rev. 12 File I/O in MPI Applications © Colfax International, 2013–2014
Virtio Transfer to Local Host Drives

Files are stored on a physical or NIC


virtual drive on the host host HDD
I/O
OS
Written data persistent across host
reboots HDD
IB HCA
Fast method
PCIe BUS
Cannot share a drive between native
/mnt/dir
coprocessors MPI Xeon
process
RAM FS PHI
Good for distributed uOS
checkpointing

MIC Developer Boot Camp Rev. 12 File I/O in MPI Applications © Colfax International, 2013–2014
Network Storage
to NFS
I/O
NIC

host HDD
Files are stored on a remote file OS
server host
HDD
Can share a mount point across IB HCA
the cluster PCIe BUS
Lustre has scalable performance native
/mnt/dir
MPI Xeon
process
NFS is slow but easy to set up RAM FS PHI
uOS
I/O
to LUSTRE

MIC Developer Boot Camp Rev. 12 File I/O in MPI Applications © Colfax International, 2013–2014
Review: Programming Models

MIC Developer Boot Camp Rev. 12 Review: Programming Models © Colfax International, 2013–2014
Programming Models
1
Native coprocessor applications
Ï Compile with -mmic
Ï Run with micnativeloadex or scp+ssh
Ï The way to go for MPI applications without offload
2
Explicit offload
Ï Functions, global variables require __attribute__((target(mic)))
Ï Initiate offload, data marshalling with #pragma offload
Ï Only bitwise-copyable data can be shared
3
Clusters and multiple coprocessors
Ï #pragma offload target(mic:i)
Ï Use threads to offload to multiple coprocessors
Ï Run native MPI applications

MIC Developer Boot Camp Rev. 12 Review: Programming Models © Colfax International, 2013–2014
§3. Porting Applications to the MIC
Architecture

MIC Developer Boot Camp Rev. 12 Porting Applications to the MIC Architecture © Colfax International, 2013–2014
Choosing the Programming Model

MIC Developer Boot Camp Rev. 12 Choosing the Programming Model © Colfax International, 2013–2014
To Offload or Not To Offload
For a “MIC-friendly” application,
Use offload if: Use native/symmetric MPI if:
Per-rank data set does not fit Parallel work-items too
in the Xeon Phi onboard small, so data transfer
memory overhead is significant
Need CPU: serial workload, Peer-to-peer
intensive file I/O communication between
MPI bandwidth-bound or workers is required
latency-bound workload Difficult to instrument data
Cannot compile some of movement or sharing with
dependencies for MIC coprocessor
MIC Developer Boot Camp Rev. 12 Choosing the Programming Model © Colfax International, 2013–2014
PCIe Bandwidth Considerations

With data sent from host to coprocessor, communication overhead must


be considered:
PCIe bandwidth: 6 GB/s, theoretical max arithmetic performance
1 TFLOP/s, practical memory bandwidth 150-170 GB/s
Offload if MIC performs À 1000 operations per transferred word
Algorithms with strong complexity scaling (e.g., O(n2 )) likely less
impacted by communication than with weak scaling (e.g., O(n),
O(n log n))

MIC Developer Boot Camp Rev. 12 Choosing the Programming Model © Colfax International, 2013–2014
Cross-Compilation of User Applications

MIC Developer Boot Camp Rev. 12 Cross-Compilation of User Applications © Colfax International, 2013–2014
Simple Applications, Native Execution

Simple CPU applications can be compiled for native execution on Xeon


Phi coprocessors by supplying the flag “-mmic” to the Intel compiler:
user@host% icpc -c myobject1.cc -mmic
user@host% icpc -c myobject2.cc -mmic
user@host% icpc -o myapplication myobject1.o myobject2.o -mmic

MIC Developer Boot Camp Rev. 12 Cross-Compilation of User Applications © Colfax International, 2013–2014
Native Applications with Autotools

Use the Intel compiler with flag -mmic


Eliminate assembly and unncecessary dependencies
Use --host=x86_64 to avoid “program does not run” errors
Example, the GNU Multiple Precision Arithmetic Library (GMP):
user@host% wget https://ftp.gnu.org/gnu/gmp/gmp-5.1.3.tar.bz2
user@host% tar -xf gmp-5.1.3.tar.bz2
user@host% cd gmp-5.1.3
user@host% ./configure CC=icc CFLAGS="-mmic" --disable-assembly --host=x86_64
...
user@host% make
...

MIC Developer Boot Camp Rev. 12 Cross-Compilation of User Applications © Colfax International, 2013–2014
Static Libraries with Offload

In offload applications, additional object files are produced:


user@host% # Program in myobject.cc contains #pragma offload
user@host% icpc -c myobject.cc
user@host% ls
myobject.cc myobjectMIC.o myobject.o

In order to compile the *MIC.o files into a static library with offload, use
xiar -qoffload-build instead of ar.

See white paper for more details:


http://research.colfaxinternational.com/post/2013/05/03/Fast-Library-Xeon-Phi.aspx

MIC Developer Boot Camp Rev. 12 Cross-Compilation of User Applications © Colfax International, 2013–2014
Performance Expectations

MIC Developer Boot Camp Rev. 12 Performance Expectations © Colfax International, 2013–2014
Performance on MIC is a Function of Optimization Level
Unoptimized Thread Parallelism: Scalar Optimizations: Vectorization: Heterogeneous:
with Fit All Threads Precomputation, Alignment, Using Host +
103 Offload in Memory Precision Control Padding, Hints + Two Coprocessors

hi
on P
n Xe
Performance Relative to Baseline

102 C++
o
Intel

GCC on CPUs
101
s
on CPU
C++
Intel
Baseline: unoptimized,
100 compiled with GCC,
running on host CPUs
(59 ms per spectrum)

Algorithm Improved Memory Access: Offload Traffic:


10-1 Optimization: Interpolation Method: Packed Data, Data Persistence
Pruning, Recurrence Packed Operations Loop Tiling on Coprocessor

0 1 2 3 4 5 6 7 8
Optimization Step

MIC Developer Boot Camp Rev. 12 Performance Expectations © Colfax International, 2013–2014
Performance on MIC is a Function of Optimization Level
Performance will be
disappointing if code is not
optimized for multi-core
CPUs
Optimized code runs better
on the MIC platform and on
the multi-core CPU
Single code for two
platforms + Ease of porting = Case study:
Incremental optimization http://research.colfaxinternational.com/post/2013/11/25/sc13-
talk.aspx

MIC Developer Boot Camp Rev. 12 Performance Expectations © Colfax International, 2013–2014
Caution on Comparative Benchmarks
In most of our benchmarks,
“Xeon Phi” = 5110P SKU
(60 cores, TDP 225 W, $2.7k),
“CPU” = dual Xeon E5-2680
(16 cores, TDP 260 W, $3.4k
+ system cost)
Why dual CPU vs single
coprocessor? Approximately
the same Thermal Design Case study:
Power (TDP) and cost. http://research.colfaxinternational.com/post/2013/11/25/sc13-
talk.aspx

MIC Developer Boot Camp Rev. 12 Performance Expectations © Colfax International, 2013–2014
Future-Proofing: Reliance on Compiler and Libraries

MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
Future-Proofing: Reliance on Compiler and Libraries
Ease of use
Threading Options Vector Options

Intel® Math Kernel Library


Intel® Math Kernel Library API*

Array Notation: Intel® Cilk™ Plus


Intel® Threading Building
Blocks

Depth
Intel® Cilk™ Plus Auto vectorization

Semi-auto vectorization:
OpenMP* #pragma (vector, simd)

OpenCL*

Pthreads* C/C++ Vector Classes


(F32vec16, F64vec8)

Fine control
MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
Next Generation MIC: Knights Landing (KNL)
2nd generation MIC product: code
name Knights Landing (KNL)
Intel’s 14 nm manufacturing process
A processor (running the OS) or a
coprocessor (PCIe device)
On-package high-bandwidth
memory w/flexible memory models:
flat, cache, & hybrid
Intel Advanced Vector Extensions Source: Intel Newsroom
AVX-512 (public)

MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
Getting Ready for the Future

Porting HPC applications to today’s


MIC architecture makes them ready for
future architectures, such as KNL
Xeon, KNC and KNL are not binary
compatible, therefore assembly-level
tuning will not scale forward.
Reliance on compiler optimization and
using optimized libraries (such as Intel
MKL) ensures future-readiness.
Source: Intel Newsroom

MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
Intel® Xeon Phi™ Product Family Roadmap
The Faster Path to Discovery

Future
Knights Landing
with Fabric
TBA
Knights Landing
3rd generation
2H’15*
Knights Landing
Intel® Xeon Phi™
x200 Product Family
Available Today
Knights Corner 14 nm process
In planning
Intel® Xeon Phi™ Server Processor &
x100 Product Family Coprocessor
22 nm process Over 3 TF DP Peak1
Coprocessor 60+ cores

Over 1 TF DP Peak
And new
Up to 61 Cores details
Up to 16GB GDDR5 today…

* First commercial systems


All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
1 Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expecations of cores, clock frequency and floating point operations per cycle. FLOPS = cores x clock frequency x floating-point operations per second per cycle.
11
MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
Knights Landing: Next-Generation Intel® Xeon Phi™
Architectural Enhancements = ManyX Performance
101010101010101001010101 Binary-compatible with 010101010101010100101010
010101010101010100101010 Intel® Xeon® 101010101010101010010101
processors
Based on
Intel® Atom™ core (based
on Silvermont microarchitecture)
60+ cores High-
Performance
with Enhancements
3+ Teraflops 1
Memory
for HPC
3x Single-Thread2
 14nm process technology
Over 5x DDR4
STREAM 3vs.
 4 Threads/Core 2-D Core Mesh DDR43
Up
Capacity
Up to
to
 Deep Out-of-Order Buffers Cache Coherency Comparable
16
16 GB
GB to Intel®
 Gather/Scatter at
at launch
launch Xeon®
 Better Branch Prediction Processors
NUMA
NUMA
 Higher Cache Bandwidth
Integrated Fabric support
support

… and many more

Core Server Processor In partnership with


*
*Other logos, brands and names are the property of their respective owners.
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
1 Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expecations of cores, clock frequency and floating point operations per cycle.

FLOPS = cores x clock frequency x floating-point operations per second per cycle.

2 Projected peak theoretical single-thread performance relative to 1st Generation Intel® Xeon Phi™ Coprocessor 7120P (formerly codenamed Knights Corner)

3 Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4

Diagram is for conceptual purposes only and only illustrates a CPU, memory, integrated fabric and DDR memory – it is not to scale and does not include all functional areas of the CPU, nor does it

represent actual component layout.


12
MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
Today’s Parallel Investment Carries Forward
MOST significant code
ADDITIONAL
modernizations carry tuning gains
forward

Recompile Tuning
Parallelization, threading, Exploiting NEW
vectorization, cache-blocking, features and
structures
MPI+OpenMP hybridization & more.
Intel® Xeon Phi™
x100 Product
Family MKL MPI TBB

OpenMP Cilk OpenCL


Plus™ KNL
Enhance-
Knights Landing Enabled ments
Performance Libraries & Runtimes
(memory, Knights Landing
architecture,
bandwidth,
Native Intel® AVX-512 etc.)

or Cache Mode For


Symmetric High Bandwidth Memory
or Knights Landing Enabled Compilers
Offload

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
13
MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
A Paradigm Shift for Highly-Parallel
Server Processor with Leadership Integration are Keys to Future
Memory Bandwidth
Over 5x STREAM vs. DDR41
Memory Capacity
Comparable to Intel® Xeon® processors2

Resiliency
Coprocessor Intel-server class reliability

Power Efficiency
>25% better than discrete card3

I/O
Knights Landing Highest bandwidth4

Fabric Cost
Less costly than discrete parts

Server Processor Flexibility


Extensive server configurations
Memory
Density
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
All projections are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 3+ KNL with fabric in 1U6
1 Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4

memory only with all channels populated.

2 Compared to 1st Generation Intel® Xeon Phi™ 7120P Coprocessor (formerly codenamed Knights Corner)

3 Projected result based on internal Intel analysis using estimated performance and power consumption of a rack sized deployment of Intel® Xeon® processors and Knights Landing coprocessors as compared to a rack with KNL processors only

4 Projected result based on internal Intel analysis comparing a discrete Knights Landing processor with integrated fabric to a discrete Intel fabric component card.

5 Projected result based on internal Intel analysis using estimated component pricing in the 2015 timeframe.

6 Theoretical density for air-cooled system; other cooling solutions and configurations may enable both lower or higher densities.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and

performance tests

to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
14
MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
§4. Parallel Scalability on Intel
Architectures

MIC Developer Boot Camp Rev. 12 Parallel Scalability on Intel Architectures © Colfax International, 2013–2014
Vectorization (Single Instruction Multiple Data, SIMD,
Parallelism)

MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
SIMD Operations
SIMD — Single Instruction Multiple Data
Scalar Loop SIMD Loop
1 for (i = 0; i < n; i++) 1 for (i = 0; i < n; i += 4)
2 A[i] = A[i] + B[i]; 2 A[i:(i+4)] = A[i:(i+4)] + B[i:(i+4)];

Each SIMD addition operator acts on 4 numbers at a time.


SIMD Instruction Pool

PU

Vector Unit
Data Pool PU

PU

PU

MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
Instruction Sets in Intel Architectures

Instruction Year and Intel Processor Vector Packed Data Types


Set registers
MMX 1997, Pentium 64-bit 8-, 16- and 32-bit integers
SSE 1999, Pentium III 128-bit 32-bit single precision FP
SSE2 2001, Pentium 4 128-bit 8 to 64-bit integers; SP & DP FP
SSE3–SSE4.2 2004 – 2009 128-bit (additional instructions)
AVX 2011, Sandy Bridge 256-bit single and double precision FP
AVX2 2013, Haswell 256-bit integers, additional instructions
IMCI 2012, Knights Corner 512-bit 32- and 64-bit integers;
single & double precision FP
AVX-512 (future) Knights Landing 512-bit 32- and 64-bit integers;
single & double precision FP

MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
Explicit Vectorization: Compiler Intrinsics
SSE2 Intrinsics IMCI Intrinsics
1 for (int i=0; i<n; i+=4) { 1 for (int i=0; i<n; i+=16) {
2 __m128 Avec=_mm_load_ps(A+i); 2 __m512 Avec=_mm512_load_ps(A+i);
3 __m128 Bvec=_mm_load_ps(B+i); 3 __m512 Bvec=_mm512_load_ps(B+i);
4 Avec=_mm_add_ps(Avec, Bvec); 4 Avec=_mm512_add_ps(Avec, Bvec);
5 _mm_store_ps(A+i, Avec); 5 _mm512_store_ps(A+i, Avec);
6 } 6 }

The arrays float A[n] and float B[n] are aligned


on a 16-byte (SSE2) and 64-byte (IMCI) boundary
n is a multiple of 4 for SSE and a multiple of 16 for IMCI
Variables Avec and Bvec are
128 = 4 × sizeof(float) bits in size for SSE2 and
512 = 16 × sizeof(float) bits for the Intel Xeon Phi architecture
MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
Automatic Vectorization of Loops
1 #include <stdio.h> user@host% icpc autovec.c -vec-report3
2 autovec.c(10): (col. 3) remark:
3 int main(){ loop was not vectorized:
4 const int n=8; vectorization possible
5 int i; but seems inefficient.
6 int A[n] __attribute__((aligned(64))); autovec.c(14): (col. 3) remark:
7 int B[n] __attribute__((aligned(64))); LOOP WAS VECTORIZED.
8 autovec.c(18): (col. 3) remark:
9 // Initialization loop was not vectorized:
10 for (i=0; i<n; i++) existence of vector
11 A[i]=B[i]=i; dependence.
12 user@host% ./a.out
13 // This loop will be auto-vectorized 0 0 0
14 for (i=0; i<n; i++) 1 2 1
15 A[i]+=B[i]; 2 4 2
16 3 6 3
17 // Output 4 8 4
18 for (i=0; i<n; i++) 5 10 5
19 printf("%2d %2d %2d\n", i, A[i], B[i]); 6 12 6
20 } 7 14 7

MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
Automatic Vectorization of Loops on MIC architecture
Compilation and runtime output of the code for Intel Xeon Phi execution

user@host% icpc autovec.c -vec-report3 -mmic


autotest.c(10): (col. 3) remark: LOOP WAS VECTORIZED.
autotest.c(14): (col. 3) remark: LOOP WAS VECTORIZED.
autotest.c(18): (col. 3) remark: loop was not vectorized:
existence of vector dependence.
user@host% micnativeloadex a.out
0 0 0
1 2 1
2 4 2
3 6 3
4 8 4
5 10 5
6 12 6
7 14 7

MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
Automatic Vectorization of Loops
Limitations:
Only for-loops can be auto-vectorized. Number of iterations must
be known at a runtime and/or compilation time
Memory access in the loop must have a regular pattern,
ideally with unit stride
Non-standard loops that cannot be automatically vectorized:
Ï loops with irregular memory access pattern
Ï calculations with vector dependence
Ï while-loops, for-loops with undetermined number of iterations
Ï outer loops (unless #pragma simd overrides this restriction)
Ï loops with complex branches (i.e., if-conditions)
Ï anything else that cannot be, or is very difficult to vectorize.

MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
Multi-Threading: OpenMP

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Parallelism in Shared Memory: OpenMP and Intel Cilk Plus

Intel Cilk Plus


Ï Good performance “out of the box”
Ï Little freedom for fine-tuning
Ï Programmer should focus on exposing the parallelism
Ï Low-level optimization (thread creation, work distribution and data sharing)
is performed by the Cilk Plus library
Ï Novel framework
OpenMP
Ï Easy to use for simple algorithms
Ï For complex parallelism, may require more tuning to perform well
Ï Allows more control over synchronization, work scheduling and distribution
Ï Well-established framework

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Program Structure in OpenMP
1 main() { // Begin serial execution.
2 ... // Only the initial thread executes
3 #pragma omp parallel // Begin a parallel construct and form
4 { // a team.
5 #pragma omp sections // Begin a work-sharing construct.
6 {
7 #pragma omp section // One unit of work.
8 {...}
9 #pragma omp section // Another unit of work.
10 {...}
11 } // Wait until both units of work complete.
12 ... // This code is executed by each team member.
13 #pragma omp for // Begin a work-sharing Construct
14 for(...)
15 { // Each iteration chunk is unit of work.
16 ... // Work is distributed among the team members.
17 } // End of work-sharing construct.

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Program Structure in OpenMP
18 #pragma omp critical // Begin a critical section.
19 {...} // Only one thread executes at a time.
20 #pragma omp task // Execute in another thread without blocking
21 {...}
22 ... // This code is executed by each team member.
23 #pragma omp barrier // Wait for all team members to arrive.
24 ... // This code is executed by each team member.
25 } // End of Parallel Construct
26 // Disband team and continue serial execution.
27 ... // Possibly more parallel constructs.
28 } // End serial execution.

1 Code outside #pragma omp parallel is serial, i.e., executed by only one thread
2 Code directly inside #pragma omp parallel is executed by each thread
3 Code inside work-sharing construct #pragma omp for is distributed across the
threads in the team
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
“Hello World” OpenMP Programs

1 #include <omp.h>
2 #include <stdio.h>
3

4 int main(){
5 const int nt=omp_get_max_threads();
6 printf("OpenMP with %d threads\n", nt);
7

8 #pragma omp parallel


9 printf("Hello World from thread %d\n", omp_get_thread_num());
10 }

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
“Hello World” OpenMP Programs
user@host% export OMP_NUM_THREADS=5
user@host% icpc -openmp hello_omp.cc
user@host% ./a.out
OpenMP with 5 threads
Hello World from thread 0
Hello World from thread 3
Hello World from thread 1
Hello World from thread 2
Hello World from thread 4
user@host% icpc -openmp-stubs hello_omp.cc
hello_omp.cc(8): warning #161: unrecognized #pragma
#pragma omp parallel
^
user@host% ./a.out
OpenMP with 1 threads
Hello World from thread 0

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Loop-Centric Parallelism: For-Loops in OpenMP

Loop iterations
Simultaneously launch

Program flow
multiple threads
Scheduler assigns loop
iterations to threads
Each thread processes
one iteration at a time

Parallelizing a for-loop.

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Loop-Centric Parallelism: For-Loops in OpenMP

The OpenMP library will distribute the iterations of the loop following the
#pragma omp parallel for across threads.

1 #pragma omp parallel for


2 for (int i=0; i<n; i++) {
3 printf("Iteration %d is processed by thread %d\n",
4 i, omp_get_thread_num());
5 // ... iterations will be distributed across available threads...
6 }

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Loop-Centric Parallelism: For-Loops in OpenMP

1 #pragma omp parallel


2 {
3 // Code placed here will be executed by all threads.
4 // Stack variables declared here will be private to each thread.
5 int private_number=0;
6 #pragma omp for schedule(dynamic, 4)
7 for (int i=0; i<n; i++) {
8 // ... iterations will be distributed across available threads...
9 }
10 // ... code placed here will be executed by all threads
11 }

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Fork-Join Model of Parallel Execution

Each thread can spawn


daughter threads
Available threads pick up
queued tasks
Expresses algorithms
that cannot be expressed
- Elemental function
in the loop model (e.g., - Fork
parallel recursion) - Join
Fork-join model of parallel execution.
(see #pragma omp task functionality, e.g., here)
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Synchronization: Avoiding Unpredictable Program Behavior
1 #include <omp.h>
2 #include <stdio.h>
3 int main() {
4 const int n = 1000;
5 int total = 0;
6 #pragma omp parallel for
7 for (int i = 0; i < n; i++) {
8 // Race condition
9 total = total + i;
10 }
11 printf("total=%d (must be %d)\n", total, ((n-1)*n)/2);
12 }

user@host% icpc -o omp-race omp-race.cc -openmp


user@host% ./omp-race
total=208112 (must be 499500)

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Synchronization: Avoiding Unpredictable Program Behavior
1 #include <omp.h>
2 #include <stdio.h>
3 int main() {
4 const int n = 1000;
5 int total = 0;
6 #pragma omp parallel for
7 for (int i = 0; i < n; i++) {
8 #pragma omp critical
9 { // Only one thread at a time can execute this section
10 total = total + i;
11 }
12 }

user@host% icpc -o omp-critical omp-critical.cc -openmp


user@host% ./omp-race
total=499500 (must be 499500)

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Synchronization: Avoiding Unpredictable Program Behavior

This parallel fragment of code has predictable behavior, because the


race condition was eliminated with an atomic operation:

1 #pragma omp parallel for


2 for (int i = 0; i < n; i++) {
3 // Lightweight synchronization
4 #pragma omp atomic
5 sum += i;
6 }

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Synchronization: Avoiding Unpredictable Program Behavior
Read : operations in the form v = x
Write : operations in the form x = v
Update : operations in the form x++, x--, --x, ++x, x binop= expr
and x = x binop expr
Capture : operations in the form v = x++, v = x–, v = –x, v = ++x,
v = x binop expr

Here x and v are scalar variables


binop is one of +, *, -, - /, &, ˆ , |, «, ».
No “trickery” is allowed for atomic operations:
Ï no operator overload,

Ï no non-scalar types,

Ï no complex expressions.

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Reduction: Avoiding Synchronization
1 #include <omp.h>
2 #include <stdio.h>
3

4 int main() {
5 const int n = 1000;
6 int sum = 0;
7 #pragma omp parallel for reduction(+: sum)
8 for (int i = 0; i < n; i++) {
9 sum = sum + i;
10 }
11 printf("sum=%d (must be %d)\n", sum, ((n-1)*n)/2);
12 }

user@host% icpc -o omp-reduction omp-reduction.cc -openmp


user@host% ./omp-reduction
sum=499500 (must be 499500)

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Implementation of Reduction using Private Variables
1 #include <omp.h>
2 #include <stdio.h>
3

4 int main() {
5 const int n = 1000;
6 int sum = 0;
7 #pragma omp parallel
8 {
9 int sum_th = 0;
10 #pragma omp for
11 for (int i = 0; i < n; i++)
12 sum_th = sum_th + i;
13 #pragma omp atomic
14 sum += sum_th;
15 }
16 printf("sum=%d (must be %d)\n", sum, ((n-1)*n)/2);
17 }

MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Task Parallelism in Distributed Memory, MPI

MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Task Parallelism in Distributed Memory, MPI

The most commonly used


framework for distributed
memory HPC calculations is
the Message Passing
Interface (MPI).

Intel MPI library implements


MPI for the x86 and for the
MIC architectures.

MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Compiling and Running MPI applications
1
Compile and link with the MPI wrapper of the compiler:
Ï mpiicc for C,
Ï mpiicpc for C++,
Ï mpiifort for Fortran 77 and Fortran 95.
2
Set up MPI environment variables and I_MPI_MIC=1
3
NFS-share or copy the MPI library and the application executable to
the coprocessors
4
Launch with the tool mpirun
Ï Colon-separated list of executables and hosts (argument -host hostname),
Ï Alternatively, use the machine file to list hosts
Ï Coprocessors have hostnames defined in /etc/hosts

MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Peer-to-Peer Communication between Coprocessors
System System
CPU CPU
Memory Memory

Network
Bridging
on br0

Ethernet RDMA
PCIe Chipset PCIe MIC PCIe Chipset PCIe MIC
NIC Device

Coprocessor

Coprocessor
Virtualized Virtualized
Network MIC InfiniBand MIC
Interface Memory HCA Memory
mic0

Left: Gigabit Ethernet bridging on host allows to place coprocessors


on the same subnet as hosts( I_MPI_FABRICS=tcp)
Right: Coprocessor Communication Link (CCL) – virtualization of an
InfiniBand device on each coprocessor (I_MPI_FABRICS=dapl)
MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Structure of MPI Applications
1 #include "mpi.h"
2 int main(int argc, char** argv) {
3 int ret = MPI_Init(&argc,&argv); // Set up MPI environment
4 if (ret != MPI_SUCCESS) {
5 MyErrorLogger("...");
6 MPI_Abort(MPI_COMM_WORLD, ret);
7 }
8 int worldSize, myRank, myNameLength;
9 char myName[MPI_MAX_PROCESSOR_NAME];
10 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
11 MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
12 MPI_Get_processor_name(myName, &myNameLength);
13 // ... Perform work, exchange messages with MPI_Send, MPI_Recv, etc. ...
14 // Terminate MPI environment
15 MPI_Finalize();
16 }

MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Point to Point Communication
1 if (rank == receiver) {
2

3 char incomingMsg[messageLength];
4 MPI_Recv (&incomingMsg, messageLength, MPI_CHAR, sender,
5 tag, MPI_COMM_WORLD, &stat);
6 printf ("Received message with tag %d: ’%s’\n", tag, incomingMsg);
7

8 } else if (rank == sender) {


9

10 char outgoingMsg[messageLength];
11 strcpy(outgoingMsg, "/Jenny");
12 MPI_Send(&outgoingMsg, messageLength, MPI_CHAR, receiver, tag, MPI_COMM_WORLD);
13

14 }

MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Collective Communication: Broadcast
1 int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype,
2 int root, MPI_Comm comm );

sender

data

Broadcast

receiver receiver receiver receiver

MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Collective Communication: Scatter
1 int MPI_Scatter(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf,
2 int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm);

sender

data data

data data

Scatter

receiver receiver receiver receiver

MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Collective Communication: Gather
1 int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype,
2 void *recvbuf, int recvcnt, MPI_Datatype recvtype,
3 int root, MPI_Comm comm);

sender sender sender sender

data data data data

Gather

receiver

MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Collective Communication: Reduction
1 int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
2 MPI_Op op, int root, MPI_Comm comm);

sender sender sender sender

1 3 5 7

Reduction

16

receiver

Available reducers: max/min, minloc/maxloc, sum, product,


AND, OR, XOR (logical or bitwise).
MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Review: Parallel Scalability

MIC Developer Boot Camp Rev. 12 Review: Parallel Scalability © Colfax International, 2013–2014
Expressing Parallelism
1
Data parallelism (vectorization)
Ï Automatic vectorization by the compiler: portable and convenient
Ï For-loops and array notation can be vectorized
Ï Compiler hints (#pragma simd, #pragma ivdep, etc.) to assist the compiler
2
Shared-memory parallelism with OpenMP and Intel Cilk Plus
Ï Parallel threads access common memory for reading and writing
Ï Parallel loops: #pragma omp parallel for
and _Cilk_for — automatic work distribution
Ï In OpenMP: private and shared variables; synchronization, reduction.
3
Distributed-memory parallelism with MPI
Ï MPI processes do not share memory, but can send information to each other
Ï All MPI processes execute the same code; role is determined by its rank
Ï Point-to-point and collective communication patterns
MIC Developer Boot Camp Rev. 12 Review: Parallel Scalability © Colfax International, 2013–2014
§5. Optimization for the Intel Xeon
Product Family

MIC Developer Boot Camp Rev. 12 Optimization for the Intel Xeon Product Family © Colfax International, 2013–2014
Optimization Roadmap

MIC Developer Boot Camp Rev. 12 Optimization Roadmap © Colfax International, 2013–2014
Performance Expectations

vs.
One Intel Xeon Phi coprocessor Two Intel Xeon Sandy Bridge CPUs

Up to 2x-3x for linear algebraic workloads


Up to 2x-4x for bandwidth-bound and transcendental arithmetics
Why compare 1 coprocessor agains 2 processors?
Same thermal design power (TDP).
See also “Intel Xeon Product Family: Performance Brief”
MIC Developer Boot Camp Rev. 12 Optimization Roadmap © Colfax International, 2013–2014
Optimization Checklist

1
Scalar optimization

2
Vectorization

3
Scale above 100 threads

4
Arithmetically intensive or bandwidth-limited

5
Efficient cooperation between the host and the coprocessor(s)

MIC Developer Boot Camp Rev. 12 Optimization Roadmap © Colfax International, 2013–2014
Finding Bottlenecks with Intel VTune Amplifier

MIC Developer Boot Camp Rev. 12 Finding Bottlenecks with Intel VTune Amplifier © Colfax International, 2013–2014
Intel VTune Parallel Amplifier XE

Hardware event-based
profiler for parallel
applications on Xeon CPUs
and Xeon Phi coprocessors.

Bottleneck detection down


to a single line of code,
hardware event collection,
minimal impact on
performance.

MIC Developer Boot Camp Rev. 12 Finding Bottlenecks with Intel VTune Amplifier © Colfax International, 2013–2014
Using VTune
Setting up a VTune project:

Results of profiling, bottom-up view:

MIC Developer Boot Camp Rev. 12 Finding Bottlenecks with Intel VTune Amplifier © Colfax International, 2013–2014
Using VTune
Locating hotspots down to a single line of code:

MIC Developer Boot Camp Rev. 12 Finding Bottlenecks with Intel VTune Amplifier © Colfax International, 2013–2014
Using VTune

Analyzing custom events

MIC Developer Boot Camp Rev. 12 Finding Bottlenecks with Intel VTune Amplifier © Colfax International, 2013–2014
MPI Diagnostics Using Intel Trace Analyzer and Collector

MIC Developer Boot Camp Rev. 12 MPI Diagnostics Using Intel Trace Analyzer and Collector © Colfax International, 2013–2014
Intel Trace Analyzer and Collector

Profiler for MPI Applications


on Xeon and Xeon Phi
architectures.

Graphical user interface,


visualization of computation
and communication.

MIC Developer Boot Camp Rev. 12 MPI Diagnostics Using Intel Trace Analyzer and Collector © Colfax International, 2013–2014
Using Intel Trace Analyzer and Collector

user@host% source /opt/intel/itac/8.1.0.024/bin/itacvars.sh


user@host% source /opt/intel/itac/8.1.0.024/mic/bin/itacvars.sh
user@host% mpiicpc -mkl -o pi_mpi pi_mpi.c
user@host% mpiicpc -mmic -mkl -o pi_mpi.mic pi_mpi.c
user@host% scp pi_mpi.mic mic0:~/
pi_mpi.mic 100% 433KB 432.5KB/s 00:00
user@host% export VT_LOGFILE_FORMAT=stfsingle
user@host% mpirun -trace -n 32 -host localhost ./pi_mpi : \
% -n 240 -host mic0 ~/pi_mpi.mic
Time, s: 0.36
[0] Intel(R) Trace Collector INFO: Writing tracefile pi_mpi.single.stf
in /home/user/pi
user@host% traceanalyzer pi_mpi.single.stf

MIC Developer Boot Camp Rev. 12 MPI Diagnostics Using Intel Trace Analyzer and Collector © Colfax International, 2013–2014
Using Intel Trace Analyzer and Collector

MIC Developer Boot Camp Rev. 12 MPI Diagnostics Using Intel Trace Analyzer and Collector © Colfax International, 2013–2014
Intel Math Kernel Library (MKL)

MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Intel Math Kernel Library (MKL)

Linear algebra, fast Fourier


transforms, vector math,
parallel random numbers,
statistics, data fitting, sparse
solvers.

Intel MKL functions are


optimized for Xeon
Processors as well as for
Xeon Phi coprocessors.

MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Using Intel MKL
Three modes of usage:
Automatic Offload
Ï No code change required to offload calculations to a Xeon Phi coprocessor
Ï Automatically uses both the CPU and the coprocessor
Ï The library takes care of data transfer and execution management
Compiler-Assisted Offload
Ï Programmer maintains explicit control of data transfer and remote execution
Ï Requires using compiler offload pragmas and directives
Native Execution
Ï Uses an Intel Xeon Phi coprocessor as an independent compute node.
Ï Data initialized & processed on the coprocessor, or communicated via MPI

MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Using MKL in Automatic Offload Mode

Calling an MKL function from host code:


1 sgemm(&transa, &transb, &SIZE, &SIZE, &SIZE, &alpha,
2 A, &newLda, B, &newLda, &beta, C, &SIZE);

Compiling and running the code. Calculation will be offloaded to a Xeon Phi coproces-
sor, if one is available at runtime.
user@host% icpc -c mycode.cc -mkl -o mycode
user@host% export MKL_MIC_ENABLE=1
user@host% ./mycode

MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Using MKL in Compiler-Assisted Offload Mode
Calling an MKL function from offloaded section:
1 #pragma offload target(mic) \
2 in(transa, transb, N, alpha, beta) \
3 in(A:length(matrix_elements)) \
4 in(B:length(matrix_elements)) \
5 out(C:length(matrix_elements) alloc_if(0))
6 {
7 sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
8 }

Compiling and running the code. If no coprocessor at runtime, MKL will fall back to
CPU calculation.
user@host% icpc -c mycode.cc -mkl -o mycode
user@host% ./mycode

MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Using MKL Native Execution Mode
1 #include <stdlib.h> 1 #include <stdlib.h>
2 #include <stdio.h> 2 #include <stdio.h>
3 3 #include <mkl_vsl.h>
4 int main() { 4 int main() {
5 const size_t N = 1<<29L; 5 const size_t N = 1<<29L;
6 const size_t F = sizeof(float); 6 const size_t F = sizeof(float);
7 float* A = (float*)malloc(N*F); 7 float* A = (float*)malloc(N*F);
8 srand(0); // Initialize RNG 8 VSLStreamStatePtr rnStream;
9 for (int i = 0; i < N; i++) { 9 vslNewStream( &rnStream, //Init RNG
10 A[i]=(float)rand() / 10 VSL_BRNG_MT19937, 1 );
11 (float)RAND_MAX; 11 vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD,
12 } 12 rnStream, N, A, 0.0f, 1.0f);
13 printf("Generated %ld random \ 13 printf("Generated %ld random \
14 numbers\nA[0]=%e\n", N, A[0]); 14 numbers\nA[0]=%e\n", N, A[0]);
15 free(A); 15 free(A);
16 } 16 }

MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Using MKL in Native Execution Mode
user@host% icpc -mmic -o rand \ user@host% icpc -mkl -mmic -o \
% rand.cc % rand-mkl rand-mkl.cc
user@host% # Run on coprocessor user@host% export SINK_LD_LIBRARY_PATH=\
user@host% # and benchmark % /opt/intel/composerxe/mkl/lib/mic:\
user@host% time micnativeloadex \ % /opt/intel/composerxe/lib/mic
% rand user@host% time micnativeloadex rand-mkl
Generated 536870912 random numbers Generated 536870912 random numbers
A[0]=8.401877e-01 A[0]=1.343642e-01

real 0m56.591s real 0m7.951s


user 0m0.002s user 0m0.053s
sys 0m0.011s sys 0m0.168s

On Intel Xeon Phi coprocessor, random number generation with


Intel MKL is 7x faster than with the C standard Library.
MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Scalar Optimization Considerations

MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Optimization Level
1 #pragma intel optimization_level 3
2 void my_function() {
user@host% icc -o mycode -O3 source.c
3 //...
4 }
The default optimization level -O2
Optimization level -O3
optimization for speed
enables more aggressive
automatic vectorization
optimization
inlining
loop fusion
constant propagation
block-unroll-and-jam
dead-code elimination
if-statement collapse
loop unrolling
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Using the const Qualifier
1 #include <stdio.h> 1 #include <stdio.h>
2 int main() { 2 int main() {
3 const int N=1<<28; 3 const int N=1<<28;
4 double w = 0.5; 4 const double w = 0.5;
5 double T = (double)N; 5 const double T = (double)N;
6 double s = 0.0; 6 double s = 0.0;
7 for (int i = 0; i < N; i++) 7 for (int i = 0; i < N; i++)
8 s += w*(double)i/T; 8 s += w*(double)i/T;
9 printf("%e\n", s); 9 printf("%e\n", s);
10 } 10 }

user@host% icpc noconst.cc user@host% icpc const.cc


user@host% time ./a.out user@host% time ./a.out
6.710886e+07 6.710886e+07
real 0m0.461s real 0m0.097s
user 0m0.460s user 0m0.094s
sys 0m0.001s sys 0m0.003s
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Array Reference by Index instead of Pointer Arithmetics
1 for (int i = 0; i < N; i++) 1 for (int i = 0; i < N; i++)
2 for (int j = 0; j < N; j++) { 2 for (int j = 0; j < N; j++) {
3 float* cp = c + i*N + j; 3

4 for (int k = 0; k < N; k++) 4 for (int k = 0; k < N; k++)


5 *cp += a[i*N+k]*b[k*N+j]; 5 c[i*N+j] += a[i*N+k]*b[k*N+j];
6 } 6 }

user@host% icc array_pointer.cc user@host% icpc array_index.cc


user@host% time ./a.out user@host% time ./a.out
real 0m1.110s real 0m0.228s
user 0m1.104s user 0m0.225s
sys 0m0.005s sys 0m0.002s

With Pointer arithmetics, the code is 5x slower than with reference to


array elements by index.
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Common Subexpression Elimination
1 for (int i = 0; i < n; i++) 1 for (int i = 0; i < n; i++) {
2 { 2 const double sin_A = sin(A[i]);
3 for (int j = 0; j < m; j++) { 3 for (int j = 0; j < m; j++) {
4 const double r = 4 const double cos_B = cos(B[j]);
5 sin(A[i])*cos(B[j]); 5 const double r = sin_A*cos_B;
6 // ... 6 // ...
7 } 7 }
8 } 8 }

The value of sin_A can be calculated once and re-used m times in


the j-loop
In some cases, at -O2 compiler eliminates common subexpressions
automatically
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Ternary if-operator Trap
Ternary if operator ( ? : ) is a short-hand for if ...else
Example: the min() function as a pre-processor expression

1 #define min(a, b) ( (a) < (b) ? (a) : (b) )


2 const float c = min(my_function(x), my_function(y));

Problem: line 2 calls my_function() 3 times


Optimization:

1 #define min(a, b) ( (a) < (b) ? (a) : (b) )


2 const float result_a = my_function(x);
3 const float result_b = my_function(y);
4 const float c = min(result_a, result_b);

MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Strength Reduction

Replace expensive operations with a combination of fast operations.


Example 1: replacing division with multiplication by the precomputed reciprocal:
1 for (int i = 0; i < n; i++) { 1 const float rn = 1.0f/(float)n;
2 A[i] /= n; 2 for (int i = 0; i < n; i++)
3 } 3 A[i] *= rn;

Example 2: algebraic transformations to replace two divisions with one


1 for (int i = 0; i < n; i++) { 1 for (int i = 0; i < n; i++) {
2 A[i] = (B[i]/C[i])/D[i]; 2 A[i] = B[i]/(C[i]*D[i]);
3 E[i] = A[i]/B[i] + C[i]/D[i]; 3 E[i] = (A[i]*D[i] + B[i]*C[i])/
4 4 (B[i]*D[i]);
5 } 5 }

MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Consistency of Precision: Constants
1
Operations on type float is faster than operations on type double.
Avoid type conversions and define single-precision literal constants
with suffix -f.
1 const double twoPi = 6.283185307179586;
2 const float phase = 0.3f; // single precision

2
Use 32-bit int values including 64-bit long where possible,
including array indices. Avoid type conversions and define 64-bit
literal constants with suffix -L or UL
1 const long N2 = 1000000*1000000; // Overflow error
2 const long N3 = 1000000L*1000000L; // Correct

MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Consistency of Precision: Functions

1
math.h contains fast single precision versions of arithmetic
functions ending with suffix -f
1 double sin(double x);
2 float sinf(float x);

2
math.h contains fast base 2 exponential and logarithmic functions:
1 double exp(double x); // Double precision, natural base
2 float expf(float x); // Single precision, natural base
3 double exp2(double x); // Double precision, base 2
4 float exp2f(float x); // Single precision, base 2

MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Floating-Point Semantics
The Intel C++ Compiler may represent floating-point expressions in executable code
differently, depending on the floating-point semantics.
-fp-model strict Only value-safe optimizations
-fp-model precise calculations are reproducible from run to run
exceptions controlled using -fp-model except
-fp-model fast=1 (default) Value-unsafe optimizations are allowed
-fp-model fast=2 better performance at the cost of lower accuracy
-fp-model source Intermediate arithmetic results are rounded to
the precision defined in the source code.
-fp-model double Intermediate arithmetic results are rounded to
53-bit (double) precision.
-fp-model extended Intermediate arithmetic results are rounded to
64-bit (extended) precision.
-fp-model [no-]except controls floating-point exception semantics.
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Precision Control for Transcendental Functions
-fimf-precision= value[:funclist] Defines the precision for math
functions. value is one of: high, medium or low
-fimf-max-error= ulps[:funclist] The maximum allowable error
expressed in ulps (units in last place)
-fimf-accuracy-bits= n[:funclist] The number of correct bits
required for mathematical function accuracy.
-fimf-domain-exclusion= n[:funclist] Defines a list of special-
value numbers that do not need to be handled.
int n derived by the bitwise OR of types:
extremes: 1, NaNs: 2, infinites: 4, denormals1 : 8, zeroes: 16.
1
by default, on Intel Xeon Phi, denormals are flushed to zero in hardware, but supported in SVML
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Precision Control for Transcendental Functions
1 #include <stdio.h>
2 #include <math.h>
3

4 int main() {
5 const int N = 1000000;
6 const int P = 10;
7 double A[N];
8 const double startValue = 1.0;
9 A[:] = startValue;
10 for (int i = 0; i < P; i++)
11 #pragma simd
12 for (int r = 0; r < N; r++)
13 A[r] = exp(-A[r]);
14

15 printf("Result=%.17e\n", A[0]);
16 }

MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Precision Control for Transcendental Functions

user@host% icpc -o precision-1 -mmic \ user@host% icpc -o precision-2 -mmic \


% -fimf-precision=low precision.cc % -fimf-precision=high precision.cc
user@host% scp precision-1 mic0:~/ user@host% scp precision-2 mic0:~/
% precision-1 100% 11KB 11.3KB/s % precision-2 100% 19KB 19.4KB/s
user@host% ssh mic0 time ./precision-1 user@host% ssh mic0 time ./precision-2
Result=5.68428695201873779e-01 Result=5.68428725029060722e-01
real 0m 0.08s real 0m 0.14s
user 0m 0.06s user 0m 0.12s
sys 0m 0.02s sys 0m 0.02s
user@host% user@host%

MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Automatic Vectorization: Making it Happen and Tuning

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Challenges with Optimizing Vectorization on Xeon Phi
Must utilize 512-bit vector registers (16 float or 8 double)
Must convince compiler that vectorization is possible
Preferably unit-stride access to data
Preferably align data on 64-byte boundary
Avoid branches in vector loops
Guide compiler regarding expected iteration count, memory
alignment, outer loop vectorization, etc.

This section:
Ensuring that automatic vectorization succeeds where it must exist.

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Diagnosing the Utilization of Vector Instructions

When porting and optimizing an application:


Find performance-critical parts
Use -vec-report3 to get information about automatic vectorization
Use Intel VTune Amplifier XE to diagnose the executable
Benchmark regular compilation vs. -no-vec -no-simd case
Provide additional information to the compiler about loops in form
of #pragmas

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Assumed Vector Dependence. The restrict Keyword.
True vector dependence makes vectorization impossible:
1 float *a, *b; /...
2 for (int i = 1; i < n; i++)
3 a[i] += b[i]*a[i-1]; // dependence on the previous element

Assumed vector dependence: when compiler cannot determine


wheter vector dependence exists, auto-vectorization fails:
user@host% icpc -vec-report3 \
1 void mycopy(int n, -c vdep.cc
2 float* a, float* b) { vdep.cc(2): (col. 3) remark:
3 for (int i = 0; i < n; i++) loop skipped: multiversioned.
4 a[i] = b[i]; vdep.cc(2): (col. 3) remark:
5 } loop was not vectorized:
not inner loop.

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Ignoring Assumed Vector Dependence

To ignore assumed vector dependence


#pragma ivdep

user@host% icpc -vec-report3 \


1 void mycopy(int n,
-c vdep.cc
2 float* a, float* b) {
vdep.cc(3): (col. 3) remark:
3 #pragma ivdep
LOOP WAS VECTORIZED.
4 for (int i = 0; i < n; i++)
vdep.cc(3): (col. 3) remark:
5 a[i] = b[i];
loop was not vectorized:
6 }
not inner loop.

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Pointer Disambiguation (alternative to #pragma ivdep)
restrict keyword applies to each pointer variable qualified with it
The object accessed by the pointer is only accessed by that pointer
in the given scope
The compiler argument -restrict must be used.

1 void mycopy(int n, float* restrict a, float* restrict b) {


2 for (int i = 0; i < n; i++)
3 a[i] = b[i];
4 }

user@host% icpc -vec-report3 -restrict -c vdep.cc


vdep.cc(2): (col. 3) remark: LOOP WAS VECTORIZED.
vdep.cc(2): (col. 3) remark: loop was not vectorized: not inner loop.

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Automatic Vectorization: Data Structures

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Challenges with Optimizing Vectorization on Xeon Phi
Must utilize 512-bit vector registers (16 float or 8 double)
Must convince compiler that vectorization is possible
Preferably unit-stride access to data
Preferably align data on 64-byte boundary
Avoid branches in vector loops
Guide compiler regarding expected iteration count, memory
alignment, outer loop vectorization, etc.

The rule of thumb for achieving unit-stride access


Use structures of arrays (SoA) instead of arrays of structures (AoS)

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Example: Unit-Stride Access in Coulomb’s Law Application
m qi
Φ~
¡ ¢ X
Rj = − ¯, (1)
i=1 ~ ri − ~
¯
¯ Rj ¯
¯ q¡
¯~ri −~
¯ ¢2 ¡ ¢2 ¡ ¢2
R¯ = ri,x − Rx + ri,y − Ry + ri,z − Rz . (2)
Charge Distribution
Positive charges Electric Potential
Negative charges
0.4
0.3 0.4 0.3
0.3 0.2
0.2 0.1
0.2 0
0.1

Φ(x,y,z=0)
0.1 -0.1
-0.2
z 0 0 -0.3
-0.1 -0.4
-0.1
-0.2
-0.2 -0.3
1 1
-0.3 y -0.4 y
0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0
x x

White paper: research.colfaxinternational.com/post/2012/03/12/AVX.aspx


MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Elegant, but Inefficient Solution: Array of Structures

1 struct Charge { // Elegant, but ineffective data layout


2 float x, y, z, q;
3 } chgs[m]; // Coordinates and value of this charge

1 for (int i=0; i<m; i++) { // This loop will be auto-vectorized


2 // Non-unit stride: (&chg[i+1].x - &chg[i].x) != sizeof(float)
3 const float dx=chg[i].x - Rx;
4 const float dy=chg[i].y - Ry;
5 const float dz=chg[i].z - Rz;
6 phi -= chg[i].q / sqrtf(dx*dx+dy*dy+dz*dz); // Coulomb’s law
7 }

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Arrays of Structures versus Structures of Arrays
Array of Structures (AoS)
1 struct Charge { // Elegant, but ineffective data layout
2 float x, y, z, q; // Coordinates and value of this charge
3 };
4 // The following line declares a set of m point charges:
5 Charge chg[m];

Structure of Arrays (SoA)


1 struct Charge_Distribution {
2 // Data layout permits effective vectorization of Coulomb’s law application
3 const int m; // Number of charges
4 float * x; // Array of x-coordinates of charges
5 float * y; // ...y-coordinates...
6 float * z; // ...etc.
7 float * q; // These arrays are allocated in the constructor
8 };
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Optimized Solution: Structure of Arrays, Unit-Stride Access
1 struct Charge_Distribution {
2 // Data layout permits effective vectorization of Coulomb’s law application
3 const int m; // Number of charges
4 float *x, *y, *z, *q; // Arrays of x-, y- and z-coordinates of charges
5 };

1 // This version vectorizes better thanks to unit-stride data access


2 for (int i=0; i<chg.m; i++) {
3 // Unit stride: (&chg.x[i+1] - &chg.x[i]) == sizeof(float)
4 const float dx=chg.x[i] - Rx;
5 const float dy=chg.y[i] - Ry;
6 const float dz=chg.z[i] - Rz;
7 phi -= chg.q[i] / sqrtf(dx*dx+dy*dy+dz*dz);
8 }

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Electric Potential Calculation with Coulomb’s Law
Electric potential calculation
1.0
0.90 s Host system
Intel Xeon Phi Coprocessor
0.8
0.73 s
Time, s (lower is better)

0.6
0.51 s 0.51 s

0.4 0.37 s

0.22 s
0.2

0.0 Non-unit stride Unit-stride Unit-stride with


(array of structures) (structure of arrays) relaxed precision

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Automatic Vectorization: Data Alignment

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Challenges with Optimizing Vectorization on Xeon Phi
Must utilize 512-bit vector registers (16 float or 8 double)
Must convince compiler that vectorization is possible
Preferably unit-stride access to data
Preferably align data on 64-byte boundary
Avoid branches in vector loops
Guide compiler regarding expected iteration count, memory
alignment, outer loop vectorization, etc.

This section:
Data alignment and compiler hints.

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Data Alignment

char* p points to an address aligned on an n-byte


boundary if ((size_t)p%n==0).
128-bit SSE load and store instructions require 16-byte
alignment,
256-bit AVX load and store instructions do not require
alignment,
512-bit IMCI load and store instructions require 64-byte
alignment.
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Data Alignment
Data alignment on the stack
1 float A[n] __attribute__((aligned(64))); // 64-byte alignment applied

Ï The address of A[0] is a multiple of 64, i.e., aligned on a 64-byte boundary.


Ï Setting a very high alignment value may lead to wasted virtual memory.
Alignment of memory blocks on the heap
1 #include <malloc.h>
2 // ...
3 float *A = (float*)_mm_malloc(n*sizeof(float), 64);
4 // ...
5 _mm_free(A);

Ï _mm_malloc and _mm_free are aligned version of malloc and free:


Ï the header file malloc.h must be included
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Data Alignment Hints
Programmer may promise to the compiler (under penalty of
segmentation fault) than alignment has been taken care of:
1 float* packedData = _mm_malloc(sizeof(float)*nData, 64);
2 float* inVector = _mm_malloc(sizeof(float)*nRows, 64);
3 // ... Pragma vector aligned promises to the compiler that elements of array
4 // used in the first iteration are 64-byte boundary aligned.
5 #pragma vector aligned
6 for (int c = 0; c < blockLen[idx]; c++) // blockLen[idx] are multiples of 64
7 sum += packedData[offs+c]*inVector[j0+c];
8 outVector[i] += sum;
9 // ...
10 _mm_free(packedData); _mm_free(inVector);

This can lead to significant speedups, because compiler will not


implement runtime checks for alignment situation.
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Data Alignment and Padding
Note: when relying on #pragma vector aligned, may need to pad the
inner dimension on data structures to a multiple of 16 (in single
precision) or 8 (double precision).
1 void GaussEl(const int n, const int m, const int start, float* const matrix) {
2 for (int i = start+1; i < n; i++) {
3 const float factor = matrix[(i-1)*m]/matrix[i*m];
4 #pragma vector aligned
5 for (int j = 0; j < m; j++)
6 matrix[i*m + j] += factor*matrix[(i-1)*m + j];
7 }
8 // ... Padding inner dimension and allocating matrix
9 if (m % 16 != 0) m += (16 - m%16);
10 matrix = (float*)_mm_malloc(n*m*sizeof(float), 64);
11 //...
12 GaussEl(n, m, 0, matrix);

MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Vectorization Pragmas, Keywords and Compiler Arguments
#pragma simd
#pragma vector always
#pragma vector aligned | unaligned
#pragma vector nontemporal | temporal
#pragma novector
#pragma ivdep
restrict qualifier and -restrict command-line argument
#pragma loop count
__assume_aligned keyword
-vec-report[n]
-O[n]
-x[code]
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Vectorization Pragmas, Keywords and Compiler Arguments
#pragma simd
#pragma vector always
#pragma vector aligned | unaligned
#pragma vector nontemporal | temporal
#pragma novector
#pragma ivdep
restrict qualifier and -restrict command-line argument
#pragma loop count
__assume_aligned keyword
-vec-report[n]
-O[n]
-x[code]
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Thread Parallelism: Reducing Synchronization

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Challenges with Thread Parallelism on Xeon Phi

Multi-core CPU: 4–48 threads, Xeon Phi: 228–244 threads.


Must have enough parallelism to keep all cores busy
Must have less synchronization than on CPU
Must have lower per-thread memory overhead
Must access core-local data whenever possible
Must co-exist with vectorization in each core

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Example: Dealing with Excessive Synchronization

Computing a histogram (m << n) with a serial code:


1 void Histogram(const float* age, int* const hist, const int n,
2 const float group_width, const int m) {
3 for (int i = 0; i < n; i++) {
4 const int j = (int) ( age[i] / group_width );
5 hist[j]++;
6 }
7 }

Code cannot be automatically vectorized


True vector dependence

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
The Same Calculation, Strip-Mined, Vectorized
1 void Histogram(const float* age, int* const hist, const int n,
2 const float group_width, const int m) {
3 const int vecLen = 16; // Length of vectorized loop
4 const float invGroupWidth = 1.0f/group_width; // Pre-compute the reciprocal
5 // Strip-mining the loop in order to vectorize the inner short loop
6 // Note: this algorithm assumes n%vecLen == 0.
7 for (int ii = 0; ii < n; ii += vecLen) { //Temporary store vecLen indices
8 int histIdx[vecLen] __attribute__((aligned(64)));
9 // Vectorize the multiplication and rounding
10 #pragma vector aligned
11 for (int i = ii; i < ii + vecLen; i++)
12 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
13 // Scattered memory access, does not get vectorized
14 for (int c = 0; c < vecLen; c++)
15 hist[histIdx[c]]++;
16 }
17 }

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Adding Thread Parallelism

1 #pragma omp parallel for schedule(guided)


2 for (int ii = 0; ii < n; ii += vecLen) {
3 int histIdx[vecLen] __attribute__((aligned(64)));
4 #pragma vector aligned
5 for (int i = ii; i < ii + vecLen; i++)
6 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
7 for (int c = 0; c < vecLen; c++)
8 // Protect the ++ operation with the atomic mutex (inefficient!)
9 #pragma omp atomic
10 hist[histIdx[c]]++;
11 }
12 }

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Improving Thread Parallelism
1 #pragma omp parallel
2 {
3 int hist_priv[m]; // Better idea: thread-private storage
4 hist_priv[:] = 0;
5 int histIdx[vecLen] __attribute__((aligned(64)));
6 #pragma omp for schedule(guided)
7 for (int ii = 0; ii < n; ii += vecLen) {
8 #pragma vector aligned
9 for (int i = ii; i < ii + vecLen; i++)
10 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
11 for (int c = 0; c < vecLen; c++)
12 hist_priv[histIdx[c]]++;
13 }
14 for (int c = 0; c < m; c++) {
15 #pragma omp atomic
16 hist[c] += hist_priv[c];
17 } } }

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Dealing with Excessive Synchronization
Computing a histogram: elimination of synchronization
71.30 s
70 Host system
Intel Xeon Phi coprocessor
60
Time, s (lower is better)

50

40 37.70 s

30
24.00 s
20

10 9.23 s
5.06 s
1.27 s 0.12 s 0.07 s
0 Scalar Serial Code Vectorized Serial Code Vectorized Parallel Code Vectorized Parallel Code
(Atomic Operations) (Private Variables)

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Thread Parallelism: False Sharing

MIC Developer Boot Camp Rev. 12 Thread Parallelism: False Sharing © Colfax International, 2013–2014
False Sharing. Data Padding and Private Variables
CPU 0 CPU 1

False sharing is similar to


race condition
Thread 0 Thread 1
Threads accessing the same
Cache Cache cache line
same
Cache Line Caused by coherent caches
Cache line is 64-byte wide
(in modern Intel architectures)

Memory

MIC Developer Boot Camp Rev. 12 Thread Parallelism: False Sharing © Colfax International, 2013–2014
False Sharing. Data Padding and Private Variables
1 const int m = 5;
2 int hist_thr[nThreads][m];
3 #pragma omp parallel for
4 for (int ii = 0; ii < n; ii += vecLen) {
5 // False sharing occurs here
6 for (int c = 0; c < vecLen; c++)
7 hist_thr[iThread][histIdx[c]]++;
8 }
9 // Reducing results from all threads to the common histogram hist
10 for (int iThread = 0; iThread < nThreads; iThread++)
11 hist[0:m] += hist_thr[iThread][0:m];

The value of m=5 is small


Array elements hist_thr[0][:] are within m*sizeof(int)=20
bytes of array elements hist_thr[1][:]
MIC Developer Boot Camp Rev. 12 Thread Parallelism: False Sharing © Colfax International, 2013–2014
Padding to Avoid False Sharing

1 // Padding for hist_thr[][] in order to avoid a situation


2 // where two (or more) rows share a cache line.
3 const int paddingBytes = 64;
4 const int paddingElements = paddingBytes / sizeof(int);
5 const int mPadded = m + (paddingElements-m%paddingElements);
6 // Shared histogram with a private section for each thread
7 int hist_thr[nThreads][mPadded];
8 hist_thr[:][:] = 0;

MIC Developer Boot Camp Rev. 12 Thread Parallelism: False Sharing © Colfax International, 2013–2014
Padding to Avoid False Sharing
Computing a histogram: elimination of false sharing
1.8
1.600 s
Host system
1.6 Intel Xeon Phi coprocessor
1.4
Time, s (lower is better)

1.2
1.0
0.8 0.720 s
0.6
0.4 0.369 s
0.270 s
0.2 0.116 s 0.114 s 0.068 s
0.073 s 0.067 s 0.067 s
0.0Baseline: Parallel Code Poor Performance: Padding to Padding to Padding to
(Private Variables) False Sharing 64 bytes 128 bytes 256 bytes

MIC Developer Boot Camp Rev. 12 Thread Parallelism: False Sharing © Colfax International, 2013–2014
Thread Parallelism: Expanding Iteration Space

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Example: Dealing with Insufficient Parallelism
n
X
Si = Mij , i = 0 . . . m. (3)
j=0

m is small, smaller than the number of threads in the system


n is large, large enough so that the matrix does not fit into cache
1 void sum_unoptimized(const int m, const int n, long* M, long* s){
2 #pragma omp parallel for
3 for (int i=0; i<m; i++) {
4 long sum=0;
5 #pragma simd
6 #pragma vector aligned
7 for (int j=0; j<n; j++)
8 sum+=M[i*n+j];
9 s[i]=sum; }}

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Dealing with Insufficient Parallelism
VTune Analysis: Row-Wise Reduction of a Short, Wide Matrix

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Strip-Mining: Simultaneous Thread and Data Parallelism
1 // Compiler may be able to simultaneously parallelize and auto-vectorize it
2 #pragma omp parallel for
3 #pragma simd
4 for (int i = 0; i < n; i++) {
5 // ... do work
6 }

1 // The strip-mining technique separates parallelization from vectorization


2 const int STRIP=1024;
3 #pragma omp parallel for
4 for (int ii = 0; ii < n; ii += STRIP)
5 #pragma simd
6 for (int i = ii; i < ii + STRIP; i++) {
7 // ... do work
8 }

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Exposing Parallelism: Strip-Mining and Loop Collapse
1 void sum_stripmine(const int m, const int n, long* M, long* s){
2 const int STRIP=1024;
3 assert(n%STRIP==0);
4 s[0:m]=0;
5 #pragma omp parallel
6 {
7 long sum[m]; sum[0:m]=0;
8 #pragma omp for collapse(2) schedule(guided)
9 for (int i=0; i<m; i++)
10 for (int jj=0; jj<n; jj+=STRIP)
11 #pragma simd
12 #pragma vector aligned
13 for (int j=jj; j<jj+STRIP; j++)
14 sum[i]+=M[i*n+j];
15 for (int i=0; i<m; i++) // Reduction
16 #pragma omp atomic
17 s[i]+=sum[i];
18 } }

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Exposing Parallelism: Strip-Mining and Loop Collapse

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Dealing with Insufficient Parallelism
Row-Wise Reduction of a Short, Wide Matrix
Parallel row-wise matrix reduction
160
Host system
140 Intel Xeon Phi Coprocessor
131.6
Performance, GB/s (higher is better)

120

100
84.9
80

60 53.7
47.5
40 38.6
28.3
20
5.9 6.5
0 Unoptimized Parallel inner loop Collapse nested loops Strip-mine and collapse

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Thread Parallelism: Affinity

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Setting Thread Affinity

OpenMP threads may migrate from one core to another


according to OS decisions.
Forbid migration — increase the performance.
Control: environment variable KMP_AFFINITY

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Uses of Thread Affinity
Bandwidth-bound applications: 1 thread per core + prevent
migration. Optimizes utilization of memory controllers.
Compute-bound applications: 2 (Xeon) or 4 (Xeon Phi) threads per
core + prevent migration. Ensures that threads consistently access
local L1 cache data (+L2 for Xeon Phi).
Offload applications : physical core 0 on Xeon Phi is used by µOS for
offload tasks. Prevent placing compute threads on that core.
Aplications in multi-socket NUMA (Non-Uniform Memory Access)
systems: partition the system for two independent tasks, pin tasks to
respective CPUs.

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
The KMP_AFFINITY Environment Variable
KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>]

modifier: type=compact, scatter or


verbose/nonverbose balanced
respect/norespect
warnings/nowarnings type=explicit,proclist=[<pr
granularity=core or thread type=disabled or none.

user@host% export MIC_ENV_PREFIX=MIC


user@host% export KMP_AFFINITY=compact,granularity=fine
user@host% export MIC_KMP_AFFINITY=balanced,granularity=fine

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Bandwidth-bound, KMP_AFFINITY=scatter
user@host% export OMP_NUM_THREADS=32
user@host% export KMP_AFFINITY=none
user@host% for i in {1..4} ; do ./rowsum_stripmine | tail -1; done
Problem size: 2.980 GB, outer dimension: 4, threads: 32
Strip-mine and collapse: 0.061 +/- 0.002 seconds (52.89 +/- 1.31 GB/s)
Strip-mine and collapse: 0.059 +/- 0.002 seconds (54.11 +/- 1.56 GB/s)
Strip-mine and collapse: 0.077 +/- 0.001 seconds (41.71 +/- 0.69 GB/s)
Strip-mine and collapse: 0.070 +/- 0.005 seconds (45.59 +/- 3.14 GB/s)
user@host% export OMP_NUM_THREADS=16
user@host% export KMP_AFFINITY=scatter
user@host% for i in {1..4}; do ./rowsum_stripmine | tail -1 ; done
Problem size: 2.980 GB, outer dimension: 4, threads: 16
Strip-mine and collapse: 0.059 +/- 0.004 seconds (54.47 +/- 3.25 GB/s)
Strip-mine and collapse: 0.061 +/- 0.004 seconds (52.30 +/- 3.30 GB/s)
Strip-mine and collapse: 0.062 +/- 0.005 seconds (51.37 +/- 4.29 GB/s)
Strip-mine and collapse: 0.058 +/- 0.001 seconds (55.48 +/- 1.27 GB/s)

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Compute-Bound, KMP_AFFINITY=compact/balanced

1 double* A = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);


2 double* B = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);
3 double* C = (double*)_mm_malloc(sizeof(double)*N*Nld, 64);
4

5 for(int k = 0; k < nIter; k++) {


6

7 dgemm(&tr, &tr, &N, &N, &N, &v, A, &Nld, B, &Nld, &v, C, &N);
8

9 double flopsNow = (2.0*N*N*N+1.0*N*N)*1e-9/(t2-t1);


10 printf("Iteration %d: %.1f GFLOP/s\n", k+1, flopsNow);
11 }
12 _mm_free(A); _mm_free(B); _mm_free(C);

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Compute-Bound, KMP_AFFINITY=compact/balanced
user@host% icpc -o bench-dgemm -mkl -mmic bench-dgemm.cc
user@host% micnativeloadex ./bench-dgemm
Iteration 1: 312.7 GFLOP/s
Iteration 2: 346.5 GFLOP/s
Iteration 3: 348.5 GFLOP/s
Iteration 4: 347.2 GFLOP/s
Iteration 5: 348.3 GFLOP/s

user@host% micnativeloadex ./bench-dgemm -e "KMP_AFFINITY=compact"


Iteration 1: 626.8 GFLOP/s
Iteration 2: 769.1 GFLOP/s
Iteration 3: 769.4 GFLOP/s
Iteration 4: 769.3 GFLOP/s
Iteration 5: 769.4 GFLOP/s

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Other Optimization Topics for Thread Parallelism

Examples found in our 4-day training and in the book:


Avoiding excessive synchronization with reduction
Load balancing across threads
Using thread affinity to partition a multi-socket NUMA system

MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
§6. Advanced Optimization for the MIC
Architecture

MIC Developer Boot Camp Rev. 12 Advanced Optimization for the MIC Architecture © Colfax International, 2013–2014
Memory Access and Cache Utilization

MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Challenges with Memory Access on Xeon Phi

More threads than CPU, same amount of Level-2 cache (~30 MB)
No hardware prefetching from Level-2 to Level-1
High penalty for data page walks
Dynamic memory allocation is serial → greater penalty than CPU
per Amdahl’s law
“Rule of Thumb” for memory optimization: locality of data access in
space and in time.
Spatial locality = data structures (packing, reordering).
Temporal locality = order of operations (e.g., loop tiling).

MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Loop Tiling (Blocking)

MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Loop Tiling (Blocking)
1 // Plain nested loops
2 for (int i = 0; i < m; i++)
3 for (int j = 0; j < n; j++)
4 compute(a[i], b[j]); // Memory access is unit-stride in j

1 // Tiled nested loops


2 for (int ii = 0; ii < m; ii += TILE)
3 for (int j = 0; j < n; j++)
4 for (int i = ii; i < ii + TILE; i++) //Re-use data for each j with several i
5 compute(a[i], b[j]); // Memory access is unit-stride in j

1 // Doubly tiled nested loops


2 for (int ii = 0; ii < m; ii += TILE)
3 for (int jj = 0; jj < n; jj += TILE)
4 for (int i = ii; i < ii + TILE; i++) //Re-use data for each j with several i
5 for (int j = jj; j < jj + TILE; j++)
6 compute(a[i], b[j]); // Memory access is unit-stride in j
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Optimization Example: In-Place Square Matrix Transposition
1 #pragma omp parallel for
2 for (int i = 0; i < n; i++) { // Distribute across threads
3 for (int j = 0; j < i; j++) { // Employ vector load/stores
4 const double c = A[i*n + j]; // Swap elements
5 A[i*n + j] = A[j*n + i];
6 A[j*n + i] = c;
7 }
8 }

Unoptimized code:
Large-stride memory accesses
Inefficient cache use
Does not reach memory bandwidth limit

MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Tiling a Parallel For-Loop (Matrix Transposition)
1 #pragma omp parallel for
2 for (int ii = 0; ii < n; ii += TILE) { // Distribute across threads
3 const int iMax = (n < ii+TILE ? n : ii+TILE); // Adapt to matrix shape
4 for (int jj = 0; jj <= ii; jj += TILE) { // Tile the work
5 for (int i = ii; i < iMax; i++) { // Universal microkernel
6 const int jMax = (i < jj+TILE ? i : jj+TILE); // for whole matrix
7 #pragma loop count avg(TILE) // Vectorization tuning
8 #pragma simd // Vectorization hint
9 for (int j = jj; j<jMax; j++) { // Variable loop count (bad)
10 const double c = A[i*n + j]; // Swap elements
11 A[i*n + j] = A[j*n + i];
12 A[j*n + i] = c;
13 } } } }

Better (but not optimal) solution:


Loop tiling to improve locality of data access
Not enough outer loop iterations to keep 240 threads busy
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Further Optimization of Matrix Transposition

Multi-versioned inner loop for


diagonal, edges and body
Tuning pragma to enforce
non-temporal stores
Expand parallel iteration space
occupy all threads
Control data alignment
OpenMP thread affinity for
bandwidth optimization

MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Further Optimization: Code Snippet
1 #pragma omp parallel
2 {
3 #pragma omp for schedule(guided)
4 for (int k = 0; k < nTilesParallel; k++) { // Bulk of calculations here
5 const int ii = plan[HEADER_OFFSET + 2*k + 0]*TILE; // Planned order
6 const int jj = plan[HEADER_OFFSET + 2*k + 1]*TILE; // of operations
7 for (int j = jj; j < jj+TILE; j++) { // Simplified main microkernel
8 #pragma simd // Vectorization hint
9 #pragma vector nontemporal // Cache traffic hint
10 for (int i = ii; i < ii+TILE; i++) { // Constant loop count (good)
11 const double c = A[i*n + j]; // Swap elements
12 A[i*n + j] = A[j*n + i];
13 A[j*n + i] = c;
14 } } }
15 // Transposing the tiles along the main diagonal and edges...
16 // ...
Longer code but still in the C language; works for CPU and MIC
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Arithmetic Intensity and Roofline Model
Theoretical estimates, Intel Xeon Phi coprocessor
Arithmetic Performance = 60 × 1.0 × (512/64) × 2 = 960 GFLOP/s.
Memory Bandwidth = η × 6.0 × 8 × 2 × 4 = η × 384 GB/s,
Peak performance for: The peak memory bandwidth:
60-core Intel Xeon Phi η ≈ 0.5 – practical efficiency
clocked at 1.0 GHz 6.0 GT/s (Transfers)
512-bit SIMD registers 8 memory controllers
64-bit floating-point numbers 2 channels in each
fused multiply-add 4 bytes per channel

MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Arithmetic Intensity and Roofline Model
Theoretical estimates, Intel Xeon Phi coprocessor
Arithmetic Performance = 60 × 1.0 × (512/64) × 2 = 960 GFLOP/s.
Memory Bandwidth = η × 6.0 × 8 × 2 × 4 = η × 384 GB/s,

To saturate Arithmetic and Logic Units (ALUs):


384/8 = 48 billion floating-point numbers per second should be delivered from
memory to the cores (double precision)

960/48 = 20 floating-point operations (multiplication/addition) must be performed on


every number fetched from the main memory

MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Arithmetic Intensity and Roofline Model
Theoretical estimates, 2x 8-core Intel Xeon E5 processors at 3.0 GHz
Arithmetic Performace = 2 sockets × 8 × 3.0 × (256/64) × 2 = 384 GFLOP/s,

Memory Bandwidth = 2 sockets × η × 6.4 × 8 = η × 102 GB/s,


Peak performance for:
16 Intel Xeon cores The peak memory bandwidth:
clocked at 3.0 GHz η ≈ 0.5 – practical efficiency
256-bit SIMD registers 6.4 GT/s (Transfers)
64-bit floating-point numbers 8 bytes per transfer
2 ALUs
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Arithmetic Intensity and Roofline Model
Theoretical estimates, 2x 8-core Intel Xeon E5 processors at 3.0 GHz
Arithmetic Performace = 2 sockets × 8 × 3.0 × (256/64) × 2 = 384 GFLOP/s,

Memory Bandwidth = 2 sockets × η × 6.4 × 8 = η × 102 GB/s,

To saturate Arithmetic and Logic Units (ALUs):


102/8 ≈ 13 billion floating-point numbers per second should be delivered from
memory to the cores (double precision)

384/13 = 30 floating-point operations (multiplication/addition) must be performed on


every number fetched from the main memory

MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Arithmetic Intensity and Roofline Model
Roofline model: theoretical peak

Host system Theor. max performance


1000 1000
900 Coprocessor 900
800 800
700 700

Performance, GFLOP/s
600 600
500 500

h
Theor. max performance 400

t
wid
400

nd

th
ba
300 300

wid
ax

nd
r. m

ba
eo
200 200

ax
Th

r. m
eo
Th
100 100
1 2 4 8 16 32 64 128 256
Arithmetic Intensity

More on roofline model: Williams et al.

MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Other Topics on Memory Traffic Optimization

Discussions found in our 4-day training and in the book:


Recursive cache-oblivious algorithms
Cross-procedural loop fusion
Software prefetching

MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Data Persistence and PCIe Traffic

MIC Developer Boot Camp Rev. 12 Data Persistence and PCIe Traffic © Colfax International, 2013–2014
Memory Retention Between Offloads
1 // Allocate arrays on coprocessor during the first iteration;
2 // retain allocated memory for subsequent iterations
3 #pragma offload target(mic:0) \
4 in(data: length(size) alloc_if(k==0) free_if(k==nTrials-1) align(64))
5 {
6 // offloaded code here...
7 }

Data transfer across the PCIe bus rate is 6 GB/s


To allocate memory on the coprocessor – 0.5 GB/s
The memory allocation operation is serial and therefore slow
Memory retention reduces the latency by a factor of 10x
For smaller arrays, the effect is even more dramatic
MIC Developer Boot Camp Rev. 12 Data Persistence and PCIe Traffic © Colfax International, 2013–2014
Offload Latency With and Without Memory/Data Retention
Offload latencies
Default offload (allocation + data transfer + deallocation)
1000 With memory retention (data transfer only)
With data persistence (no memory allocation or data transfer)
sfer)
at ran
dat
100 ati on+
lloc
ry a
mo
(me nly
)
Latency, ms

oad ro
lt offl n sfe
fau tra
10 De da
ta
tion(
reten
ry
e mo
t hm
1 Wi

0.1 With data persistence (no memory allocation or data transfer)

MB

MB

MB
kB

kB

kB

MB

MB

MB
kB

kB

kB

B
B

128

256

512

1M

2M

4M

8M

128

256

512

1G
1k

2k

4k

8k

16

32

64

16

32

64
Array Size

MIC Developer Boot Camp Rev. 12 Data Persistence and PCIe Traffic © Colfax International, 2013–2014
MPI Applications on Clusters with Coprocessors

MIC Developer Boot Camp Rev. 12 MPI Applications on Clusters with Coprocessors © Colfax International, 2013–2014
MPI: Fabrics

MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
MPI Fabric Selection: Ethernet and InfiniBand
Ethernet+TCP between coprocessors slower than the hardware limit
InfiniBand approaches the hardware limit from CPU to coprocessors
500 0.12 http://research.colfaxinternational.com/
400 0.10 CPU - remote CPU

Bandwidth [GB/S]
0.08 CPU - mic0
Latency [µs]

300 mic0 - mic1


0.06
200 0.04 CPU - remote mic0
100 0.02
mic0 - remote mic0
0 0.00
4B 64B 1kB 1kB 1MB 1GB
Message Size
15 7 http://research.colfaxinternational.com/
6 CPU - remote CPU
Bandwidth [GB/S]

10 5 CPU - mic0
Latency [µs]

4 CPU - remote mic0


3 CPU - remote mic2
5 2
1
0 0
4B 64B 1kB 1kB 1MB 1GB
Message Size Message Size
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
MPI Fabric Selection: Ethernet and InfiniBand

System
CPU
Memory
InfiniBand requires additional
software on top of MPSS
Environment variable RDMA
PCIe Chipset PCIe MIC
Device
I_MPI_FABRICS

Coprocessor
More information in white
Virtualized
paper InfiniBand MIC
HCA Memory

MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
MPI Fabric Selection: Intra-Device Fabric
Part of CCL: virtual interface ibscif for communication between
coprocessors within a system
Default Combination: I_MPI_FABRICS=shm:dapl
shm provides better latency, dapl – greater bandwidth
15 12 http://research.colfaxinternational.com/
10 mic0 - mic0 (dapl)

Bandwidth [GB/S]
10 8 mic0 - mic0 (shm)
Latency [µs]

6
5 4
2
0 0
4B 64B 1kB 1kB 1MB 1GB
Message Size Message Size

MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
Communication Efficiency with Symmetric Clustering
http://research.colfaxinternational.com/
5 . 8 μ s 0 . 5 G B/ s
1.1μs 3.7GB/s
4.8μs 3.6GB/s
MPI communication between IB Switch
CPU and coprocessors as 1.1 μs

HCA
HCA
4.2GB/s
CPU0 4.6μs CPU1 CPU0 CPU1
6.

efficient as offload /s

5G
6.5 G
PCIe PCIe PCIe PCIe

B/s
μs
3.6

Peer-to-peer communication B/
s

mic0

mic2

mic0

mic2
mic1

mic3

mic1

mic3
11G
μs
not uniform, but better than 8.8

8.7

s
μs

B/s
B/

s
3.8 G
with Gigabit Ethernet 8.5μs 0.5GB/s

B/
9.3μs 1.3G

3G
1.
9.3μs 9.6μs 0.3GB
/s

/ s
9.2 μs 3.7 G B
8.9μs 0.5GB/s

White paper with details:


http://research.colfaxinternational.com/post/2014/03/11/InfiniBand-for-MIC.aspx

MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
Process Parallelism: MPI Optimization Strategies

Dynamic scheduling
Load balancing
Communication-efficient algorithms
OpenMP/MPI hybrid

MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
The Monte Carlo Method of Computing the Number π
1
Aquarter circle = πR2
4
π = 3.141592653589793 . . .
2
Asquare = L . y

Aquarter circle

L=1
〈Nquarter circle 〉 = N. 1
Asquare R=

x
L=1
〈Nquarter circle 〉 πR2
4 = 4 2 = π.
N 4L - Monte Carlo trial
- unit square area
- quarter circle area
Nquarter circle
π≈4 .
N
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
The Monte Carlo Method of Computing the Number π
1 #include <mkl_vsl.h>
2 const long BLOCK_SIZE=4096;
3

4 // Random number generator from MKL


5 VSLStreamStatePtr stream;
6 vslNewStream( &stream, VSL_BRNG_MT19937, seed );
7

8 for (long j = 0; j < nBlocks; j++) {


9 vsRngUniform( 0, stream, BLOCK_SIZE*2, r, 0.0f, 1.0f );
10 for (i = 0; i < BLOCK_SIZE; i++) {
11 const float x = r[i];
12 const float y = r[i+BLOCK_SIZE];
13 if (x*x + y*y < 1.0f) dUnderCurve++;
14 }
15 }
16 const double pi = (double)dUnderCurve / (double)iter * 4.0

MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
The Monte Carlo Method of Computing the Number π
1 int rank, nRanks, trial;
2 MPI_Init(&argc, &argv);
3 MPI_Comm_size(MPI_COMM_WORLD, &nRanks);
4 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
5

6 const double blocksPerProc = (double)nBlocks / (double)nRanks;


7 const long myFirstBlock = (long)(blocksPerProc*rank);
8 const long myLastBlock = (long)(blocksPerProc*(rank+1));
9

10 RunMonteCarlo(myFirstBlock, myLastBlock, stream, dUC);


11 // Compute pi
12 MPI_Reduce(&dUC, &UnderCurveSum, 1, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
13 if (rank==0)
14 const double pi = (double)UnderCurveSum / (double) iter * 4.0 ;
15

16 MPI_Barrier(MPI_COMM_WORLD);
17 MPI_Finalize();

MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
The Monte Carlo Method of Computing the Number π
Host, coprocessor, heterogeneous
user@host% mpirun -np 32 -host localhost ./pi_mpi
Time, s: 0.84
user@host% mpirun -np 240 -host mic0 ~/pi_mpi
Time, s: 0.44
user@host% mpirun -np 32 -host localhost ./pi_mpi : -np 240 -host mic0 ~/pi_mpi
Time, s: 0.36

Coprocessor is 1.9x faster than the host system


Thost ≈ 0.84 seconds, TPhi ≈ 0.44 seconds
Expect Tboth = 1/(1/0.84 + 1/0.44) ≈ 0.29 seconds
Tmeasured ≈ 0.36 seconds, which is 25% worse than expected. Why?

MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
Using Intel Trace Analyzer and Collector

CPU finishes its share of work faster than coporocessors.

MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
Load Balancing with Static Scheduling
Solution: assign more work to CPU ranks.

bhost
α = ,
bMIC Effect of load balancing between host and coprocessor in the Monte Carlo calculation of π

Run time

Btotal = bhost Phost + 0.5


Load imbalance on Host
Load imbalance on Coprocessor

0.4

+bMIC PMIC , Baseline (no load balancing)

Time, s (lower is better)


0.3 Theoretical best

0.2

Btotal 0.1

bhost = ,
αPhost + PMIC 0.0
0 1 2 3 4
Parameter α
5 6 7 8

αBtotal
bMIC = .
αPhost + PMIC
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
Load Balancing with Static Scheduling
Load balance: execution times
0.9
0.839
0.8

0.7
Time, s (lower is better)

0.6

0.5
0.449
0.4 0.366

0.3 0.283

0.2

0.1

0.0
es) sse
s) 1.0 3.4
ess oce α= α=
roc r i, i,
2p 0p Ph Ph
(3 (24 on eon
ly Xe
on nly + +X
eon io on eon
X Ph Xe X
on
Xe

MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
§7. Conclusion

MIC Developer Boot Camp Rev. 12 Conclusion © Colfax International, 2013–2014


Course Recap

MIC Developer Boot Camp Rev. 12 Course Recap © Colfax International, 2013–2014
Programming Models for Xeon Phi Coprocessors
1
Native coprocessor applications
Ï Compile with -mmic
Ï Run with micnativeloadex or scp+ssh
Ï The way to go for MPI applications without offload
2
Explicit offload
Ï Functions, global variables require __attribute__((target(mic)))
Ï Initiate offload, data marshalling with #pragma offload
Ï Only bitwise-copyable data can be shared
3
Clusters and multiple coprocessors
Ï #pragma offload target(mic:i)
Ï Use threads to offload to multiple coprocessors
Ï Run native MPI applications

MIC Developer Boot Camp Rev. 12 Course Recap © Colfax International, 2013–2014
Optimization Checklist

1
Scalar optimization

2
Vectorization

3
Scale above 100 threads

4
Arithmetically intensive or bandwidth-limited

5
Efficient cooperation between the host and the coprocessor(s)

MIC Developer Boot Camp Rev. 12 Course Recap © Colfax International, 2013–2014
Additional Resources: Reading, Guides, Support

MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Reference Guides

Intel C++ Compiler 14.0 User and Reference Guide


Intel VTune Amplifier XE User’s Guide
Intel Trace Analyzer and Collector Reference Gude
Intel MPI Library for Linux* OS Reference Manual
Intel Math Kernel Library Reference Manual
Intel Software Documentation Library
MPI Routines on the ANL Web Site
OpenMP Specifications

MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel’s Top 10 List
1
Download programming books: “Intel Xeon Phi Coprocessor High
Performance Programming” by Jeffers & Reinders, and “Parallel
Programming and Optimization with Intel Xeon Phi Coprocessors”
by Colfax.
2
Watch the parallel programming webinar
3
Bookmark and browse the mic-developer website
4
Bookmark and browse the two developer support forums: “Intel
MIC Architecture” and “Threading on Intel Parallel Architectures”.
5
Consult the “Quick Start” guide to prepare your system for first use,
learn about tools, and get C/C++ and Fortran-based programs up
and running
Link to TOP10 List for Starter Kit Developers
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel’s Top 10 List (continued)
6
Try your hand at the beginning lab exercises
7
Try your hand at the beginner/intermediate real world app exercises
8
Browse the case studies webpage to view examples from many
segments
9
Begin optimizing your application(s); consult your programming
books, the ISA reference manual, and the support forums for
assistance.
10
Hone your skills by watching more advanced video workshops

Link to TOP10 List for Starter Kit Developers

MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel Xeon Phi Starter Kit

MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel Xeon Phi Starter Kit

MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel Xeon Phi Starter Kit

MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel Xeon Phi Starter Kit

MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Workstations with Intel Xeon Phi Coprocessors (Jan 2014)

http://www.colfax-intl.com/nd/xeonphi/workstations.aspx
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Servers with Intel Xeon Phi Coprocessors (Jan 2014)

http://www.colfax-intl.com/nd/xeonphi/servers.aspx
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Research and Consulting

http://research.colfaxinternational.com/
http://nlreg.colfax-intl.com/
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Research and Consulting
Colfax offers consulting services for Enterprises, Research Labs, and
Universities. We can help you to:
Optimize your existing application to take advantage of all levels of
hardware parallelism
Future-proof for upcoming innovations in computing solutions.
Accelerate your application using coprocessor technologies.
Investigate the potential system configurations that satisfy your
cost, power and performance requirements.
Take a deep dive to develop a novel approach.
For more details, contact us at [email protected] to discuss what we
can do together
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel® Xeon Phi™ Coprocessor
Remote Access and System Loaner Programs
Remote Access Systems Loaner Programs
Intel-supported options for Academia: Intel Demo Depot:
 Manycore Testing Lab through SSG (more info)  Contact your local Intel sales representative for
requesting an Intel® Xeon Phi™ coprocessor-based
 Intel Science & Technology Center (ISTC) and Intel system
Collaborative Research Institutes (ICRI) programs
through Intel Labs (more info)

 Texas Advanced Computing Center (TACC) and


National Institute for Computational Sciences (NICS)
both offer allocations through the NSF XSEDE
program (more info)

Colfax Code Treadmill: Colfax Loaner Program:

 Seven-day, 24/7 remote access to a personal HPC  30-day access to a loaner system, complete with
server at Colfax with training materials, Intel® Colfax hardware and software programming
Xeon® processors, Intel® Xeon Phi™ coprocessors support
and software development tools
 More information please send email to :
 More Information: HERE [email protected]

Please contact your Intel BDM or local OEM


representative for more remote access and
systemIntelloaner
Confidential options
1
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
Intel Confidential
Intel® Xeon Phi™ Coprocessor Starter Kits

Go parallel today with a


fully-configured
system starting below $5K*

3120A
OR

5110P

software.intel.com/xeon-phi-starter-kit

Other brands and names are the property of their respective owners.
*Pricing and starter kit configurations will vary. See software.intel.com/xeon-phi-starter-kit and provider websites for full details and disclaimers. Stated currency
is US Dollars.

Intel Confidential
2
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
Intel Confidential
Thank you for tuning in,
and
have a wonderful journey
to the Parallel World!

MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014

You might also like