Phi Intro PDF
Phi Intro PDF
http://www.colfax-intl.com/nd/xeonphi/training.aspx
MIC Developer Boot Camp Rev. 12 About This Document © Colfax International, 2013–2014
Disclaimer
While best efforts have been used in preparing this training, Colfax International makes no
representations or warranties of any kind and assumes no liabilities of any kind with respect to
the accuracy or completeness of the contents and specifically disclaims any implied warranties
of merchantability or fitness of use for a particular purpose. The publisher shall not be held
liable or responsible to any person or entity with respect to any loss or incidental or
consequential damages caused, or alleged to have been caused, directly or indirectly, by the
information or programs contained herein. No warranty may be created or extended by sales
representatives or written sales materials.
MIC Developer Boot Camp Rev. 12 Supplementary Materials © Colfax International, 2013–2014
Supplementary Materials: Textbook
ISBN: 978-0-9885234-1-8 (520 pages)
Parallel Programming
and Optimization with
Intel® Xeon Phi™
Coprocessors
Handbook on the Development and
Optimization of Parallel Applications
for Intel® Xeon® Processors
and Intel® Xeon Phi™ Coprocessors
MIC Developer Boot Camp Rev. 12 Supplementary Materials © Colfax International, 2013–2014
Research and Consulting
http://research.colfaxinternational.com/
http://nlreg.colfax-intl.com/
MIC Developer Boot Camp Rev. 12 Supplementary Materials © Colfax International, 2013–2014
Additional Reading
Learn more about this book:
It all comes down to
PARALLEL lotsofcores.com This book belongs on the
PROGRAMMING !
(applicable to processors bookshelf of every HPC
and Intel® Xeon Phi™ professional. Not only does it
coprocessors both) successfully and accessibly
teach us how to use and
Forward, Preface obtain high performance on
Chapters: the Intel MIC architecture, it is
1. Introduction
2. High Performance Closed
about much more than that. It
Track takes us back to the universal
Test Drive! fundamentals of high-
3. A Friendly Country Road Race performance computing
4. Driving Around Town: including how to think and
Optimizing A Real-World reason about the performance
Code Example
of algorithms mapped to
5. Lots of Data (Vectors)
6. Lots of Tasks (not Threads) modern architectures, and it
7. Offload puts into your hands powerful
8. Coprocessor Architecture tools that will be useful for
9. Coprocessor System Software years to come.
10. Linux on the Coprocessor —Robert J. Harrison
11. Math Library Institute for Advanced
MPI
Computational Science,
12.
“© 2013, James Reinders & Jim Jeffers, book image used with permission
MIC Developer Boot Camp Rev. 12 Supplementary Materials © Colfax International, 2013–2014
List of Topics
MIC Developer Boot Camp Rev. 12 List of Topics © Colfax International, 2013–2014
List of Topics
1
Introduction
Ï Intel Xeon Phi Architecture from the Programmer’s Perspective
Ï Software Tools for Intel Xeon Phi Coprocessors
Ï Will Application X benefit from the MIC architecture?
2
Programming Models for Intel Xeon Phi Applications
Ï Native Applications for Coprocessors and MPI
Ï Offload Programming Models
Ï Using Multiple Coprocessors
Ï MPI Applications and Heterogeneous Clustering
MIC Developer Boot Camp Rev. 12 List of Topics © Colfax International, 2013–2014
List of Topics
3
Porting Applications to the MIC Architecture
Ï Future-Proofing: Reliance on Compiler and Libraries
Ï Choosing the Programming Model
Ï Cross-Compilation of User Applications
Ï Performance Expectations
4
Parallel Scalability on Intel Architectures
Ï Vectorization (Single Instruction Multiple Data, SIMD, Parallelism)
Ï Multi-threading: OpenMP, Intel Cilk Plus
Ï Task Parallelism in Distributed Memory, MPI
MIC Developer Boot Camp Rev. 12 List of Topics © Colfax International, 2013–2014
List of Topics
5
Optimization for the Intel Xeon Product Family
Ï Optimization Checklist
Ï Finding Bottlenecks with Intel VTune Amplifier
Ï MPI Diagnostics Using Intel Trace Analyzer and Collector
Ï Intel Math Kernel Library (MKL)
Ï Scalar Optimization Considerations
Ï Automatic Vectorization and Data Structures
Ï Optimization of Thread Parallelism
MIC Developer Boot Camp Rev. 12 List of Topics © Colfax International, 2013–2014
List of Topics
6
Advanced Optimization for the MIC Architecture
Ï Memory Access and Cache Utilization
Ï Data Persistence and PCIe Traffic
Ï MPI Applications on Clusters with Coprocessors
7
Conclusion
Ï Course Recap
Ï Additional Resources: Reading, Guides, Support
MIC Developer Boot Camp Rev. 12 List of Topics © Colfax International, 2013–2014
§1. Introduction to the Intel Many
Integrated Core (MIC) Architecture
MIC Developer Boot Camp Rev. 12 Introduction to the Intel Many Integrated Core (MIC) Architecture © Colfax International, 2013–2014
MIC Architecture from the Programmer’s Perspective
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Intel Xeon Phi Coprocessors and the MIC Architecture
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Xeon Family Product Performance
Many-core Coprocessors
(Xeon Phi) vs Multi-core
Processors (Xeon) —
Better performance per
system & performance
per watt for parallel
applications
Same programming
methods, same Source: “Intel Xeon Product Family:
development tools. Performance Brief”
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Intel Xeon Processors and the MIC Architecture
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Paper: research.colfaxinternational.com/post/2013/01/07/Nbody-Xeon-Phi.aspx
Demo: http://www.youtube.com/watch?v=KxaSEcmkGTo
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Microarchitecture
Core Ring
SBOX
CORE CORE CORE CORE Interconnect (CRI)
PCIe v2.0
controller, DATA
L2 L2 L2 L2
DMA engines
ADDRESS
COHERENCE
TD TD TD TD
GDDR5 GDDR5
TD TD
CORE L2 L2 CORE
GDDR5 TD TD TD TD GDDR5
GBOX GBOX
GDDR5 (memory (memory GDDR5
controller) L2 L2 L2 L2 controller)
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Cache Structure
The caches are 8-way associative, fully coherent with
the LRU (Least Recently Used) replacement policy
Cache line size 64B
L1 size 32KB data, 32KB code
L1 latency 1 cycle
L2 size 512KB
L2 ways 8
L2 latency 11 cycles
Memory → L2 prefetching hardware and software
L2 → L1 prefetching software only
Translation Lookaside Buffer(TLB) 64 pages of size 4KB (256KB coverage)
coverage options (L1, data) 8 pages of size 2MB (16MB coverage)
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Features of the IMCI Instruction Set
Intel IMCI is the instruction set supported by Intel Xeon Phi copr.
512-bit wide registers
Ï can pack up to eight 64-bit elements (long int, double)
Ï up to sixteen 32-bit elements (int, float)
Arithmetic Instructions
Ï Addition, subtraction and multiplication
Ï Fused Multiply-Add instruction (FMA)
Ï Division and reciprocal calculation;
Ï Error function, inverse error function;
Ï Exponential functions (natural, base 2 and base 10) and the power function.
Ï Logarithms (natural, base 2 and base 10).
Ï Square root, inverse square root, hypothenuse value and cubic root;
Ï Trigonometric functions (sin, cos, tan, sinh, cosh, tanh, asin, acos . . . );
Ï Rounding functions
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Features of the IMCI Instruction Set
MIC Developer Boot Camp Rev. 12 MIC Architecture from the Programmer’s Perspective © Colfax International, 2013–2014
Interactions between Operating Systems
MIC Developer Boot Camp Rev. 12 Interactions between Operating Systems © Colfax International, 2013–2014
Linux Host Intel® Xeon Phi™ coprocessor
Virtual terminal session
Host-side offload Target-side “native" Target-side offload
application application application
MIC Developer Boot Camp Rev. 12 Interactions between Operating Systems © Colfax International, 2013–2014
Linux µOS on Intel Xeon Phi coprocessors (part of MPSS)
user@host% lspci | grep -i "co-processor"
06:00.0 Co-processor: Intel Corporation Device 2250 (rev 11)
82:00.0 Co-processor: Intel Corporation Device 2250 (rev 11)
user@host% sudo service mpss status
mpss is running
user@host% cat /etc/hosts | grep mic
172.31.1.1 host-mic0 mic0
172.31.2.1 host-mic1 mic1
user@host% ssh mic0
user@mic0% cat /proc/cpuinfo | grep proc | tail -n 3
processor : 237
processor : 238
processor : 239
user@mic0% ls /
amplxe dev home lib64 oldroot proc sbin sys usr
bin etc lib linuxrc opt root sep3.10 tmp var
MIC Developer Boot Camp Rev. 12 Interactions between Operating Systems © Colfax International, 2013–2014
Software Tools for Intel Xeon Phi Coprocessors
MIC Developer Boot Camp Rev. 12 Software Tools for Intel Xeon Phi Coprocessors © Colfax International, 2013–2014
Execute MIC Applications (all free):
MIC Developer Boot Camp Rev. 12 Software Tools for Intel Xeon Phi Coprocessors © Colfax International, 2013–2014
MPSS Tools and Utilities
micinfo a system information query tool
micsmc a utility for monitoring and modifying the physical
paramaters: temperature, power modes, core utilization, etc.
micctrl a comprehensive configuration tool for the Intel Xeon Phi
coprocessor operating system
miccheck a set of diagnostic tests for the verification of the Intel Xeon
Phi coprocessor configuration
micrasd a host daemon logger of hardware errors reported by Intel
Xeon Phi coprocessors
micflash an Intel Xeon Phi flash memory agent
MIC Developer Boot Camp Rev. 12 Software Tools for Intel Xeon Phi Coprocessors © Colfax International, 2013–2014
Build Xeon Phi & Xeon CPU Applications (all licensed):
MIC Developer Boot Camp Rev. 12 Software Tools for Intel Xeon Phi Coprocessors © Colfax International, 2013–2014
Will Application X Benefit from the MIC architecture?
MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
Three Layers of Parallelism
SIMD Instruction Pool
PU
Vector Unit
Data Pool
PU
PU
PU
MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
Three Layers of Parallelism
SIMD Instruction Pool
PU
Vector Unit
Data Pool
PU
PU
PU
MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
Three Layers of Parallelism
SIMD Instruction Pool
PU
Vector Unit
Data Pool
PU
PU
PU
MPI
MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
Compute-Bound Application Performance
- Intel Xeon Phi
- Intel Xeon
More Parallel
1 10 100 1k 10k
More Performance
MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
One Size Does Not Fit All
MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
Xeon + Xeon Phi Coprocessors = Xeon Family
MIC Developer Boot Camp Rev. 12 Will Application X Benefit from the MIC architecture? © Colfax International, 2013–2014
§2. Programming Models for Intel Xeon
Phi Applications
MIC Developer Boot Camp Rev. 12 Programming Models for Intel Xeon Phi Applications © Colfax International, 2013–2014
Native Execution
“Hello World” application:
1 #include <stdio.h>
2 #include <unistd.h>
3 int main(){
4 printf("Hello world! I have %ld logical cores.\n",
5 sysconf(_SC_NPROCESSORS_ONLN ));
6 }
MIC Developer Boot Camp Rev. 12 Programming Models for Intel Xeon Phi Applications © Colfax International, 2013–2014
Native Execution
Compile and run the same code on the coprocessor in the native mode:
user@host% icc hello.c -mmic
user@host% scp a.out mic0:~/
a.out 100% 10KB 10.4KB/s 00:00
user@host% ssh mic0
user@mic0% pwd
/home/user
user@mic0% ls
a.out
user@mic0% ./a.out
Hello world! I have 240 logical cores.
user@mic0%
MIC Developer Boot Camp Rev. 12 Programming Models for Intel Xeon Phi Applications © Colfax International, 2013–2014
Running MPI Applications on Host
MIC Developer Boot Camp Rev. 12 Programming Models for Intel Xeon Phi Applications © Colfax International, 2013–2014
Running Native MPI Applications on Coprocessors
user@host% source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh
user@host% export I_MPI_MIC=1
user@host% export I_MPI_FABRICS=shm:tcp
user@host% mpiicpc -mmic -o HelloMPI.MIC HelloMPI.c
user@host% scp HelloMPI.MIC mic0:~/
user@host% mpirun -host mic0 -np 2 ~/HelloMPI.MIC
Hello World from rank 1 running on host-mic0!
Hello World from rank 0 running on host-mic0!
MPI World size = 2 processes
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Explicit Offload: Pragma-based approach
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Compiling and Running an Offload Application
user@host% icpc hello_offload.cpp -o hello_offload
user@host% ./hello_offload
Hello World from host!
Bye
Hello World from coprocessor!
6 void MyFunctionTwo() {
7 // ... implement function as usual
8 }
9 #pragma offload_attribute(pop)
To mark a long block of code with the offload attribute, use #pragma
offload_attribute(push/pop)
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Offloading Data: Local Scalars and Arrays
1 void MyFunction() {
2 const int N = 1000;
3 int data[N];
4 #pragma offload target(mic)
5 {
6 for (int i = 0; i < N; i++)
7 data[i] = 0;
8 }
3 void MyFunction() {
4 static int __attribute__((target(mic))) N;
5 // ...
6 }
7
8 int main() {
9 // ...
10 }
Global and static variables must be marked with the offload attribute
#pragma offload_attribute(push/pop) may be used as well
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Data Marshalling for Dynamically Allocated Data
1 double *p1=(double*)malloc(sizeof(double)*N);
2 double *p2=(double*)malloc(sizeof(double)*N);
3
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Memory retention and data persistence on coprocessor
1 #pragma offload target(mic) in(p : length(N) alloc_if(1) free_if(0) )
2 { /* allocate memory for array p on coprocessor, do not deallocate */ }
3
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Precautions with persistent data
Use explicit zero-based coprocessor number
(e.g., mic:0 as shown below)
With multiple coprocessors, if target number is unspecified, any
coprocessor can be used, which will result in runtime errors if
persistent data cannot be found.
1 #pragma offload target(mic:0) in(p : length(N)) alloc_if(1) free_if(0) )
2 { /* allocate memory for array p on coprocessor, do not deallocate */ }
Do not change the value of the host pointer to a persistent array: the
runtime system finds the data on coprocessor using the host pointer
value, not variable name.
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Virtual-shared Memory Model
1 _Cilk_shared int arr[N]; // This is a virtual-shared array
2
7 int main() {
8 // arr[] can be initialized on the host
9 _Cilk_offload Compute(); // and used on coprocessor
10 // and the values are returned to the host
11 }
3 int main() {
4 // Working with pointer-based data is illustrated below:
5 data = (_Cilk_shared int*)_Offload_shared_malloc(N*sizeof(float));
6 _Offload_shared_free(data);
7 }
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Target-Specific Code
MIC Developer Boot Camp Rev. 12 Offload Programming Models © Colfax International, 2013–2014
Using Multiple Coprocessors
MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
Multiple Coprocessors with Explicit Offload
MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
Multiple Blocking Offloads Using Host Threads
(Explicit Offload)
1 const int nDevices = _Offload_number_of_devices();
2 #pragma omp parallel for
3 for (int i = 0; i < nDevices; i++) {
4 #pragma offload target(mic: i)
5 {
6 MyFunction(/*...*/ );
7 }
8 }
MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
Blocking Explicit Offloads Using Threads: Dynamic Work
Distribution Across Coprocessors
1 const int nDevices = _Offload_number_of_devices();
2 omp_set_num_threads(nDevices);
3 #pragma omp parallel for schedule(dynamic, 1)
4 for (int i = 0; i < nWorkItems; i++) {
5 const int iDevice = omp_get_thread_num();
6 #pragma offload target(mic: iDevice)
7 {
8 MyFunction(i);
9 }
10 }
5 DoSomethingElse();
6
MIC Developer Boot Camp Rev. 12 Using Multiple Coprocessors © Colfax International, 2013–2014
Multiple Asynchronous Explicit Offloads From a Single
Thread
1 const int nDevices = _Offload_number_of_devices();
2 char sig[nDevices];
3 for (int i = 0; i < nDevices; i++) {
4 #pragma offload target(mic: i) signal(&sig[i])
5 {
6 MyFunction(/*...*/ );
7 }
8 }
9 for (int i = 0; i < nDevices; i++) {
10 #pragma offload_wait target(mic: i) wait(&sig[i])
11 }
MIC Developer Boot Camp Rev. 12 MPI Applications and Heterogeneous Clustering © Colfax International, 2013–2014
Heterogeneous MPI Applications: Host + Coprocessors
MIC Developer Boot Camp Rev. 12 MPI Applications and Heterogeneous Clustering © Colfax International, 2013–2014
Heterogeneous Distributed Computing with Xeon Phi
MIC Developer Boot Camp Rev. 12 File I/O in MPI Applications © Colfax International, 2013–2014
RAM Filesystem
NIC
MIC Developer Boot Camp Rev. 12 File I/O in MPI Applications © Colfax International, 2013–2014
Virtio Transfer to Local Host Drives
MIC Developer Boot Camp Rev. 12 File I/O in MPI Applications © Colfax International, 2013–2014
Network Storage
to NFS
I/O
NIC
host HDD
Files are stored on a remote file OS
server host
HDD
Can share a mount point across IB HCA
the cluster PCIe BUS
Lustre has scalable performance native
/mnt/dir
MPI Xeon
process
NFS is slow but easy to set up RAM FS PHI
uOS
I/O
to LUSTRE
MIC Developer Boot Camp Rev. 12 File I/O in MPI Applications © Colfax International, 2013–2014
Review: Programming Models
MIC Developer Boot Camp Rev. 12 Review: Programming Models © Colfax International, 2013–2014
Programming Models
1
Native coprocessor applications
Ï Compile with -mmic
Ï Run with micnativeloadex or scp+ssh
Ï The way to go for MPI applications without offload
2
Explicit offload
Ï Functions, global variables require __attribute__((target(mic)))
Ï Initiate offload, data marshalling with #pragma offload
Ï Only bitwise-copyable data can be shared
3
Clusters and multiple coprocessors
Ï #pragma offload target(mic:i)
Ï Use threads to offload to multiple coprocessors
Ï Run native MPI applications
MIC Developer Boot Camp Rev. 12 Review: Programming Models © Colfax International, 2013–2014
§3. Porting Applications to the MIC
Architecture
MIC Developer Boot Camp Rev. 12 Porting Applications to the MIC Architecture © Colfax International, 2013–2014
Choosing the Programming Model
MIC Developer Boot Camp Rev. 12 Choosing the Programming Model © Colfax International, 2013–2014
To Offload or Not To Offload
For a “MIC-friendly” application,
Use offload if: Use native/symmetric MPI if:
Per-rank data set does not fit Parallel work-items too
in the Xeon Phi onboard small, so data transfer
memory overhead is significant
Need CPU: serial workload, Peer-to-peer
intensive file I/O communication between
MPI bandwidth-bound or workers is required
latency-bound workload Difficult to instrument data
Cannot compile some of movement or sharing with
dependencies for MIC coprocessor
MIC Developer Boot Camp Rev. 12 Choosing the Programming Model © Colfax International, 2013–2014
PCIe Bandwidth Considerations
MIC Developer Boot Camp Rev. 12 Choosing the Programming Model © Colfax International, 2013–2014
Cross-Compilation of User Applications
MIC Developer Boot Camp Rev. 12 Cross-Compilation of User Applications © Colfax International, 2013–2014
Simple Applications, Native Execution
MIC Developer Boot Camp Rev. 12 Cross-Compilation of User Applications © Colfax International, 2013–2014
Native Applications with Autotools
MIC Developer Boot Camp Rev. 12 Cross-Compilation of User Applications © Colfax International, 2013–2014
Static Libraries with Offload
In order to compile the *MIC.o files into a static library with offload, use
xiar -qoffload-build instead of ar.
MIC Developer Boot Camp Rev. 12 Cross-Compilation of User Applications © Colfax International, 2013–2014
Performance Expectations
MIC Developer Boot Camp Rev. 12 Performance Expectations © Colfax International, 2013–2014
Performance on MIC is a Function of Optimization Level
Unoptimized Thread Parallelism: Scalar Optimizations: Vectorization: Heterogeneous:
with Fit All Threads Precomputation, Alignment, Using Host +
103 Offload in Memory Precision Control Padding, Hints + Two Coprocessors
hi
on P
n Xe
Performance Relative to Baseline
102 C++
o
Intel
GCC on CPUs
101
s
on CPU
C++
Intel
Baseline: unoptimized,
100 compiled with GCC,
running on host CPUs
(59 ms per spectrum)
0 1 2 3 4 5 6 7 8
Optimization Step
MIC Developer Boot Camp Rev. 12 Performance Expectations © Colfax International, 2013–2014
Performance on MIC is a Function of Optimization Level
Performance will be
disappointing if code is not
optimized for multi-core
CPUs
Optimized code runs better
on the MIC platform and on
the multi-core CPU
Single code for two
platforms + Ease of porting = Case study:
Incremental optimization http://research.colfaxinternational.com/post/2013/11/25/sc13-
talk.aspx
MIC Developer Boot Camp Rev. 12 Performance Expectations © Colfax International, 2013–2014
Caution on Comparative Benchmarks
In most of our benchmarks,
“Xeon Phi” = 5110P SKU
(60 cores, TDP 225 W, $2.7k),
“CPU” = dual Xeon E5-2680
(16 cores, TDP 260 W, $3.4k
+ system cost)
Why dual CPU vs single
coprocessor? Approximately
the same Thermal Design Case study:
Power (TDP) and cost. http://research.colfaxinternational.com/post/2013/11/25/sc13-
talk.aspx
MIC Developer Boot Camp Rev. 12 Performance Expectations © Colfax International, 2013–2014
Future-Proofing: Reliance on Compiler and Libraries
MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
Future-Proofing: Reliance on Compiler and Libraries
Ease of use
Threading Options Vector Options
Depth
Intel® Cilk™ Plus Auto vectorization
Semi-auto vectorization:
OpenMP* #pragma (vector, simd)
OpenCL*
Fine control
MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
Next Generation MIC: Knights Landing (KNL)
2nd generation MIC product: code
name Knights Landing (KNL)
Intel’s 14 nm manufacturing process
A processor (running the OS) or a
coprocessor (PCIe device)
On-package high-bandwidth
memory w/flexible memory models:
flat, cache, & hybrid
Intel Advanced Vector Extensions Source: Intel Newsroom
AVX-512 (public)
MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
Getting Ready for the Future
MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
Intel® Xeon Phi™ Product Family Roadmap
The Faster Path to Discovery
Future
Knights Landing
with Fabric
TBA
Knights Landing
3rd generation
2H’15*
Knights Landing
Intel® Xeon Phi™
x200 Product Family
Available Today
Knights Corner 14 nm process
In planning
Intel® Xeon Phi™ Server Processor &
x100 Product Family Coprocessor
22 nm process Over 3 TF DP Peak1
Coprocessor 60+ cores
Over 1 TF DP Peak
And new
Up to 61 Cores details
Up to 16GB GDDR5 today…
FLOPS = cores x clock frequency x floating-point operations per second per cycle.
2 Projected peak theoretical single-thread performance relative to 1st Generation Intel® Xeon Phi™ Coprocessor 7120P (formerly codenamed Knights Corner)
3 Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4
Diagram is for conceptual purposes only and only illustrates a CPU, memory, integrated fabric and DDR memory – it is not to scale and does not include all functional areas of the CPU, nor does it
Recompile Tuning
Parallelization, threading, Exploiting NEW
vectorization, cache-blocking, features and
structures
MPI+OpenMP hybridization & more.
Intel® Xeon Phi™
x100 Product
Family MKL MPI TBB
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
13
MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
A Paradigm Shift for Highly-Parallel
Server Processor with Leadership Integration are Keys to Future
Memory Bandwidth
Over 5x STREAM vs. DDR41
Memory Capacity
Comparable to Intel® Xeon® processors2
Resiliency
Coprocessor Intel-server class reliability
Power Efficiency
>25% better than discrete card3
I/O
Knights Landing Highest bandwidth4
Fabric Cost
Less costly than discrete parts
2 Compared to 1st Generation Intel® Xeon Phi™ 7120P Coprocessor (formerly codenamed Knights Corner)
3 Projected result based on internal Intel analysis using estimated performance and power consumption of a rack sized deployment of Intel® Xeon® processors and Knights Landing coprocessors as compared to a rack with KNL processors only
4 Projected result based on internal Intel analysis comparing a discrete Knights Landing processor with integrated fabric to a discrete Intel fabric component card.
5 Projected result based on internal Intel analysis using estimated component pricing in the 2015 timeframe.
6 Theoretical density for air-cooled system; other cooling solutions and configurations may enable both lower or higher densities.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests
to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
14
MIC Developer Boot Camp Rev. 12 Future-Proofing: Reliance on Compiler and Libraries © Colfax International, 2013–2014
§4. Parallel Scalability on Intel
Architectures
MIC Developer Boot Camp Rev. 12 Parallel Scalability on Intel Architectures © Colfax International, 2013–2014
Vectorization (Single Instruction Multiple Data, SIMD,
Parallelism)
MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
SIMD Operations
SIMD — Single Instruction Multiple Data
Scalar Loop SIMD Loop
1 for (i = 0; i < n; i++) 1 for (i = 0; i < n; i += 4)
2 A[i] = A[i] + B[i]; 2 A[i:(i+4)] = A[i:(i+4)] + B[i:(i+4)];
PU
Vector Unit
Data Pool PU
PU
PU
MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
Instruction Sets in Intel Architectures
MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
Explicit Vectorization: Compiler Intrinsics
SSE2 Intrinsics IMCI Intrinsics
1 for (int i=0; i<n; i+=4) { 1 for (int i=0; i<n; i+=16) {
2 __m128 Avec=_mm_load_ps(A+i); 2 __m512 Avec=_mm512_load_ps(A+i);
3 __m128 Bvec=_mm_load_ps(B+i); 3 __m512 Bvec=_mm512_load_ps(B+i);
4 Avec=_mm_add_ps(Avec, Bvec); 4 Avec=_mm512_add_ps(Avec, Bvec);
5 _mm_store_ps(A+i, Avec); 5 _mm512_store_ps(A+i, Avec);
6 } 6 }
MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
Automatic Vectorization of Loops on MIC architecture
Compilation and runtime output of the code for Intel Xeon Phi execution
MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
Automatic Vectorization of Loops
Limitations:
Only for-loops can be auto-vectorized. Number of iterations must
be known at a runtime and/or compilation time
Memory access in the loop must have a regular pattern,
ideally with unit stride
Non-standard loops that cannot be automatically vectorized:
Ï loops with irregular memory access pattern
Ï calculations with vector dependence
Ï while-loops, for-loops with undetermined number of iterations
Ï outer loops (unless #pragma simd overrides this restriction)
Ï loops with complex branches (i.e., if-conditions)
Ï anything else that cannot be, or is very difficult to vectorize.
MIC Developer Boot Camp Rev. 12 Vectorization (Single Instruction Multiple Data, SIMD, Parallelism) © Colfax International, 2013–2014
Multi-Threading: OpenMP
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Parallelism in Shared Memory: OpenMP and Intel Cilk Plus
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Program Structure in OpenMP
1 main() { // Begin serial execution.
2 ... // Only the initial thread executes
3 #pragma omp parallel // Begin a parallel construct and form
4 { // a team.
5 #pragma omp sections // Begin a work-sharing construct.
6 {
7 #pragma omp section // One unit of work.
8 {...}
9 #pragma omp section // Another unit of work.
10 {...}
11 } // Wait until both units of work complete.
12 ... // This code is executed by each team member.
13 #pragma omp for // Begin a work-sharing Construct
14 for(...)
15 { // Each iteration chunk is unit of work.
16 ... // Work is distributed among the team members.
17 } // End of work-sharing construct.
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Program Structure in OpenMP
18 #pragma omp critical // Begin a critical section.
19 {...} // Only one thread executes at a time.
20 #pragma omp task // Execute in another thread without blocking
21 {...}
22 ... // This code is executed by each team member.
23 #pragma omp barrier // Wait for all team members to arrive.
24 ... // This code is executed by each team member.
25 } // End of Parallel Construct
26 // Disband team and continue serial execution.
27 ... // Possibly more parallel constructs.
28 } // End serial execution.
1 Code outside #pragma omp parallel is serial, i.e., executed by only one thread
2 Code directly inside #pragma omp parallel is executed by each thread
3 Code inside work-sharing construct #pragma omp for is distributed across the
threads in the team
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
“Hello World” OpenMP Programs
1 #include <omp.h>
2 #include <stdio.h>
3
4 int main(){
5 const int nt=omp_get_max_threads();
6 printf("OpenMP with %d threads\n", nt);
7
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
“Hello World” OpenMP Programs
user@host% export OMP_NUM_THREADS=5
user@host% icpc -openmp hello_omp.cc
user@host% ./a.out
OpenMP with 5 threads
Hello World from thread 0
Hello World from thread 3
Hello World from thread 1
Hello World from thread 2
Hello World from thread 4
user@host% icpc -openmp-stubs hello_omp.cc
hello_omp.cc(8): warning #161: unrecognized #pragma
#pragma omp parallel
^
user@host% ./a.out
OpenMP with 1 threads
Hello World from thread 0
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Loop-Centric Parallelism: For-Loops in OpenMP
Loop iterations
Simultaneously launch
Program flow
multiple threads
Scheduler assigns loop
iterations to threads
Each thread processes
one iteration at a time
Parallelizing a for-loop.
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Loop-Centric Parallelism: For-Loops in OpenMP
The OpenMP library will distribute the iterations of the loop following the
#pragma omp parallel for across threads.
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Loop-Centric Parallelism: For-Loops in OpenMP
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Fork-Join Model of Parallel Execution
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Synchronization: Avoiding Unpredictable Program Behavior
1 #include <omp.h>
2 #include <stdio.h>
3 int main() {
4 const int n = 1000;
5 int total = 0;
6 #pragma omp parallel for
7 for (int i = 0; i < n; i++) {
8 #pragma omp critical
9 { // Only one thread at a time can execute this section
10 total = total + i;
11 }
12 }
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Synchronization: Avoiding Unpredictable Program Behavior
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Synchronization: Avoiding Unpredictable Program Behavior
Read : operations in the form v = x
Write : operations in the form x = v
Update : operations in the form x++, x--, --x, ++x, x binop= expr
and x = x binop expr
Capture : operations in the form v = x++, v = x–, v = –x, v = ++x,
v = x binop expr
Ï no non-scalar types,
Ï no complex expressions.
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Reduction: Avoiding Synchronization
1 #include <omp.h>
2 #include <stdio.h>
3
4 int main() {
5 const int n = 1000;
6 int sum = 0;
7 #pragma omp parallel for reduction(+: sum)
8 for (int i = 0; i < n; i++) {
9 sum = sum + i;
10 }
11 printf("sum=%d (must be %d)\n", sum, ((n-1)*n)/2);
12 }
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Implementation of Reduction using Private Variables
1 #include <omp.h>
2 #include <stdio.h>
3
4 int main() {
5 const int n = 1000;
6 int sum = 0;
7 #pragma omp parallel
8 {
9 int sum_th = 0;
10 #pragma omp for
11 for (int i = 0; i < n; i++)
12 sum_th = sum_th + i;
13 #pragma omp atomic
14 sum += sum_th;
15 }
16 printf("sum=%d (must be %d)\n", sum, ((n-1)*n)/2);
17 }
MIC Developer Boot Camp Rev. 12 Multi-Threading: OpenMP © Colfax International, 2013–2014
Task Parallelism in Distributed Memory, MPI
MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Task Parallelism in Distributed Memory, MPI
MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Compiling and Running MPI applications
1
Compile and link with the MPI wrapper of the compiler:
Ï mpiicc for C,
Ï mpiicpc for C++,
Ï mpiifort for Fortran 77 and Fortran 95.
2
Set up MPI environment variables and I_MPI_MIC=1
3
NFS-share or copy the MPI library and the application executable to
the coprocessors
4
Launch with the tool mpirun
Ï Colon-separated list of executables and hosts (argument -host hostname),
Ï Alternatively, use the machine file to list hosts
Ï Coprocessors have hostnames defined in /etc/hosts
MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Peer-to-Peer Communication between Coprocessors
System System
CPU CPU
Memory Memory
Network
Bridging
on br0
Ethernet RDMA
PCIe Chipset PCIe MIC PCIe Chipset PCIe MIC
NIC Device
Coprocessor
Coprocessor
Virtualized Virtualized
Network MIC InfiniBand MIC
Interface Memory HCA Memory
mic0
MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Point to Point Communication
1 if (rank == receiver) {
2
3 char incomingMsg[messageLength];
4 MPI_Recv (&incomingMsg, messageLength, MPI_CHAR, sender,
5 tag, MPI_COMM_WORLD, &stat);
6 printf ("Received message with tag %d: ’%s’\n", tag, incomingMsg);
7
10 char outgoingMsg[messageLength];
11 strcpy(outgoingMsg, "/Jenny");
12 MPI_Send(&outgoingMsg, messageLength, MPI_CHAR, receiver, tag, MPI_COMM_WORLD);
13
14 }
MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Collective Communication: Broadcast
1 int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype,
2 int root, MPI_Comm comm );
sender
data
Broadcast
MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Collective Communication: Scatter
1 int MPI_Scatter(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf,
2 int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm);
sender
data data
data data
Scatter
MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Collective Communication: Gather
1 int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype,
2 void *recvbuf, int recvcnt, MPI_Datatype recvtype,
3 int root, MPI_Comm comm);
Gather
receiver
MIC Developer Boot Camp Rev. 12 Task Parallelism in Distributed Memory, MPI © Colfax International, 2013–2014
Collective Communication: Reduction
1 int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
2 MPI_Op op, int root, MPI_Comm comm);
1 3 5 7
Reduction
16
receiver
MIC Developer Boot Camp Rev. 12 Review: Parallel Scalability © Colfax International, 2013–2014
Expressing Parallelism
1
Data parallelism (vectorization)
Ï Automatic vectorization by the compiler: portable and convenient
Ï For-loops and array notation can be vectorized
Ï Compiler hints (#pragma simd, #pragma ivdep, etc.) to assist the compiler
2
Shared-memory parallelism with OpenMP and Intel Cilk Plus
Ï Parallel threads access common memory for reading and writing
Ï Parallel loops: #pragma omp parallel for
and _Cilk_for — automatic work distribution
Ï In OpenMP: private and shared variables; synchronization, reduction.
3
Distributed-memory parallelism with MPI
Ï MPI processes do not share memory, but can send information to each other
Ï All MPI processes execute the same code; role is determined by its rank
Ï Point-to-point and collective communication patterns
MIC Developer Boot Camp Rev. 12 Review: Parallel Scalability © Colfax International, 2013–2014
§5. Optimization for the Intel Xeon
Product Family
MIC Developer Boot Camp Rev. 12 Optimization for the Intel Xeon Product Family © Colfax International, 2013–2014
Optimization Roadmap
MIC Developer Boot Camp Rev. 12 Optimization Roadmap © Colfax International, 2013–2014
Performance Expectations
vs.
One Intel Xeon Phi coprocessor Two Intel Xeon Sandy Bridge CPUs
1
Scalar optimization
2
Vectorization
3
Scale above 100 threads
4
Arithmetically intensive or bandwidth-limited
5
Efficient cooperation between the host and the coprocessor(s)
MIC Developer Boot Camp Rev. 12 Optimization Roadmap © Colfax International, 2013–2014
Finding Bottlenecks with Intel VTune Amplifier
MIC Developer Boot Camp Rev. 12 Finding Bottlenecks with Intel VTune Amplifier © Colfax International, 2013–2014
Intel VTune Parallel Amplifier XE
Hardware event-based
profiler for parallel
applications on Xeon CPUs
and Xeon Phi coprocessors.
MIC Developer Boot Camp Rev. 12 Finding Bottlenecks with Intel VTune Amplifier © Colfax International, 2013–2014
Using VTune
Setting up a VTune project:
MIC Developer Boot Camp Rev. 12 Finding Bottlenecks with Intel VTune Amplifier © Colfax International, 2013–2014
Using VTune
Locating hotspots down to a single line of code:
MIC Developer Boot Camp Rev. 12 Finding Bottlenecks with Intel VTune Amplifier © Colfax International, 2013–2014
Using VTune
MIC Developer Boot Camp Rev. 12 Finding Bottlenecks with Intel VTune Amplifier © Colfax International, 2013–2014
MPI Diagnostics Using Intel Trace Analyzer and Collector
MIC Developer Boot Camp Rev. 12 MPI Diagnostics Using Intel Trace Analyzer and Collector © Colfax International, 2013–2014
Intel Trace Analyzer and Collector
MIC Developer Boot Camp Rev. 12 MPI Diagnostics Using Intel Trace Analyzer and Collector © Colfax International, 2013–2014
Using Intel Trace Analyzer and Collector
MIC Developer Boot Camp Rev. 12 MPI Diagnostics Using Intel Trace Analyzer and Collector © Colfax International, 2013–2014
Using Intel Trace Analyzer and Collector
MIC Developer Boot Camp Rev. 12 MPI Diagnostics Using Intel Trace Analyzer and Collector © Colfax International, 2013–2014
Intel Math Kernel Library (MKL)
MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Intel Math Kernel Library (MKL)
MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Using Intel MKL
Three modes of usage:
Automatic Offload
Ï No code change required to offload calculations to a Xeon Phi coprocessor
Ï Automatically uses both the CPU and the coprocessor
Ï The library takes care of data transfer and execution management
Compiler-Assisted Offload
Ï Programmer maintains explicit control of data transfer and remote execution
Ï Requires using compiler offload pragmas and directives
Native Execution
Ï Uses an Intel Xeon Phi coprocessor as an independent compute node.
Ï Data initialized & processed on the coprocessor, or communicated via MPI
MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Using MKL in Automatic Offload Mode
Compiling and running the code. Calculation will be offloaded to a Xeon Phi coproces-
sor, if one is available at runtime.
user@host% icpc -c mycode.cc -mkl -o mycode
user@host% export MKL_MIC_ENABLE=1
user@host% ./mycode
MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Using MKL in Compiler-Assisted Offload Mode
Calling an MKL function from offloaded section:
1 #pragma offload target(mic) \
2 in(transa, transb, N, alpha, beta) \
3 in(A:length(matrix_elements)) \
4 in(B:length(matrix_elements)) \
5 out(C:length(matrix_elements) alloc_if(0))
6 {
7 sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
8 }
Compiling and running the code. If no coprocessor at runtime, MKL will fall back to
CPU calculation.
user@host% icpc -c mycode.cc -mkl -o mycode
user@host% ./mycode
MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Using MKL Native Execution Mode
1 #include <stdlib.h> 1 #include <stdlib.h>
2 #include <stdio.h> 2 #include <stdio.h>
3 3 #include <mkl_vsl.h>
4 int main() { 4 int main() {
5 const size_t N = 1<<29L; 5 const size_t N = 1<<29L;
6 const size_t F = sizeof(float); 6 const size_t F = sizeof(float);
7 float* A = (float*)malloc(N*F); 7 float* A = (float*)malloc(N*F);
8 srand(0); // Initialize RNG 8 VSLStreamStatePtr rnStream;
9 for (int i = 0; i < N; i++) { 9 vslNewStream( &rnStream, //Init RNG
10 A[i]=(float)rand() / 10 VSL_BRNG_MT19937, 1 );
11 (float)RAND_MAX; 11 vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD,
12 } 12 rnStream, N, A, 0.0f, 1.0f);
13 printf("Generated %ld random \ 13 printf("Generated %ld random \
14 numbers\nA[0]=%e\n", N, A[0]); 14 numbers\nA[0]=%e\n", N, A[0]);
15 free(A); 15 free(A);
16 } 16 }
MIC Developer Boot Camp Rev. 12 Intel Math Kernel Library (MKL) © Colfax International, 2013–2014
Using MKL in Native Execution Mode
user@host% icpc -mmic -o rand \ user@host% icpc -mkl -mmic -o \
% rand.cc % rand-mkl rand-mkl.cc
user@host% # Run on coprocessor user@host% export SINK_LD_LIBRARY_PATH=\
user@host% # and benchmark % /opt/intel/composerxe/mkl/lib/mic:\
user@host% time micnativeloadex \ % /opt/intel/composerxe/lib/mic
% rand user@host% time micnativeloadex rand-mkl
Generated 536870912 random numbers Generated 536870912 random numbers
A[0]=8.401877e-01 A[0]=1.343642e-01
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Optimization Level
1 #pragma intel optimization_level 3
2 void my_function() {
user@host% icc -o mycode -O3 source.c
3 //...
4 }
The default optimization level -O2
Optimization level -O3
optimization for speed
enables more aggressive
automatic vectorization
optimization
inlining
loop fusion
constant propagation
block-unroll-and-jam
dead-code elimination
if-statement collapse
loop unrolling
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Using the const Qualifier
1 #include <stdio.h> 1 #include <stdio.h>
2 int main() { 2 int main() {
3 const int N=1<<28; 3 const int N=1<<28;
4 double w = 0.5; 4 const double w = 0.5;
5 double T = (double)N; 5 const double T = (double)N;
6 double s = 0.0; 6 double s = 0.0;
7 for (int i = 0; i < N; i++) 7 for (int i = 0; i < N; i++)
8 s += w*(double)i/T; 8 s += w*(double)i/T;
9 printf("%e\n", s); 9 printf("%e\n", s);
10 } 10 }
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Strength Reduction
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Consistency of Precision: Constants
1
Operations on type float is faster than operations on type double.
Avoid type conversions and define single-precision literal constants
with suffix -f.
1 const double twoPi = 6.283185307179586;
2 const float phase = 0.3f; // single precision
2
Use 32-bit int values including 64-bit long where possible,
including array indices. Avoid type conversions and define 64-bit
literal constants with suffix -L or UL
1 const long N2 = 1000000*1000000; // Overflow error
2 const long N3 = 1000000L*1000000L; // Correct
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Consistency of Precision: Functions
1
math.h contains fast single precision versions of arithmetic
functions ending with suffix -f
1 double sin(double x);
2 float sinf(float x);
2
math.h contains fast base 2 exponential and logarithmic functions:
1 double exp(double x); // Double precision, natural base
2 float expf(float x); // Single precision, natural base
3 double exp2(double x); // Double precision, base 2
4 float exp2f(float x); // Single precision, base 2
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Floating-Point Semantics
The Intel C++ Compiler may represent floating-point expressions in executable code
differently, depending on the floating-point semantics.
-fp-model strict Only value-safe optimizations
-fp-model precise calculations are reproducible from run to run
exceptions controlled using -fp-model except
-fp-model fast=1 (default) Value-unsafe optimizations are allowed
-fp-model fast=2 better performance at the cost of lower accuracy
-fp-model source Intermediate arithmetic results are rounded to
the precision defined in the source code.
-fp-model double Intermediate arithmetic results are rounded to
53-bit (double) precision.
-fp-model extended Intermediate arithmetic results are rounded to
64-bit (extended) precision.
-fp-model [no-]except controls floating-point exception semantics.
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Precision Control for Transcendental Functions
-fimf-precision= value[:funclist] Defines the precision for math
functions. value is one of: high, medium or low
-fimf-max-error= ulps[:funclist] The maximum allowable error
expressed in ulps (units in last place)
-fimf-accuracy-bits= n[:funclist] The number of correct bits
required for mathematical function accuracy.
-fimf-domain-exclusion= n[:funclist] Defines a list of special-
value numbers that do not need to be handled.
int n derived by the bitwise OR of types:
extremes: 1, NaNs: 2, infinites: 4, denormals1 : 8, zeroes: 16.
1
by default, on Intel Xeon Phi, denormals are flushed to zero in hardware, but supported in SVML
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Precision Control for Transcendental Functions
1 #include <stdio.h>
2 #include <math.h>
3
4 int main() {
5 const int N = 1000000;
6 const int P = 10;
7 double A[N];
8 const double startValue = 1.0;
9 A[:] = startValue;
10 for (int i = 0; i < P; i++)
11 #pragma simd
12 for (int r = 0; r < N; r++)
13 A[r] = exp(-A[r]);
14
15 printf("Result=%.17e\n", A[0]);
16 }
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Precision Control for Transcendental Functions
MIC Developer Boot Camp Rev. 12 Scalar Optimization Considerations © Colfax International, 2013–2014
Automatic Vectorization: Making it Happen and Tuning
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Challenges with Optimizing Vectorization on Xeon Phi
Must utilize 512-bit vector registers (16 float or 8 double)
Must convince compiler that vectorization is possible
Preferably unit-stride access to data
Preferably align data on 64-byte boundary
Avoid branches in vector loops
Guide compiler regarding expected iteration count, memory
alignment, outer loop vectorization, etc.
This section:
Ensuring that automatic vectorization succeeds where it must exist.
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Diagnosing the Utilization of Vector Instructions
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Assumed Vector Dependence. The restrict Keyword.
True vector dependence makes vectorization impossible:
1 float *a, *b; /...
2 for (int i = 1; i < n; i++)
3 a[i] += b[i]*a[i-1]; // dependence on the previous element
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Ignoring Assumed Vector Dependence
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Pointer Disambiguation (alternative to #pragma ivdep)
restrict keyword applies to each pointer variable qualified with it
The object accessed by the pointer is only accessed by that pointer
in the given scope
The compiler argument -restrict must be used.
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Making it Happen and Tuning © Colfax International, 2013–2014
Automatic Vectorization: Data Structures
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Challenges with Optimizing Vectorization on Xeon Phi
Must utilize 512-bit vector registers (16 float or 8 double)
Must convince compiler that vectorization is possible
Preferably unit-stride access to data
Preferably align data on 64-byte boundary
Avoid branches in vector loops
Guide compiler regarding expected iteration count, memory
alignment, outer loop vectorization, etc.
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Example: Unit-Stride Access in Coulomb’s Law Application
m qi
Φ~
¡ ¢ X
Rj = − ¯, (1)
i=1 ~ ri − ~
¯
¯ Rj ¯
¯ q¡
¯~ri −~
¯ ¢2 ¡ ¢2 ¡ ¢2
R¯ = ri,x − Rx + ri,y − Ry + ri,z − Rz . (2)
Charge Distribution
Positive charges Electric Potential
Negative charges
0.4
0.3 0.4 0.3
0.3 0.2
0.2 0.1
0.2 0
0.1
Φ(x,y,z=0)
0.1 -0.1
-0.2
z 0 0 -0.3
-0.1 -0.4
-0.1
-0.2
-0.2 -0.3
1 1
-0.3 y -0.4 y
0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0
x x
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Arrays of Structures versus Structures of Arrays
Array of Structures (AoS)
1 struct Charge { // Elegant, but ineffective data layout
2 float x, y, z, q; // Coordinates and value of this charge
3 };
4 // The following line declares a set of m point charges:
5 Charge chg[m];
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Electric Potential Calculation with Coulomb’s Law
Electric potential calculation
1.0
0.90 s Host system
Intel Xeon Phi Coprocessor
0.8
0.73 s
Time, s (lower is better)
0.6
0.51 s 0.51 s
0.4 0.37 s
0.22 s
0.2
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Structures © Colfax International, 2013–2014
Automatic Vectorization: Data Alignment
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Challenges with Optimizing Vectorization on Xeon Phi
Must utilize 512-bit vector registers (16 float or 8 double)
Must convince compiler that vectorization is possible
Preferably unit-stride access to data
Preferably align data on 64-byte boundary
Avoid branches in vector loops
Guide compiler regarding expected iteration count, memory
alignment, outer loop vectorization, etc.
This section:
Data alignment and compiler hints.
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Data Alignment
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Vectorization Pragmas, Keywords and Compiler Arguments
#pragma simd
#pragma vector always
#pragma vector aligned | unaligned
#pragma vector nontemporal | temporal
#pragma novector
#pragma ivdep
restrict qualifier and -restrict command-line argument
#pragma loop count
__assume_aligned keyword
-vec-report[n]
-O[n]
-x[code]
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Vectorization Pragmas, Keywords and Compiler Arguments
#pragma simd
#pragma vector always
#pragma vector aligned | unaligned
#pragma vector nontemporal | temporal
#pragma novector
#pragma ivdep
restrict qualifier and -restrict command-line argument
#pragma loop count
__assume_aligned keyword
-vec-report[n]
-O[n]
-x[code]
MIC Developer Boot Camp Rev. 12 Automatic Vectorization: Data Alignment © Colfax International, 2013–2014
Thread Parallelism: Reducing Synchronization
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Challenges with Thread Parallelism on Xeon Phi
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Example: Dealing with Excessive Synchronization
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
The Same Calculation, Strip-Mined, Vectorized
1 void Histogram(const float* age, int* const hist, const int n,
2 const float group_width, const int m) {
3 const int vecLen = 16; // Length of vectorized loop
4 const float invGroupWidth = 1.0f/group_width; // Pre-compute the reciprocal
5 // Strip-mining the loop in order to vectorize the inner short loop
6 // Note: this algorithm assumes n%vecLen == 0.
7 for (int ii = 0; ii < n; ii += vecLen) { //Temporary store vecLen indices
8 int histIdx[vecLen] __attribute__((aligned(64)));
9 // Vectorize the multiplication and rounding
10 #pragma vector aligned
11 for (int i = ii; i < ii + vecLen; i++)
12 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
13 // Scattered memory access, does not get vectorized
14 for (int c = 0; c < vecLen; c++)
15 hist[histIdx[c]]++;
16 }
17 }
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Adding Thread Parallelism
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Improving Thread Parallelism
1 #pragma omp parallel
2 {
3 int hist_priv[m]; // Better idea: thread-private storage
4 hist_priv[:] = 0;
5 int histIdx[vecLen] __attribute__((aligned(64)));
6 #pragma omp for schedule(guided)
7 for (int ii = 0; ii < n; ii += vecLen) {
8 #pragma vector aligned
9 for (int i = ii; i < ii + vecLen; i++)
10 histIdx[i-ii] = (int) ( age[i] * invGroupWidth );
11 for (int c = 0; c < vecLen; c++)
12 hist_priv[histIdx[c]]++;
13 }
14 for (int c = 0; c < m; c++) {
15 #pragma omp atomic
16 hist[c] += hist_priv[c];
17 } } }
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Dealing with Excessive Synchronization
Computing a histogram: elimination of synchronization
71.30 s
70 Host system
Intel Xeon Phi coprocessor
60
Time, s (lower is better)
50
40 37.70 s
30
24.00 s
20
10 9.23 s
5.06 s
1.27 s 0.12 s 0.07 s
0 Scalar Serial Code Vectorized Serial Code Vectorized Parallel Code Vectorized Parallel Code
(Atomic Operations) (Private Variables)
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Reducing Synchronization © Colfax International, 2013–2014
Thread Parallelism: False Sharing
MIC Developer Boot Camp Rev. 12 Thread Parallelism: False Sharing © Colfax International, 2013–2014
False Sharing. Data Padding and Private Variables
CPU 0 CPU 1
Memory
MIC Developer Boot Camp Rev. 12 Thread Parallelism: False Sharing © Colfax International, 2013–2014
False Sharing. Data Padding and Private Variables
1 const int m = 5;
2 int hist_thr[nThreads][m];
3 #pragma omp parallel for
4 for (int ii = 0; ii < n; ii += vecLen) {
5 // False sharing occurs here
6 for (int c = 0; c < vecLen; c++)
7 hist_thr[iThread][histIdx[c]]++;
8 }
9 // Reducing results from all threads to the common histogram hist
10 for (int iThread = 0; iThread < nThreads; iThread++)
11 hist[0:m] += hist_thr[iThread][0:m];
MIC Developer Boot Camp Rev. 12 Thread Parallelism: False Sharing © Colfax International, 2013–2014
Padding to Avoid False Sharing
Computing a histogram: elimination of false sharing
1.8
1.600 s
Host system
1.6 Intel Xeon Phi coprocessor
1.4
Time, s (lower is better)
1.2
1.0
0.8 0.720 s
0.6
0.4 0.369 s
0.270 s
0.2 0.116 s 0.114 s 0.068 s
0.073 s 0.067 s 0.067 s
0.0Baseline: Parallel Code Poor Performance: Padding to Padding to Padding to
(Private Variables) False Sharing 64 bytes 128 bytes 256 bytes
MIC Developer Boot Camp Rev. 12 Thread Parallelism: False Sharing © Colfax International, 2013–2014
Thread Parallelism: Expanding Iteration Space
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Example: Dealing with Insufficient Parallelism
n
X
Si = Mij , i = 0 . . . m. (3)
j=0
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Dealing with Insufficient Parallelism
VTune Analysis: Row-Wise Reduction of a Short, Wide Matrix
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Strip-Mining: Simultaneous Thread and Data Parallelism
1 // Compiler may be able to simultaneously parallelize and auto-vectorize it
2 #pragma omp parallel for
3 #pragma simd
4 for (int i = 0; i < n; i++) {
5 // ... do work
6 }
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Exposing Parallelism: Strip-Mining and Loop Collapse
1 void sum_stripmine(const int m, const int n, long* M, long* s){
2 const int STRIP=1024;
3 assert(n%STRIP==0);
4 s[0:m]=0;
5 #pragma omp parallel
6 {
7 long sum[m]; sum[0:m]=0;
8 #pragma omp for collapse(2) schedule(guided)
9 for (int i=0; i<m; i++)
10 for (int jj=0; jj<n; jj+=STRIP)
11 #pragma simd
12 #pragma vector aligned
13 for (int j=jj; j<jj+STRIP; j++)
14 sum[i]+=M[i*n+j];
15 for (int i=0; i<m; i++) // Reduction
16 #pragma omp atomic
17 s[i]+=sum[i];
18 } }
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Exposing Parallelism: Strip-Mining and Loop Collapse
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Dealing with Insufficient Parallelism
Row-Wise Reduction of a Short, Wide Matrix
Parallel row-wise matrix reduction
160
Host system
140 Intel Xeon Phi Coprocessor
131.6
Performance, GB/s (higher is better)
120
100
84.9
80
60 53.7
47.5
40 38.6
28.3
20
5.9 6.5
0 Unoptimized Parallel inner loop Collapse nested loops Strip-mine and collapse
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Expanding Iteration Space © Colfax International, 2013–2014
Thread Parallelism: Affinity
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Setting Thread Affinity
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Uses of Thread Affinity
Bandwidth-bound applications: 1 thread per core + prevent
migration. Optimizes utilization of memory controllers.
Compute-bound applications: 2 (Xeon) or 4 (Xeon Phi) threads per
core + prevent migration. Ensures that threads consistently access
local L1 cache data (+L2 for Xeon Phi).
Offload applications : physical core 0 on Xeon Phi is used by µOS for
offload tasks. Prevent placing compute threads on that core.
Aplications in multi-socket NUMA (Non-Uniform Memory Access)
systems: partition the system for two independent tasks, pin tasks to
respective CPUs.
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
The KMP_AFFINITY Environment Variable
KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>]
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Bandwidth-bound, KMP_AFFINITY=scatter
user@host% export OMP_NUM_THREADS=32
user@host% export KMP_AFFINITY=none
user@host% for i in {1..4} ; do ./rowsum_stripmine | tail -1; done
Problem size: 2.980 GB, outer dimension: 4, threads: 32
Strip-mine and collapse: 0.061 +/- 0.002 seconds (52.89 +/- 1.31 GB/s)
Strip-mine and collapse: 0.059 +/- 0.002 seconds (54.11 +/- 1.56 GB/s)
Strip-mine and collapse: 0.077 +/- 0.001 seconds (41.71 +/- 0.69 GB/s)
Strip-mine and collapse: 0.070 +/- 0.005 seconds (45.59 +/- 3.14 GB/s)
user@host% export OMP_NUM_THREADS=16
user@host% export KMP_AFFINITY=scatter
user@host% for i in {1..4}; do ./rowsum_stripmine | tail -1 ; done
Problem size: 2.980 GB, outer dimension: 4, threads: 16
Strip-mine and collapse: 0.059 +/- 0.004 seconds (54.47 +/- 3.25 GB/s)
Strip-mine and collapse: 0.061 +/- 0.004 seconds (52.30 +/- 3.30 GB/s)
Strip-mine and collapse: 0.062 +/- 0.005 seconds (51.37 +/- 4.29 GB/s)
Strip-mine and collapse: 0.058 +/- 0.001 seconds (55.48 +/- 1.27 GB/s)
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Compute-Bound, KMP_AFFINITY=compact/balanced
7 dgemm(&tr, &tr, &N, &N, &N, &v, A, &Nld, B, &Nld, &v, C, &N);
8
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Compute-Bound, KMP_AFFINITY=compact/balanced
user@host% icpc -o bench-dgemm -mkl -mmic bench-dgemm.cc
user@host% micnativeloadex ./bench-dgemm
Iteration 1: 312.7 GFLOP/s
Iteration 2: 346.5 GFLOP/s
Iteration 3: 348.5 GFLOP/s
Iteration 4: 347.2 GFLOP/s
Iteration 5: 348.3 GFLOP/s
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
Other Optimization Topics for Thread Parallelism
MIC Developer Boot Camp Rev. 12 Thread Parallelism: Affinity © Colfax International, 2013–2014
§6. Advanced Optimization for the MIC
Architecture
MIC Developer Boot Camp Rev. 12 Advanced Optimization for the MIC Architecture © Colfax International, 2013–2014
Memory Access and Cache Utilization
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Challenges with Memory Access on Xeon Phi
More threads than CPU, same amount of Level-2 cache (~30 MB)
No hardware prefetching from Level-2 to Level-1
High penalty for data page walks
Dynamic memory allocation is serial → greater penalty than CPU
per Amdahl’s law
“Rule of Thumb” for memory optimization: locality of data access in
space and in time.
Spatial locality = data structures (packing, reordering).
Temporal locality = order of operations (e.g., loop tiling).
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Loop Tiling (Blocking)
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Loop Tiling (Blocking)
1 // Plain nested loops
2 for (int i = 0; i < m; i++)
3 for (int j = 0; j < n; j++)
4 compute(a[i], b[j]); // Memory access is unit-stride in j
Unoptimized code:
Large-stride memory accesses
Inefficient cache use
Does not reach memory bandwidth limit
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Tiling a Parallel For-Loop (Matrix Transposition)
1 #pragma omp parallel for
2 for (int ii = 0; ii < n; ii += TILE) { // Distribute across threads
3 const int iMax = (n < ii+TILE ? n : ii+TILE); // Adapt to matrix shape
4 for (int jj = 0; jj <= ii; jj += TILE) { // Tile the work
5 for (int i = ii; i < iMax; i++) { // Universal microkernel
6 const int jMax = (i < jj+TILE ? i : jj+TILE); // for whole matrix
7 #pragma loop count avg(TILE) // Vectorization tuning
8 #pragma simd // Vectorization hint
9 for (int j = jj; j<jMax; j++) { // Variable loop count (bad)
10 const double c = A[i*n + j]; // Swap elements
11 A[i*n + j] = A[j*n + i];
12 A[j*n + i] = c;
13 } } } }
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Further Optimization: Code Snippet
1 #pragma omp parallel
2 {
3 #pragma omp for schedule(guided)
4 for (int k = 0; k < nTilesParallel; k++) { // Bulk of calculations here
5 const int ii = plan[HEADER_OFFSET + 2*k + 0]*TILE; // Planned order
6 const int jj = plan[HEADER_OFFSET + 2*k + 1]*TILE; // of operations
7 for (int j = jj; j < jj+TILE; j++) { // Simplified main microkernel
8 #pragma simd // Vectorization hint
9 #pragma vector nontemporal // Cache traffic hint
10 for (int i = ii; i < ii+TILE; i++) { // Constant loop count (good)
11 const double c = A[i*n + j]; // Swap elements
12 A[i*n + j] = A[j*n + i];
13 A[j*n + i] = c;
14 } } }
15 // Transposing the tiles along the main diagonal and edges...
16 // ...
Longer code but still in the C language; works for CPU and MIC
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Arithmetic Intensity and Roofline Model
Theoretical estimates, Intel Xeon Phi coprocessor
Arithmetic Performance = 60 × 1.0 × (512/64) × 2 = 960 GFLOP/s.
Memory Bandwidth = η × 6.0 × 8 × 2 × 4 = η × 384 GB/s,
Peak performance for: The peak memory bandwidth:
60-core Intel Xeon Phi η ≈ 0.5 – practical efficiency
clocked at 1.0 GHz 6.0 GT/s (Transfers)
512-bit SIMD registers 8 memory controllers
64-bit floating-point numbers 2 channels in each
fused multiply-add 4 bytes per channel
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Arithmetic Intensity and Roofline Model
Theoretical estimates, Intel Xeon Phi coprocessor
Arithmetic Performance = 60 × 1.0 × (512/64) × 2 = 960 GFLOP/s.
Memory Bandwidth = η × 6.0 × 8 × 2 × 4 = η × 384 GB/s,
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Arithmetic Intensity and Roofline Model
Theoretical estimates, 2x 8-core Intel Xeon E5 processors at 3.0 GHz
Arithmetic Performace = 2 sockets × 8 × 3.0 × (256/64) × 2 = 384 GFLOP/s,
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Arithmetic Intensity and Roofline Model
Roofline model: theoretical peak
Performance, GFLOP/s
600 600
500 500
h
Theor. max performance 400
t
wid
400
nd
th
ba
300 300
wid
ax
nd
r. m
ba
eo
200 200
ax
Th
r. m
eo
Th
100 100
1 2 4 8 16 32 64 128 256
Arithmetic Intensity
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Other Topics on Memory Traffic Optimization
MIC Developer Boot Camp Rev. 12 Memory Access and Cache Utilization © Colfax International, 2013–2014
Data Persistence and PCIe Traffic
MIC Developer Boot Camp Rev. 12 Data Persistence and PCIe Traffic © Colfax International, 2013–2014
Memory Retention Between Offloads
1 // Allocate arrays on coprocessor during the first iteration;
2 // retain allocated memory for subsequent iterations
3 #pragma offload target(mic:0) \
4 in(data: length(size) alloc_if(k==0) free_if(k==nTrials-1) align(64))
5 {
6 // offloaded code here...
7 }
oad ro
lt offl n sfe
fau tra
10 De da
ta
tion(
reten
ry
e mo
t hm
1 Wi
MB
MB
MB
kB
kB
kB
MB
MB
MB
kB
kB
kB
B
B
128
256
512
1M
2M
4M
8M
128
256
512
1G
1k
2k
4k
8k
16
32
64
16
32
64
Array Size
MIC Developer Boot Camp Rev. 12 Data Persistence and PCIe Traffic © Colfax International, 2013–2014
MPI Applications on Clusters with Coprocessors
MIC Developer Boot Camp Rev. 12 MPI Applications on Clusters with Coprocessors © Colfax International, 2013–2014
MPI: Fabrics
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
MPI Fabric Selection: Ethernet and InfiniBand
Ethernet+TCP between coprocessors slower than the hardware limit
InfiniBand approaches the hardware limit from CPU to coprocessors
500 0.12 http://research.colfaxinternational.com/
400 0.10 CPU - remote CPU
Bandwidth [GB/S]
0.08 CPU - mic0
Latency [µs]
10 5 CPU - mic0
Latency [µs]
System
CPU
Memory
InfiniBand requires additional
software on top of MPSS
Environment variable RDMA
PCIe Chipset PCIe MIC
Device
I_MPI_FABRICS
Coprocessor
More information in white
Virtualized
paper InfiniBand MIC
HCA Memory
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
MPI Fabric Selection: Intra-Device Fabric
Part of CCL: virtual interface ibscif for communication between
coprocessors within a system
Default Combination: I_MPI_FABRICS=shm:dapl
shm provides better latency, dapl – greater bandwidth
15 12 http://research.colfaxinternational.com/
10 mic0 - mic0 (dapl)
Bandwidth [GB/S]
10 8 mic0 - mic0 (shm)
Latency [µs]
6
5 4
2
0 0
4B 64B 1kB 1kB 1MB 1GB
Message Size Message Size
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
Communication Efficiency with Symmetric Clustering
http://research.colfaxinternational.com/
5 . 8 μ s 0 . 5 G B/ s
1.1μs 3.7GB/s
4.8μs 3.6GB/s
MPI communication between IB Switch
CPU and coprocessors as 1.1 μs
HCA
HCA
4.2GB/s
CPU0 4.6μs CPU1 CPU0 CPU1
6.
efficient as offload /s
5G
6.5 G
PCIe PCIe PCIe PCIe
B/s
μs
3.6
Peer-to-peer communication B/
s
mic0
mic2
mic0
mic2
mic1
mic3
mic1
mic3
11G
μs
not uniform, but better than 8.8
8.7
s
μs
B/s
B/
s
3.8 G
with Gigabit Ethernet 8.5μs 0.5GB/s
B/
9.3μs 1.3G
3G
1.
9.3μs 9.6μs 0.3GB
/s
/ s
9.2 μs 3.7 G B
8.9μs 0.5GB/s
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
Process Parallelism: MPI Optimization Strategies
Dynamic scheduling
Load balancing
Communication-efficient algorithms
OpenMP/MPI hybrid
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
The Monte Carlo Method of Computing the Number π
1
Aquarter circle = πR2
4
π = 3.141592653589793 . . .
2
Asquare = L . y
Aquarter circle
L=1
〈Nquarter circle 〉 = N. 1
Asquare R=
x
L=1
〈Nquarter circle 〉 πR2
4 = 4 2 = π.
N 4L - Monte Carlo trial
- unit square area
- quarter circle area
Nquarter circle
π≈4 .
N
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
The Monte Carlo Method of Computing the Number π
1 #include <mkl_vsl.h>
2 const long BLOCK_SIZE=4096;
3
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
The Monte Carlo Method of Computing the Number π
1 int rank, nRanks, trial;
2 MPI_Init(&argc, &argv);
3 MPI_Comm_size(MPI_COMM_WORLD, &nRanks);
4 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
5
16 MPI_Barrier(MPI_COMM_WORLD);
17 MPI_Finalize();
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
The Monte Carlo Method of Computing the Number π
Host, coprocessor, heterogeneous
user@host% mpirun -np 32 -host localhost ./pi_mpi
Time, s: 0.84
user@host% mpirun -np 240 -host mic0 ~/pi_mpi
Time, s: 0.44
user@host% mpirun -np 32 -host localhost ./pi_mpi : -np 240 -host mic0 ~/pi_mpi
Time, s: 0.36
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
Using Intel Trace Analyzer and Collector
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
Load Balancing with Static Scheduling
Solution: assign more work to CPU ranks.
bhost
α = ,
bMIC Effect of load balancing between host and coprocessor in the Monte Carlo calculation of π
Run time
0.4
0.2
Btotal 0.1
bhost = ,
αPhost + PMIC 0.0
0 1 2 3 4
Parameter α
5 6 7 8
αBtotal
bMIC = .
αPhost + PMIC
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
Load Balancing with Static Scheduling
Load balance: execution times
0.9
0.839
0.8
0.7
Time, s (lower is better)
0.6
0.5
0.449
0.4 0.366
0.3 0.283
0.2
0.1
0.0
es) sse
s) 1.0 3.4
ess oce α= α=
roc r i, i,
2p 0p Ph Ph
(3 (24 on eon
ly Xe
on nly + +X
eon io on eon
X Ph Xe X
on
Xe
MIC Developer Boot Camp Rev. 12 MPI: Fabrics © Colfax International, 2013–2014
§7. Conclusion
MIC Developer Boot Camp Rev. 12 Course Recap © Colfax International, 2013–2014
Programming Models for Xeon Phi Coprocessors
1
Native coprocessor applications
Ï Compile with -mmic
Ï Run with micnativeloadex or scp+ssh
Ï The way to go for MPI applications without offload
2
Explicit offload
Ï Functions, global variables require __attribute__((target(mic)))
Ï Initiate offload, data marshalling with #pragma offload
Ï Only bitwise-copyable data can be shared
3
Clusters and multiple coprocessors
Ï #pragma offload target(mic:i)
Ï Use threads to offload to multiple coprocessors
Ï Run native MPI applications
MIC Developer Boot Camp Rev. 12 Course Recap © Colfax International, 2013–2014
Optimization Checklist
1
Scalar optimization
2
Vectorization
3
Scale above 100 threads
4
Arithmetically intensive or bandwidth-limited
5
Efficient cooperation between the host and the coprocessor(s)
MIC Developer Boot Camp Rev. 12 Course Recap © Colfax International, 2013–2014
Additional Resources: Reading, Guides, Support
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Reference Guides
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel’s Top 10 List
1
Download programming books: “Intel Xeon Phi Coprocessor High
Performance Programming” by Jeffers & Reinders, and “Parallel
Programming and Optimization with Intel Xeon Phi Coprocessors”
by Colfax.
2
Watch the parallel programming webinar
3
Bookmark and browse the mic-developer website
4
Bookmark and browse the two developer support forums: “Intel
MIC Architecture” and “Threading on Intel Parallel Architectures”.
5
Consult the “Quick Start” guide to prepare your system for first use,
learn about tools, and get C/C++ and Fortran-based programs up
and running
Link to TOP10 List for Starter Kit Developers
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel’s Top 10 List (continued)
6
Try your hand at the beginning lab exercises
7
Try your hand at the beginner/intermediate real world app exercises
8
Browse the case studies webpage to view examples from many
segments
9
Begin optimizing your application(s); consult your programming
books, the ISA reference manual, and the support forums for
assistance.
10
Hone your skills by watching more advanced video workshops
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel Xeon Phi Starter Kit
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel Xeon Phi Starter Kit
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel Xeon Phi Starter Kit
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel Xeon Phi Starter Kit
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Workstations with Intel Xeon Phi Coprocessors (Jan 2014)
http://www.colfax-intl.com/nd/xeonphi/workstations.aspx
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Servers with Intel Xeon Phi Coprocessors (Jan 2014)
http://www.colfax-intl.com/nd/xeonphi/servers.aspx
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Research and Consulting
http://research.colfaxinternational.com/
http://nlreg.colfax-intl.com/
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Research and Consulting
Colfax offers consulting services for Enterprises, Research Labs, and
Universities. We can help you to:
Optimize your existing application to take advantage of all levels of
hardware parallelism
Future-proof for upcoming innovations in computing solutions.
Accelerate your application using coprocessor technologies.
Investigate the potential system configurations that satisfy your
cost, power and performance requirements.
Take a deep dive to develop a novel approach.
For more details, contact us at [email protected] to discuss what we
can do together
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014
Intel® Xeon Phi™ Coprocessor
Remote Access and System Loaner Programs
Remote Access Systems Loaner Programs
Intel-supported options for Academia: Intel Demo Depot:
Manycore Testing Lab through SSG (more info) Contact your local Intel sales representative for
requesting an Intel® Xeon Phi™ coprocessor-based
Intel Science & Technology Center (ISTC) and Intel system
Collaborative Research Institutes (ICRI) programs
through Intel Labs (more info)
Seven-day, 24/7 remote access to a personal HPC 30-day access to a loaner system, complete with
server at Colfax with training materials, Intel® Colfax hardware and software programming
Xeon® processors, Intel® Xeon Phi™ coprocessors support
and software development tools
More information please send email to :
More Information: HERE [email protected]
3120A
OR
5110P
software.intel.com/xeon-phi-starter-kit
Other brands and names are the property of their respective owners.
*Pricing and starter kit configurations will vary. See software.intel.com/xeon-phi-starter-kit and provider websites for full details and disclaimers. Stated currency
is US Dollars.
Intel Confidential
2
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
Intel Confidential
Thank you for tuning in,
and
have a wonderful journey
to the Parallel World!
MIC Developer Boot Camp Rev. 12 Additional Resources: Reading, Guides, Support © Colfax International, 2013–2014