Multicore Challenge in Vector Pascal: P Cockshott, Y Gdura
Multicore Challenge in Vector Pascal: P Cockshott, Y Gdura
P Cockshott, Y Gdura
N-body Problem
Part 1 (Performance on Intel Nehalem )
Introduction (Vector Pascal, Machine specifications, N-body algorithm) Data Structures (1D and 2D layouts) Performance of single thread code (C and Vector Pascal) Performance of multithread code ( VP SIMD version ) Summary Performance on Nehalem
Vector Pascal
Extends Pascals support for array operations Designed to make use of SIMD instruction sets and multi-core
Xeon Specifications
Hardware
Year 2010 2 Intel Xeon Nehalem (E5620) - 8 cores 24 GB RAM, 12MB cache 16 threads 2.4 GHz
Software
Linux Vector Pascal compiler GCC version 4.1.2
Data Structures
The C implementation stores the information as an array of structures each of which is
struct planet { double x, y, z; double vx, vy, vz; double mass; };
This layout aligns the vectors with the cache lines and with the vector registers
Note that this version has side effects so the successive iterations of the outer loop can not run in parallel as the inner loop updates the velocities.
parallel}
for i:= 1 to N do { iterate on planets } for j:= 1 to 3 do { iterate on dimensions } v^[j,i]:=v^[j,i]+ dv[i,1].pos[j]; { update velocities } x^ := x^ + v^ *dt; { Finally update positions. } end;
V12 record C version V12 rec hyper Power (V12 record) Power (C version)
hyperthreaded
0.01
0.001
V12 record v8simdhyper non hyper C version V12 rec hyper Power (V12 record) Power (v8simdhyper)
y = 0.0284x-0.768
0.01 Power (non hyper) Power (C version) Power (V12 rec hyper)
0.001
c0.89,
Performance in GFLOPS
We pick the 6 core versions as it gives the peak flops, being just before the hyper-threading transition. This transition affects 7 and 8 thread versions.
Op.s per Body compute displacement get distance compute mag evaluate dv total per inner loop times round inner loop times round outer loop total per timestep Vector Pascal 3 6 5 6 20 1024 1024 20971520 C 3 6 3 18 30 512 1024 15728640
Language / version
Time mec
14.36 2.80 23.50 4.23 14.00
SIMD version Pascal SIMD version Pascal record version Pascal record version Pascal C version
Performance on Cell
2 Different Instruction sets ( 2 Different Compilers) Memory Flow Controller (MFC) on each SPE. (DMA, Mailbox, signals ) Alignment boundary (16 bytes or 128bytes for better performance) Existing Supported Languages (C/C++ and Fortran)
Aim at
Array expressions in intensive-data applications.
Built of
1. 2. A PowerPC compiler A Virtual SIMD Machine (VSM) model to access the SPEs.
VSM Interpreter
i. Checks Inbound mailbox for new messages ii. On receiving a message, an SPE performs the required operation iii. Sends an acknowledgment with the completion , ( If needed)
4. SPE Handles
1. 2. Alignment (load & Store) Synchronization
Parts of data that might being processed on the preceding SPE and succeeding SPE
(Store Operation)
...
...
3 DMA Transfers
2nd DMA
1st DMA
1st DMA
SPE1
Sets lock on 128B
.. SPE1
..
SPE2
Sets lock on 128B
SPE3
Sets lock on 128B
SIMD version Pascal SIMD version Pascal record version Pascal record version Pascal C version Pascal Pascal Pascal C (PPE) (SPE) (SPEs) (PPE, O3)
Vector Pascal
1K 4K 8K 16K
Time in (secs)
y = 97.9x-0.9
10
y = 20.5x-0.813
1
1 10 100