0% found this document useful (0 votes)
105 views

Multicore Challenge in Vector Pascal: P Cockshott, Y Gdura

The document summarizes research on using the Vector Pascal programming language to optimize an N-body simulation problem for multicore CPUs and heterogeneous architectures like the IBM Cell processor. On Intel Nehalem CPUs, a SIMD-friendly Vector Pascal implementation achieved a speedup of over 4x compared to optimized C code by exploiting up to 7 cores. A new Cell-Vector Pascal compiler was also developed that used a virtual SIMD machine model to partition data and launch computations across the PowerPC core and 8 synergistic processors of the Cell architecture, achieving good scaling to 4 cores.

Uploaded by

Paul Cockshott
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views

Multicore Challenge in Vector Pascal: P Cockshott, Y Gdura

The document summarizes research on using the Vector Pascal programming language to optimize an N-body simulation problem for multicore CPUs and heterogeneous architectures like the IBM Cell processor. On Intel Nehalem CPUs, a SIMD-friendly Vector Pascal implementation achieved a speedup of over 4x compared to optimized C code by exploiting up to 7 cores. A new Cell-Vector Pascal compiler was also developed that used a virtual SIMD machine model to partition data and launch computations across the PowerPC core and 8 synergistic processors of the Cell architecture, achieving good scaling to 4 cores.

Uploaded by

Paul Cockshott
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Multicore Challenge in Vector Pascal

P Cockshott, Y Gdura

N-body Problem
Part 1 (Performance on Intel Nehalem )
Introduction (Vector Pascal, Machine specifications, N-body algorithm) Data Structures (1D and 2D layouts) Performance of single thread code (C and Vector Pascal) Performance of multithread code ( VP SIMD version ) Summary Performance on Nehalem

Part 2 (Performance on IBM Cell)


Introduction New Cell-Vector Pascal (CellVP) Compiler Performance on Cell

(C and Vector Pascal)

Vector Pascal
Extends Pascals support for array operations Designed to make use of SIMD instruction sets and multi-core

Xeon Specifications
Hardware
Year 2010 2 Intel Xeon Nehalem (E5620) - 8 cores 24 GB RAM, 12MB cache 16 threads 2.4 GHz

Software
Linux Vector Pascal compiler GCC version 4.1.2

The N body Problem


For 1024 bodies Each time step
For each body B in 1024
Compute force on it from each other body From these derive partial acceleration Sum the partial accelerations Compute new velocity of B

For each body B in 1024


Compute new position

Data Structures
The C implementation stores the information as an array of structures each of which is
struct planet { double x, y, z; double vx, vy, vz; double mass; };

Does not align well with cache or SIMD registers

Alternative Horizontal Structure

This layout aligns the vectors with the cache lines and with the vector registers

The Reference C Version


for (i = 0; i < nbodies; i++) { struct planet * b = &(bodies[i]); for (j = i + 1; j < nbodies; j++){ struct planet * b2 = &(bodies[j]); double dx = b->x - b2->x; double dy = b->y - b2->y; double dz = b->z - b2->z; double distance = sqrt(dx * dx + dy * dy + dz * dz); double mag = dt / (distance * distance * distance); b->vx -= dx * b2->mass * mag; b->vy -= dy * b2->mass * mag; b->vz -= dz * b2->mass * mag; b2->vx += dx * b->mass * mag; b2->vy += dy * b->mass * mag; b2->vz += dz * b->mass * mag; }

Note that this version has side effects so the successive iterations of the outer loop can not run in parallel as the inner loop updates the velocities.

Equivalent Record Based Pascal


row:=0; b := planets[i]; for j := 1to n do begin b2 := planets[j]; dx := b^.x - b2^.x; dy := b^.y - b2^.y; dz := b^.z - b2^.z; distance := sqrt(dx * dx + dy * dy + dz * dz); mag := dt*b2^.mass / (distance * distance * distance+epsilon); row[1] :=row[1]- dx * mag; row[2] := row[2] -dy * mag; row[3] :=row[3] - dz * mag; end; This is side effect free as the total change in the velocity of the ith planet is built up in a local row vector which is added to the planet velocities later.

Complexity and Performance Comparison


Timings below are for single threaded code on Xeon
Vector Pascal Unoptimised -O3 28.9 ms 23.5 ms C 30 ms 14 ms

Note: Pascal performs N2 operations while C does N2/2

SIMD friendly version no explicit inner loop


pure function computevelocitychange(start:integer):coord;

-- declarations {M: pointer to mass vector, x: pointer to position matrix, di :

displacement matrix, distance: vector of distances}


begin row:=x^[iota[0],i]; { Compute the displacement vector between each planet and planet i.} di:= row[iota[0]]- x^; { Next compute the euclidean distances } xp:=@ di[1,1];yp:=@di[2,1];zp:=@di[3,1]; { point at the rows } distance:= sqrt(xp^*xp^+ yp^*yp^+ zp^*zp^)+epsilon; mag:=dt/(distance *distance*distance ); changes.pos:= \+ (M^*mag*di); end
Row Summation operator builds x,y,z components of dv

Pack this up in Pure Function Applied in Parallel


procedure radvance( dt:real); var dv:array[1..n,1..1] of coord; i,j:integer; pure function computevelocitychange(i:integer;dt:real):coord; begin {--- do the computation on last slide} computevelocitychange:=changes.pos; end; begin Iota[0] is the 0th index vector If the left hand side

dv :=computevelocitychange(iota[0],dt); { can be evaluated in

parallel}
for i:= 1 to N do { iterate on planets } for j:= 1 to 3 do { iterate on dimensions } v^[j,i]:=v^[j,i]+ dv[i,1].pos[j]; { update velocities } x^ := x^ + v^ *dt; { Finally update positions. } end;

Now Compile with the Multiple Cores


Programme unchanged compiled with from 1 to 16 cores for example vpc V12 cpugnuP4 cores8 X axis threads, Y axis time in seconds, log log plot, 256 runs Mean time for 7 cores = 5.2 ms
1 1 10

V12 record C version V12 rec hyper Power (V12 record) Power (C version)

y = 0.0229x-0.895 y = 0.0146x-0.002 y = 0.0284x-0.768 0.1

hyperthreaded
0.01

Power (V12 rec hyper)

0.001

Combined SIMD Multicore Performance


1
1 10

V12 record v8simdhyper non hyper C version V12 rec hyper Power (V12 record) Power (v8simdhyper)

y = 0.0229x-0.895 y = 0.009x-0.448 y= 0.0135x-0.842 0.1

y = 0.0284x-0.768

0.01 Power (non hyper) Power (C version) Power (V12 rec hyper)

0.001

Summary Time per Iteration


Best performance on the Xeon was using 7 cores:
SIMD performance scales as c0.84
time C optimised 1 core SIMD code Pascal 1 core SIMD code Pascal 7 cores Record code Pascal 1 core Record code Pascal 7 cores 14 ms 16 ms 02.25 ms 23 ms 03.75 ms

Record performance scales as


where c the number of cores.

c0.89,

Performance in GFLOPS
We pick the 6 core versions as it gives the peak flops, being just before the hyper-threading transition. This transition affects 7 and 8 thread versions.
Op.s per Body compute displacement get distance compute mag evaluate dv total per inner loop times round inner loop times round outer loop total per timestep Vector Pascal 3 6 5 6 20 1024 1024 20971520 C 3 6 3 18 30 512 1024 15728640

Language / version

Time mec
14.36 2.80 23.50 4.23 14.00

SIMD version Pascal SIMD version Pascal record version Pascal record version Pascal C version

Number Of Cores Xeon 1 6 1 6 1

GFLOPS Total Per Core


1.460 7.490 0.892 4.958 1.123 1.460 1.248 0.892 0.826 1.123

Part 2 N-Body on Cell


The Cell Architecture

The CellVP Compiler using Virtual SIMD Machine


Alignment and Synchronization

Performance on Cell

The Cell Heterogeneous Architecture


Year 2007 Processors
1 PowerPC (PPE) , 3.2 GHz, 512 MB RAM, 512KB L2, 32KB L1 cache 8 Synergistic processors (SPEs), 3.2 GHz, 256 KB

2 Different Instruction sets ( 2 Different Compilers) Memory Flow Controller (MFC) on each SPE. (DMA, Mailbox, signals ) Alignment boundary (16 bytes or 128bytes for better performance) Existing Supported Languages (C/C++ and Fortran)

The CellVP Compiler System


Objective
Automatic parallelizing compiler using virtual machine model

Aim at
Array expressions in intensive-data applications.

Built of
1. 2. A PowerPC compiler A Virtual SIMD Machine (VSM) model to access the SPEs.

The PowerPC Compiler

Transform sequential VP code into PPE code


Convert large array expression into VM instructions

Append to the prologue code, code to launch threads on the SPEs


Append to the epilogue code, code to terminate SPEs threads.

Virtual SIMD Machine (VSM) Model VSM Instructions


Register to Register Instructions Operate on virtual SIMD registers ( 1KB - 16KB )

Support basic Operations (+, - , / , * , sqrt , \+, rep ... etc)

VSM Interpreter

1. The PPE Opcode dispatcher


i. ii. iii. iv. Chops data equally on used SPEs Formatting messages (opcode, registers to be used, starting address) Writing messages to SPEs Inbound mailbox Waiting for a completion acknowledgment from SPEs (blocking mode )

2. The SPE Interpreter

(A program runs in a background)

i. Checks Inbound mailbox for new messages ii. On receiving a message, an SPE performs the required operation iii. Sends an acknowledgment with the completion , ( If needed)

The CellVP Compiler System


1. Generates PowerPC machine instructions (sequential code) 2. Generates VSM instructions to evaluate large arrays on SPEs. 3. PPE Handles
1. 2. Data Partitioning on SPEs Communication (Mailboxs)

4. SPE Handles
1. 2. Alignment (load & Store) Synchronization
Parts of data that might being processed on the preceding SPE and succeeding SPE

Alignment & Synchronization


........
........
Block0-SPE0

(Store Operation)

Virtual Register 4KB


Block1-SPE1 Block2-SPE2 Block3-SPE3

...
...

Actual Starting Address Aligned address 1st DMA

3 DMA Transfers

Data Block Size (1KB)

2nd DMA

1st DMA

1st DMA

SPE1
Sets lock on 128B

.. SPE1

..

SPE2
Sets lock on 128B

SPE3
Sets lock on 128B

Virtual SIMD Register Chopped on 4 SPEs

N-Body Problem on the Cell


Code: Data Structure: Machine: Compilers: Same Xeon version large scale (4KB) Horizontal Structure PS3 (only four SPEs used ) GNU C/C++ compiler version 4.1.2 Vector Pascal CellVP

Performance of VP&C on Xeon & Cell (GFLOPS)


Op.s per Body compute displacement get distance compute mag evaluate dv total per inner loop times round inner loop
times round outer loop total per time step Language / version Time msec 14.36 2.80 23.50 4.23 14.00 381 105 48 45 Vector Pascal 3 6 5 6 20 1024 1024 20971520 Number of Cores Xeon 1 6 1 6 1 Cell 1 1 4 1 C 3 6 3 18 30 512 1024 15728640 GFLOPS Total Per Core 1.460 7.490 0.892 4.958 1.123 0.055 0.119 0.436 0.349 1.460 1.248 0.892 0.826 1.123 0.055 0.119 0.109 0.349

SIMD version Pascal SIMD version Pascal record version Pascal record version Pascal C version Pascal Pascal Pascal C (PPE) (SPE) (SPEs) (PPE, O3)

VP Performance on Large Problems


N-body
Problem Size PPE 1 SPE 2 SPEs 4 SPEs PPE Performance (seconds) per Iteration

Vector Pascal

1K 4K 8K 16K

0.381 4.852 20.355 100.250

0.105 1.387 5.715 22.278

0.065 0.782 3.334 13.248

0.048 0.470 2.056 8.086

0.045 0.771 3.232 16.524

Log log chart of performance of the Cell


1000 8k 16K Power (8k)

Power (16K) 100

Time in (secs)

y = 97.9x-0.9

10

y = 20.5x-0.813

1
1 10 100

Degree of FPU parallelism

Thank You Any?

You might also like