0% found this document useful (0 votes)

105 views

Multicore Challenge in Vector Pascal: P Cockshott, Y Gdura

The document summarizes research on using the Vector Pascal programming language to optimize an N-body simulation problem for multicore CPUs and heterogeneous architectures like the IBM Cell processor. On Intel Nehalem CPUs, a SIMD-friendly Vector Pascal implementation achieved a speedup of over 4x compared to optimized C code by exploiting up to 7 cores. A new Cell-Vector Pascal compiler was also developed that used a virtual SIMD machine model to partition data and launch computations across the PowerPC core and 8 synergistic processors of the Cell architecture, achieving good scaling to 4 cores.

Uploaded by

Paul Cockshott

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views

Multicore Challenge in Vector Pascal: P Cockshott, Y Gdura

Uploaded by

Paul Cockshott

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Multicore Challenge in Vector Pascal

P Cockshott, Y Gdura

N-body Problem
Part 1 (Performance on Intel Nehalem )
Introduction (Vector Pascal, Machine specifications, N-body algorithm) Data Structures (1D and 2D layouts) Performance of single thread code (C and Vector Pascal) Performance of multithread code ( VP SIMD version ) Summary Performance on Nehalem

Part 2 (Performance on IBM Cell)

Introduction New Cell-Vector Pascal (CellVP) Compiler Performance on Cell

(C and Vector Pascal)

Vector Pascal
Extends Pascals support for array operations Designed to make use of SIMD instruction sets and multi-core

Xeon Specifications
Hardware
Year 2010 2 Intel Xeon Nehalem (E5620) - 8 cores 24 GB RAM, 12MB cache 16 threads 2.4 GHz

Software
Linux Vector Pascal compiler GCC version 4.1.2

The N body Problem

For 1024 bodies Each time step
For each body B in 1024
Compute force on it from each other body From these derive partial acceleration Sum the partial accelerations Compute new velocity of B

For each body B in 1024

Compute new position

Data Structures
The C implementation stores the information as an array of structures each of which is
struct planet { double x, y, z; double vx, vy, vz; double mass; };

Does not align well with cache or SIMD registers

Alternative Horizontal Structure

This layout aligns the vectors with the cache lines and with the vector registers

The Reference C Version

for (i = 0; i < nbodies; i++) { struct planet * b = &(bodies[i]); for (j = i + 1; j < nbodies; j++){ struct planet * b2 = &(bodies[j]); double dx = b->x - b2->x; double dy = b->y - b2->y; double dz = b->z - b2->z; double distance = sqrt(dx * dx + dy * dy + dz * dz); double mag = dt / (distance * distance * distance); b->vx -= dx * b2->mass * mag; b->vy -= dy * b2->mass * mag; b->vz -= dz * b2->mass * mag; b2->vx += dx * b->mass * mag; b2->vy += dy * b->mass * mag; b2->vz += dz * b->mass * mag; }

Note that this version has side effects so the successive iterations of the outer loop can not run in parallel as the inner loop updates the velocities.

Equivalent Record Based Pascal

row:=0; b := planets[i]; for j := 1to n do begin b2 := planets[j]; dx := b^.x - b2^.x; dy := b^.y - b2^.y; dz := b^.z - b2^.z; distance := sqrt(dx * dx + dy * dy + dz * dz); mag := dt*b2^.mass / (distance * distance * distance+epsilon); row[1] :=row[1]- dx * mag; row[2] := row[2] -dy * mag; row[3] :=row[3] - dz * mag; end; This is side effect free as the total change in the velocity of the ith planet is built up in a local row vector which is added to the planet velocities later.

Complexity and Performance Comparison

Timings below are for single threaded code on Xeon
Vector Pascal Unoptimised -O3 28.9 ms 23.5 ms C 30 ms 14 ms

Note: Pascal performs N2 operations while C does N2/2

SIMD friendly version no explicit inner loop

pure function computevelocitychange(start:integer):coord;

-- declarations {M: pointer to mass vector, x: pointer to position matrix, di :

displacement matrix, distance: vector of distances}

begin row:=x^[iota[0],i]; { Compute the displacement vector between each planet and planet i.} di:= row[iota[0]]- x^; { Next compute the euclidean distances } xp:=@ di[1,1];yp:=@di[2,1];zp:=@di[3,1]; { point at the rows } distance:= sqrt(xp^*xp^+ yp^*yp^+ zp^*zp^)+epsilon; mag:=dt/(distance *distance*distance ); changes.pos:= \+ (M^*mag*di); end
Row Summation operator builds x,y,z components of dv

Pack this up in Pure Function Applied in Parallel

procedure radvance( dt:real); var dv:array[1..n,1..1] of coord; i,j:integer; pure function computevelocitychange(i:integer;dt:real):coord; begin {--- do the computation on last slide} computevelocitychange:=changes.pos; end; begin Iota[0] is the 0th index vector If the left hand side

dv :=computevelocitychange(iota[0],dt); { can be evaluated in

parallel}
for i:= 1 to N do { iterate on planets } for j:= 1 to 3 do { iterate on dimensions } v^[j,i]:=v^[j,i]+ dv[i,1].pos[j]; { update velocities } x^ := x^ + v^ *dt; { Finally update positions. } end;

Now Compile with the Multiple Cores

Programme unchanged compiled with from 1 to 16 cores for example vpc V12 cpugnuP4 cores8 X axis threads, Y axis time in seconds, log log plot, 256 runs Mean time for 7 cores = 5.2 ms
1 1 10

V12 record C version V12 rec hyper Power (V12 record) Power (C version)

y = 0.0229x-0.895 y = 0.0146x-0.002 y = 0.0284x-0.768 0.1

hyperthreaded
0.01

Power (V12 rec hyper)

0.001

Combined SIMD Multicore Performance

1
1 10

V12 record v8simdhyper non hyper C version V12 rec hyper Power (V12 record) Power (v8simdhyper)

y = 0.0229x-0.895 y = 0.009x-0.448 y= 0.0135x-0.842 0.1

y = 0.0284x-0.768

0.01 Power (non hyper) Power (C version) Power (V12 rec hyper)

0.001

Summary Time per Iteration

Best performance on the Xeon was using 7 cores:
SIMD performance scales as c0.84
time C optimised 1 core SIMD code Pascal 1 core SIMD code Pascal 7 cores Record code Pascal 1 core Record code Pascal 7 cores 14 ms 16 ms 02.25 ms 23 ms 03.75 ms

Record performance scales as

where c the number of cores.

c0.89,

Performance in GFLOPS
We pick the 6 core versions as it gives the peak flops, being just before the hyper-threading transition. This transition affects 7 and 8 thread versions.
Op.s per Body compute displacement get distance compute mag evaluate dv total per inner loop times round inner loop times round outer loop total per timestep Vector Pascal 3 6 5 6 20 1024 1024 20971520 C 3 6 3 18 30 512 1024 15728640

Language / version

Time mec
14.36 2.80 23.50 4.23 14.00

SIMD version Pascal SIMD version Pascal record version Pascal record version Pascal C version

Number Of Cores Xeon 1 6 1 6 1

GFLOPS Total Per Core

1.460 7.490 0.892 4.958 1.123 1.460 1.248 0.892 0.826 1.123

Part 2 N-Body on Cell

The Cell Architecture

The CellVP Compiler using Virtual SIMD Machine

Alignment and Synchronization

Performance on Cell

The Cell Heterogeneous Architecture

Year 2007 Processors
1 PowerPC (PPE) , 3.2 GHz, 512 MB RAM, 512KB L2, 32KB L1 cache 8 Synergistic processors (SPEs), 3.2 GHz, 256 KB

2 Different Instruction sets ( 2 Different Compilers) Memory Flow Controller (MFC) on each SPE. (DMA, Mailbox, signals ) Alignment boundary (16 bytes or 128bytes for better performance) Existing Supported Languages (C/C++ and Fortran)

The CellVP Compiler System

Objective
Automatic parallelizing compiler using virtual machine model

Aim at
Array expressions in intensive-data applications.

Built of
1. 2. A PowerPC compiler A Virtual SIMD Machine (VSM) model to access the SPEs.

The PowerPC Compiler

Transform sequential VP code into PPE code

Convert large array expression into VM instructions

Append to the prologue code, code to launch threads on the SPEs

Append to the epilogue code, code to terminate SPEs threads.

Virtual SIMD Machine (VSM) Model VSM Instructions

Support basic Operations (+, - , / , * , sqrt , \+, rep ... etc)

VSM Interpreter

1. The PPE Opcode dispatcher

i. ii. iii. iv. Chops data equally on used SPEs Formatting messages (opcode, registers to be used, starting address) Writing messages to SPEs Inbound mailbox Waiting for a completion acknowledgment from SPEs (blocking mode )

2. The SPE Interpreter

(A program runs in a background)

i. Checks Inbound mailbox for new messages ii. On receiving a message, an SPE performs the required operation iii. Sends an acknowledgment with the completion , ( If needed)

The CellVP Compiler System

1. Generates PowerPC machine instructions (sequential code) 2. Generates VSM instructions to evaluate large arrays on SPEs. 3. PPE Handles
1. 2. Data Partitioning on SPEs Communication (Mailboxs)

4. SPE Handles
1. 2. Alignment (load & Store) Synchronization
Parts of data that might being processed on the preceding SPE and succeeding SPE

Alignment & Synchronization

........
........
Block0-SPE0

(Store Operation)

Virtual Register 4KB

Block1-SPE1 Block2-SPE2 Block3-SPE3

...
...

Actual Starting Address Aligned address 1st DMA

3 DMA Transfers

Data Block Size (1KB)

2nd DMA

1st DMA

SPE1
Sets lock on 128B

.. SPE1

SPE2
Sets lock on 128B

SPE3
Sets lock on 128B

Virtual SIMD Register Chopped on 4 SPEs

N-Body Problem on the Cell

Code: Data Structure: Machine: Compilers: Same Xeon version large scale (4KB) Horizontal Structure PS3 (only four SPEs used ) GNU C/C++ compiler version 4.1.2 Vector Pascal CellVP

Performance of VP&C on Xeon & Cell (GFLOPS)

Op.s per Body compute displacement get distance compute mag evaluate dv total per inner loop times round inner loop
times round outer loop total per time step Language / version Time msec 14.36 2.80 23.50 4.23 14.00 381 105 48 45 Vector Pascal 3 6 5 6 20 1024 1024 20971520 Number of Cores Xeon 1 6 1 6 1 Cell 1 1 4 1 C 3 6 3 18 30 512 1024 15728640 GFLOPS Total Per Core 1.460 7.490 0.892 4.958 1.123 0.055 0.119 0.436 0.349 1.460 1.248 0.892 0.826 1.123 0.055 0.119 0.109 0.349

SIMD version Pascal SIMD version Pascal record version Pascal record version Pascal C version Pascal Pascal Pascal C (PPE) (SPE) (SPEs) (PPE, O3)

VP Performance on Large Problems

N-body
Problem Size PPE 1 SPE 2 SPEs 4 SPEs PPE Performance (seconds) per Iteration

Vector Pascal

1K 4K 8K 16K

0.381 4.852 20.355 100.250

0.105 1.387 5.715 22.278

0.065 0.782 3.334 13.248

0.048 0.470 2.056 8.086

0.045 0.771 3.232 16.524

Log log chart of performance of the Cell

1000 8k 16K Power (8k)

Power (16K) 100

Time in (secs)

y = 97.9x-0.9

y = 20.5x-0.813

1
1 10 100

Degree of FPU parallelism

Thank You Any?

PP-PI Configuration Steps
50% (2)
PP-PI Configuration Steps
15 pages
Testing Marx With Input Output Tables
No ratings yet
Testing Marx With Input Output Tables
27 pages
CUDA
No ratings yet
CUDA
33 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
No ratings yet
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
26 pages
disc07-sols
No ratings yet
disc07-sols
6 pages
Native Shader Compilation With LLVM PDF
No ratings yet
Native Shader Compilation With LLVM PDF
37 pages
4085 Cao Ass4
No ratings yet
4085 Cao Ass4
9 pages
Accelerating MATLAB With CUDA: Massimiliano Fatica Nvidia Won-Ki Jeong University of Utah
No ratings yet
Accelerating MATLAB With CUDA: Massimiliano Fatica Nvidia Won-Ki Jeong University of Utah
17 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
Pgi Cuda Tutorial
No ratings yet
Pgi Cuda Tutorial
58 pages
Multi Gpu Programming With Mpi
No ratings yet
Multi Gpu Programming With Mpi
93 pages
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
No ratings yet
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
34 pages
GPU Programming EE 4702-1 Final Examination: Exam Total
No ratings yet
GPU Programming EE 4702-1 Final Examination: Exam Total
10 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
CSE679: JPEG: JPEG Goals JPEG Compression Steps Lab
No ratings yet
CSE679: JPEG: JPEG Goals JPEG Compression Steps Lab
24 pages
Extreme Computing: Advanced CUDA Part 3
No ratings yet
Extreme Computing: Advanced CUDA Part 3
27 pages
HSCD_FewSmall_CaseStudy
No ratings yet
HSCD_FewSmall_CaseStudy
19 pages
GATE QnA
No ratings yet
GATE QnA
5 pages
Week 11 A
No ratings yet
Week 11 A
47 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Week 11
No ratings yet
Week 11
21 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
EndSemExam-aut2022-ee721-HDLs-sbp-ee-iitb-17Nov22.pdf
No ratings yet
EndSemExam-aut2022-ee721-HDLs-sbp-ee-iitb-17Nov22.pdf
4 pages
1) Convert The C Function Below To Ia-32 Assembly Language.: Compe 271 Mid-Term Exam #2 Fa16
No ratings yet
1) Convert The C Function Below To Ia-32 Assembly Language.: Compe 271 Mid-Term Exam #2 Fa16
4 pages
ECE408 2012 Practice Exam1
No ratings yet
ECE408 2012 Practice Exam1
10 pages
MATH49111/69111: Scientific Computing: 26th October 2017
No ratings yet
MATH49111/69111: Scientific Computing: 26th October 2017
29 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
Class8
No ratings yet
Class8
72 pages
QP4_BRN32
No ratings yet
QP4_BRN32
7 pages
Oracle9i - Transfer Data Between Database Via EXPIMP
No ratings yet
Oracle9i - Transfer Data Between Database Via EXPIMP
7 pages
Aca305 2000
No ratings yet
Aca305 2000
8 pages
Quiz For Chapter 7 With Solutions
No ratings yet
Quiz For Chapter 7 With Solutions
8 pages
COA Notes
No ratings yet
COA Notes
5 pages
DSP-8 (DSP Processors)
No ratings yet
DSP-8 (DSP Processors)
8 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
VLSI, SP Interview Questions
No ratings yet
VLSI, SP Interview Questions
3 pages
Reverse Engineering Assignment
No ratings yet
Reverse Engineering Assignment
8 pages
Recurrent Neural Networks: Prof. Gheith Abandah
No ratings yet
Recurrent Neural Networks: Prof. Gheith Abandah
32 pages
EXPT 10 Circular conv using DSP kit
No ratings yet
EXPT 10 Circular conv using DSP kit
11 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Answers
No ratings yet
Answers
15 pages
Ece4750 T01 Proc Concepts Problems
No ratings yet
Ece4750 T01 Proc Concepts Problems
10 pages
The Verilog Language
No ratings yet
The Verilog Language
83 pages
Sms045 - DSP Systems in Practice Lab 2 - Adsp-2181 Ez-Kit Lite and Visualdsp++
No ratings yet
Sms045 - DSP Systems in Practice Lab 2 - Adsp-2181 Ez-Kit Lite and Visualdsp++
25 pages
1.6-2005 India 3rd IITKanpur-Web
No ratings yet
1.6-2005 India 3rd IITKanpur-Web
16 pages
Continuous & Discreate Control Systems
No ratings yet
Continuous & Discreate Control Systems
71 pages
AEL ZG626 EC-3R FIRST SEM 2024-2025
No ratings yet
AEL ZG626 EC-3R FIRST SEM 2024-2025
5 pages
HP48 Frequently Asked Questions List (FAQ) Appendix B GX Specific Information
No ratings yet
HP48 Frequently Asked Questions List (FAQ) Appendix B GX Specific Information
12 pages
Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions
No ratings yet
Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions
26 pages
Assinmet&Case Study
No ratings yet
Assinmet&Case Study
19 pages
S, R Are All STD - Logic
100% (1)
S, R Are All STD - Logic
7 pages
Systemc-Ams Tutorial: Institute of Computer Technology Vienna University of Technology Markus Damm
No ratings yet
Systemc-Ams Tutorial: Institute of Computer Technology Vienna University of Technology Markus Damm
26 pages
Aaltonen Sebastian GPU Based Clay
No ratings yet
Aaltonen Sebastian GPU Based Clay
70 pages
Intro 2 Cuda
No ratings yet
Intro 2 Cuda
30 pages
GPU Programming EE 4702-1 Final Examination: Exam Total
No ratings yet
GPU Programming EE 4702-1 Final Examination: Exam Total
10 pages
18627 Fpga Acceleration Lfric Weather and Climate Model Euroexa Project Using Vivado Hls
No ratings yet
18627 Fpga Acceleration Lfric Weather and Climate Model Euroexa Project Using Vivado Hls
27 pages
written_asst2
No ratings yet
written_asst2
27 pages
Convol 3 D 16 Bit
No ratings yet
Convol 3 D 16 Bit
14 pages
Data Logic Cells Unit3 Asic
No ratings yet
Data Logic Cells Unit3 Asic
34 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Sics A Talk
No ratings yet
Sics A Talk
12 pages
Turing and Materialism
No ratings yet
Turing and Materialism
2 pages
Interrupts: How To Do 2 Things at The Same Time
No ratings yet
Interrupts: How To Do 2 Things at The Same Time
22 pages
Challenging Multi Cores
No ratings yet
Challenging Multi Cores
27 pages
Death On The Campsies (Updated)
No ratings yet
Death On The Campsies (Updated)
40 pages
Tiny Basic
No ratings yet
Tiny Basic
31 pages
Nasm
100% (1)
Nasm
67 pages
Strings, Records and Arrays
No ratings yet
Strings, Records and Arrays
36 pages
Hilbert Space Models Commodity Exchange
No ratings yet
Hilbert Space Models Commodity Exchange
7 pages
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
No ratings yet
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
31 pages
Heaps
No ratings yet
Heaps
67 pages
Hilbert Space Models Commodity Exchanges
No ratings yet
Hilbert Space Models Commodity Exchanges
9 pages
Extending Handivote To Handle Digital Economic Decisions: Karen Renaud & Paul Cockshott
No ratings yet
Extending Handivote To Handle Digital Economic Decisions: Karen Renaud & Paul Cockshott
21 pages
Lesson Plan of MPMC
No ratings yet
Lesson Plan of MPMC
2 pages
ARM Chap 3 - Last
No ratings yet
ARM Chap 3 - Last
51 pages
S3C9442/C9444/F9444/C9452/C9454/F9454 Sam88Rcri Instruction Set
No ratings yet
S3C9442/C9444/F9444/C9452/C9454/F9454 Sam88Rcri Instruction Set
50 pages
LPC 2378 Development Board
No ratings yet
LPC 2378 Development Board
160 pages
Microprocessors and Microcontroller
No ratings yet
Microprocessors and Microcontroller
18 pages
CC Viva Questions
0% (1)
CC Viva Questions
5 pages
Micro Pills
No ratings yet
Micro Pills
33 pages
Micro Lec Note1
No ratings yet
Micro Lec Note1
118 pages
Synopsis On "Massive Parallel Processing (MPP) "
No ratings yet
Synopsis On "Massive Parallel Processing (MPP) "
4 pages
MELSEC Q Series: Multiple CPU System
No ratings yet
MELSEC Q Series: Multiple CPU System
268 pages
Lab Manual SPCC
No ratings yet
Lab Manual SPCC
62 pages
Computer Programming Notes
100% (1)
Computer Programming Notes
103 pages
Vishwakarma Institute of Technology
No ratings yet
Vishwakarma Institute of Technology
87 pages
Thumb Instructions
No ratings yet
Thumb Instructions
37 pages
Netx Insiders Guide
No ratings yet
Netx Insiders Guide
261 pages
Overview or Features of 8086
No ratings yet
Overview or Features of 8086
33 pages
Broad New OS Research: Challenges and Opportunities
No ratings yet
Broad New OS Research: Challenges and Opportunities
6 pages
Detailing Process Instruction Categories: Transaction Code To Configure
No ratings yet
Detailing Process Instruction Categories: Transaction Code To Configure
25 pages
MindShare x86 ISA
100% (3)
MindShare x86 ISA
1,567 pages
HCS12
No ratings yet
HCS12
42 pages
Unit5-8051 4th Sem Anna University Cse
No ratings yet
Unit5-8051 4th Sem Anna University Cse
21 pages
KVM/arm64 Architectural Evolutions
No ratings yet
KVM/arm64 Architectural Evolutions
14 pages
MCA Course
No ratings yet
MCA Course
44 pages
Introduction To ARM Processor Architecture: Chaitra - Cs.et@msruas - Ac.in
No ratings yet
Introduction To ARM Processor Architecture: Chaitra - Cs.et@msruas - Ac.in
29 pages
CPU Organization / Main Memory
No ratings yet
CPU Organization / Main Memory
15 pages
CS M151B / EE M116C: Computer Systems Architecture
No ratings yet
CS M151B / EE M116C: Computer Systems Architecture
29 pages
10 - Multi Threaded Programming With Java Technology
No ratings yet
10 - Multi Threaded Programming With Java Technology
312 pages
Final Thesis Speech Recognition
No ratings yet
Final Thesis Speech Recognition
45 pages
Basic Knowledge of IT & Computer Industry
No ratings yet
Basic Knowledge of IT & Computer Industry
24 pages

Multicore Challenge in Vector Pascal: P Cockshott, Y Gdura

Uploaded by

Multicore Challenge in Vector Pascal: P Cockshott, Y Gdura

Uploaded by

Multicore Challenge in Vector Pascal

Part 2 (Performance on IBM Cell)

(C and Vector Pascal)

The N body Problem

For each body B in 1024

Does not align well with cache or SIMD registers

Alternative Horizontal Structure

The Reference C Version

Equivalent Record Based Pascal

Complexity and Performance Comparison

Note: Pascal performs N2 operations while C does N2/2

SIMD friendly version no explicit inner loop

-- declarations {M: pointer to mass vector, x: pointer to position matrix, di :

displacement matrix, distance: vector of distances}

Pack this up in Pure Function Applied in Parallel

dv :=computevelocitychange(iota[0],dt); { can be evaluated in

Now Compile with the Multiple Cores

y = 0.0229x-0.895 y = 0.0146x-0.002 y = 0.0284x-0.768 0.1

Power (V12 rec hyper)

Combined SIMD Multicore Performance

y = 0.0229x-0.895 y = 0.009x-0.448 y= 0.0135x-0.842 0.1

Summary Time per Iteration

Record performance scales as

Number Of Cores Xeon 1 6 1 6 1

GFLOPS Total Per Core

Part 2 N-Body on Cell

The CellVP Compiler using Virtual SIMD Machine

The Cell Heterogeneous Architecture

The CellVP Compiler System

The PowerPC Compiler

Transform sequential VP code into PPE code

Append to the prologue code, code to launch threads on the SPEs

Virtual SIMD Machine (VSM) Model VSM Instructions

Support basic Operations (+, - , / , * , sqrt , \+, rep ... etc)

1. The PPE Opcode dispatcher

2. The SPE Interpreter

(A program runs in a background)

The CellVP Compiler System

Alignment & Synchronization

Virtual Register 4KB

Actual Starting Address Aligned address 1st DMA

Data Block Size (1KB)

Virtual SIMD Register Chopped on 4 SPEs

N-Body Problem on the Cell

Performance of VP&C on Xeon & Cell (GFLOPS)

VP Performance on Large Problems

0.381 4.852 20.355 100.250

0.105 1.387 5.715 22.278

0.065 0.782 3.334 13.248

0.048 0.470 2.056 8.086

0.045 0.771 3.232 16.524

Log log chart of performance of the Cell

Power (16K) 100

Degree of FPU parallelism

Thank You Any?

You might also like